Search Syntax
Canopy’s search engine is built on the powerful Apache Lucene library, which allows for a wide range of search capabilities, including fuzzy searches, wildcard searches, proximity searches, and more. In this guide, we will cover the important query string syntax and search operators that you can use to refine your searches.
Query String Syntax commonly consists of Terms, Fields, and Operators.
- Terms are single words or phrases you want to search for.
For example, tree
and work
are single search terms that allow you to search for documents with the words “tree” and “work”.
When running a query, search Terms will be entered into a Field.
-
Fields: When performing a search, you may select a Field from the Fields guide. If no field is specified, Canopy searches across all relevant text fields, containing extracted or OCRed text. This search is optimized to be as complete as possible.
-
Operator allows you to customize your search. Common Operators include
AND
,OR
,NOT
(must be capitalized). You can also use+
forAND
and-
forNOT
.
Canopy’s powerful search helps users find documents quickly and easily. Here are some of the basic search you can use:
-
Keyword Search: Canopy’s search engine uses a built-in English stemmer to find documents based on root and base words. Typing
run
in the search bar will return documents containing variations of “running,” “runner,” “ran,” etc. -
Phrase Search:
- Use double quotes to search for an exact phrase and its close variation. For example,
"red delicious apple"
(in quotes) returns results containing a variation of"red delicious apple"
phrase, such as"red delicious apple"
or"red delicious apples"
due to stemming. - Without quotes, the search will return documents containing each word in any order. For example,
red delicious apple
(without quotes) returns document containing each word ("red"
,"delicious"
,"apple"
) and the phrase containing those words in any order ("delicious apple"
,"delicious red apple"
, etc.).
- Use double quotes to search for an exact phrase and its close variation. For example,
These basic searches are not case-sensitive. Whether you type “Apple,” “Apple,” or “APPLE,” you’ll get the same results.
You can use wildcards to search for partial terms. Wildcards are useful when you are unsure of the spelling or want to find variations of a word.
Wildcard Operators | Description | Examples |
---|---|---|
* |
Matches zero or more characters | appl* matches “apple”, “apples”, etc. |
? |
Matches exactly one character | ex?mple matches “example”, etc. |
*
and ?
can be used at the beginning, middle, or end of a term.
You can use the fuzzy operator to search for terms that are similar but not an exact match. This is useful for finding documents with misspelled words or variations of a term.
- Use ~ after a term to enable fuzzy matching (e.g.,
aple~
matches “apple”,tre~ wrk~
matches “tree work”). - You can specify the edit distance (e.g.,
aple~2
find words that are up to 2 edits away from “aple”). The default edit distance is set to 2 characters.
Proximity Search finds two or more words within a specific distance apart in a document. It also allows text to be in a different order than when searching for a quoted phrase.
- You may specify a maximum edit distance of words in a phrase by using the tilde (~) operator followed by a number (e.g.,
“tree work”~4
will find documents that contain the words “tree” and “work” within 4 words of each other, regardless of their order). - Documents with text that more closely matches the original specified order will be considered more relevant to your search.
Click here for more information on Field Search
To gain more control over your search results and achieve greater precision, you can target your searches to specific fields based on the Analyzer used to index the text data within those fields.
To ensure high-quality and consistent indexing of text data, Canopy utilizes the Standard Tokenizer in conjunction with many other specialized Language Tokenizers optimized for the linguistic nuances of specific languages.
Different Analyzers handle text indexing and searching differently.
- The Standard Analyzer breaks text into tokens based on whitespace, punctuation, and other non-alphanumeric characters, preserving the tokens’ original form.
- Language Analyzers (e.g., English, French) perform similar tokenization but also apply language-specific token filters, such as:
- Stemming: Reducing words to their root form (e.g., “running” becomes “run”).
- Stopwords Removal: Ignore common words that have little semantic value (e.g., “the,” “a,” “is”).
- Lowercase: Converting all characters to lowercase.
- Other linguistic normalizations: Handling elisions, case variations, etc.
Click here to learn more about How we Index the text data in Canopy
The following table shows the mapping of the different Analyzers used by Canopy and the corresponding data fields:
Analyzer | Field Mapping |
---|---|
Standard Analyzer | content.text |
English Analyzer | content.text_english |
French Analyzer | content.text_french |
German Analyzer | content.text_german |
Italian Analyzer | content.text_italian |
Kuromoji Analyzer | content.text_japanese |
Nori Analyzer | content.text_korean |
Smart Chinese Analyzer | content.text_chinese |
Default Search: When you enter a search term without specifying a field, Canopy searches across all relevant text fields, regardless of the analyzer used. This search is optimized to be as complete as possible.
Analyzer-Specific Field Search: When you specify a field mapping in your search query, you instruct Canopy to search only within the text indexed by the Analyzer associated with that specific field. This allows you to leverage the unique processing capabilities of each Analyzer for more targeted results.
Consider documents with the following content in different fields:
- Account: Main Account
- Accounting: Finance Department
- Account No: 123-456
- Accounting No: ACC-123
- Account #: ABC-124
- Accounting #: FY25-123
Search Term | Expected Result | Explanation |
---|---|---|
account |
1, 2, 3, 4, 5, 6 | Search across all text fields for variation of the word “account” |
account* e.g., accounting |
1, 2, 3, 4, 5, 6 | Search across all text fields for variation of the word “account” |
"account no" |
1, 2, 3, 4, 5, 6 | The stopword “no” is removed while indexing. This search term yield the same result as “account” |
"account* no" e.g., "accounting no" |
1, 2, 3, 4, 5, 6 | The stopword “no” is removed while indexing. This search term yield the same result as “accounting” |
"account #" |
1, 2, 3, 4, 5, 6 | The symbol # is removed while indexing. This search term yield the same result as “account” |
"account* #" e.g., "accounting #" |
1, 2, 3, 4, 5, 6 | The symbol # is removed while indexing. This search term yield the same result as “accounting” |
content.text:account |
1, 3, 5 | Searches across all text fields for the exact word “account” |
content.text:accounting |
2, 4, 6 | Searches across all text fields for the exact word “accounting” |
content.text:"account no" |
3 | Searches across all text fields for the exact phrase “account no”. Standard Analyzer does not remove the stopword “no” |
content.text:"accounting no" |
4 | Searches across all text fields for the exact phrase “accounting no”. Standard Analyzer does not remove the stopword “no” |
content.text:"account #" |
1, 3, 5 | The symbol # is removed while indexing. This search term yield the same result as content.text:account |
content.text:"accounting #" |
2, 4, 6 | The symbol # is removed while indexing. This search term yield the same result as content.text:accounting |
content.text_english:account |
1, 2, 3, 4, 5, 6 | English Analyzer use stemmer, so this syntax search across all text fields for variation of the word “account” |
content.text_english:accounting |
1, 2, 3, 4, 5, 6 | English Analyzer use stemmer, so this syntax search across all text fields for variation of the word “account” |
content.text_english:"account no" |
1, 2, 3, 4, 5, 6 | The stopword “no” is removed while indexing. This search term yield the same result as “content.text_english:account ” |
content.text_english:"accounting no" |
1, 2, 3, 4, 5, 6 | The stopword “no” is removed while indexing. This search term yield the same result as “content.text_english:accounting ” |
content.text_english:"account #" |
1, 2, 3, 4, 5, 6 | The symbol # is removed while indexing. This search term yield the same result as “content.text_english:account ” |
content.text_english:"accounting #" |
1, 2, 3, 4, 5, 6 | The symbol # is removed while indexing. This search term yield the same result as “content.text_english:accounting ” |
Click here for more information on Regular Expression Syntax
You can use different brackets to denote specific ranges for date, numeric, and string fields.
Ranges Search | Examples |
---|---|
Use square brackets to specify inclusive ranges [min-max] | date:["2018-01-01 00:00:00.000" TO "2018-12-31 00:00:00.000"] searches for all days in 2018. count:[100 TO *] searches for numbers from 100 upwards. |
Use curly brackets to specify exclusive ranges {min-max} | tag:{delta TO sigma} searches for tags between delta and sigma , excluding delta and sigma . date:{* TO "2018-01-01 00:00:00.000"} searches for all dates before 2018. |
Combine curly and square brackets | count:[1 TO 8} searches for numbers from 1 up to but not including 8. |
Range search with one side unbounded | age:>30 searches for ages greater than 30. age:>=30 searches for ages greater than or equal to 30. |
Combine an upper and lower unbounded range, and join them by AND operator | age:(>=10 AND <30) age:(+>=10 +<30) |
The boost operator (^
) allows you to increase the relevance of a term or phrase in your search results. By default, all terms are given equal weight, but you can adjust the weight of specific terms to prioritize them in the search results.
Boost Operator can be used on | Examples |
---|---|
Individual Terms | sugar^2 maple |
Phrases | "tree work"^2 |
Groups of Terms | (sugar maple)^4 |
Although the default boost value is 1, it can be any positive floating point number. Boosts between 0 and 1 reduce relevance.
Boolean operators are used to combine or exclude keywords in a search, allowing you to refine your search results.
Boolean operators include + (this term must be included) and - (this term must not be included), while all other terms are optional. For example, sugar maple +tree -work
states that:
tree
must be includedwork
must not be includedsugar
andmaple
are optional; their inclusion increases relevance
Users may also use Operators such as AND, OR and NOT (also written &&
, ||
and !
respectively) to combine or exclude keywords in a search. Some important rules to note:
- NOT takes precedence over AND, which takes precedence over OR.
+
and-
only affect the term to the operator’s right. However, AND and OR affect the terms to the left and right.
For example:
-
sugar OR maple AND tree AND NOT work
. This example will yield an inaccurate result becausemaple
is now a required term. -
(sugar OR maple) AND tree AND NOT work
. This example will yield an inaccurate result because at least one ofsugar
ormaple
is now required and the search for those terms would now be scored differently from the original query. -
((sugar AND tree) OR (maple AND tree) OR tree) AND NOT work
. This example replicates the logic from the original query, but the relevance scoring will not match that of the original query.
The operators AND, NOT, and OR must be in upper case.
You can group terms or phrases using parentheses ()
to form sub-queries (e.g.,(sugar OR maple) AND tree
).
-
Groups can be used to focus on a particular field or boost results of a sub-query (e.g.,
piitag:(name OR phone) title:(full text search)^3
). -
Groups can be used to find a list of values in a field (e.g.,
id:(2FG2G55FGF OR 2FG2G55CGF OR 3FG2G55FGF)
or alternately:id:(2FG2G55FGF 2FG2G55CGF 3FG2G55FGF)
)