Search Syntax

Canopy’s search engine is built on the powerful Apache Lucene library, which allows for a wide range of search capabilities, including fuzzy searches, wildcard searches, proximity searches, and more. In this guide, we will cover the important query string syntax and search operators that you can use to refine your searches.

Overview

Query String Syntax commonly consists of Terms, Fields, and Operators.

Terms are single words or phrases you want to search for.

For example, tree and work are single search terms that allow you to search for documents with the words “tree” and “work”. When running a query, search Terms will be entered into a Field.

Fields: When performing a search, you may select a Field from the Fields guide. If no field is specified, Canopy searches across all relevant text fields, containing extracted or OCRed text. This search is optimized to be as complete as possible.
Operator allows you to customize your search. Common Operators include AND, OR, NOT (must be capitalized). You can also use + for AND and - for NOT.

Basic Search

Canopy’s powerful search helps users find documents quickly and easily. Here are some of the basic search you can use:

Keyword Search: Canopy’s search engine uses a built-in English stemmer to find documents based on root and base words. Typing run in the search bar will return documents containing variations of “running,” “runner,” “ran,” etc.
Phrase Search:
- Use double quotes to search for an exact phrase and its close variation. For example, "red delicious apple" (in quotes) returns results containing a variation of "red delicious apple" phrase, such as "red delicious apple" or "red delicious apples" due to stemming.
- Without quotes, the search will return documents containing each word in any order. For example, red delicious apple (without quotes) returns document containing each word ("red", "delicious", "apple") and the phrase containing those words in any order ("delicious apple", "delicious red apple", etc.).

These basic searches are not case-sensitive. Whether you type “Apple,” “Apple,” or “APPLE,” you’ll get the same results.

Wildcard Search

You can use wildcards to search for partial terms. Wildcards are useful when you are unsure of the spelling or want to find variations of a word.

Wildcard Operators	Description	Examples
`*`	Matches zero or more characters	`appl*` matches “apple”, “apples”, etc.
`?`	Matches exactly one character	`ex?mple` matches “example”, etc.

* and ? can be used at the beginning, middle, or end of a term.

Fuzzy Search

You can use the fuzzy operator to search for terms that are similar but not an exact match. This is useful for finding documents with misspelled words or variations of a term.

Use ~ after a term to enable fuzzy matching (e.g., aple~ matches “apple”, tre~ wrk~ matches “tree work”).
You can specify the edit distance (e.g., aple~2 find words that are up to 2 edits away from “aple”). The default edit distance is set to 2 characters.

Proximity Search

Proximity Search finds two or more words within a specific distance apart in a document. It also allows text to be in a different order than when searching for a quoted phrase.

You may specify a maximum edit distance of words in a phrase by using the tilde (~) operator followed by a number (e.g.,“tree work”~4 will find documents that contain the words “tree” and “work” within 4 words of each other, regardless of their order).
Documents with text that more closely matches the original specified order will be considered more relevant to your search.

Field Search

Click here for more information on Field Search

Analyzer-Specific Field Search

To gain more control over your search results and achieve greater precision, you can target your searches to specific fields based on the Analyzer used to index the text data within those fields.

Understanding Analyzers and Fields

To ensure high-quality and consistent indexing of text data, Canopy utilizes the Standard Tokenizer in conjunction with many other specialized Language Tokenizers optimized for the linguistic nuances of specific languages.

Different Analyzers handle text indexing and searching differently.

The Standard Analyzer breaks text into tokens based on whitespace, punctuation, and other non-alphanumeric characters, preserving the tokens’ original form.
Language Analyzers (e.g., English, French) perform similar tokenization but also apply language-specific token filters, such as:
- Stemming: Reducing words to their root form (e.g., “running” becomes “run”).
- Stopwords Removal: Ignore common words that have little semantic value (e.g., “the,” “a,” “is”).
- Lowercase: Converting all characters to lowercase.
- Other linguistic normalizations: Handling elisions, case variations, etc.

Click here to learn more about How we Index the text data in Canopy

The following table shows the mapping of the different Analyzers used by Canopy and the corresponding data fields:

Analyzer	Field Mapping
Standard Analyzer	`content.text`
English Analyzer	`content.text_english`
French Analyzer	`content.text_french`
German Analyzer	`content.text_german`
Italian Analyzer	`content.text_italian`
Kuromoji Analyzer	`content.text_japanese`
Nori Analyzer	`content.text_korean`
Smart Chinese Analyzer	`content.text_chinese`

Default Search: When you enter a search term without specifying a field, Canopy searches across all relevant text fields, regardless of the analyzer used. This search is optimized to be as complete as possible.

Analyzer-Specific Field Search: When you specify a field mapping in your search query, you instruct Canopy to search only within the text indexed by the Analyzer associated with that specific field. This allows you to leverage the unique processing capabilities of each Analyzer for more targeted results.

Example Search Behavior

Consider documents with the following content in different fields:

Account: Main Account
Accounting: Finance Department
Account No: 123-456
Accounting No: ACC-123
Account #: ABC-124
Accounting #: FY25-123

Search Term	Expected Result	Explanation
`account`	1, 2, 3, 4, 5, 6	Search across all text fields for variation of the word “account”
`account*` e.g., `accounting`	1, 2, 3, 4, 5, 6	Search across all text fields for variation of the word “account”
`"account no"`	1, 2, 3, 4, 5, 6	The stopword “no” is removed while indexing. This search term yield the same result as “account”
`"account* no"` e.g., `"accounting no"`	1, 2, 3, 4, 5, 6	The stopword “no” is removed while indexing. This search term yield the same result as “accounting”
`"account #"`	1, 2, 3, 4, 5, 6	The symbol # is removed while indexing. This search term yield the same result as “account”
`"account* #"` e.g., `"accounting #"`	1, 2, 3, 4, 5, 6	The symbol # is removed while indexing. This search term yield the same result as “accounting”
`content.text:account`	1, 3, 5	Searches across all text fields for the exact word “account”
`content.text:accounting`	2, 4, 6	Searches across all text fields for the exact word “accounting”
`content.text:"account no"`	3	Searches across all text fields for the exact phrase “account no”. Standard Analyzer does not remove the stopword “no”
`content.text:"accounting no"`	4	Searches across all text fields for the exact phrase “accounting no”. Standard Analyzer does not remove the stopword “no”
`content.text:"account #"`	1, 3, 5	The symbol # is removed while indexing. This search term yield the same result as `content.text:account`
`content.text:"accounting #"`	2, 4, 6	The symbol # is removed while indexing. This search term yield the same result as `content.text:accounting`
`content.text_english:account`	1, 2, 3, 4, 5, 6	English Analyzer use stemmer, so this syntax search across all text fields for variation of the word “account”
`content.text_english:accounting`	1, 2, 3, 4, 5, 6	English Analyzer use stemmer, so this syntax search across all text fields for variation of the word “account”
`content.text_english:"account no"`	1, 2, 3, 4, 5, 6	The stopword “no” is removed while indexing. This search term yield the same result as “`content.text_english:account`”
`content.text_english:"accounting no"`	1, 2, 3, 4, 5, 6	The stopword “no” is removed while indexing. This search term yield the same result as “`content.text_english:accounting`”
`content.text_english:"account #"`	1, 2, 3, 4, 5, 6	The symbol # is removed while indexing. This search term yield the same result as “`content.text_english:account`”
`content.text_english:"accounting #"`	1, 2, 3, 4, 5, 6	The symbol # is removed while indexing. This search term yield the same result as “`content.text_english:accounting`”

Regular Expressions (regex)

Click here for more information on Regular Expression Syntax

Ranges

You can use different brackets to denote specific ranges for date, numeric, and string fields.

Ranges Search	Examples
Use square brackets to specify inclusive ranges [min-max]	`date:["2018-01-01 00:00:00.000" TO "2018-12-31 00:00:00.000"]` searches for all days in 2018. `count:[100 TO *]` searches for numbers from 100 upwards.
Use curly brackets to specify exclusive ranges {min-max}	`tag:{delta TO sigma}` searches for tags between `delta` and `sigma`, excluding `delta` and `sigma`. `date:{* TO "2018-01-01 00:00:00.000"}` searches for all dates before 2018.
Combine curly and square brackets	`count:[1 TO 8}` searches for numbers from 1 up to but not including 8.
Range search with one side unbounded	`age:>30` searches for ages greater than 30. `age:>=30` searches for ages greater than or equal to 30.
Combine an upper and lower unbounded range, and join them by AND operator	`age:(>=10 AND <30)` `age:(+>=10 +<30)`

Boost Operator

The boost operator (^) allows you to increase the relevance of a term or phrase in your search results. By default, all terms are given equal weight, but you can adjust the weight of specific terms to prioritize them in the search results.

Boost Operator can be used on	Examples
Individual Terms	`sugar^2 maple`
Phrases	`"tree work"^2`
Groups of Terms	`(sugar maple)^4`

Although the default boost value is 1, it can be any positive floating point number. Boosts between 0 and 1 reduce relevance.

Boolean Operators

Boolean operators are used to combine or exclude keywords in a search, allowing you to refine your search results.

Boolean operators include + (this term must be included) and - (this term must not be included), while all other terms are optional. For example, sugar maple +tree -work states that:

tree must be included
work must not be included
sugar and maple are optional; their inclusion increases relevance

Users may also use Operators such as AND, OR and NOT (also written &&, || and ! respectively) to combine or exclude keywords in a search. Some important rules to note:

NOT takes precedence over AND, which takes precedence over OR.
+ and - only affect the term to the operator’s right. However, AND and OR affect the terms to the left and right.

For example:

sugar OR maple AND tree AND NOT work. This example will yield an inaccurate result because maple is now a required term.
(sugar OR maple) AND tree AND NOT work. This example will yield an inaccurate result because at least one of sugar or maple is now required and the search for those terms would now be scored differently from the original query.
((sugar AND tree) OR (maple AND tree) OR tree) AND NOT work. This example replicates the logic from the original query, but the relevance scoring will not match that of the original query.

The operators AND, NOT, and OR must be in upper case.

Grouping

You can group terms or phrases using parentheses () to form sub-queries (e.g.,(sugar OR maple) AND tree).

Groups can be used to focus on a particular field or boost results of a sub-query (e.g.,piitag:(name OR phone) title:(full text search)^3).
Groups can be used to find a list of values in a field (e.g.,id:(2FG2G55FGF OR 2FG2G55CGF OR 3FG2G55FGF) or alternately: id:(2FG2G55FGF 2FG2G55CGF 3FG2G55FGF))