How We Index
In order to understand searching and expected search results, it is important for you to understand how Canopy indexes text. Canopy goes through an analyzing process as depicted below:
flowchart TD Text --> Analysis -- token --> Index subgraph Analysis direction TB Tokenizers --> token[Token Filters] end
Canopy’s grammar-based tokenization (based on the Unicode Text Segmentation method, as described in Unicode Standard Annex #29) is an effective standard tokenizer for the majority of languages.
The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.
The above query would produce the following results:
[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]
Token filters accept a stream of tokens from a tokenizer and have the ability to modify tokens, delete tokens, or add tokens.
Canopy uses the following token filters when creating the index:
For chains of alphanumeric characters separated by non-alphabetic delimiters, the filter generates catenated tokens.
For example,
super-duper-xl-500
[ super, superduperxl500, duper, xl, 500 ]
For chains of numeric characters separated by non-alphabetic delimiters, the filter generates catenated tokens.
For example,
01-02-03
[ 01, 010203, 02, 03 ]
For chains of alphabetical characters separated by non-alphabetic delimiters, the filter generates catenated tokens.
For example,
super-duper-xl
[ super, superduperxl, duper, xl ]
Canopy specifies the English stemmer.
Each token’s English possessive (’s) is eliminated by the filter.
For example,
O'Neil's
[ O, Neil ]
Canopy uses the following English stop words when indexing:
Letter | Word |
---|---|
A | “a”, “an”, “and”, “are”, “as”, “at” |
B | “be”, “but”, “by” |
F | “for” |
I | “if”, “in”, “into”, “is”, “it” |
N | “no”, “not” |
O | “of”, “on”, “or” |
S | “such” |
T | “that”, “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to” |
W | “was”, “will”, “with” |
Changes search tokenization text to lowercase.
For example,
THE Lazy DoG
would output, the lazy dog
.
Tokens are split by the filter at changes in letter case.
For example,
camelCase
[ camel, Case ]
Tokens are split by the filter at letter-numeric transitions.
For example,
j2se
[ j, 2, se ]