How We Index

Introduction

In order to understand searching and expected search results, it is important for you to understand how Canopy indexes text. Canopy goes through an analyzing process as depicted below:

flowchart TD
    Text --> Analysis -- token --> Index
    subgraph Analysis
    direction TB
    Tokenizers --> token[Token Filters]
    end

Tokenizers

Canopy’s grammar-based tokenization (based on the Unicode Text Segmentation method, as described in Unicode Standard Annex #29) is an effective standard tokenizer for the majority of languages.

The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.

The above query would produce the following results:

[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]

Token Filters

Token filters accept a stream of tokens from a tokenizer and have the ability to modify tokens, delete tokens, or add tokens.

Canopy uses the following token filters when creating the index:

Catenation Filter

Catenate All

For chains of alphanumeric characters separated by non-alphabetic delimiters, the filter generates catenated tokens.

For example,

super-duper-xl-500

[ super, superduperxl500, duper, xl, 500 ]

Catenate Numbers

For chains of numeric characters separated by non-alphabetic delimiters, the filter generates catenated tokens.

For example,

01-02-03

[ 01, 010203, 02, 03 ]

Catenate Words

For chains of alphabetical characters separated by non-alphabetic delimiters, the filter generates catenated tokens.

For example,

super-duper-xl

[ super, superduperxl, duper, xl ]

English Stemmer Filter

English Stemmer

Canopy specifies the English stemmer.

English Possessive Stemmer

Each token’s English possessive (’s) is eliminated by the filter.

For example,

O'Neil's

[ O, Neil ]

English Stopwords

Canopy uses the following English stopwords when indexing:

Letter	Word
A	“a”, “an”, “and”, “are”, “as”, “at”
B	“be”, “but”, “by”
F	“for”
I	“if”, “in”, “into”, “is”, “it”
N	“no”, “not”
O	“of”, “on”, “or”
S	“such”
T	“that”, “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to”
W	“was”, “will”, “with”

Starting from Canopy Version 3, users can search for stopwords within the extracted text using the content.text prefix.

For example, to search for a phrase containing alphanumerics and a stopword, type content.text:"Account No" in the search bar or bulk search list.

Lowercase Filter

Changes search tokenization text to lowercase.

For example,

THE Lazy DoG would output, the lazy dog.

Splits

Split on Case Change

Tokens are split by the filter at changes in letter case.

For example,

camelCase

[ camel, Case ]

Split on Numerics

Tokens are split by the filter at letter-numeric transitions.

For example,

j2se

[ j, 2, se ]