Regular Expressions

A regular expression is a way to match patterns in data using placeholder characters, called operators.

Canopy uses two different types of regular expressions, Apache’s Lucene’s RegEx syntax when using the search bar, and Python’s RegEx syntax when defining custom detection rules.

Test for Elastic RegEx Search Pattern

Users can employ elastic regular expression to search for specific types of documents in our application. To help validate your search pattern or query, you can use general-purpose regex testing websites such as regex101.com or regexr.com.

For the closest approximation, choose the Java flavor.

When using these general-purpose RegEx testing websites, please remember that Canopy uses Apache’s Lucene’s RegEx syntax, and Lucene’s regex implementation has specific behaviors regarding anchoring and limited lookaround support.

These online testers provide a general sense of your pattern’s structure; however, it may not perfectly reflect how it behaves in Elasticsearch.

Regular Expression Syntax for Use in the Search Bar

Use common Apache Lucene regular expression syntax to search using regular expression.

To run a regular expression in the search bar, surround expressions with forward slashes /.

For example, a simple regular expression to find strings that match a three-digit number:

/[0-9]{3}/

By default, the expression will be run against the extracted or OCR text found in the content.text field.

Regular Expressions (RegEx) can be run against other compatible fields listed here..

For example, the following will find all documents with three digit filenames: short_name:/[0-9]{3}/

RegEx Search Default (content) vs. Specific Field
You can run a regex search in the search bar with and without specifying a field. The content for the pattern match is prepared differently for each case and will provide different results given the same data.

While regex patterns are always case-sensitive, content for the field match are changed to lower case. For example, take Rio de Janeiro:

Without specifying a field -> contents case is not changed.

\Rio de Janeiro\ -> matches

\rio de janeiro\ -> does not match

Rio de Janeiro -> non-regex search matches

rio de janeiro -> non-regex search matches

Specifying a field -> contents of the field are all converted to lower case

\Rio de Janeiro\ -> does not match

\rio de janeiro\ -> matches

Rio de Janeiro -> non-regex search matches

rio de janeiro -> non-regex search matches

When pattern searching for chains of numeric characters separated by non-alphabetic delimiters, numeric strings will be tokenized separately, for example, 01-02-03 → [01],[02],[03].

Without specifying a field -> leaves chains of numeric characters separated by non-alphabetic delimiters intact. Thus, the following applies when pattern searching:

/[0-9]{2}[0-9]{2}[0-9]{2}/ → matches on 01-02-03

/[0-9]{2}/ → does not match on each of the strings 01, 02, and 03

Specifying a field -> numeric characters separated by non-alphabetic delimiters are separated into tokens for pattern searches:

name:/[0-9]{2}/ → matches on each of the strings 01, 02, and 03

name:/[0-9]{2}[0-9]{2}[0-9]{2}/ → does not match on file name 01-02-03

Reserved characters

Lucene’s regular expression engine supports all Unicode characters. However, the following characters are reserved as operators:

. ? + * | { } [ ] ( ) " \

To use one of these characters literally, escape it with a preceding backslash or surround it with double quotes. For example:

\@ # renders as a literal '@'

\\ # renders as a literal '\'

john"@smith.com" # renders as 'john@smith.com'

Character Classes

Lucene regex allows you to specify a set or range of characters in ascending order as they are defined in Unicode. Ranges of potential characters may be represented as character classes by enclosing them in square brackets []. Here are the common character ranges that can be used in Lucene regex:

\[abc]\ # Matches any single character that is either a, b, or c.

\[a-z]\ # Matches any single character between a and z (inclusive), i.e., lowercase English letters.

\[A-Z]\ # Matches any single character between A and Z (inclusive), i.e., uppercase English letters.

\[0-9]\ # Matches any single digit from 0 to 9.

\[a-zA-Z0-9]\ # Matches any single character that is either a lowercase or uppercase letter or a digit.

Also, be aware that the character range [A-z] (with a lowercase z) is not ideal, because the characters [, , ], ^, _, and ```) are defined between A-Z and a-z. Instead, it is recommended to use the character range [a-zA-Z] to match only the English letters in both upper and lower case.

A leading ^ negates the character class.
\[^abc]\ # Matches any single character that is not a, b, or c.

Standard operators

Lucene’s regular expression engine does not use the Perl Compatible Regular Expressions (PCRE) library, but it does support the following standard operators.

. – Matches any character. For example:

ab. # matches 'aba', 'abb', 'abz', etc.

? – Repeat the preceding character zero or one times. Often used to make the preceding character optional. For example:

abc? # matches 'ab' and 'abc'

+ – Repeat the preceding character one or more times. For example:

ab+ # matches 'ab', 'abb', 'abbb', etc.

* – Repeat the preceding character zero or more times. For example:

ab* # matches 'a', 'ab', 'abb', 'abbb', etc.

{} – Minimum and maximum number of times the preceding character can repeat. For example:

a{2} # matches 'aa'

a{2,4} # matches 'aa', 'aaa', and 'aaaa'

a{2,} # matches 'a' repeated two or more times

| – OR operator. The match will succeed if the longest pattern on either the left side OR the right side matches. For example:

abc|xyz # matches 'abc' and 'xyz'

( … ) – Forms a group. You can use a group to treat part of the expression as a single character. For example:

abc(def)? # matches 'abc' and 'abcdef' but not 'abcd'

[ … ] – Match one of the characters in the brackets. For example:

[abc] # matches 'a', 'b', 'c'

Inside the brackets, - indicates a range unless - is the first character or escaped. For example:

[a-c] # matches 'a', 'b', or 'c'

[-abc] # '-' is first character. Matches '-', 'a', 'b', or 'c'

[abc\-] # Escapes '-'. Matches 'a', 'b', 'c', or '-'

A ^ before a character in the brackets negates the character or range. For example:

[^abc] # matches any character except 'a', 'b', or 'c'

[^a-c] # matches any character except 'a', 'b', or 'c'

[^-abc] # matches any character except '-', 'a', 'b', or 'c'

[^abc\-] # matches any character except 'a', 'b', 'c', or '-'

Optional operators

You can use the flags parameter to enable more optional operators for Lucene’s regular expression engine.

To enable multiple operators, use a | separator. For example, a flags value of COMPLEMENT|INTERVAL enables the COMPLEMENT and INTERVAL operators.

Valid values

ALL (Default)

Enables all optional operators.

"" (empty string)

Alias for the ALL value.

COMPLEMENT

Enables the ~ operator. You can use ~ to negate the shortest following pattern. For example:

a~bc # matches 'adc' and 'aec' but not 'abc'

EMPTY

Enables the # (empty language) operator. The # operator doesn’t match any string, not even an empty string.

If you create regular expressions by programmatically combining values, you can pass # to specify “no string.” This lets you avoid accidentally matching empty strings or other unwanted strings. For example:

#|abc # matches 'abc' but nothing else, not even an empty string

INTERVAL

Enables the <> operators. You can use <> to match a numeric range. For example:

foo<1-100> # matches 'foo1', 'foo2' ... 'foo99', 'foo100'

foo<01-100> # matches 'foo01', 'foo02' ... 'foo99', 'foo100'

INTERSECTION

Enables the & operator, which acts as an AND operator. The match will succeed if patterns on both the left side AND the right side matches. For example:

aaa.+&.+bbb # matches 'aaabbb'

ANYSTRING

Enables the @ operator. You can use @ to match any entire string.

You can combine the @ operator with & and ~ operators to create an “everything except” logic. For example:

@&~(abc.+) # matches everything except terms beginning with 'abc'

NONE

Disables all optional operators.

Unsupported operators

Lucene’s regular expression engine does not support anchor operators, such as ^ (beginning of line) or $ (end of line). To match a term, the regular expression must match the entire string.