Custom Detection Rules

Getting Started

Users have the ability to create custom detection rules when the data mining team needs to augment Canopy’s existing detection. Custom detection rules are created using the Python regular expression (regex) format.

The regex (Python format) used when building custom detection rules is different from the regex (Lucene format) used to search the document list.

Useful Sites to Get Started with Regex
Please use any of these guides for quick reference to using regex expressions. These third-party sites are not owned by Canopy, so please do not enter any customer data into these sites when testing.

Python Regex Cheatsheet

Regex 101

Python Word Distance Guide

Python Regex Tester

Detecting PII with Custom Detection

To create custom detection rules:

Download the sample .csv file
Edit .csv and define your rules
Upload your .csv with custom detection rules
Upload & process data or re-run PII detection

Download Sample Rules

Sample Low Confidence Rules

The following are examples of over inclusive rules that you can use in sampling to find detections that were not found by the standard detection methods. As the name implies, these detection rules should return many more false positives than the standard detection methods.

Download low confidence detection rules .csv.

RuleName	TagValue	QueryType	Query
US_SSN_VLC	US_SSN_VLC	regex	“\b[0-9]{3}[-][0-9]{2}[-][0-9]{4}\b”
US_SSN_LC	US_SSN_LC	regex	“\b(?!123456789)(?!000)(?!666)(?!9)\d{3}[- ]?\d{2}[- ]?\d{4}\b”
US_Passport_LC	US_Passport_LC	regex	“\b(?!123456789)\d{9}\|[AXYZ]\d{8}\b”
CA_Passport_LC	CA_Passport_LC	regex	“\b[A-Z]{2}(?!([0-9])\1{5,6})[0-9]{6,7}\b”
CA_SIN_LC	CA_SIN_LC	regex	“\b(?!0)(?!8)(?!123456789)\d{3}[- ]?\d{3}[- ]?\d{3}\b”
UK_Passport_LC	UK_Passport_LC	regex	“\b(?!([0-9])\1{8})(?!123456789)\d{9}\b”
AU_Passport_LC	AU_Passport_LC	regex	“\b[AC-FNUX](?!([0-9])\1{6})\d{7}\|P[A-FUWXZ](?!([0-9])\1{6})\d{7}\b”
AU_TFN_LC	AU_TFN_LC	regex	“\b(\d{8}\|\d{3}[\s-]?\d{3}[\s-]?\d{3})\b”
NZ_Passport_LC	NZ_Passport_LC	regex	“\b(LA\|LD\|LF\|N\|EA\|LH\|EP)(?!([0-9])\1{5})\d{6}\b”
NZ_IRD_LC	NZ_IRD_LC	regex	“\b\d{2}-?\d{3}-?\d{3,4}\b”

Sample Rules For Practice

You can practice using regex by downloading our sample rules.

Download the sample custom detection rule .csv.

RuleName	TagValue	QueryType	Query
PAT	PAT	regex	“[\d]{4}\w[\d]{3}”
PAT2	PAT2	regex	“[A-F]?[\d]{4}\w[\d]{3}”
PAT3	PAT3	regex	“[ABDEF]?[\d]{4}\w[\d]{3}”
PAT4	PAT4	regex	“[ABDEF]?[\d]{4,8}\w[\d]{3}”
PAT5	PAT5	regex	“[ABDEF]?[\d]{4,8}\w[\d]{3}(-[1-9][\d]{3})?”
PAT6	PAT6	regex	“[ABDEF]?[ -]?[\d]{4,8}[- ]?\w[\d]{3}(-[1-9][\d]{3})?”
PAT7	PAT7	regex	“\b[ABDEF]?[ -]?[\d]{4,8}[- ]?\w[\d]{3}(-[1-9][\d]{3})?”
PAT8	TFN	regex	“\b(SSN\|TFN)\b”

These sample custom detection rules were created using the examples in the tutorial below.

Define Your Rules

To define a custom rule, each of the following values are required:

RuleName: This field contains the short name to identify the rule.
TagValue: This field contains the name of the tag to apply when an element is detected. Tag values cannot contain spaces.
QueryType: This field defines the query type and must be regex
Query: This field contains the regular expression.

Regular Expressions Must Be Surrounded by Double Quotation Marks ("")
Confirm that double quotation marks are surrounding the regular expression using a plain text editor to open the .csv file. Many spreadsheet applications may strip or augment the quotes when opening and saving the .csv file.

Regex Tutorial

You can refer to this tutorial, as well as the table and websites listed above, to practice your regex skills. Here are some common patterns and the regular expressions used to detect them:

PAT can be detected with [\d]{4}\w[\d]{3}: Detect a pattern (PAT) as follows: 4 digits(0-9) followed by a lowercase letter (a-z) followed by another 3 digits (0-9).
PAT2 can be detected with [A-F]?[\d]{4}\w[\d]{3}: PAT2 will now start with an uppercase character between A and F, but the older numbers from PAT are still valid.
PAT3 can be detected with [ABDEF]?[\d]{4}\w[\d]{3}: PAT3 allows all uppercase characters between A to F, except C. PAT2 is not valid.
PAT4 can be detected with [ABDEF]?[\d]{4,8}\w[\d]{3}: PAT4 can have between 4 and 8 digits between the first and second letters, but PAT and PAT3 are still valid.
PAT5 can be detected as [ABDEF]?[\d]{4,8}\w[\d]{3}(-[1-9][\d]{3})?: PAT5 will have a “-” followed by a 4-digit code at the end, which can start with all digits except 0. The 4-digit code is optional and PAT, PAT3, and PAT4 are still valid.
PAT6 can be detected as [ABDEF]?[ -]?[\d]{4,8}[- ]?\w[\d]{3}(-[1-9][\d]{3})?: PAT6 allows spaces or dashes between each group of letters and numbers. PAT, PAT3, PAT4 and PAT5 are still valid.
PAT7 can be detected as \b[ABDEF]?[ -]?[\d]{4,8}[- ]?\w[\d]{3}(-[1-9][\d]{3})?: PAT7 allows us to detect PAT6 if it exists by itself and is not part of any other sequence of characters. PAT, PAT3, PAT4, PAT5 and PAT6 are still valid.
PAT8 can be detected as \b(SSN|TFN)\b: PAT8 allows us to detect whether the words “SSN” OR “TFN” are present in a document (case-sensitive).

Here are some examples based on these patterns:

Upload and Process Your Rules

In order to upload and process with custom detection, start by clicking on the gear icon at the top right of the screen and select Templates and Layouts from the Project Settings menu. Click on Manage Processing Templates in the Processing Templates window:

Check the box for Use My Custom Rules under the PII Detection Options section. Click on the Custom Rules icon to upload your custom rules:

You can upload a new rules list or one you have previously created and uploaded:

You can save your custom rules templates and make them available to other projects. You can also update a saved template or save a custom rules template as your default setting:

Exception Handling

When your team uploads a list of custom detection rules, the application will validate the tag names against the list of custom tags.

Duplicate System Tag: If a system tag with the exact same name already exists, the app will highlight the rules that failed validation. The system will produce an error message instructing your team to rename the tag names that are duplicative of a system tag and re-upload the custom detection rules:

Duplicate system tag hover message
Duplicate Custom Tag: If your team uses duplicate custom rules, the combined rules will act as if they were separated by OR operators and will be flagged with a warning:

Duplicate custom tag hover message