Product Documentation
Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Custom Detection Rules

Getting Started

Users have the ability to create custom detection rules when the data mining team needs to augment Canopy’s existing detection. Custom detection rules are created using the Python regular expression (regex) format.

The regex (Python format) used when building custom detection rules is different from the regex (Lucene format) used to search the document list.

Useful Sites to Get Started with Regex
Please use any of these guides for quick reference to using regex expressions. These third-party sites are not owned by Canopy, so please do not enter any customer data into these sites when testing.

Python Regex Cheatsheet

Regex 101

Python Word Distance Guide

Python Regex Tester

Detecting PII with Custom Detection

To create custom detection rules:

  1. Download the sample .csv file
  2. Edit .csv and define your rules
  3. Upload your .csv with custom detection rules
  4. Upload & process data or re-run PII detection

Download Sample Rules

Sample Low Confidence Rules

The following are examples of over inclusive rules that you can use in sampling to find detections that were not found by the standard detection methods. As the name implies, these detection rules should return many more false positives than the standard detection methods.

Download low confidence detection rules .csv.
RuleName TagValue QueryType Query
US_SSN_VLC US_SSN_VLC regex “\b[0-9]{3}[-][0-9]{2}[-][0-9]{4}\b”
US_SSN_LC US_SSN_LC regex “\b(?!123456789)(?!000)(?!666)(?!9)\d{3}[- ]?\d{2}[- ]?\d{4}\b”
US_Passport_LC US_Passport_LC regex “\b(?!123456789)\d{9}|[AXYZ]\d{8}\b”
CA_Passport_LC CA_Passport_LC regex “\b[A-Z]{2}(?!([0-9])\1{5,6})[0-9]{6,7}\b”
CA_SIN_LC CA_SIN_LC regex “\b(?!0)(?!8)(?!123456789)\d{3}[- ]?\d{3}[- ]?\d{3}\b”
UK_Passport_LC UK_Passport_LC regex “\b(?!([0-9])\1{8})(?!123456789)\d{9}\b”
AU_Passport_LC AU_Passport_LC regex “\b[AC-FNUX](?!([0-9])\1{6})\d{7}|P[A-FUWXZ](?!([0-9])\1{6})\d{7}\b”
AU_TFN_LC AU_TFN_LC regex “\b(\d{8}|\d{3}[\s-]?\d{3}[\s-]?\d{3})\b”
NZ_Passport_LC NZ_Passport_LC regex “\b(LA|LD|LF|N|EA|LH|EP)(?!([0-9])\1{5})\d{6}\b”
NZ_IRD_LC NZ_IRD_LC regex “\b\d{2}-?\d{3}-?\d{3,4}\b”

Sample Rules For Practice

You can practice using regex by downloading our sample rules.

Download the sample custom detection rule .csv.
RuleName TagValue QueryType Query
PAT PAT regex “[\d]{4}\w[\d]{3}”
PAT2 PAT2 regex “[A-F]?[\d]{4}\w[\d]{3}”
PAT3 PAT3 regex “[ABDEF]?[\d]{4}\w[\d]{3}”
PAT4 PAT4 regex “[ABDEF]?[\d]{4,8}\w[\d]{3}”
PAT5 PAT5 regex “[ABDEF]?[\d]{4,8}\w[\d]{3}(-[1-9][\d]{3})?”
PAT6 PAT6 regex “[ABDEF]?[ -]?[\d]{4,8}[- ]?\w[\d]{3}(-[1-9][\d]{3})?”
PAT7 PAT7 regex “\b[ABDEF]?[ -]?[\d]{4,8}[- ]?\w[\d]{3}(-[1-9][\d]{3})?”
PAT8 TFN regex “\b(SSN|TFN)\b”

These sample custom detection rules were created using the examples in the tutorial below.

Define Your Rules

To define a custom rule, each of the following values are required:

RuleName
This field contains the short name to identify the rule.
TagValue
This field contains the name of the tag to apply when an element is detected. Tag values cannot contain spaces.
QueryType
This field defines the query type and must be regex
Query
This field contains the regular expression.
Regular Expressions Must Be Surrounded by Double Quotation Marks ("")
Confirm that double quotation marks are surrounding the regular expression using a plain text editor to open the .csv file. Many spreadsheet applications may strip or augment the quotes when opening and saving the .csv file.

Regex Tutorial

You can refer to this tutorial, as well as the table and websites listed above, to practice your regex skills. Here are some common patterns and the regular expressions used to detect them:

PAT can be detected with [\d]{4}\w[\d]{3}
Detect a pattern (PAT) as follows: 4 digits(0-9) followed by a lowercase letter (a-z) followed by another 3 digits (0-9).
PAT2 can be detected with [A-F]?[\d]{4}\w[\d]{3}
PAT2 will now start with an uppercase character between A and F, but the older numbers from PAT are still valid.
PAT3 can be detected with [ABDEF]?[\d]{4}\w[\d]{3}
PAT3 allows all uppercase characters between A to F, except C. PAT2 is not valid.
PAT4 can be detected with [ABDEF]?[\d]{4,8}\w[\d]{3}
PAT4 can have between 4 and 8 digits between the first and second letters, but PAT and PAT3 are still valid.
PAT5 can be detected as [ABDEF]?[\d]{4,8}\w[\d]{3}(-[1-9][\d]{3})?
PAT5 will have a “-” followed by a 4-digit code at the end, which can start with all digits except 0. The 4-digit code is optional and PAT, PAT3, and PAT4 are still valid.
PAT6 can be detected as [ABDEF]?[ -]?[\d]{4,8}[- ]?\w[\d]{3}(-[1-9][\d]{3})?
PAT6 allows spaces or dashes between each group of letters and numbers. PAT, PAT3, PAT4 and PAT5 are still valid.
PAT7 can be detected as \b[ABDEF]?[ -]?[\d]{4,8}[- ]?\w[\d]{3}(-[1-9][\d]{3})?
PAT7 allows us to detect PAT6 if it exists by itself and is not part of any other sequence of characters. PAT, PAT3, PAT4, PAT5 and PAT6 are still valid.
PAT8 can be detected as \b(SSN|TFN)\b
PAT8 allows us to detect whether the words “SSN” OR “TFN” are present in a document (case-sensitive).

Here are some examples based on these patterns:

Sample Detections

Upload and Process Your Rules

In order to upload and process with custom detection, start by clicking on the gear icon at the top right of the screen and select Templates and Layouts from the Project Settings menu. Click on Manage Processing Templates in the Processing Templates window:

img_13.png

Check the box for Use My Custom Rules under the PII Detection Options section. Click on the Custom Rules icon to upload your custom rules:

img_9.png

You can upload a new rules list or one you have previously created and uploaded:

img_4.png

You can save your custom rules templates and make them available to other projects. You can also update a saved template or save a custom rules template as your default setting:

img_22.png

Exception Handling

When your team uploads a list of custom detection rules, the application will validate the tag names against the list of custom tags.

Duplicate System Tag
If a system tag with the exact same name already exists, the app will highlight the rules that failed validation. The system will produce an error message instructing your team to rename the tag names that are duplicative of a system tag and re-upload the custom detection rules:

Duplicate system tag hover message

Duplicate Custom Tag
If your team uses duplicate custom rules, the combined rules will act as if they were separated by OR operators and will be flagged with a warning:

Duplicate custom tag hover message

Unicode Support for Custom Detection

Unicode characters are supported when creating custom detection rules:

Example Unicode custom detection rules

Upload and Process with Custom Detection

While uploading data, you can load custom detection rules when configuring your processing settings:

Re-Running PII with Custom Detection

From the kebab menu on the Document List page, you can choose to re-run PII detection and select custom detection: