Custom Detection Rules
Users have the ability to create custom detection rules when the data mining team needs to augment Canopy’s existing detection. Custom detection rules are created using the Python regular expression (regex) format.
The regex (Python format) used when building custom detection rules is different from the regex (Lucene format) used to search the document list.
Useful Sites to Get Started with Regex
Please use any of these guides for quick reference to using regex expressions. These third-party sites are not owned by Canopy, so please do not enter any customer data into these sites when testing.
To create custom detection rules:
- Download the sample .csv file
- Edit .csv and define your rules
- Upload your .csv with custom detection rules
- Upload & process data or re-run PII detection
The following are examples of over inclusive rules that you can use in sampling to find detections that were not found by the standard detection methods. As the name implies, these detection rules should return many more false positives than the standard detection methods.
RuleName | TagValue | QueryType | Query |
---|---|---|---|
US_SSN_VLC | US_SSN_VLC | regex | “\b[0-9]{3}[-][0-9]{2}[-][0-9]{4}\b” |
US_SSN_LC | US_SSN_LC | regex | “\b(?!123456789)(?!000)(?!666)(?!9)\d{3}[- ]?\d{2}[- ]?\d{4}\b” |
US_Passport_LC | US_Passport_LC | regex | “\b(?!123456789)\d{9}|[AXYZ]\d{8}\b” |
CA_Passport_LC | CA_Passport_LC | regex | “\b[A-Z]{2}(?!([0-9])\1{5,6})[0-9]{6,7}\b” |
CA_SIN_LC | CA_SIN_LC | regex | “\b(?!0)(?!8)(?!123456789)\d{3}[- ]?\d{3}[- ]?\d{3}\b” |
UK_Passport_LC | UK_Passport_LC | regex | “\b(?!([0-9])\1{8})(?!123456789)\d{9}\b” |
AU_Passport_LC | AU_Passport_LC | regex | “\b[AC-FNUX](?!([0-9])\1{6})\d{7}|P[A-FUWXZ](?!([0-9])\1{6})\d{7}\b” |
AU_TFN_LC | AU_TFN_LC | regex | “\b(\d{8}|\d{3}[\s-]?\d{3}[\s-]?\d{3})\b” |
NZ_Passport_LC | NZ_Passport_LC | regex | “\b(LA|LD|LF|N|EA|LH|EP)(?!([0-9])\1{5})\d{6}\b” |
NZ_IRD_LC | NZ_IRD_LC | regex | “\b\d{2}-?\d{3}-?\d{3,4}\b” |
You can practice using regex by downloading our sample rules.
RuleName | TagValue | QueryType | Query |
---|---|---|---|
PAT | PAT | regex | “[\d]{4}\w[\d]{3}” |
PAT2 | PAT2 | regex | “[A-F]?[\d]{4}\w[\d]{3}” |
PAT3 | PAT3 | regex | “[ABDEF]?[\d]{4}\w[\d]{3}” |
PAT4 | PAT4 | regex | “[ABDEF]?[\d]{4,8}\w[\d]{3}” |
PAT5 | PAT5 | regex | “[ABDEF]?[\d]{4,8}\w[\d]{3}(-[1-9][\d]{3})?” |
PAT6 | PAT6 | regex | “[ABDEF]?[ -]?[\d]{4,8}[- ]?\w[\d]{3}(-[1-9][\d]{3})?” |
PAT7 | PAT7 | regex | “\b[ABDEF]?[ -]?[\d]{4,8}[- ]?\w[\d]{3}(-[1-9][\d]{3})?” |
PAT8 | TFN | regex | “\b(SSN|TFN)\b” |
These sample custom detection rules were created using the examples in the tutorial below.
To define a custom rule, each of the following values are required:
- RuleName
- This field contains the short name to identify the rule.
- TagValue
- This field contains the name of the tag to apply when an element is detected. Tag values cannot contain spaces.
- QueryType
- This field defines the query type and must be
regex
- Query
- This field contains the regular expression.
Regular Expressions Must Be Surrounded by Double Quotation Marks ("")
Confirm that double quotation marks are surrounding the regular expression using a plain text editor to open the .csv file. Many spreadsheet applications may strip or augment the quotes when opening and saving the .csv file.
You can refer to this tutorial, as well as the table and websites listed above, to practice your regex skills. Here are some common patterns and the regular expressions used to detect them:
- PAT can be detected with
[\d]{4}\w[\d]{3}
- Detect a pattern (PAT) as follows: 4 digits(0-9) followed by a lowercase letter (a-z) followed by another 3 digits (0-9).
- PAT2 can be detected with
[A-F]?[\d]{4}\w[\d]{3}
- PAT2 will now start with an uppercase character between A and F, but the older numbers from PAT are still valid.
- PAT3 can be detected with
[ABDEF]?[\d]{4}\w[\d]{3}
- PAT3 allows all uppercase characters between A to F, except C. PAT2 is not valid.
- PAT4 can be detected with
[ABDEF]?[\d]{4,8}\w[\d]{3}
- PAT4 can have between 4 and 8 digits between the first and second letters, but PAT and PAT3 are still valid.
- PAT5 can be detected as
[ABDEF]?[\d]{4,8}\w[\d]{3}(-[1-9][\d]{3})?
- PAT5 will have a “-” followed by a 4-digit code at the end, which can start with all digits except 0. The 4-digit code is optional and PAT, PAT3, and PAT4 are still valid.
- PAT6 can be detected as
[ABDEF]?[ -]?[\d]{4,8}[- ]?\w[\d]{3}(-[1-9][\d]{3})?
- PAT6 allows spaces or dashes between each group of letters and numbers. PAT, PAT3, PAT4 and PAT5 are still valid.
- PAT7 can be detected as
\b[ABDEF]?[ -]?[\d]{4,8}[- ]?\w[\d]{3}(-[1-9][\d]{3})?
- PAT7 allows us to detect PAT6 if it exists by itself and is not part of any other sequence of characters. PAT, PAT3, PAT4, PAT5 and PAT6 are still valid.
- PAT8 can be detected as
\b(SSN|TFN)\b
- PAT8 allows us to detect whether the words “SSN” OR “TFN” are present in a document (case-sensitive).
Here are some examples based on these patterns:
In order to upload and process with custom detection, start by clicking on the gear icon at the top right of the screen and select Templates and Layouts from the Project Settings menu. Click on Manage Processing Templates in the Processing Templates window:
Check the box for Use My Custom Rules under the PII Detection Options section. Click on the Custom Rules icon to upload your custom rules:
You can upload a new rules list or one you have previously created and uploaded:
You can save your custom rules templates and make them available to other projects. You can also update a saved template or save a custom rules template as your default setting:
When your team uploads a list of custom detection rules, the application will validate the tag names against the list of custom tags.
- Duplicate System Tag
- If a system tag with the exact same name already exists, the app will highlight the rules that failed validation. The system will produce an error message instructing your team to rename the tag names that are duplicative of a system tag and re-upload the custom detection rules:
- Duplicate Custom Tag
- If your team uses duplicate custom rules, the combined rules will act as if they were separated by OR operators and will be flagged with a warning:
Unicode characters are supported when creating custom detection rules:
While uploading data, you can load custom detection rules when configuring your processing settings:
From the kebab menu on the Document List page, you can choose to re-run PII detection and select custom detection: