Entity Propagation

Entity Propagation Overview

Propagation is the name for Canopy’s automated process of searching, detecting, and matching entities found in documents uploaded into our system by our clients. Entity Propagation is an effective data mining tool for use alongside Gallery View, Statistical Sampling, Smart Mapping, and Database Mapping to efficiently handle variations found in compromised data.

One of the most important tasks in data mining is to find the association between sensitive data and a specific person. Our goal is to discover these associations as efficiently as possible. Once an association has been found on one document, there is little value in spending time collecting the same PII (“Personal Identification Information,” including SSN, email address, etc.) on multiple documents. Canopy’s Entity Propagation capability helps eliminate the need for manually adding an element multiple times by automatically propagating data found in the consolidated master entities to all PII detected in the project.

Canopy gathers entity information for propagation in two different ways, from “seed” entity data and from “raw” entity data. Seed entity data includes lists of a client’s known entities, such as a list of names and associated PII from their own personnel or current customers. Raw entity data is entity information that is identified and manually entered into Canopy’s system by reviewers.

Seed Data

A client’s seed entity data is uploaded to Canopy and added to its database. This data is then propagated once for each tranche of uploaded documents. Clients may add new entities to the seed data and update or delete entities from seed data after the initial upload. Canopy’s system synchronizes every twenty (20) minutes to incorporate new seed entity data into propagation.

Professional Services Required
Seed data should not be uploaded through the normal uploading and processing method. Doing so would mix seed data with compromised data entities. Please contact Customer Support for assistance uploading seed data.

Raw Entities

Reviewers can also manually enter entity information as they sort though tranches of uploaded documents. This “raw” entity data is tracked and matched to the existing seed entity data in the system, and then propagated throughout all uploaded documents. Based on fields present in the detected entities, Canopy will check for the presence of matching PII and PUIDS (Personally Unique Identifiers) and propagate confirmed matches.

Entity Propagation Workflow

To match an entity, all personally unique identifiers (PUIDs) from the PII found in a searched entity are queried in the detected PII list compiled from the seed and raw entity data. If any one PUID matches on a document, that document is fetched for further processing. The PII fields that are present in both the searched entity and the detected PII in the document will be propagated.

Propagation is triggered in the following cases:

When new entities are added
When new documents are uploaded
If a user re-runs PII detection, all previous propagation will be cleared. Propagation will then be run with the new PII values.

Hash values are assigned to known entities to find and track them throughout uploaded documents. Each tranche of documents uploaded by the client is assigned an ID number. Canopy’s system utilizes these hash values and ID numbers to accurately propagate newly added entities into all previously uploaded tranches of documents. If a user deletes propagated entities from seed data, the deletion details will be noted and the next time propagation is run, the deleted entities will be removed. If a user deletes original entities from seed data, the deletion details will be noted and the deleted entities removed from propagation across all documents, unless a user has already marked documents as “Reviewed.” If a user updates propagated entities, update details will be noted and the next time propagation is run, the updates will be added. Values for updated or deleted seed entities will be updated every fifteen (15) minutes.

Deduplication

Canopy tracks all uploaded entity information and automatically de-duplicates any repeated information to isolate specific entities for use in building a master entity list. When entity information is added or updated, the deduplication process will run, and, upon completion, will trigger consolidation.

Consolidation

A Master entity is a collection of merged entities that identify a single, unique person. Consolidation uses all confirmed entity information from the seed and raw entity data to compile a consolidated master entity list to be used for meeting notification requirements. Based on the hash values assigned to known entities, duplicate entities will get clustered for de-duplication, and then the consolidation rule will run. Master entities are now being stored in a persistent PostgreSQL databases, so subsequent consolidation runs will be faster, as there is no longer a need to copy all the data on every run. If the system is in the process of copying added, updated or deleted entity information, consolidation will be queued and will start after copying has completed. Please refer to Entity Consolidation product documentation for detailed information on the consolidation workflow.

Starting and Running Entity Propagation

Step 1. Entity Propagation Setup

Navigate to Project Settings→Manage Review Settings:

Toggle ON Enable Entity Propagation.
Toggle OFF Show Auto Suggestion Panel.
Select Add to Document to add raw entities to exact duplicate documents.

You can find out more about the Entity Propagation settings in the Project Settings→Review Management section.

Step 2. Upload, Configure Settings, and Process

When configuring processing settings, make sure you are detecting the PUIDs that you think will be present in the dataset.

Canopy’s defined list of Personally Unique Identifiers (PUIDs) can be found by navigating Project Settings→Templates and Layouts→Manage Entity Layouts and clicking on Add New. Toggle on Show only personally unique identifiers and three categories containing PUIDs will appear with checkboxes: Identification, Medical, and Financial. Click on a category to see fields containing PUIDs indicated with a green dot.

More details can be found in the Project Settings→Templates and Layouts section.

Step 3. Filter and Batch Documents

The value of Entity Propagation can be maximized by reviewing documents with high number of PII, PUIDs, or mappable data, because these documents will have the benefit of both accelerating review and automating the consolidation process.

Here are several batching ideas that may be useful to perform during the assessment phase, prior to the full team review, or during the full team review:

Batch Documents with the Highest Number of PII First

To create a batch set of documents with the highest number of detected PII first, navigate to the Document List page, click on the filter panel, and select the following:

Processing and Detection→Sensitive Document Overview→Any Sensitive Data
Review and Analysis→Review Status→Not Batched

You may want to review high impact documents prior to initiating a full team review. In the below example, we chose 5% (about 12 documents). You can limit the scope of documents by searching meta.total_pii:>20. When you create batch sets from this sorted list, the batches will be organized in the same way.

Batch Mappable Documents with the Highest PII Count First

Alternatively, you may want to take advantage of both Smart Map and Entity Propagation to map high impact documents before you start a full team review. This could provide value early in the assessment phase, because it will tell you something about the complexity of the data. For example, if the data is repetitive, mapping a few key spreadsheets would propagate a high percentage of the detected PII as raw entities.

To batch mappable documents with the highest number of PII first, navigate to the Document List page and click on the filter panel. Select the following:

Processing and Detection→Sensitive Document Overview→Any Sensitive Data
Processing and Detection→Mappable Types→All Mappable
Review and Analysis→Review Status→Not Batched

Batch High Volume PUIDs First

Another way to use Entity Propagation is to create a batch set of documents with detected PUIDs. This technique is similar to the previous approaches, but aims to drive consolidation early by prioritizing the review of specific PUIDs with the highest volume or importance.

To batch high volume PUIDs first, navigate to the Document List page, click on the filter panel, and then select the following:

Processing and Detection→PII Elements by Category→PUID
Review and Analysis→Review Status→Not Batched

You can sort these documents by highest total PII and batch these documents out.

You can search and independently batch each PUID in order of priority. For example, say your priorities are the following:

SSN
Medical ID

You could scope your search and batch documents that contain more than the following:

1000 social security numbers meta.pii_density.socialsecuritynumber
10000 medical identification numbers meta.pii_density.mid

Step 4. Review

Checking out batches in the order of priority will provide the biggest impact. Provided that the batches were created from a list sorted by priority, a user can check out batches in priority order, starting with the lowest number batches first.

When a user is reviewing a document that is checked out, they may see entities already added in green lettering. These are automatically propagated entities.

When Entity Propagation is running, raw entities can concurrently be propagated to a document a user is reviewing.

Step 5. Propagate Entities

After running consolidation, our system will propagate all the elements in the master entities to unreviewed documents, as long as the following conditions are met:

The document has not been marked as Reviewed
The detected element is a PUID or is a detected non-PUID element associated with a detected PUID

Known Conditions Where Entities Are Not Detected

Images and PDFs - Low quality images can result in low OCR and less precise detection. Use Gallery View to scan and sweep images and PDFs in and out of the review set.

Excel/CSVs - Detection focuses on determining whether spreadsheets may contain PII, but due to high volumes of PII, detection is not exhaustive or necessary on these types of files. Use Smart Map to add entities to these documents. Entity Propagation does not propagate raw entities to Spreadsheets and CSV files.

Entity Propagation Summary

Entities are added after each entity consolidation run.
Any element in a consolidated master entity can be propagated, whether the element was detected or manually added.
Elements will only be propagated to a document when both the value and element type match. For example, an entity with the Social Security Number 333-22-4444:
- will be added to a document when the number 333-222-4444 was detected as a Social Security Number
- will not be added if 333-222-4444 was detected as a passport
- will not be added of 333-222-4444 appears in a text search, but was not detected as an SSN
Detected elements from the same master entity will be merged as a single entity.
When two master entities contain the same element, the element will be propagated to both master entities.
If you edit a propagated entity on a document that is marked reviewed, your edits will remain.
If you edit a propagated entity on a document that is not marked as reviewed, your edits will be removed the next time consolidation and propagation is run. If you delete a propagated entity or its elements, it/they will be added back in the next consolidation run.

Roadmap

Planned Features
Propagation of non-PUID elements using last name, first name, and date of birth.
Propagation of custom elements.
Run Auto Propagation independently of Entity Consolidation.

Step 6. Measure and QC

In order to manage the review, you may want to identify documents where raw entities have been added.

Propagated Raw Entities: From the Entity Raw View page, the user can filter on Manually Added or Auto Propagated entities in the Entry Method column.
Documents with Auto Propagated Entities: From the Document List page, the Propagated Entity filter can be selected from the filter panel on the right of the screen.

Roadmap

Planned Features
More advanced analytics and reporting.