Automatically inventory personal information

Updated on 29 Jan 2025
1 Minute to read
Contributors

Print
Share
Dark
Light
PDF

Article summary

Did you find this summary helpful?

Thank you for your feedback!

Metadata Extractors

Constellio's metadata extractors allow you to detect the personal information that is contained in your files. Using the power of regular expressions, it is possible to search for multiple information formats. Note that if your Constellio server is configured with OCR, the detection will be done for both text and image documents (e.g. scanned PDF documents).

Here are some examples for different sensitive information

Title	Regular Expression	Detected Formats
Social Insurance Number	\b((\d{3}[- ]\d{3}[- ]\d{3})\|( \d{9}))\b	999999999, 999 999 999, 999-999-999
Credit card	\b((\d{4}[- ]\d{4}[- ]\d{4}[- ]\d{4})\|( \d{16}))\b	9999999999999999 9999 9999 9999 9999 9999-9999-9999-9999
Telephone number	(\b[1-9][-\s]\|\b) ([(]\d{3}[)]\|\d{3}) [-\s]?\d{3}[-\s]?\d{4}\b	9 (999) 999-9999 (999) 999-9999 999-999-9999 999 999 9999 9-999-999-9999 9 999 999 9999
E-mail address	\b[_A-Za-z0-9-\+]+(\.[ _A-Za-z0-9-]+)@[A-Za-z0-9-]+(\.[ A-Za-z0-9]+)(\.[ A-Za-z]{2,})\b	XXXXXXXXXXX@XXXXX.XXX XXXXXXXXXXX@XXXXX.XXX.XXX

You can either simply detect the presence of personal information or extract the value.

Here are the different possible parameters :

Possible Values Definition		Field
Metadata	The metadata in which the analysis is done	To parse text in a PDF, Docx, etc. file; select File metadata
Regex	Regular expression to detect targeted data	*See examples above
Type	Determines whether we want to detect the information or if we want to extract it	Substitution: If the information is detected, write a predefined value in the metadata, e.g. "Contains a Social Insurance Number" Transformation: If the intelligence is detected, extract the value from the metadata
Value	Determines what is written to the metadata	Override: Enter a preset value like "Contains a Social Insurance Number" Transformation: The written value is the position of the detected value. For example, if the text detects a credit card 3 times, write $0 for the first match $1 for the second match $2 for the third connection

For more information on metadata extractors, see the Metadata Extractor.

Was this article helpful?

What's Next

The Interface

Table of contents

Metadata Extractors