Metadata extractors
  • 11 Apr 2024
  • 6 Minutes to read
  • Contributors
  • Dark
    Light
  • PDF

Metadata extractors

  • Dark
    Light
  • PDF

Article summary

1. Metadata extractors

Allows you to add elements to Constellio and extract their metadata, either by using styles, regular checkouts, or properties. The order of priority for populating a metadata is defined in the system configurations. If no data is defined for styles and regular expressions by the metadata extractor, Constellio will automatically export the property data. 

Here is the form to fill out to define styles, properties, and regular expressions for a specific metadata: 

It is useful for accurate metadata to specify extraction information in all three methods (styles, properties, and regular expressions). However, for a specific Word template with styles, it may also be useful to create a metadata schema specific to that template and define precisely how to retrieve each metadata with one or more Word styles or templates.

POINTS OF VIGILANCE
If no data is defined for styles and regular expressions by the metadata extractor, Constellio will automatically export the property data.
More information - Using word template
It is useful for accurate metadata to specify extraction information in all three methods (styles, properties, and regular expressions). However, for a specific Word template with styles, it may also be useful to create a metadata schema specific to that template and precisely define how to extract each metadata with one or more styles from the Word template.

2. Create a metadata extractor

  1. Click on " Administration " in the navigation menu;
  2. Click on " Metadata Extractor ";
  3. In the " Metadata Extractor " window, click " Add ";
  4. In the second " Metadata Extractor " window, complete the metadata needed to create a metadata extractor. Whether with styles, properties, regular expressions or only the desired elements. Click on " Save ". 
Stream 1: Metadata selection
Field NameTypeDescription
Schema TypeObligatorySelect a schema type.
SchemeObligatoryIf multiple schemas exist for the selected schema type, choose the precise schema that contains the metadata for which the extractor is to be created. 
MetadataObligatorySelect the precise metadata (e.g., title, author, description, etc.)
Pane 2: Define styles, properties, and regular expressions
Field NameTypeDescription
StylesFacultativeEnter the name given to the style in Word. The name must be written a lowercase and without spaces. (E.g.: if the style is named Proper Title, you must write own title).
It is possible to register several styles for a metadata. 
PropertiesFacultativeEnter the name of the property that is equivalent to the metadata. For schemas, documents, and emails, properties that are equivalent to metadata are already specified by default in the metadata extractor. If you add a new schema, it is possible to rely on the ones indicated for the document. 
Regexes (regular expressions)

Facultative
Allows you to define one or more regular expressions, each for a specific metadata. For each regular expression, when the target metadata matches, it is possible to configure the extractor to use the value found, or another value that is specified. 
MetadataThe metadata in which the analysis is done. To analyze the text in a PDF, Docx, etc. file; select the File metadata.
RegexAllows you to register the regular expression.
Type

Allows us to determine if we want to detect the information or if we want to extract it.

  • Substitution: If the information is detected, write a predefined value in the metadata, for example "Contains a social insurance number"
  • Transformation: If the information is detected, extract the value in the metadata
Value
  • Substitution: Enter a predefined value such as "Contains a social insurance number"
  • Transformation: The written value is the position of the detected value. For example, if the text detects a credit card 3 times, write
    • $0 for first match
    • $1 for the second connection
    • $2 for the third connection.
Enabled only at creationFacultativeAllows you to specify whether the checkout is done only when the document is created, or each time it is modified.

2.1 Property Analyzer

The property parser allows you to select the document of your choice to analyze its properties and choose the metadata you want to extract automatically.

  1. Click on "Administration" in the navigation menu;
  2. Click on "Metadata Extractor";
  3. In the "Metadata Extractor" window, click "Add";
  4. Click on the "Properties analyzer" option;
  5. Select by the button a document or drag it into the page;
  6. The metadata of properties and styles are displayed, click on the metadata of your choice;
  7. A confirmation that the property has been added to the list appears;
  8. Close the window to return to the metadata extraction page. The "Page Count" metadata has been added to the "Page-Count" metadata.
  9. You must now fill in the other fields to determine in which already existing metadata "Page Count" should be extracted.
  10.  Metadata is now defined as extracted metadata. 
  11. The metadata is now automatically extracted as soon as it is added to Constellio.
Metadata
You must have previously created your metadata to link the extraction to. To learn more about creating metadata, see "Add metadata."

3. Edit a metadata extractor

  1. Click on " Administration " in the navigation menu;
  2. Click on " Metadata Extractor ";
  3. In the "Metadata Extractor" window, click on the notebook to the right of the item to be modified;
  4. Make the changes and click " Save ".

4. Delete a metadata extractor

  1. Click on " Administration " in the navigation menu;
  2. Click on " Metadata Extractor ";
  3. In the " Metadata Extractor " window, click on the red X to the right of the item to be deleted;
  4. A confirmation window appears, click on " Save ".

5. Configurations

In this section, you will find all the system configurations impacting metadata extractors. To learn more about configurations, see the "Systems configurations" article.

Advanced tab
ConfigurationDescriptionPossible valuesImpacts
Remove the extension in the title of a documentThis configuration allows you to remove the extension (e.g., .txt, .doc) in the "Title" field of a document when it is fed using metadata extractors (extraction by properties).ActivatedThe title of the metadata card will not include the file extension.
DisabledThe title of the metadata card will include the file extension.
Priority when populating metadataThis configuration makes it possible to determine the order of prioritization for the populating of the metadata during the automatic extraction of the title in the import of documents.Styles : For a Word document will be imported and takes into account in priority the style that was created in the Word document. Styles : For a Word document will be imported and takes into account in priority the style that was created in the Word document.  Example: For the Choice Styles -> Regular Expressions -> Properties, Constellio will check it out in the following order if the data is available:
  • Styles 
  • Regular expressions
  • Properties

If there is no data in the regular styles and expressions, Constellio will automatically export the property data.

Example: For the Choice Styles -> Regular Expressions -> Properties, Constellio will check it out in the following order if the data is available.
  • Styles
  • File name
  • Properties: If there is no data in the regular styles and expressions, Constellio will automatically export the property data. 
File nameThe file name will be used.
Properties The title defined in the properties will be used.
Priority when populating the titleThis configuration allows you to specify the order in which the title metadata will be extracted when importing documents. To do this, you must configure the Metadata Extractors module.Styles : For a Word document will be imported and takes into account in priority the style that was created in the Word document.Example: For the Choice Styles -> Regular Expressions -> Properties, Constellio will check out in the following order if the data is available:
  • Styles
  • File name
  • Properties

If there is no data in the regular styles and expressions, Constellio will automatically export the property data. 

File name : The file name will be used.
Properties : The title defined in the properties will be used. 



Was this article helpful?

Changing your password will log you out immediately. Use the new password to log back in.
First name must have atleast 2 characters. Numbers and special characters are not allowed.
Last name must have atleast 1 characters. Numbers and special characters are not allowed.
Enter a valid email
Enter a valid password
Your profile has been successfully updated.