Metadata Extractor
The MetadataRegexExtractor module is designed to extract structured metadata from documents by applying regular expression (regex) patterns. It identifies elements such as proper names, dates, numbers, emails, URLs, and custom patterns.
Coming Soon
This feature is currently under development and will be available soon.
Installation
To use the MetadataRegexExtractor, you first need to install the purecpp_extract
Python package.
Before you begin, ensure your environment meets the following requirements:
- Python 3.9, 3.10, 3.11: PureCPP is compatible with the latest versions of Python.
- Linux/WSL support: The library is fully compatible with Linux-based systems and Windows Subsystem for Linux (WSL).
- pip: Ensure pip is installed and updated to the latest version.
Initialization
You can initialize the MetadataRegexExtractor without any parameters to use the default extraction patterns:
Default Extraction Patterns
When you initialize the MetadataRegexExtractor, it comes with the following default regex patterns:
Name | Regex Pattern | Description |
---|---|---|
ProperName | [A-Z][a-z]+(?:\s[A-Z][a-z]+)* | Extracts proper names. |
Date | \b\d{1,2}/\d{1,2}/\d{2,4}\b|\b\d{4}-\d{2}-\d{2}\b | Extracts dates. |
Number | \b\d+\b | Extracts numbers. |
\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b | Extracts email addresses. | |
URL | \bhttps?://[^\s]+\b | Extracts URLs. |
These patterns help automatically enrich documents with relevant metadata.
Extract default metadata
To process a document and extract metadata, use the ProcessDocument method.
⚠️ Note: The example below manually creates a Document
object for demonstration purposes. In a real-world scenario, you would typically use the output of a dataloader directly. See the Dataloader Documentation
Adding Custom Extraction Patterns
You can add custom patterns to extract specific metadata by using the AddPattern method.
Example
Let’s add a custom pattern to extract a “Category” field: