Content Cleaner
This module allows cleaning document content using regex patterns, removing unwanted characters, extra whitespace, or other artifacts before further processing.
Installation
To use the ContentCleaner, you first need to install the purecpp_extract
Python package.
Before you begin, ensure your environment meets the following requirements:
- Python 3.9, 3.10, 3.11: PureCPP is compatible with the latest versions of Python.
- Linux/WSL support: The library is fully compatible with Linux-based systems and Windows Subsystem for Linux (WSL).
- pip: Ensure pip is installed and updated to the latest version.
Default Cleaning Patterns
When you initialize the ContentCleaner without passing any patterns, it uses the following default regex patterns:
Name | Regex Pattern | Description |
---|---|---|
Extra Spaces | \s+ | Replaces multiple spaces with a single one. |
Non-ASCII Characters | [^\x00-\x7F]+ | Removes all non-ASCII characters. |
Symbols at Line Edges | ^W+|\W+$ | Removes symbols at the start/end of lines. |
Usage
Initialization
You can initialize the ContentCleaner using default patterns or provide custom patterns.
Cleaning a Document
To clean the text of a document, use the ProcessDocument method. For creating documents, use the RAGDocument
class from purecpp_libs.
Using Custom Cleaning Patterns
You can add extra regex patterns to further clean your document. These custom patterns are applied after the default patterns.
Example
Suppose we want to remove the word “test” from the string: