Coming Soon
This feature is currently under development and will be available soon.
This feature is currently under development and will be available soon.
Installation
To use the ContentCleaner, you first need to install thepurecpp_extract Python package.
Before you begin, ensure your environment meets the following requirements:
- Python 3.9, 3.10, 3.11: PureCPP is compatible with the latest versions of Python.
- Linux/WSL support: The library is fully compatible with Linux-based systems and Windows Subsystem for Linux (WSL).
- pip: Ensure pip is installed and updated to the latest version.
Default Cleaning Patterns
When you initialize the ContentCleaner without passing any patterns, it uses the following default regex patterns:| Name | Regex Pattern | Description |
|---|---|---|
| Extra Spaces | \s+ | Replaces multiple spaces with a single one. |
| Non-ASCII Characters | [^\x00-\x7F]+ | Removes all non-ASCII characters. |
| Symbols at Line Edges | ^W+|\W+$ | Removes symbols at the start/end of lines. |
Usage
Initialization
You can initialize the ContentCleaner using default patterns or provide custom patterns.Cleaning a Document
To clean the text of a document, use the ProcessDocument method. For creating documents, use theRAGDocument class from purecpp_libs.