Coming Soon
This feature is currently under development and will be available soon.
Installation
To use the ContentCleaner, you first need to install the purecpp_extract Python package.
Before you begin, ensure your environment meets the following requirements:
- Python 3.9, 3.10, 3.11: PureCPP is compatible with the latest versions of Python.
- Linux/WSL support: The library is fully compatible with Linux-based systems and Windows Subsystem for Linux (WSL).
- pip: Ensure pip is installed and updated to the latest version.
pip install purecpp_chunks_clean
Default Cleaning Patterns
When you initialize the ContentCleaner without passing any patterns, it uses the following default regex patterns:
| Name | Regex Pattern | Description |
|---|
| Extra Spaces | \s+ | Replaces multiple spaces with a single one. |
| Non-ASCII Characters | [^\x00-\x7F]+ | Removes all non-ASCII characters. |
| Symbols at Line Edges | ^W+|\W+$ | Removes symbols at the start/end of lines. |
Usage
Initialization
You can initialize the ContentCleaner using default patterns or provide custom patterns.
from purecpp_chunks_clean import ContentCleaner
# Initialize with default patterns
cleaner = ContentCleaner()
Cleaning a Document
To clean the text of a document, use the ProcessDocument method. For creating documents, use the RAGDocument class from purecpp_libs.
from purecpp_libs import RAGDocument
input_data = "This is a test doc with multiple spaces."
doc = RAGDocument(page_content=input_data, metadata={})
cleaned_doc = cleaner.ProcessDocument(doc)
print(cleaned_doc.page_content)
# Output: "This is a test doc with multiple spaces."
Using Custom Cleaning Patterns
You can add extra regex patterns to further clean your document. These custom patterns are applied after the default patterns.
Example
Suppose we want to remove the word “test” from the string:
regex_pattern_to_remove = r"test"
# Apply custom patterns during processing
cleaned_doc = cleaner.ProcessDocument(doc, [regex_pattern_to_remove])
print(cleaned_doc.page_content)
# Output: "This is a doc with multiple spaces."