This module allows cleaning document content using regex patterns, removing unwanted characters, extra whitespace, or other artifacts before further processing.
purecpp_extract
Python package.
Before you begin, ensure your environment meets the following requirements:
Name | Regex Pattern | Description |
---|---|---|
Extra Spaces | \s+ | Replaces multiple spaces with a single one. |
Non-ASCII Characters | [^\x00-\x7F]+ | Removes all non-ASCII characters. |
Symbols at Line Edges | ^W+|\W+$ | Removes symbols at the start/end of lines. |
RAGDocument
class from purecpp_libs.