Chunks
ChunkCount
Split text based on a count pattern.
Installation
To use ChunkCount, you first need to install the purecpp_chunks_clean
Python package:
Initialization
To initialize ChunkCount, define the count_unit
, overlap
, and count_threshold
.
Parameter | Description |
---|---|
count_unit | The element to count before splitting (word, character, regex). |
overlap | Number of characters shared between consecutive chunks. |
count_threshold | Number of times the count_unit must appear before splitting. |
Example:
Processing Documents
To process a list of documents and split them into chunks, use the ProcessDocuments
method. Each resulting chunk will also be an instance of Document
.
The max_workers
parameter controls the number of concurrent threads used during processing.
Each input document is processed into multiple chunks, and the output is a list of Document
instances containing the chunked text data.
Using with a Data Loader
You can also use ChunkCount with a data loader to process text files directly: