ChunkDefault
The ChunkDefault module splits large pieces of text into manageable chunks, using overlap to maintain context between segments. This is particularly useful in Retrieval-Augmented Generation (RAG) pipelines and other text processing tasks where continuity matters.
Installation
To use ChunkDefault, you first need to install the purecpp_chunks_clean
Python package:
Initialization
To use the ChunkDefault module, you first need to create an instance by specifying the chunk_size
and overlap
. These parameters define how the text will be split, ensuring that each chunk stays within the defined size and shares context with the following chunk.
Parameter | Description |
---|---|
chunk_size | Maximum size of each chunk (in characters). |
overlap | Number of characters shared between consecutive chunks. |
Note:
overlap
must be smaller thanchunk_size
, otherwise an error will be raised.
Example:
Processing Documents
To process a list of documents and split them into chunks, use the ProcessDocuments
method. Each resulting chunk will also be an instance of Document
.
The max_workers
parameter controls the number of concurrent threads used during processing.
Each input document is processed into multiple chunks, and the output is a list of Document
instances containing the chunked text data.
Using with a Data Loader
You can also use ChunkDefault with a data loader to process text files directly: