Installation

To use ChunkDefault, you first need to install the purecpp_chunks_clean Python package:

pip install purecpp_chunks_clean

Initialization

To use the ChunkDefault module, you first need to create an instance by specifying the chunk_size and overlap. These parameters define how the text will be split, ensuring that each chunk stays within the defined size and shares context with the following chunk.

ParameterDescription
chunk_sizeMaximum size of each chunk (in characters).
overlapNumber of characters shared between consecutive chunks.

Note: overlap must be smaller than chunk_size, otherwise an error will be raised.

Example:

from purecpp_chunks_clean import ChunkDefault

chunk_default = ChunkDefault(chunk_size=200, overlap=10)

Processing Documents

To process a list of documents and split them into chunks, use the ProcessDocuments method. Each resulting chunk will also be an instance of Document.

The max_workers parameter controls the number of concurrent threads used during processing.

from purecpp_chunks_clean import ChunkDefault

chunk_default = ChunkDefault(chunk_size=200, overlap=10)
documents = chunk_default.ProcessDocuments(
    items=[single_input_data, single_input_data_2],
    max_workers=5
)

print(documents)

Each input document is processed into multiple chunks, and the output is a list of Document instances containing the chunked text data.

Using with a Data Loader

You can also use ChunkDefault with a data loader to process text files directly:

from purecpp_libs import TXTLoader
from purecpp_chunks_clean import ChunkDefault

# Load text file
txt_loader = TXTLoader("path/to/your/textfile.txt")
input_data = txt_loader.Load()

# Process documents
chunk_default = ChunkDefault(chunk_size=200, overlap=10)
documents = chunk_default.ProcessDocuments(items=input_data)

print(documents)