Installation

To use ChunkCount, you first need to install the purecpp_chunks_clean Python package:

pip install purecpp_chunks_clean

Initialization

To initialize ChunkCount, define the count_unit, overlap, and count_threshold.

ParameterDescription
count_unitThe element to count before splitting (word, character, regex).
overlapNumber of characters shared between consecutive chunks.
count_thresholdNumber of times the count_unit must appear before splitting.

Example:

from purecpp_chunks_clean import ChunkCount

chunk_count = ChunkCount(count_unit="keyword", overlap=500, count_threshold=2)

Processing Documents

To process a list of documents and split them into chunks, use the ProcessDocuments method. Each resulting chunk will also be an instance of Document.

The max_workers parameter controls the number of concurrent threads used during processing.

from purecpp_chunks_clean import ChunkCount

chunk_count = ChunkCount(count_unit="keyword", overlap=500, count_threshold=2)
documents = chunk_count.ProcessDocuments(
    items=[input_data_1, input_data_2],
    max_workers=4
)

print(documents)

Each input document is processed into multiple chunks, and the output is a list of Document instances containing the chunked text data.

Using with a Data Loader

You can also use ChunkCount with a data loader to process text files directly:

from purecpp_libs import TXTLoader
from purecpp_chunks_clean import ChunkCount

# Load text file
txt_loader = TXTLoader("path/to/your/textfile.txt")
input_data = txt_loader.Load()

# Process documents
chunk_count = ChunkCount(count_unit="text", overlap=200, count_threshold=1)
documents = chunk_count.ProcessDocuments(items=input_data)

print(documents)