Installation

To use ChunkQuery, you first need to install the purecpp_chunks_clean Python package:

pip install purecpp_chunks_clean

Initialization

To initialize the ChunkQuery, set the chunk_size, overlap, and provide an embedding_model. If using OpenAI embeddings, provide your openai_api_key.

ParameterDescription
chunk_sizeMaximum size of each chunk (in characters).
overlapNumber of characters shared between consecutive chunks.
embedding_modelEmbedding model used for similarity calculation (HuggingFace or OpenAI).
openai_api_keyAPI key required if using the OpenAI embedding model.
similarity_thresholdMinimum similarity score for a chunk to be included.

example:

from purecpp_chunks_clean import ChunkQuery, EmbeddingModel

# HuggingFace model
embedding_model = EmbeddingModel(0)
chunk_query = ChunkQuery(chunk_size=200, overlap=2, embedding_model=embedding_model)

# OpenAI model
embedding_model = EmbeddingModel(1)
chunk_query = ChunkQuery(chunk_size=200, overlap=2, embedding_model=embedding_model, openai_api_key="your_openai_api_key")

Embedding Model

The embedding_model parameter specifies which model to use for generating embeddings. The embeddings are used to calculate the similarity between the query and document chunks. Two options are available:

  • HuggingFace (0): Uses the SentenceTransformer model (all-MiniLM-L6-v2).
  • OpenAI (1): Uses OpenAI’s embedding model (text-embedding-ada-002). Requires an openai_api_key.

Example:

from purecpp_chunks_clean import EmbeddingModel

# HuggingFace model
embedding_model = EmbeddingModel(0)

# OpenAI model
embedding_model = EmbeddingModel(1)

Setting the OpenAI API Key

If you are using OpenAI’s embedding model, set the OPENAI_API_KEY environment variable in the terminal:

export OPENAI_API_KEY=<your_key>

Processing Documents with ProcessDocuments

The ProcessDocuments method processes multiple input documents and filters the chunks based on their similarity to a query.

Parameters:

ParameterDescription
itemsList of LoaderDataStruct documents to process.
queryThe search query used for similarity comparison.
similarity_thresholdMinimum similarity score required for a chunk to be included in the output.
max_workersNumber of parallel workers for processing multiple documents.

Example with HuggingFace:

from purecpp_chunks_clean import ChunkQuery, EmbeddingModel

embedding_model = EmbeddingModel(0)
chunk_query = ChunkQuery(chunk_size=200, overlap=2, embedding_model=embedding_model)

documents = chunk_query.ProcessDocuments(
    items=[input_data_1, input_data_2],
    query="atomic models",
    similarity_threshold=0.5,
    max_workers=4
)

for doc in documents:
    print(doc.StringRepr())

Example with OpenAI:

from purecpp_chunks_clean import ChunkQuery, EmbeddingModel

embedding_model = EmbeddingModel(1)
chunk_query = ChunkQuery(chunk_size=200, overlap=2, embedding_model=embedding_model)

documents = chunk_query.ProcessDocuments(
    items=[input_data],
    query="atomic models",
    similarity_threshold=0.5,
    max_workers=4
)

for doc in documents:
    print(doc.StringRepr())