Installation

To use ChunkSimilarity, you first need to install the purecpp_chunks_clean Python package:

pip install purecpp_chunks_clean

Initialization

To initialize the ChunkSimilarity, set the chunk_size, overlap, and provide an embedding_model. If using OpenAI embeddings, provide your openai_api_key.

ParameterDescription
chunk_sizeMaximum size of each chunk (in characters).
overlapNumber of characters shared between consecutive chunks.
embedding_modelEmbedding model used for similarity calculation (HuggingFace or OpenAI).
openai_api_keyAPI key required if using the OpenAI embedding model.

Example:

from purecpp_chunks_clean import ChunkSimilarity, EmbeddingModel

# HuggingFace model
embedding_model = EmbeddingModel(0)
chunk_similarity = ChunkSimilarity(chunk_size=200, overlap=2, embedding_model=embedding_model)

# OpenAI model
embedding_model = EmbeddingModel(1)
chunk_similarity = ChunkSimilarity(chunk_size=200, overlap=2, embedding_model=embedding_model, openai_api_key="your_openai_api_key")

Embedding Model

The embedding_model parameter specifies which model to use for generating embeddings. The embeddings are used to calculate the similarity between document chunks. Two options are available:

  • HuggingFace (0): Uses the SentenceTransformer model (all-MiniLM-L6-v2).
  • OpenAI (1): Uses OpenAI’s embedding model (text-embedding-ada-002). Requires an openai_api_key.

Example:

from purecpp_chunks_clean import EmbeddingModel

# HuggingFace model
embedding_model = EmbeddingModel(0)

# OpenAI model
embedding_model = EmbeddingModel(1)

Setting the OpenAI API Key

If you are using OpenAI’s embedding model, set the OPENAI_API_KEY environment variable in the terminal:

export OPENAI_API_KEY=<your_key>

Processing a Single Document with ProcessSingleDocument

The ProcessSingleDocument method processes a single document and sorts the chunks based on their similarity.

Example with HuggingFace:

from purecpp_chunks_clean import ChunkSimilarity, EmbeddingModel

embedding_model = EmbeddingModel(0)
chunk_similarity = ChunkSimilarity(chunk_size=200, overlap=2, embedding_model=embedding_model)

documents = chunk_similarity.ProcessSingleDocument(single_input_data)
for doc in documents:
    print(doc.StringRepr())

Example with OpenAI:

from purecpp_chunks_clean import ChunkSimilarity, EmbeddingModel

embedding_model = EmbeddingModel(1)
chunk_similarity = ChunkSimilarity(chunk_size=200, overlap=2, embedding_model=embedding_model, openai_api_key="your_openai_api_key")

documents = chunk_similarity.ProcessSingleDocument(single_input_data)
for doc in documents:
    print(doc.StringRepr())

Processing Multiple Documents with ProcessDocuments

The ProcessDocuments method processes multiple input documents and sorts the chunks based on their similarity.

Parameters:

ParameterDescription
itemsList of LoaderDataStruct documents to process.
max_workersNumber of parallel workers for processing.

Example with HuggingFace:

from purecpp_chunks_clean import ChunkSimilarity, EmbeddingModel

embedding_model = EmbeddingModel(0)
chunk_similarity = ChunkSimilarity(chunk_size=200, overlap=2, embedding_model=embedding_model)

documents = chunk_similarity.ProcessDocuments(
    items=[single_input_data, single_input_data_2],
    max_workers=4
)

for doc in documents:
    print(doc.StringRepr())

Example with OpenAI:

from purecpp_chunks_clean import ChunkSimilarity, EmbeddingModel

embedding_model = EmbeddingModel(1)
chunk_similarity = ChunkSimilarity(chunk_size=200, overlap=2, embedding_model=embedding_model, openai_api_key="your_openai_api_key")

documents = chunk_similarity.ProcessDocuments(
    items=[single_input_data, single_input_data_2],
    max_workers=4
)

for doc in documents:
    print(doc.StringRepr())