Installation
To use ChunkSimilarity, you first need to install the purecpp_chunks_clean
Python package:
pip install purecpp_chunks_clean
Initialization
To initialize the ChunkSimilarity, set the chunk_size
, overlap
, and provide an embedding_model
. If using OpenAI embeddings, provide your openai_api_key
.
Parameter | Description |
---|
chunk_size | Maximum size of each chunk (in characters). |
overlap | Number of characters shared between consecutive chunks. |
embedding_model | Embedding model used for similarity calculation (HuggingFace or OpenAI). |
openai_api_key | API key required if using the OpenAI embedding model. |
Example:
from purecpp_chunks_clean import ChunkSimilarity, EmbeddingModel
# HuggingFace model
embedding_model = EmbeddingModel(0)
chunk_similarity = ChunkSimilarity(chunk_size=200, overlap=2, embedding_model=embedding_model)
# OpenAI model
embedding_model = EmbeddingModel(1)
chunk_similarity = ChunkSimilarity(chunk_size=200, overlap=2, embedding_model=embedding_model, openai_api_key="your_openai_api_key")
Embedding Model
The embedding_model
parameter specifies which model to use for generating embeddings. The embeddings are used to calculate the similarity between document chunks. Two options are available:
- HuggingFace (
0
): Uses the SentenceTransformer model (all-MiniLM-L6-v2
).
- OpenAI (
1
): Uses OpenAI’s embedding model (text-embedding-ada-002
). Requires an openai_api_key
.
Example:
from purecpp_chunks_clean import EmbeddingModel
# HuggingFace model
embedding_model = EmbeddingModel(0)
# OpenAI model
embedding_model = EmbeddingModel(1)
Setting the OpenAI API Key
If you are using OpenAI’s embedding model, set the OPENAI_API_KEY
environment variable in the terminal:
export OPENAI_API_KEY=<your_key>
Processing a Single Document with ProcessSingleDocument
The ProcessSingleDocument
method processes a single document and sorts the chunks based on their similarity.
Example with HuggingFace:
from purecpp_chunks_clean import ChunkSimilarity, EmbeddingModel
embedding_model = EmbeddingModel(0)
chunk_similarity = ChunkSimilarity(chunk_size=200, overlap=2, embedding_model=embedding_model)
documents = chunk_similarity.ProcessSingleDocument(single_input_data)
for doc in documents:
print(doc.StringRepr())
Example with OpenAI:
from purecpp_chunks_clean import ChunkSimilarity, EmbeddingModel
embedding_model = EmbeddingModel(1)
chunk_similarity = ChunkSimilarity(chunk_size=200, overlap=2, embedding_model=embedding_model, openai_api_key="your_openai_api_key")
documents = chunk_similarity.ProcessSingleDocument(single_input_data)
for doc in documents:
print(doc.StringRepr())
Processing Multiple Documents with ProcessDocuments
The ProcessDocuments
method processes multiple input documents and sorts the chunks based on their similarity.
Parameters:
Parameter | Description |
---|
items | List of LoaderDataStruct documents to process. |
max_workers | Number of parallel workers for processing. |
Example with HuggingFace:
from purecpp_chunks_clean import ChunkSimilarity, EmbeddingModel
embedding_model = EmbeddingModel(0)
chunk_similarity = ChunkSimilarity(chunk_size=200, overlap=2, embedding_model=embedding_model)
documents = chunk_similarity.ProcessDocuments(
items=[single_input_data, single_input_data_2],
max_workers=4
)
for doc in documents:
print(doc.StringRepr())
Example with OpenAI:
from purecpp_chunks_clean import ChunkSimilarity, EmbeddingModel
embedding_model = EmbeddingModel(1)
chunk_similarity = ChunkSimilarity(chunk_size=200, overlap=2, embedding_model=embedding_model, openai_api_key="your_openai_api_key")
documents = chunk_similarity.ProcessDocuments(
items=[single_input_data, single_input_data_2],
max_workers=4
)
for doc in documents:
print(doc.StringRepr())