Installation
To use ChunkSimilarity, you first need to install thepurecpp_chunks_clean Python package:
Initialization
To initialize the ChunkSimilarity, set thechunk_size, overlap, and provide an embedding_model. If using OpenAI embeddings, provide your openai_api_key.
| Parameter | Description |
|---|---|
chunk_size | Maximum size of each chunk (in characters). |
overlap | Number of characters shared between consecutive chunks. |
embedding_model | Embedding model used for similarity calculation (HuggingFace or OpenAI). |
openai_api_key | API key required if using the OpenAI embedding model. |
Embedding Model
Theembedding_model parameter specifies which model to use for generating embeddings. The embeddings are used to calculate the similarity between document chunks. Two options are available:
- HuggingFace (
0): Uses the SentenceTransformer model (all-MiniLM-L6-v2). - OpenAI (
1): Uses OpenAI’s embedding model (text-embedding-ada-002). Requires anopenai_api_key.
Setting the OpenAI API Key
If you are using OpenAI’s embedding model, set theOPENAI_API_KEY environment variable in the terminal:
Processing a Single Document with ProcessSingleDocument
The ProcessSingleDocument method processes a single document and sorts the chunks based on their similarity.
Example with HuggingFace:
Example with OpenAI:
Processing Multiple Documents with ProcessDocuments
The ProcessDocuments method processes multiple input documents and sorts the chunks based on their similarity.
Parameters:
| Parameter | Description |
|---|---|
items | List of LoaderDataStruct documents to process. |
max_workers | Number of parallel workers for processing. |