Chunks
ChunkQuery
Split text into chunks and filter them based on similarity to a query.
Installation
To use ChunkQuery, you first need to install the purecpp_chunks_clean
Python package:
Initialization
To initialize the ChunkQuery, set the chunk_size
, overlap
, and provide an embedding_model
. If using OpenAI embeddings, provide your openai_api_key
.
Parameter | Description |
---|---|
chunk_size | Maximum size of each chunk (in characters). |
overlap | Number of characters shared between consecutive chunks. |
embedding_model | Embedding model used for similarity calculation (HuggingFace or OpenAI). |
openai_api_key | API key required if using the OpenAI embedding model. |
similarity_threshold | Minimum similarity score for a chunk to be included. |
example:
Embedding Model
The embedding_model
parameter specifies which model to use for generating embeddings. The embeddings are used to calculate the similarity between the query and document chunks. Two options are available:
- HuggingFace (
0
): Uses the SentenceTransformer model (all-MiniLM-L6-v2
). - OpenAI (
1
): Uses OpenAI’s embedding model (text-embedding-ada-002
). Requires anopenai_api_key
.
Example:
Setting the OpenAI API Key
If you are using OpenAI’s embedding model, set the OPENAI_API_KEY
environment variable in the terminal:
Processing Documents with ProcessDocuments
The ProcessDocuments
method processes multiple input documents and filters the chunks based on their similarity to a query.
Parameters:
Parameter | Description |
---|---|
items | List of LoaderDataStruct documents to process. |
query | The search query used for similarity comparison. |
similarity_threshold | Minimum similarity score required for a chunk to be included in the output. |
max_workers | Number of parallel workers for processing multiple documents. |