Installation

To use EmbeddingOpenAI, you need to install the purecpp_embed Python package:

Before you begin, ensure your environment meets the following requirements:

  • Python 3.9, 3.10, 3.11: PureCPP is compatible with the latest versions of Python.
  • Linux/WSL support: The library is fully compatible with Linux-based systems and Windows Subsystem for Linux (WSL).
  • pip: Ensure pip is installed and updated to the latest version.
pip install purecpp_embed

Introduction

The EmbeddingOpenAI module generates vector embeddings using OpenAI’s embedding models. These embeddings are numerical representations of text, useful for semantic search, document similarity, and information retrieval.

Initialization

To use EmbeddingOpenAI, set your OpenAI API key using the os.environ environment variable.

Example:

import os
from purecpp_embed import EmbeddingOpenAI

os.environ["OPENAI_API_KEY"] = "your_openai_api_key"
embedding = EmbeddingOpenAI()

Generating Embeddings

The EmbeddingOpenAI class provides a method GenerateEmbeddings that expects two arguments:

  • items: A list of Documents (objects with page_content and optional metadata).
  • model: A string representing the OpenAI embedding model you wish to use (e.g., "text-embedding-ada-002").

It returns a list of Documents of the same length, preserving the metadata and page_content, but with an additional field embedding containing the embedding vector (a list of floats).

Example using a text loader and chunk

import os
from purecpp_embed import EmbeddingOpenAI
from purecpp_extract import TXTLoader
from purecpp_chunks_clean import ChunkDefault

os.environ["OPENAI_API_KEY"] = "your_openai_api_key"

# Load a text file
txt_loader = TXTLoader("path/to/your/textfile.txt")
input_data = txt_loader.Load()

# Chunk the loaded document
chunk_default = ChunkDefault(chunk_size=200, overlap=10)
chunked_docs = chunk_default.ProcessDocuments(items=input_data)

# Generate embeddings using a specific model
embedding = EmbeddingOpenAI()
embedding_docs = embedding.GenerateEmbeddings(chunked_docs, model="text-embedding-ada-002")

# Each document now includes an `embedding` field
print(embedding_docs)

For more details on how to chunk documents, refer to the Chunk documentation.