Installation

Initialization

Setting the OpenAI API Key

Embedding Model

Parameters:

Example with HuggingFace:

Example with OpenAI:

Processing Documents with ProcessDocuments

Split text into chunks and filter them based on similarity to a query.

ChunkQuery

PureCPP

Explore  PureCPP and PureCPP – innovative solutions designed to revolutionize Retrieval-Augmented Generation (RAG) workflows.  PureCPP delivers advanced techniques and integrations for RAG pipelines, while PureCPP offers unparalleled speed and on-premises capabilities built with C++.

PureAI Solutions

Welcome to the Quickstart Guide for **PureCPP**, your all-in-one solution for building **Retrieval-Augmented Generation (RAG)** pipelines with ease and efficiency. This guide will walk you through the steps to get started quickly.

Quickstart Quide

This module allows cleaning document content using regex patterns, removing unwanted characters, extra whitespace, or other artifacts before further processing.

Content Cleaner

The **MetadataRegexExtractor** module is designed to extract structured metadata from documents by applying regular expression (regex) patterns. It identifies elements such as proper names, dates, numbers, emails, URLs, and custom patterns.

Metadata Extractor

Generate text embeddings using OpenAI's embedding model.

Embedding

Community

GitHub

Go to PureAI

Support

Data loaders convert raw data into the standardized PureAI format, ensuring consistency across different data sources. Each loader follows a unified structure, offering a consistent set of methods and a seamless usage experience.

Introduction - Data Loader

This data loader allows loading webpages from the internet.

WEB Loader

This data loader allows loading text files from local storage.

TXT Loader

This data loader allows loading PDF files from local storage.

PDF Loader

This data loader allows loading DOCX files from local storage.

DOCX Loader

Chunking modules split large pieces of text into smaller, manageable segments. Overlapping helps maintain context between chunks, making them essential for **Retrieval-Augmented Generation (RAG)** pipelines and other text-processing tasks.

Introduction - Chunks

The **ChunkDefault** module splits large pieces of text into manageable chunks, using overlap to maintain context between segments. This is particularly useful in **Retrieval-Augmented Generation (RAG)** pipelines and other text processing tasks where continuity matters.

ChunkDefault

ChunkCount

Split text into chunks and sort them based on similarity.

Parameter	Description
`chunk_size`	Maximum size of each chunk (in characters).
`overlap`	Number of characters shared between consecutive chunks.
`embedding_model`	Embedding model used for similarity calculation (HuggingFace or OpenAI).
`openai_api_key`	API key required if using the OpenAI embedding model.
`similarity_threshold`	Minimum similarity score for a chunk to be included.

Parameter	Description
`items`	List of `LoaderDataStruct` documents to process.
`query`	The search query used for similarity comparison.
`similarity_threshold`	Minimum similarity score required for a chunk to be included in the output.
`max_workers`	Number of parallel workers for processing multiple documents.

Introduction

Build with PureCPP

ChunkQuery

Installation

Initialization

Embedding Model

Setting the OpenAI API Key

Processing Documents with `ProcessDocuments`

Parameters:

Example with HuggingFace:

Example with OpenAI:

Introduction

Build with PureCPP

​Installation

​Initialization

​Embedding Model

​Setting the OpenAI API Key

​Processing Documents with ProcessDocuments

​Parameters:

​Example with HuggingFace:

​Example with OpenAI:

Installation

Initialization

Embedding Model

Setting the OpenAI API Key

Processing Documents with `ProcessDocuments`

Parameters:

Example with HuggingFace:

Example with OpenAI: