Installation

Initialization

Processing Documents

Using with a Data Loader

ChunkCount

PureCPP

Explore  PureCPP and PureCPP – innovative solutions designed to revolutionize Retrieval-Augmented Generation (RAG) workflows.  PureCPP delivers advanced techniques and integrations for RAG pipelines, while PureCPP offers unparalleled speed and on-premises capabilities built with C++.

PureAI Solutions

Welcome to the Quickstart Guide for **PureCPP**, your all-in-one solution for building **Retrieval-Augmented Generation (RAG)** pipelines with ease and efficiency. This guide will walk you through the steps to get started quickly.

Quickstart Quide

This module allows cleaning document content using regex patterns, removing unwanted characters, extra whitespace, or other artifacts before further processing.

Content Cleaner

The **MetadataRegexExtractor** module is designed to extract structured metadata from documents by applying regular expression (regex) patterns. It identifies elements such as proper names, dates, numbers, emails, URLs, and custom patterns.

Metadata Extractor

Generate text embeddings using OpenAI's embedding model.

Embedding

Community

GitHub

Go to PureAI

Support

Data loaders convert raw data into the standardized PureAI format, ensuring consistency across different data sources. Each loader follows a unified structure, offering a consistent set of methods and a seamless usage experience.

Introduction - Data Loader

This data loader allows loading webpages from the internet.

WEB Loader

This data loader allows loading text files from local storage.

TXT Loader

This data loader allows loading PDF files from local storage.

PDF Loader

This data loader allows loading DOCX files from local storage.

DOCX Loader

Chunking modules split large pieces of text into smaller, manageable segments. Overlapping helps maintain context between chunks, making them essential for **Retrieval-Augmented Generation (RAG)** pipelines and other text-processing tasks.

Introduction - Chunks

The **ChunkDefault** module splits large pieces of text into manageable chunks, using overlap to maintain context between segments. This is particularly useful in **Retrieval-Augmented Generation (RAG)** pipelines and other text processing tasks where continuity matters.

Parameter	Description
`count_unit`	The element to count before splitting (word, character, regex).
`overlap`	Number of characters shared between consecutive chunks.
`count_threshold`	Number of times the `count_unit` must appear before splitting.

Introduction

Build with PureCPP

​Installation

​Initialization

​Processing Documents

​Using with a Data Loader

Installation

Initialization

Processing Documents

Using with a Data Loader