Installation

Default Cleaning Patterns

Initialization

Usage

Cleaning a Document

Example

Using Custom Cleaning Patterns

This module allows cleaning document content using regex patterns, removing unwanted characters, extra whitespace, or other artifacts before further processing.

Content Cleaner

PureCPP

Explore  PureCPP and PureCPP – innovative solutions designed to revolutionize Retrieval-Augmented Generation (RAG) workflows.  PureCPP delivers advanced techniques and integrations for RAG pipelines, while PureCPP offers unparalleled speed and on-premises capabilities built with C++.

PureAI Solutions

Welcome to the Quickstart Guide for **PureCPP**, your all-in-one solution for building **Retrieval-Augmented Generation (RAG)** pipelines with ease and efficiency. This guide will walk you through the steps to get started quickly.

Quickstart Quide

The **MetadataRegexExtractor** module is designed to extract structured metadata from documents by applying regular expression (regex) patterns. It identifies elements such as proper names, dates, numbers, emails, URLs, and custom patterns.

Metadata Extractor

Generate text embeddings using OpenAI's embedding model.

Embedding

Community

GitHub

Go to PureAI

Support

Data loaders convert raw data into the standardized PureAI format, ensuring consistency across different data sources. Each loader follows a unified structure, offering a consistent set of methods and a seamless usage experience.

Introduction - Data Loader

This data loader allows loading webpages from the internet.

WEB Loader

This data loader allows loading text files from local storage.

TXT Loader

This data loader allows loading PDF files from local storage.

PDF Loader

This data loader allows loading DOCX files from local storage.

DOCX Loader

Chunking modules split large pieces of text into smaller, manageable segments. Overlapping helps maintain context between chunks, making them essential for **Retrieval-Augmented Generation (RAG)** pipelines and other text-processing tasks.

Introduction - Chunks

The **ChunkDefault** module splits large pieces of text into manageable chunks, using overlap to maintain context between segments. This is particularly useful in **Retrieval-Augmented Generation (RAG)** pipelines and other text processing tasks where continuity matters.

Name	Regex Pattern	Description
Extra Spaces	`\s+`	Replaces multiple spaces with a single one.
Non-ASCII Characters	`[^\x00-\x7F]+`	Removes all non-ASCII characters.
Symbols at Line Edges	`^W+\|\W+$`	Removes symbols at the start/end of lines.

Introduction

Build with PureCPP

​Installation

​Default Cleaning Patterns

​Usage

​Initialization

​Cleaning a Document

​Using Custom Cleaning Patterns

​Example