Installation

To use the PDFLoader, you first need to install the purecpp_extract Python package:

pip install purecpp_extract

Initialization

You can initialize the PDFLoader by providing the path to a .pdf file or a directory containing .pdf files.

from purecpp_extract import PDFLoader

# Load a single PDF file
pdf_loader = PDFLoader("/path/to/file.pdf")

# Load all PDF files from a directory
pdf_loader = PDFLoader("/path/to/directory")

Load

Once initialized, use the Load() method to extract the contents of the files. This method returns a list of Document objects.

Each Document contains the following attributes:

  • metadata: A dictionary with metadata about the document
  • page_content: The full text content of the document
documents = pdf_loader.Load()

for doc in documents:
    print(doc.metadata)
    print(doc.page_content)
  • If a single file path was provided during initialization, the returned list will contain one Document.
  • If a directory path was provided, the list will contain one Document per .pdf file found in the directory.