Installation

To use the DOCXLoader, you first need to install the purecpp_extract Python package:

pip install purecpp_extract

Initialization

You can initialize the DOCXLoader by providing the path to a .docx file or a directory containing .docx files.

from purecpp_extract import DOCXLoader

# Load a single DOCX file
docx_loader = DOCXLoader("/path/to/file.docx")

# Load all DOCX files from a directory
docx_loader = DOCXLoader("/path/to/directory")

Load

Once initialized, use the Load() method to extract the contents of the files. This method returns a list of Document objects.

Each Document contains the following attributes:

  • metadata: A dictionary with metadata about the document
  • page_content: The full text content of the document
documents = docx_loader.Load()

for doc in documents:
    print(doc.metadata)
    print(doc.page_content)
  • If a single file path was provided during initialization, the returned list will contain one Document.
  • If a directory path was provided, the list will contain one Document per .docx file found in the directory.