Installation

To use the WebLoader, you first need to install the purecpp_extract Python package:

pip install purecpp_extract

Initialization

You can initialize the WebLoader by providing a single URL.

from purecpp_extract import WebLoader

# Load a webpage from the internet
web_loader = WebLoader("https://pt.wikipedia.org/wiki/Brasil")

Load

Once initialized, use the Load() method to fetch and extract content from the webpage. This method returns a list containing one Document object.

Each Document contains the following attributes:

  • metadata: A dictionary with metadata about the document
  • page_content: The full text content of the webpage
documents = web_loader.Load()

for doc in documents:
    print(doc.metadata)
    print(doc.page_content)
  • Since only a single URL is allowed per instance, the returned list will always contain exactly one Document.