Skip to content

Document Ingestion

% pip install novastack-workflows[prebuilt]

A prebuilt workflow for document ingestion that handles the complete pipeline of loading, transforming, and storing documents in a vector store.

This workflow orchestrates the document ingestion pipeline with support for: - Multiple document loaders (e.g., Docx, PDF, S3) - Multiple transformation components (e.g., chunking, embedding) - Deduplication strategies - Vector store integration

This workflow provides a streamlined approach to building document ingestion pipelines with support for custom transformers and flexible document processing strategies.

Attributes#

Parameter Type Description
transformers list[TransformerComponent] List of transformer components to apply to the documents. Transformers are applied in sequence to process and modify documents during the ingestion pipeline.
doc_strategy DocStrategy, optional Strategy for handling document processing. Defines how documents should be processed and managed throughout the workflow.
post_transformer bool, optional Flag indicating whether to apply post-transformation processing. When enabled, additional processing steps are executed after the main transformers.
loaders list[BaseLoader], optional Optional loader component for reading documents from various sources. If not provided, documents must be supplied directly to the workflow.
vector_store BaseVectorStore, optional Optional vector store for persisting processed documents. When provided, documents are automatically stored after transformation.

Example#

from novastack_workflows.prebuilt import DocumentIngestionWorkflow
from novastack.loaders import DirectoryLoader
from novastack.text_chunkers import TokenChunker
from novastack.vector_stores import ChromaVectorStore
from novastack.embeddings import HuggingFaceEmbeddings

# Initialize components
dir_loader = DirectoryLoader(input_dir="./documents")
chunker = TokenChunker(chunk_size=512, chunk_overlap=50)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vector_store = ChromaVectorStore(
    collection_name="my_documents",
    embedding_function=embeddings
)

# Create ingestion workflow
workflow = DocumentIngestionWorkflow(
    transformers=[chunker],
    doc_strategy="merge",
    post_transformer=True,
    loaders=[dir_loader],
    vector_store=vector_store
)

# Run the workflow
result = await workflow.run()

Workflows#

Full workflow documentation here