babylon.rag.chunker
Document chunking and preprocessing for the RAG system.
Classes
|
Represents a chunk of a document with metadata. |
|
High-level document processor that combines preprocessing and chunking. |
|
Preprocesses documents for chunking and embedding. |
|
Chunks text into smaller, contextually meaningful pieces. |
- class babylon.rag.chunker.DocumentChunk(**data)[source]
Bases:
BaseModelRepresents a chunk of a document with metadata.
- Parameters:
- model_config: ClassVar[ConfigDict] = {'validate_assignment': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class babylon.rag.chunker.Preprocessor(min_content_length=50, max_content_length=100000, remove_extra_whitespace=True, normalize_unicode=True)[source]
Bases:
objectPreprocesses documents for chunking and embedding.
- Parameters:
- __init__(min_content_length=50, max_content_length=100000, remove_extra_whitespace=True, normalize_unicode=True)[source]
Initialize the preprocessor.
- class babylon.rag.chunker.TextChunker(chunk_size=1000, overlap_size=100, preserve_paragraphs=True, preserve_sentences=True)[source]
Bases:
objectChunks text into smaller, contextually meaningful pieces.
- Parameters:
- __init__(chunk_size=1000, overlap_size=100, preserve_paragraphs=True, preserve_sentences=True)[source]
Initialize the chunker.
- class babylon.rag.chunker.DocumentProcessor(preprocessor=None, chunker=None)[source]
Bases:
objectHigh-level document processor that combines preprocessing and chunking.
- Parameters:
preprocessor (Preprocessor | None)
chunker (TextChunker | None)
- __init__(preprocessor=None, chunker=None)[source]
Initialize the document processor.
- Parameters:
preprocessor (
Preprocessor|None) – Optional custom preprocessor (uses default if None)chunker (
TextChunker|None) – Optional custom chunker (uses default if None)
- process_text(content, source_file=None, metadata=None)[source]
Process raw text into document chunks.
- Parameters:
- Return type:
- Returns:
List of processed DocumentChunk objects
- Raises:
PreprocessingError – If preprocessing fails
ChunkingError – If chunking fails
- process_file(file_path, encoding='utf-8')[source]
Process a text file into document chunks.
- Parameters:
- Return type:
- Returns:
List of processed DocumentChunk objects
- Raises:
FileNotFoundError – If file doesn’t exist
PreprocessingError – If preprocessing fails
ChunkingError – If chunking fails