babylon.rag.chunker

Document chunking and preprocessing for the RAG system.

Classes

`DocumentChunk`(**data)	Represents a chunk of a document with metadata.
`DocumentProcessor`([preprocessor, chunker])	High-level document processor that combines preprocessing and chunking.
`Preprocessor`([min_content_length, ...])	Preprocesses documents for chunking and embedding.
`TextChunker`([chunk_size, overlap_size, ...])	Chunks text into smaller, contextually meaningful pieces.

class babylon.rag.chunker.DocumentChunk(**data)[source]

Bases: BaseModel

Represents a chunk of a document with metadata.

Parameters:

id (str)
content (str)
source_file (str | None)
chunk_index (int)
start_char (int)
end_char (int)
metadata (dict[str, Any] | None)
embedding (list[float] | None)

model_config: ClassVar[ConfigDict] = {'validate_assignment': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

id: str

content: str

source_file: str | None

chunk_index: int

start_char: int

end_char: int

metadata: dict[str, Any] | None

embedding: list[float] | None

generate_id_if_empty()[source]

Generate ID if not provided.

Return type:: DocumentChunk

class babylon.rag.chunker.Preprocessor(min_content_length=50, max_content_length=100000, remove_extra_whitespace=True, normalize_unicode=True)[source]

Bases: object

Preprocesses documents for chunking and embedding.

Parameters:

min_content_length (int)
max_content_length (int)
remove_extra_whitespace (bool)
normalize_unicode (bool)

__init__(min_content_length=50, max_content_length=100000, remove_extra_whitespace=True, normalize_unicode=True)[source]

Initialize the preprocessor.

Parameters:

min_content_length (int) – Minimum acceptable content length
max_content_length (int) – Maximum acceptable content length
remove_extra_whitespace (bool) – Whether to normalize whitespace
normalize_unicode (bool) – Whether to normalize unicode characters

preprocess(content, content_id=None)[source]

Preprocess content for chunking.

Parameters:

content (str) – Raw content to preprocess
content_id (str | None) – Optional identifier for error reporting

Return type:

str

Returns:

Preprocessed content

Raises:

PreprocessingError – If content fails validation or preprocessing

class babylon.rag.chunker.TextChunker(chunk_size=1000, overlap_size=100, preserve_paragraphs=True, preserve_sentences=True)[source]

Bases: object

Chunks text into smaller, contextually meaningful pieces.

Parameters:

chunk_size (int)
overlap_size (int)
preserve_paragraphs (bool)
preserve_sentences (bool)

__init__(chunk_size=1000, overlap_size=100, preserve_paragraphs=True, preserve_sentences=True)[source]

Initialize the chunker.

Parameters:

chunk_size (int) – Target size for each chunk in characters
overlap_size (int) – Number of overlapping characters between chunks
preserve_paragraphs (bool) – Try to avoid splitting paragraphs
preserve_sentences (bool) – Try to avoid splitting sentences

chunk_text(content, source_file=None, metadata=None)[source]

Chunk text content into DocumentChunk objects.

Parameters:

content (str) – Text content to chunk
source_file (str | None) – Optional source file path
metadata (dict[str, Any] | None) – Optional metadata to attach to chunks

Return type:

list[DocumentChunk]

Returns:

List of DocumentChunk objects

Raises:

ChunkingError – If chunking fails

class babylon.rag.chunker.DocumentProcessor(preprocessor=None, chunker=None)[source]

Bases: object

High-level document processor that combines preprocessing and chunking.

Parameters:

preprocessor (Preprocessor | None)
chunker (TextChunker | None)

__init__(preprocessor=None, chunker=None)[source]

Initialize the document processor.

Parameters:

preprocessor (Preprocessor | None) – Optional custom preprocessor (uses default if None)
chunker (TextChunker | None) – Optional custom chunker (uses default if None)

process_text(content, source_file=None, metadata=None)[source]

Process raw text into document chunks.

Parameters:

content (str) – Raw text content
source_file (str | None) – Optional source file path
metadata (dict[str, Any] | None) – Optional metadata to attach to chunks

Return type:

list[DocumentChunk]

Returns:

List of processed DocumentChunk objects

Raises:

PreprocessingError – If preprocessing fails
ChunkingError – If chunking fails

process_file(file_path, encoding='utf-8')[source]

Process a text file into document chunks.

Parameters:

file_path (str) – Path to the text file
encoding (str) – File encoding (default: utf-8)

Return type:

list[DocumentChunk]

Returns:

List of processed DocumentChunk objects

Raises:

FileNotFoundError – If file doesn’t exist
PreprocessingError – If preprocessing fails
ChunkingError – If chunking fails