babylon.rag.chunker

Document chunking and preprocessing for the RAG system.

Classes

DocumentChunk(**data)

Represents a chunk of a document with metadata.

DocumentProcessor([preprocessor, chunker])

High-level document processor that combines preprocessing and chunking.

Preprocessor([min_content_length, ...])

Preprocesses documents for chunking and embedding.

TextChunker([chunk_size, overlap_size, ...])

Chunks text into smaller, contextually meaningful pieces.

class babylon.rag.chunker.DocumentChunk(**data)[source]

Bases: BaseModel

Represents a chunk of a document with metadata.

Parameters:
model_config: ClassVar[ConfigDict] = {'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

id: str
content: str
source_file: str | None
chunk_index: int
start_char: int
end_char: int
metadata: dict[str, Any] | None
embedding: list[float] | None
generate_id_if_empty()[source]

Generate ID if not provided.

Return type:

DocumentChunk

class babylon.rag.chunker.Preprocessor(min_content_length=50, max_content_length=100000, remove_extra_whitespace=True, normalize_unicode=True)[source]

Bases: object

Preprocesses documents for chunking and embedding.

Parameters:
  • min_content_length (int)

  • max_content_length (int)

  • remove_extra_whitespace (bool)

  • normalize_unicode (bool)

__init__(min_content_length=50, max_content_length=100000, remove_extra_whitespace=True, normalize_unicode=True)[source]

Initialize the preprocessor.

Parameters:
  • min_content_length (int) – Minimum acceptable content length

  • max_content_length (int) – Maximum acceptable content length

  • remove_extra_whitespace (bool) – Whether to normalize whitespace

  • normalize_unicode (bool) – Whether to normalize unicode characters

preprocess(content, content_id=None)[source]

Preprocess content for chunking.

Parameters:
  • content (str) – Raw content to preprocess

  • content_id (str | None) – Optional identifier for error reporting

Return type:

str

Returns:

Preprocessed content

Raises:

PreprocessingError – If content fails validation or preprocessing

class babylon.rag.chunker.TextChunker(chunk_size=1000, overlap_size=100, preserve_paragraphs=True, preserve_sentences=True)[source]

Bases: object

Chunks text into smaller, contextually meaningful pieces.

Parameters:
  • chunk_size (int)

  • overlap_size (int)

  • preserve_paragraphs (bool)

  • preserve_sentences (bool)

__init__(chunk_size=1000, overlap_size=100, preserve_paragraphs=True, preserve_sentences=True)[source]

Initialize the chunker.

Parameters:
  • chunk_size (int) – Target size for each chunk in characters

  • overlap_size (int) – Number of overlapping characters between chunks

  • preserve_paragraphs (bool) – Try to avoid splitting paragraphs

  • preserve_sentences (bool) – Try to avoid splitting sentences

chunk_text(content, source_file=None, metadata=None)[source]

Chunk text content into DocumentChunk objects.

Parameters:
  • content (str) – Text content to chunk

  • source_file (str | None) – Optional source file path

  • metadata (dict[str, Any] | None) – Optional metadata to attach to chunks

Return type:

list[DocumentChunk]

Returns:

List of DocumentChunk objects

Raises:

ChunkingError – If chunking fails

class babylon.rag.chunker.DocumentProcessor(preprocessor=None, chunker=None)[source]

Bases: object

High-level document processor that combines preprocessing and chunking.

Parameters:
__init__(preprocessor=None, chunker=None)[source]

Initialize the document processor.

Parameters:
  • preprocessor (Preprocessor | None) – Optional custom preprocessor (uses default if None)

  • chunker (TextChunker | None) – Optional custom chunker (uses default if None)

process_text(content, source_file=None, metadata=None)[source]

Process raw text into document chunks.

Parameters:
  • content (str) – Raw text content

  • source_file (str | None) – Optional source file path

  • metadata (dict[str, Any] | None) – Optional metadata to attach to chunks

Return type:

list[DocumentChunk]

Returns:

List of processed DocumentChunk objects

Raises:
  • PreprocessingError – If preprocessing fails

  • ChunkingError – If chunking fails

process_file(file_path, encoding='utf-8')[source]

Process a text file into document chunks.

Parameters:
  • file_path (str) – Path to the text file

  • encoding (str) – File encoding (default: utf-8)

Return type:

list[DocumentChunk]

Returns:

List of processed DocumentChunk objects

Raises:
  • FileNotFoundError – If file doesn’t exist

  • PreprocessingError – If preprocessing fails

  • ChunkingError – If chunking fails