babylon.rag.pre_embeddings.preprocessor

Content preprocessing for the RAG system.

This module provides functionality for normalizing and preprocessing content before it is chunked and embedded.

Classes

ContentPreprocessor([config])

Preprocesses content before chunking and embedding.

PreprocessingConfig(**data)

Configuration for content preprocessing.

class babylon.rag.pre_embeddings.preprocessor.PreprocessingConfig(**data)[source]

Bases: BaseModel

Configuration for content preprocessing.

Parameters:
  • normalize_whitespace (bool)

  • normalize_case (bool)

  • remove_special_chars (bool)

  • language_detection (bool)

  • min_content_length (int)

  • max_content_length (int)

normalize_whitespace

Whether to normalize whitespace in content

normalize_case

Whether to convert content to lowercase

remove_special_chars

Whether to remove special characters

language_detection

Whether to detect and validate language

min_content_length

Minimum allowed content length

max_content_length

Maximum allowed content length

model_config: ClassVar[ConfigDict] = {'frozen': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

normalize_whitespace: bool
normalize_case: bool
remove_special_chars: bool
language_detection: bool
min_content_length: int
max_content_length: int
class babylon.rag.pre_embeddings.preprocessor.ContentPreprocessor(config=None)[source]

Bases: object

Preprocesses content before chunking and embedding.

This class handles text normalization, validation, and preparation for the chunking and embedding processes.

Parameters:

config (PreprocessingConfig | None)

__init__(config=None)[source]

Initialize with configuration options.

Parameters:

config (PreprocessingConfig | None) – Configuration for preprocessing behavior

preprocess(content)[source]

Process raw content into normalized form.

Parameters:

content (str) – Raw content to preprocess

Return type:

str

Returns:

Preprocessed content

Raises:

PreprocessingError – If content validation fails

preprocess_batch(contents)[source]

Process multiple content items efficiently.

Parameters:

contents (list[str]) – List of content items to preprocess

Return type:

list[str]

Returns:

List of preprocessed content items