babylon.rag.pre_embeddings.preprocessor
Content preprocessing for the RAG system.
This module provides functionality for normalizing and preprocessing content before it is chunked and embedded.
Classes
|
Preprocesses content before chunking and embedding. |
|
Configuration for content preprocessing. |
- class babylon.rag.pre_embeddings.preprocessor.PreprocessingConfig(**data)[source]
Bases:
BaseModelConfiguration for content preprocessing.
- Parameters:
- normalize_whitespace
Whether to normalize whitespace in content
- normalize_case
Whether to convert content to lowercase
- remove_special_chars
Whether to remove special characters
- language_detection
Whether to detect and validate language
- min_content_length
Minimum allowed content length
- max_content_length
Maximum allowed content length
- model_config: ClassVar[ConfigDict] = {'frozen': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class babylon.rag.pre_embeddings.preprocessor.ContentPreprocessor(config=None)[source]
Bases:
objectPreprocesses content before chunking and embedding.
This class handles text normalization, validation, and preparation for the chunking and embedding processes.
- Parameters:
config (PreprocessingConfig | None)
- __init__(config=None)[source]
Initialize with configuration options.
- Parameters:
config (
PreprocessingConfig|None) – Configuration for preprocessing behavior