babylon.rag.pre_embeddings.manager

Pre-embeddings management for the RAG system.

This module provides the PreEmbeddingsManager which integrates preprocessing, chunking, and caching components to prepare content for embedding.

Classes

`PreEmbeddingsConfig`(**data)	Configuration for the pre-embeddings system.
`PreEmbeddingsManager`([config, preprocessor, ...])	Manages the pre-embeddings pipeline for the RAG system.

class babylon.rag.pre_embeddings.manager.PreEmbeddingsConfig(**data)[source]

Bases: BaseModel

Configuration for the pre-embeddings system.

Parameters:

preprocessing_config (PreprocessingConfig | None)
chunking_config (ChunkingConfig | None)
cache_config (CacheConfig | None)

preprocessing_config: Configuration for content preprocessing

chunking_config: Configuration for content chunking

cache_config: Configuration for embedding cache management

model_config: ClassVar[ConfigDict] = {'frozen': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

preprocessing_config: PreprocessingConfig | None

chunking_config: ChunkingConfig | None

cache_config: CacheConfig | None

class babylon.rag.pre_embeddings.manager.PreEmbeddingsManager(config=None, preprocessor=None, chunker=None, cache_manager=None, lifecycle_manager=None, metrics=None)[source]

Bases: object

Manages the pre-embeddings pipeline for the RAG system.

This class integrates preprocessing, chunking, and caching components to prepare content for embedding generation.

Parameters:

config (PreEmbeddingsConfig | None)
preprocessor (ContentPreprocessor | None)
chunker (ChunkingStrategy | None)
cache_manager (EmbeddingCacheManager | None)
lifecycle_manager (Any | None)
metrics (MetricsCollectorProtocol | None)

__init__(config=None, preprocessor=None, chunker=None, cache_manager=None, lifecycle_manager=None, metrics=None)[source]

Initialize with configuration and optional component instances.

Parameters:

config (PreEmbeddingsConfig | None) – Configuration for the pre-embeddings system
preprocessor (ContentPreprocessor | None) – Custom preprocessor instance
chunker (ChunkingStrategy | None) – Custom chunker instance
cache_manager (EmbeddingCacheManager | None) – Custom cache manager instance
lifecycle_manager (Any | None) – Lifecycle manager for object state tracking
metrics (MetricsCollectorProtocol | None) – Optional metrics collector for DI (default: creates new MetricsCollector)

process_content(content)[source]

Process a single content item through the pre-embeddings pipeline.

Parameters:: content (str) – Raw content to process
Return type:: list[dict[str, Any]]
Returns:: List of processed chunks with metadata
Raises:: PreEmbeddingError – If processing fails at any stage

process_batch(contents)[source]

Process multiple content items efficiently.

Parameters:: contents (list[str]) – List of raw content items to process
Return type:: list[list[dict[str, Any]]]
Returns:: List of lists of processed chunks with metadata
Raises:: PreEmbeddingError – If batch processing fails

prepare_for_embedding(obj)[source]

Prepare an object for embedding by processing its content.

This method is designed to work with objects that follow the Embeddable protocol from the embedding system.

Parameters:: obj (Any) – Object with content to prepare for embedding
Return type:: dict[str, Any]
Returns:: Dictionary with processed content and metadata
Raises:: PreEmbeddingError – If preparation fails

prepare_batch_for_embedding(objects)[source]

Prepare multiple objects for embedding by processing their content.

Parameters:: objects (list[Any]) – List of objects with content to prepare
Return type:: list[dict[str, Any]]
Returns:: List of dictionaries with processed content and metadata
Raises:: PreEmbeddingError – If batch preparation fails

get_stats()[source]

Get statistics about the pre-embeddings system.

Return type:: dict[str, Any]
Returns:: Dictionary of statistics