babylon.ai.judge
Narrative evaluation via LLM-as-judge pattern.
The NarrativeCommissar evaluates narrative text quality using structured prompting to extract consistent metrics. This enables automated verification of the “Dialectical U-Curve” hypothesis - that narrative certainty follows a U-shape across economic conditions (high at stability, low at inflection, high at collapse).
- Components:
MetaphorFamily: Enum categorizing metaphorical language in narratives. JudgmentResult: Immutable Pydantic model holding evaluation metrics. NarrativeCommissar: LLM-powered judge that evaluates narrative quality.
The Commissar follows the same pattern as NarrativeDirector - it uses the LLMProvider protocol for swappable LLM backends (MockLLM for testing, DeepSeekClient for production).
Example
>>> from babylon.ai import MockLLM, NarrativeCommissar
>>> mock = MockLLM(responses=['{"ominousness": 7, "certainty": 8, "drama": 6, "metaphor_family": "biological"}'])
>>> commissar = NarrativeCommissar(llm=mock)
>>> result = commissar.evaluate("The empire crumbles...")
>>> result.ominousness
7
Classes
|
Result of narrative evaluation by the Commissar. |
|
Categories of metaphorical language in narratives. |
|
LLM-powered judge that evaluates narrative quality. |
- class babylon.ai.judge.MetaphorFamily(*values)[source]
-
Categories of metaphorical language in narratives.
Narratives about economic crisis tend to cluster around certain metaphorical domains. Tracking these helps identify rhetorical patterns in how the AI describes class struggle.
- BIOLOGICAL
Bodies, organs, disease, parasites, health.
- PHYSICS
Pressure, tension, phase transitions, energy.
- MECHANICAL
Gears, machines, breaking, grinding.
- NONE
No strong metaphorical clustering detected.
- BIOLOGICAL = 'biological'
- PHYSICS = 'physics'
- MECHANICAL = 'mechanical'
- NONE = 'none'
- class babylon.ai.judge.JudgmentResult(**data)[source]
Bases:
BaseModelResult of narrative evaluation by the Commissar.
An immutable record of how the LLM-judge evaluated a narrative text. All metrics use a 1-10 scale for consistency.
- Parameters:
- ominousness
How threatening/foreboding the narrative (1-10).
- certainty
How confident/absolute the assertions (1-10).
- drama
Emotional intensity of the narrative (1-10).
- metaphor_family
Dominant metaphorical domain used.
Example
>>> from babylon.ai.judge import JudgmentResult, MetaphorFamily >>> result = JudgmentResult( ... ominousness=8, ... certainty=9, ... drama=7, ... metaphor_family=MetaphorFamily.BIOLOGICAL, ... ) >>> result.ominousness 8
- model_config: ClassVar[ConfigDict] = {'frozen': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- ominousness: int
- certainty: int
- drama: int
- metaphor_family: MetaphorFamily
- class babylon.ai.judge.NarrativeCommissar(llm)[source]
Bases:
objectLLM-powered judge that evaluates narrative quality.
The Commissar uses structured prompting to extract consistent metrics from narrative text. It follows the LLM-as-judge pattern where the LLM acts as a consistent evaluator rather than a generator.
This enables automated verification of narrative hypotheses like the “Dialectical U-Curve” - that narrative certainty follows a U-shape across economic conditions.
- Parameters:
llm (LLMProvider)
- name
Identifier for logging (“NarrativeCommissar”).
Example
>>> from babylon.ai import MockLLM >>> from babylon.ai.judge import NarrativeCommissar >>> mock = MockLLM(responses=['{"ominousness": 5, "certainty": 5, "drama": 5, "metaphor_family": "none"}']) >>> commissar = NarrativeCommissar(llm=mock) >>> result = commissar.evaluate("The workers unite.") >>> isinstance(result.ominousness, int) True
- __init__(llm)[source]
Initialize the NarrativeCommissar.
- Parameters:
llm (
LLMProvider) – LLMProvider implementation for text generation. Use MockLLM for testing, DeepSeekClient for production.- Return type:
None
- evaluate(text)[source]
Evaluate a narrative text and return structured metrics.
Sends the text to the LLM with a structured prompt requesting JSON output. Parses the response and returns a JudgmentResult.
- Parameters:
text (
str) – The narrative text to evaluate.- Return type:
- Returns:
JudgmentResult containing the evaluation metrics.
- Raises:
json.JSONDecodeError – If LLM response is not valid JSON.
pydantic.ValidationError – If JSON values are out of bounds.
Example
>>> from babylon.ai import MockLLM >>> from babylon.ai.judge import NarrativeCommissar >>> mock = MockLLM(responses=['{"ominousness": 7, "certainty": 8, "drama": 6, "metaphor_family": "biological"}']) >>> commissar = NarrativeCommissar(llm=mock) >>> result = commissar.evaluate("The crisis deepens.") >>> result.ominousness 7