babylon.ai.judge

Narrative evaluation via LLM-as-judge pattern.

The NarrativeCommissar evaluates narrative text quality using structured prompting to extract consistent metrics. This enables automated verification of the “Dialectical U-Curve” hypothesis - that narrative certainty follows a U-shape across economic conditions (high at stability, low at inflection, high at collapse).

Components:: MetaphorFamily: Enum categorizing metaphorical language in narratives. JudgmentResult: Immutable Pydantic model holding evaluation metrics. NarrativeCommissar: LLM-powered judge that evaluates narrative quality.

The Commissar follows the same pattern as NarrativeDirector - it uses the LLMProvider protocol for swappable LLM backends (MockLLM for testing, DeepSeekClient for production).

Example

>>> from babylon.ai import MockLLM, NarrativeCommissar
>>> mock = MockLLM(responses=['{"ominousness": 7, "certainty": 8, "drama": 6, "metaphor_family": "biological"}'])
>>> commissar = NarrativeCommissar(llm=mock)
>>> result = commissar.evaluate("The empire crumbles...")
>>> result.ominousness
7

Classes

`JudgmentResult`(**data)	Result of narrative evaluation by the Commissar.
`MetaphorFamily`(*values)	Categories of metaphorical language in narratives.
`NarrativeCommissar`(llm)	LLM-powered judge that evaluates narrative quality.

class babylon.ai.judge.MetaphorFamily(*values)[source]

Bases: str, Enum

Categories of metaphorical language in narratives.

Narratives about economic crisis tend to cluster around certain metaphorical domains. Tracking these helps identify rhetorical patterns in how the AI describes class struggle.

BIOLOGICAL: Bodies, organs, disease, parasites, health.

PHYSICS: Pressure, tension, phase transitions, energy.

MECHANICAL: Gears, machines, breaking, grinding.

NONE: No strong metaphorical clustering detected.

BIOLOGICAL = 'biological'

PHYSICS = 'physics'

MECHANICAL = 'mechanical'

NONE = 'none'

class babylon.ai.judge.JudgmentResult(**data)[source]

Bases: BaseModel

Result of narrative evaluation by the Commissar.

An immutable record of how the LLM-judge evaluated a narrative text. All metrics use a 1-10 scale for consistency.

Parameters:

ominousness (Annotated[int, Ge(ge=1), Le(le=10)])
certainty (Annotated[int, Ge(ge=1), Le(le=10)])
drama (Annotated[int, Ge(ge=1), Le(le=10)])
metaphor_family (MetaphorFamily)

ominousness: How threatening/foreboding the narrative (1-10).

certainty: How confident/absolute the assertions (1-10).

drama: Emotional intensity of the narrative (1-10).

metaphor_family: Dominant metaphorical domain used.

Example

>>> from babylon.ai.judge import JudgmentResult, MetaphorFamily
>>> result = JudgmentResult(
...     ominousness=8,
...     certainty=9,
...     drama=7,
...     metaphor_family=MetaphorFamily.BIOLOGICAL,
... )
>>> result.ominousness
8

model_config: ClassVar[ConfigDict] = {'frozen': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

ominousness: int

certainty: int

drama: int

metaphor_family: MetaphorFamily

class babylon.ai.judge.NarrativeCommissar(llm)[source]

Bases: object

LLM-powered judge that evaluates narrative quality.

The Commissar uses structured prompting to extract consistent metrics from narrative text. It follows the LLM-as-judge pattern where the LLM acts as a consistent evaluator rather than a generator.

This enables automated verification of narrative hypotheses like the “Dialectical U-Curve” - that narrative certainty follows a U-shape across economic conditions.

Parameters:: llm (LLMProvider)

name: Identifier for logging (“NarrativeCommissar”).

Example

>>> from babylon.ai import MockLLM
>>> from babylon.ai.judge import NarrativeCommissar
>>> mock = MockLLM(responses=['{"ominousness": 5, "certainty": 5, "drama": 5, "metaphor_family": "none"}'])
>>> commissar = NarrativeCommissar(llm=mock)
>>> result = commissar.evaluate("The workers unite.")
>>> isinstance(result.ominousness, int)
True

__init__(llm)[source]

Initialize the NarrativeCommissar.

Parameters:: llm (LLMProvider) – LLMProvider implementation for text generation. Use MockLLM for testing, DeepSeekClient for production.
Return type:: None

property name: str

Return identifier for logging.

Returns:: The string “NarrativeCommissar”.

evaluate(text)[source]

Evaluate a narrative text and return structured metrics.

Sends the text to the LLM with a structured prompt requesting JSON output. Parses the response and returns a JudgmentResult.

Parameters:

text (str) – The narrative text to evaluate.

Return type:

JudgmentResult

Returns:

JudgmentResult containing the evaluation metrics.

Raises:

json.JSONDecodeError – If LLM response is not valid JSON.
pydantic.ValidationError – If JSON values are out of bounds.

Example

>>> from babylon.ai import MockLLM
>>> from babylon.ai.judge import NarrativeCommissar
>>> mock = MockLLM(responses=['{"ominousness": 7, "certainty": 8, "drama": 6, "metaphor_family": "biological"}'])
>>> commissar = NarrativeCommissar(llm=mock)
>>> result = commissar.evaluate("The crisis deepens.")
>>> result.ominousness
7