babylon.ai.judge

Narrative evaluation via LLM-as-judge pattern.

The NarrativeCommissar evaluates narrative text quality using structured prompting to extract consistent metrics. This enables automated verification of the “Dialectical U-Curve” hypothesis - that narrative certainty follows a U-shape across economic conditions (high at stability, low at inflection, high at collapse).

Components:

MetaphorFamily: Enum categorizing metaphorical language in narratives. JudgmentResult: Immutable Pydantic model holding evaluation metrics. NarrativeCommissar: LLM-powered judge that evaluates narrative quality.

The Commissar follows the same pattern as NarrativeDirector - it uses the LLMProvider protocol for swappable LLM backends (MockLLM for testing, DeepSeekClient for production).

Example

>>> from babylon.ai import MockLLM, NarrativeCommissar
>>> mock = MockLLM(responses=['{"ominousness": 7, "certainty": 8, "drama": 6, "metaphor_family": "biological"}'])
>>> commissar = NarrativeCommissar(llm=mock)
>>> result = commissar.evaluate("The empire crumbles...")
>>> result.ominousness
7

Classes

JudgmentResult(**data)

Result of narrative evaluation by the Commissar.

MetaphorFamily(*values)

Categories of metaphorical language in narratives.

NarrativeCommissar(llm)

LLM-powered judge that evaluates narrative quality.

class babylon.ai.judge.MetaphorFamily(*values)[source]

Bases: str, Enum

Categories of metaphorical language in narratives.

Narratives about economic crisis tend to cluster around certain metaphorical domains. Tracking these helps identify rhetorical patterns in how the AI describes class struggle.

BIOLOGICAL

Bodies, organs, disease, parasites, health.

PHYSICS

Pressure, tension, phase transitions, energy.

MECHANICAL

Gears, machines, breaking, grinding.

NONE

No strong metaphorical clustering detected.

BIOLOGICAL = 'biological'
PHYSICS = 'physics'
MECHANICAL = 'mechanical'
NONE = 'none'
class babylon.ai.judge.JudgmentResult(**data)[source]

Bases: BaseModel

Result of narrative evaluation by the Commissar.

An immutable record of how the LLM-judge evaluated a narrative text. All metrics use a 1-10 scale for consistency.

Parameters:
ominousness

How threatening/foreboding the narrative (1-10).

certainty

How confident/absolute the assertions (1-10).

drama

Emotional intensity of the narrative (1-10).

metaphor_family

Dominant metaphorical domain used.

Example

>>> from babylon.ai.judge import JudgmentResult, MetaphorFamily
>>> result = JudgmentResult(
...     ominousness=8,
...     certainty=9,
...     drama=7,
...     metaphor_family=MetaphorFamily.BIOLOGICAL,
... )
>>> result.ominousness
8
model_config: ClassVar[ConfigDict] = {'frozen': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

ominousness: int
certainty: int
drama: int
metaphor_family: MetaphorFamily
class babylon.ai.judge.NarrativeCommissar(llm)[source]

Bases: object

LLM-powered judge that evaluates narrative quality.

The Commissar uses structured prompting to extract consistent metrics from narrative text. It follows the LLM-as-judge pattern where the LLM acts as a consistent evaluator rather than a generator.

This enables automated verification of narrative hypotheses like the “Dialectical U-Curve” - that narrative certainty follows a U-shape across economic conditions.

Parameters:

llm (LLMProvider)

name

Identifier for logging (“NarrativeCommissar”).

Example

>>> from babylon.ai import MockLLM
>>> from babylon.ai.judge import NarrativeCommissar
>>> mock = MockLLM(responses=['{"ominousness": 5, "certainty": 5, "drama": 5, "metaphor_family": "none"}'])
>>> commissar = NarrativeCommissar(llm=mock)
>>> result = commissar.evaluate("The workers unite.")
>>> isinstance(result.ominousness, int)
True
__init__(llm)[source]

Initialize the NarrativeCommissar.

Parameters:

llm (LLMProvider) – LLMProvider implementation for text generation. Use MockLLM for testing, DeepSeekClient for production.

Return type:

None

property name: str

Return identifier for logging.

Returns:

The string “NarrativeCommissar”.

evaluate(text)[source]

Evaluate a narrative text and return structured metrics.

Sends the text to the LLM with a structured prompt requesting JSON output. Parses the response and returns a JudgmentResult.

Parameters:

text (str) – The narrative text to evaluate.

Return type:

JudgmentResult

Returns:

JudgmentResult containing the evaluation metrics.

Raises:
  • json.JSONDecodeError – If LLM response is not valid JSON.

  • pydantic.ValidationError – If JSON values are out of bounds.

Example

>>> from babylon.ai import MockLLM
>>> from babylon.ai.judge import NarrativeCommissar
>>> mock = MockLLM(responses=['{"ominousness": 7, "certainty": 8, "drama": 6, "metaphor_family": "biological"}'])
>>> commissar = NarrativeCommissar(llm=mock)
>>> result = commissar.evaluate("The crisis deepens.")
>>> result.ominousness
7