Design Alternatives & Rationale

Status

Draft · v0.2.0 · 2026-06-30

Overview

This chapter documents the systematic design space exploration conducted for the IOA-ORM specification. It follows the QOC (Questions, Options, Criteria) methodology (MacLean et al., 1991) to make explicit the design decisions, alternatives considered, evaluation criteria grounded in literature, and trade-offs accepted.

The design space exploration identified 12 primary design decisions, each evaluated against 4–6 criteria drawn from assessment theory (Joughin, 1998; Akimov & Malin, 2020; Fenton, 2025; Bayley et al., 2024), information systems literature (van der Aalst et al., 2003; Young, 2010; Fowler, 2005), agent systems research (Yao et al., 2023; Schick et al., 2023), AI safety literature (Bai et al., 2022; Rebedea et al., 2023; Greshake et al., 2023), and software engineering (Bass et al., 2012; Gamma et al., 1995; Lattner et al., 2020).

The methodology prevents design fixation — “a deeply ingrained psychological tendency where designers unconsciously adhere to the influence of prior designs” (Jansson & Smith, 1991, cited in Grisold et al., 2021). By systematically exploring alternatives, we ensure that the chosen design is justified by its merits, not by path dependence.

Decision 1: IR as Compilation Target

Question

Why introduce an Intermediate Representation (IR) as the compilation target from the authoring studio? Why not have the authoring studio produce flowJson directly?

Options Considered

Option	Description
A: Direct Authoring → Runtime	Authoring studio communicates directly with runtime. No intermediate artifact.
B: Authoring → flowJson → Runtime	flowJson (Pipecat FlowManager config) serves as the source of truth.
C: Authoring → IR → Multi-Target	Rich IR compiles to Pipecat adapter, runtime controller config, and marking config.
D: Authoring → IR → Single Target	IR compiles only to the runtime controller.

Evaluation Criteria (with literature grounding)

Criterion	Source
Separation of concerns	Bass, Clements & Kazman (2012) — layered architecture enables independent evolution
Multi-target compilation	Lattner et al. (2020) — MLIR concept: one IR, multiple compilation targets
Versionability & diffability	Fowler (2005) — event sourcing: stable artifacts enable change tracking
Lossless semantic capture	van der Aalst et al. (2003) — workflow patterns: the specification must capture all assessment semantics
Authoring independence	Gamma et al. (1995) — adapter pattern decouples authoring from execution
Testability	Bass et al. (2012) — compile-time validation of IR packages

Evaluation Matrix

Criterion	A: Direct	B: flowJson	C: Multi-Target	D: Single-Target
Separation of concerns	Weak	Moderate	Strong	Strong
Multi-target compilation	None	Weak	Strong	Moderate
Versionability	Weak	Moderate	Strong	Strong
Lossless semantic capture	Moderate	Weak	Strong	Strong
Authoring independence	Weak	Moderate	Strong	Strong
Testability	Weak	Moderate	Strong	Strong

Decision & Rationale

Chosen: Option C — Multi-Target IR. The IR serves as the canonical, versioned, executable specification of a published oral assessment. It is the single source of truth consumed by the Pipecat adapter, runtime controller, and marking runtime. From 00-overview.md: “flowJson is a serialization convenience — a bag of nodes and edges consumed by Pipecat’s FlowManager. It was designed to describe conversational flow, not to serve as the canonical executable specification of a high-stakes oral assessment.”

The multi-target property is critical because different consumers need different views: the Pipecat adapter needs conversational flow, the runtime controller needs policy enforcement, and the marking runtime needs evidence targets and rubric mappings.

Rejected Options

Option A (Direct): Tight coupling between authoring and runtime makes independent evolution impossible.
Option B (flowJson): flowJson lacks runtime state schema, hard constraint vocabulary, event contract, and versioning (00-overview.md §1).
Option D (Single-Target): The separation between evidence collection (runtime) and evidence evaluation (marking) requires distinct compilation targets (Akimov & Malin, 2020).

Decision 2: Three-Layer Architecture

Question

Why three layers (Specification / Runtime Controller / Pipecat Adapter) instead of two or four?

Options Considered

Option	Description
A: Two Layers	IR compiles to combined Runtime+Pipecat component.
B: Three Layers	IR → Runtime Controller → Pipecat Adapter. LLM is a tool invoked by the Runtime Controller.
C: Four Layers	IR → Policy Engine → Runtime Controller → Pipecat Adapter.
D: Pipecat + Overlay	Pipecat FlowManager is primary; lightweight overlay handles what Pipecat cannot.

Evaluation Criteria

Criterion	Source
Single responsibility	Bass et al. (2012) — each layer should have one reason to change
Domain logic isolation	Young (2010) CQRS — domain logic separated from infrastructure
LLM boundary enforcement	Rebedea et al. (2023) — proxy layer enables programmable constraint enforcement
Testability	Bass et al. (2012) — independently testable layers
Pipecat independence	Gamma et al. (1995) — adapter pattern decouples domain from Pipecat

Evaluation Matrix

Criterion	A: Two Layers	B: Three Layers	C: Four Layers	D: Pipecat+Overlay
Single responsibility	Weak	Strong	Moderate	Weak
Domain logic isolation	Weak	Strong	Strong	Weak
LLM boundary enforcement	Moderate	Strong	Strong	Moderate
Testability	Weak	Strong	Strong	Weak
Pipecat independence	Weak	Strong	Strong	Weak

Decision & Rationale

Chosen: Option B — Three Layers. The Runtime Controller acts as a programmable proxy between the LLM and the environment (Rebedea et al., 2023). The LLM is a tool invoked by the Runtime Controller — it does NOT own state, does NOT decide transitions unilaterally, and does NOT persist evidence directly (03-runtime-semantics.md §1.2).

Rejected Options

Option A: Domain logic mixed with Pipecat integration makes it impossible to swap voice pipeline.
Option C: Policy evaluation is lightweight and doesn’t justify a separate layer.
Option D: The overlay pattern is reactive rather than proactive; NeMo Guardrails research shows proxy architectures are more reliable.

Decision 3: Agent Communication Model

Question

How should the LLM communicate with the Runtime Controller?

Options Considered

Option	Description
A: Single `report_observation`	One function bundles all observations per turn.
B: Multiple Functions	`report_evidence_signal`, `report_candidate_command`, `request_transition`.
C: Free-Text + Parsing	LLM produces free-text; runtime parses for signals/commands.
D: Structured Output	LLM produces structured JSON (no function calling).

Evaluation Criteria

Criterion	Source
Hallucination risk	Schick et al. (2023) — more tools = more hallucination
Latency	Yao et al. (2023) — multiple calls = multiple round-trips
Atomicity	Young (2010) — all observations processed atomically
Interpretability	Yao et al. (2023) — transparent reasoning for audit trails

Evaluation Matrix

Criterion	A: Single Function	B: Multiple Functions	C: Free-Text	D: Structured Output
Hallucination risk	Strong	Weak	Moderate	Strong
Latency	Strong	Weak	Strong	Strong
Atomicity	Strong	Weak	Moderate	Strong
Interpretability	Strong	Moderate	Weak	Strong

Decision & Rationale

Chosen: Option A — Single report_observation. From 04-agent-boundary.md: “Previous design had request_transition + report_evidence_signal + report_candidate_command. This caused multiple LLM round-trips per turn, increased latency, and hallucination risk. One function = one call = one Runtime Controller evaluation = atomic decision-making.”

Rejected Options

Option B: Correlation problem (matching signals to transitions) and latency cost.
Option C: Parsing complexity and prompt injection risk (Greshake et al., 2023).
Option D: Functionally similar to A; chosen A for Pipecat integration.

Decision 4: Evidence Signal Model

Question

How should evidence of candidate competence be represented?

Options Considered

Option	Description
A: Discrete Signal Kinds Only	`positive`, `partial`, `absent`, `misconception`. No confidence.
B: Continuous Confidence Only	0.0–1.0 confidence score. No discrete classification.
C: Rich Taxonomy	Discrete kinds + confidence + process quality + provenance + scaffolding intensity.
D: Rubric-Level Judgments	LLM directly maps evidence to rubric levels.

Evaluation Criteria

Criterion	Source
Assessment validity	Joughin (1998) — must capture what oral assessment measures
Process quality capture	Fenton (2025) — “the process of learning rather than the output”
Provenance tracking	Buneman et al. (2001) — why/where/how provenance
Moderation support	Akimov & Malin (2020) — human review and override

Evaluation Matrix

Criterion	A: Discrete Only	B: Confidence Only	C: Rich Taxonomy	D: Rubric-Level
Assessment validity	Moderate	Moderate	Strong	Moderate
Process quality capture	Weak	Weak	Strong	Weak
Provenance tracking	Weak	Weak	Strong	Weak
Moderation support	Moderate	Moderate	Strong	Weak

Decision & Rationale

Chosen: Option C — Rich Taxonomy. Eight signal kinds (positive, partial, absent, misconception, flawed_reasoning, process_positive, process_negative, self_correction) capture the full spectrum of assessment evidence. Confidence scores enable calibration. Provenance fields (proposedBy, sttConfidenceSummary, scaffoldingIntensity) support auditing and moderation. Fenton (2025): “Oral assessments reveal the process of learning rather than the output.”

Rejected Options

Option A: Misses process quality (self-correction, reasoning process).
Option B: Misses what type of evidence was observed.
Option D: Conflates evidence collection with evidence interpretation.

Decision 5: Policy Expression

Question

How should assessment policies be expressed?

Options Considered

Option	Description
A: Structured Data Objects	TypeScript interfaces: `CompletionPolicy`, `FollowUpPolicy`, etc.
B: Code	Executable Python/TypeScript functions.
C: Prompt Instructions	Natural language in LLM system prompt.
D: Rule Engine	Domain-specific language (e.g., Colang).

Evaluation Criteria

Criterion	Source
Machine enforceability	Rebedea et al. (2023) — programmable rails over embedded rails
Authorability	Bass et al. (2012) — accessible to assessment designers
Compile-time validation	Lattner et al. (2020) — catch errors before runtime
LLM independence	Bai et al. (2022) — enforced regardless of LLM behavior

Evaluation Matrix

Criterion	A: Structured Data	B: Code	C: Prompts	D: Rule Engine
Machine enforceability	Strong	Strong	Weak	Strong
Authorability	Strong	Weak	Strong	Moderate
Compile-time validation	Strong	Moderate	Weak	Strong
LLM independence	Strong	Strong	Weak	Strong

Decision & Rationale

Chosen: Option A — Structured Data Objects. Policies are typed data structures evaluated deterministically by the Runtime Controller. From 03-runtime-semantics.md: “The Runtime MUST evaluate transition conditions in the order they appear in the specification. The FIRST matching condition wins.” Structured data enables compile-time validation (08-validation-rules.md) and is authorable via the authoring studio UI.

Rejected Options

Option B: Not auditable by non-engineers; security risk.
Option C: Not enforceable; Greshake et al. (2023) show prompt constraints can be overridden.
Option D: Near-miss; Colang-style DSLs are powerful but add learning curve.

Decision 6: State Machine vs Implicit Tracking

Question

Should the exam lifecycle be governed by explicit state machines or implicit state from event log?

Options Considered

Option	Description
A: Explicit State Machines	Formal SMs for exam/node/turn lifecycle.
B: Event Sourcing	State derived by replaying event log.
C: Hybrid	Explicit SMs for runtime + event log for audit/recovery.
D: Reactive State	Observable streams, no central authority.

Evaluation Criteria

Criterion	Source
Invariant enforcement	van der Aalst et al. (2003) — structural invariants
Recovery capability	Fowler (2005) — state reconstructable from events
Auditability	Young (2010) — every state change recorded
Performance	Young (2010) — fast reads for real-time interaction

Evaluation Matrix

Criterion	A: Explicit SM	B: Event Sourcing	C: Hybrid	D: Reactive
Invariant enforcement	Strong	Moderate	Strong	Weak
Recovery capability	Moderate	Strong	Strong	Weak
Auditability	Moderate	Strong	Strong	Weak
Performance	Strong	Weak	Strong	Strong

Decision & Rationale

Chosen: Option C — Hybrid. Explicit state machines (exam/node/turn lifecycle) enforce invariants at runtime. Event log provides audit trail and recovery capability. From 02-schema.md §5: “RuntimeStateSchema is mutable per-session state tracked by the runtime controller. NOT persisted as a log — this is working memory.”

Rejected Options

Option A: No audit trail or recovery capability.
Option B: Replay too slow for real-time voice interaction.
Option D: Unnecessary complexity for single-threaded sessions.

Decision 7: Agent Autonomy Gradient

Question

Where should the line be drawn between LLM autonomy and Runtime control?

Options Considered

Option	Description
A: Two Levels	Autonomous / Controlled.
B: Three Levels	Autonomous / Advisory / Controlled.
C: Five Levels	Fully Autonomous / Guided / Advisory / Constrained / Forbidden.
D: Full Autonomy	LLM decides everything; post-hoc validation.

Evaluation Criteria

Criterion	Source
Assessment naturalness	Joughin (1998) — “bidirectional adaptation”
Fairness	Akimov & Malin (2020) — consistent treatment
Safety	Bai et al. (2022) — constitutional principles
Agent systems alignment	Yao et al. (2023) — action space augmentation

Evaluation Matrix

Criterion	A: Two Levels	B: Three Levels	C: Five Levels	D: Full Autonomy
Assessment naturalness	Moderate	Strong	Strong	Strong
Fairness	Strong	Strong	Strong	Weak
Safety	Strong	Strong	Strong	Weak
Agent systems alignment	Moderate	Strong	Moderate	Weak

Decision & Rationale

Chosen: Option B — Three Levels. The LLM fully decides wording and dialogue strategy (autonomous), advises on evidence sufficiency and follow-up need (advisory), and is fully controlled on transitions and scoring (controlled). This maps to ReAct’s (Yao et al., 2023) action space augmentation: Thoughts are autonomous, Actions that affect the environment are advisory or controlled.

Rejected Options

Option A: Too coarse for nuanced evidence assessment.
Option C: Over-engineered; three levels provide sufficient granularity.
Option D: Violates safety requirements for summative assessment.

Decision 8: Event Protocol

Question

What event architecture should the system use?

Options Considered

Option	Description
A: Push Only	Runtime pushes events via WebSocket/LiveKit.
B: Pull Only	Consumers poll for state changes.
C: Event Sourcing	Append-only event store; consumers read from store.
D: Hybrid	Push for real-time + event store for persistence.

Evaluation Criteria

Criterion	Source
Real-time delivery	Hohpe & Woolf (2003) — frontend needs low-latency updates
Auditability	Young (2010) — all events persisted for audit
Replay capability	Fowler (2005) — sessions reconstructable from events
Transport agnosticism	Hohpe & Woolf (2003) — same event over multiple transports

Evaluation Matrix

Criterion	A: Push Only	B: Pull Only	C: Event Sourcing	D: Hybrid
Real-time delivery	Strong	Weak	Moderate	Strong
Auditability	Weak	Moderate	Strong	Strong
Replay capability	Weak	Weak	Strong	Strong
Transport agnosticism	Moderate	Moderate	Strong	Strong

Decision & Rationale

Chosen: Option D — Hybrid. Push events to real-time consumers (frontend via LiveKit data channel, WebSocket). Persist all events to an append-only store for audit and marking. Events are transport-agnostic (05-event-protocol.md Principle E3): the same envelope works over all transports.

Rejected Options

Option A: No persistence; events lost on failure.
Option B: Unacceptable latency for real-time UI.
Option C: Replay too slow for real-time consumers.

Decision 9: Evidence as First-Class Output

Question

Should evidence be a separate artifact, embedded in transcript, or extracted post-hoc?

Options Considered

Option	Description
A: Separate EvidenceLedger	Real-time structured ledger with provenance.
B: Embedded in Transcript	Evidence as metadata on transcript turns.
C: Post-Hoc Extraction	Evidence derived from transcript after exam.
D: Hybrid	Real-time signals + post-hoc enrichment.

Evaluation Criteria

Criterion	Source
Marking determinism	Akimov & Malin (2020) — reproducible marking
Provenance	Buneman et al. (2001) — why/where/how provenance
Moderation support	Akimov & Malin (2020) — human review and override
Separation of concerns	Young (2010) — collection separate from evaluation

Evaluation Matrix

Criterion	A: Separate Ledger	B: Embedded	C: Post-Hoc	D: Hybrid
Marking determinism	Strong	Moderate	Weak	Strong
Provenance	Strong	Moderate	Weak	Strong
Moderation support	Strong	Moderate	Weak	Strong
Separation of concerns	Strong	Weak	Moderate	Strong

Decision & Rationale

Chosen: Option A — Separate EvidenceLedger. Evidence signals are written to a dedicated ledger in real-time. Each signal carries provenance (proposedBy, sttConfidenceSummary, scaffoldingIntensity) and is linked to transcript turns via turnIds. From 06-evidence-ledger.md: “The evidence ledger is not a post-processing step over the transcript. It is a structured, real-time, authoritative record of assessment evidence.”

Rejected Options

Option B: Conflating evidence with transcript prevents independent override and provenance tracking.
Option C: Non-deterministic; no real-time feedback for formative assessment.
Option D: Near-miss; the spec supports post-hoc enrichment via proposedBy: "manual_marker".

Decision 10: Recovery Strategy

Question

How should the system handle failures?

Options Considered

Option	Description
A: Fully Automated	All failures handled by runtime.
B: Human-in-the-Loop	All failures require proctor decision.
C: Categorized	Technical = automated; Assessment = LLM-assisted with runtime guardrails.
D: Graduated	Severity-based: minor → automated; moderate → LLM-assisted; severe → human.

Evaluation Criteria

Criterion	Source
Assessment validity	Fenton (2025) — recovery must not compromise assessment
Candidate welfare	Akimov & Malin (2020) — recovery must not cause distress
Operational feasibility	Bayley et al. (2024) — must work at scale
Assessment neutrality	Pearce & Chiavaroli (2020) — recovery must not reveal rubric

Evaluation Matrix

Criterion	A: Automated	B: Human	C: Categorized	D: Graduated
Assessment validity	Moderate	Strong	Strong	Strong
Candidate welfare	Moderate	Strong	Strong	Strong
Operational feasibility	Strong	Weak	Strong	Strong
Assessment neutrality	Strong	Moderate	Strong	Strong

Decision & Rationale

Chosen: Option C — Categorized Recovery. Technical failures (network, STT, TTS, silence) are handled automatically. Assessment failures (candidate confusion, distress, off-topic) are handled by the LLM with runtime guardrails. From 03-runtime-semantics.md §6.1: “Recovery MUST NOT reveal model answers, rubric scoring logic, or the ‘correct’ response.”

Rejected Options

Option A: Cannot handle nuanced affective/pedagogical recovery.
Option B: Infeasible at scale (Bayley et al., 2024: 600+ students).
Option D: Near-miss; categorization (technical vs. assessment) was chosen over severity because the recovery mechanisms are fundamentally different.

Decision 11: Transition Authority

Question

Who decides when to transition between nodes?

Options Considered

Option	Description
A: Runtime Only	Deterministic policy evaluation. No LLM input.
B: LLM Proposes + Runtime Approves	LLM signals readiness; runtime validates against policy.
C: LLM Decides	LLM controls transitions.
D: Runtime + LLM Delay	Runtime decides; LLM can request more time.

Evaluation Criteria

Criterion	Source
Fairness	Akimov & Malin (2020) — consistent treatment
Auditability	Buneman et al. (2001) — traceable to policy
Assessment naturalness	Joughin (1998) — natural transitions
Agent boundary	Rebedea et al. (2023) — LLM should not control structure

Evaluation Matrix

Criterion	A: Runtime Only	B: Propose+Approve	C: LLM Decides	D: Runtime+Delay
Fairness	Strong	Strong	Weak	Strong
Auditability	Strong	Strong	Weak	Strong
Assessment naturalness	Moderate	Strong	Strong	Strong
Agent boundary	Strong	Strong	Weak	Strong

Decision & Rationale

Chosen: Option B — LLM Proposes + Runtime Approves. The LLM signals evidenceSufficient: true in report_observation. The Runtime validates against CompletionPolicy (requiredEvidenceTargetIds, minTurns, timeBudget). The LLM is the best judge of assessment quality; the Runtime is the best judge of structural policy. From 04-agent-boundary.md: “Judging evidence sufficiency: LLM emits opinion; Runtime applies policy.”

Rejected Options

Option A: Misses assessment quality dimension.
Option C: Violates fairness and reliability (Joughin, 1998).
Option D: needsFollowUp signal already serves this purpose.

Decision 12: Context Management

Question

What context does the LLM see between nodes?

Options Considered

Option	Description
A: RESET Only	Fresh context per node. No carryover.
B: APPEND Only	Full conversation history across all nodes.
C: Hybrid	RESET + previous node summary.
D: Policy-Driven	ContextPolicy governs per-node visibility. RESET default.

Evaluation Criteria

Criterion	Source
Information leakage prevention	Greshake et al. (2023) — minimize attack surface
Conversational naturalness	Joughin (1998) — continuity across topics
Context window efficiency	Schick et al. (2023) — manage token limits
Security	Greshake et al. (2023) — sanitize candidate input

Evaluation Matrix

Criterion	A: RESET	B: APPEND	C: Hybrid	D: Policy-Driven
Leakage prevention	Strong	Weak	Moderate	Strong
Naturalness	Weak	Strong	Strong	Strong
Window efficiency	Strong	Weak	Strong	Strong
Security	Strong	Weak	Moderate	Strong

Decision & Rationale

Chosen: Option D — Policy-Driven (RESET default). Context is reset between nodes (preventing cross-node contamination). The ContextPolicy allows per-node overrides (includePreviousNodes, includeEvidenceStatus, etc.). Rubric criteria ARE shared as evidence vocabulary; scoring logic is NOT shared. From 07-pipecat-adapter.md §6: “Between Nodes: RESET. Within a Node: APPEND.”

Rejected Options

Option A: No conversational continuity.
Option B: Unbounded context growth; prompt injection risk.
Option C: Near-miss; the policy-driven approach adds per-node configurability on top of the hybrid base.

Cross-Cutting Themes

1. Determinism Where It Matters

Across all 12 decisions, the spec consistently chooses determinism over flexibility for structural concerns (transitions, policies, evidence validation) while preserving flexibility for conversational concerns (wording, dialogue strategy, tone). This reflects the fundamental tension in AI-powered assessment: the conversation must feel natural, but the assessment must be fair and auditable.

Pattern: LLM decides how; Runtime decides when and whether.

2. Separation of Collection and Evaluation

The spec separates evidence collection (runtime) from evidence evaluation (marking) at every level:

IR layer: Defines what to assess (evidence targets) separately from how to score (marking rubric)
Runtime layer: Collects evidence signals; does not assign marks
Marking layer: Evaluates signals; does not collect evidence

This mirrors CQRS (Young, 2010): the command side (runtime) and query side (marking) have different models optimized for their specific concerns.

3. Defense in Depth

No single mechanism is sufficient for safety. The spec implements multiple overlapping layers:

IR constraints: Policies defined at compile time
Runtime guardrails: Enforced at execution time
Output validation: Filters LLM output before it reaches the candidate
Context policy: Controls what the LLM can see
Audit trail: Records everything for post-hoc review

This directly follows Greshake et al. (2023): “Defense is a difficult ‘whack-a-mole’ game. RLHF and input filtering are not sufficient.”

4. LLM as Sensor, Runtime as Judge

The LLM is treated as a sensor — it observes candidate responses and proposes evidence signals. The Runtime is the judge — it validates proposals against policy and makes structural decisions. This separation ensures that:

The LLM’s creative freedom is preserved (it can adapt dialogue naturally)
The Runtime’s authority is preserved (it enforces hard constraints)
The marking pipeline receives validated, structured evidence

5. Provenance as First-Class

Every artifact carries provenance metadata:

Evidence signals: proposedBy, sttConfidenceSummary, scaffoldingIntensity
Runtime events: source, timestamp, correlationId
Policy decisions: condition evaluated, reason for transition

This follows Buneman et al. (2001): why-provenance (why was this signal emitted?), where-provenance (where did the evidence come from?), and how-provenance (how was it derived?).

Design Principles Emerged

The specification is the contract: Between authoring and execution, between collection and evaluation, between human and machine. It is the single source of truth.
Policies are data, not code or prompts: Machine-enforceable, auditable, testable, authorable.
The LLM proposes, the Runtime disposes: The LLM has judgment; the Runtime has authority. Neither alone is sufficient.
Events are immutable facts: Every state change is recorded. The event log is the audit trail, the recovery mechanism, and the integration backbone.
Evidence is structured, not derived: Signals are produced during the exam, not extracted after. They carry provenance, confidence, and classification.
Context is controlled, not accumulated: The LLM sees what it needs, not everything it could see. Minimizing context minimizes risk.
Recovery is categorized, not uniform: Technical failures require infrastructure actions; assessment failures require pedagogical actions. Different problems need different solutions.
Fairness through determinism: Structural decisions (transitions, completion, time budgets) are deterministic. Conversational decisions (wording, tone, follow-up type) are adaptive. This preserves both fairness and naturalness.

Revision History

Version	Date	Changes
v0.2.0	2026-06-30	Updated design rationale for IOA-ORM framing. Added evaluation criteria for anxiety management and Bloom’s integration. Updated terminology from ‘Exam Runtime IR’ to ‘IOA-ORM’.
v0.1.0	2026-05-06	Initial release.