Skip to content

Design Alternatives & Rationale

Draft · v0.2.0 · 2026-06-30

This chapter documents the systematic design space exploration conducted for the IOA-ORM specification. It follows the QOC (Questions, Options, Criteria) methodology (MacLean et al., 1991) to make explicit the design decisions, alternatives considered, evaluation criteria grounded in literature, and trade-offs accepted.

The design space exploration identified 12 primary design decisions, each evaluated against 4–6 criteria drawn from assessment theory (Joughin, 1998; Akimov & Malin, 2020; Fenton, 2025; Bayley et al., 2024), information systems literature (van der Aalst et al., 2003; Young, 2010; Fowler, 2005), agent systems research (Yao et al., 2023; Schick et al., 2023), AI safety literature (Bai et al., 2022; Rebedea et al., 2023; Greshake et al., 2023), and software engineering (Bass et al., 2012; Gamma et al., 1995; Lattner et al., 2020).

The methodology prevents design fixation — “a deeply ingrained psychological tendency where designers unconsciously adhere to the influence of prior designs” (Jansson & Smith, 1991, cited in Grisold et al., 2021). By systematically exploring alternatives, we ensure that the chosen design is justified by its merits, not by path dependence.


Why introduce an Intermediate Representation (IR) as the compilation target from the authoring studio? Why not have the authoring studio produce flowJson directly?

OptionDescription
A: Direct Authoring → RuntimeAuthoring studio communicates directly with runtime. No intermediate artifact.
B: Authoring → flowJson → RuntimeflowJson (Pipecat FlowManager config) serves as the source of truth.
C: Authoring → IR → Multi-TargetRich IR compiles to Pipecat adapter, runtime controller config, and marking config.
D: Authoring → IR → Single TargetIR compiles only to the runtime controller.

Evaluation Criteria (with literature grounding)

Section titled “Evaluation Criteria (with literature grounding)”
CriterionSource
Separation of concernsBass, Clements & Kazman (2012) — layered architecture enables independent evolution
Multi-target compilationLattner et al. (2020) — MLIR concept: one IR, multiple compilation targets
Versionability & diffabilityFowler (2005) — event sourcing: stable artifacts enable change tracking
Lossless semantic capturevan der Aalst et al. (2003) — workflow patterns: the specification must capture all assessment semantics
Authoring independenceGamma et al. (1995) — adapter pattern decouples authoring from execution
TestabilityBass et al. (2012) — compile-time validation of IR packages
CriterionA: DirectB: flowJsonC: Multi-TargetD: Single-Target
Separation of concernsWeakModerateStrongStrong
Multi-target compilationNoneWeakStrongModerate
VersionabilityWeakModerateStrongStrong
Lossless semantic captureModerateWeakStrongStrong
Authoring independenceWeakModerateStrongStrong
TestabilityWeakModerateStrongStrong

Chosen: Option C — Multi-Target IR. The IR serves as the canonical, versioned, executable specification of a published oral assessment. It is the single source of truth consumed by the Pipecat adapter, runtime controller, and marking runtime. From 00-overview.md: “flowJson is a serialization convenience — a bag of nodes and edges consumed by Pipecat’s FlowManager. It was designed to describe conversational flow, not to serve as the canonical executable specification of a high-stakes oral assessment.”

The multi-target property is critical because different consumers need different views: the Pipecat adapter needs conversational flow, the runtime controller needs policy enforcement, and the marking runtime needs evidence targets and rubric mappings.

  • Option A (Direct): Tight coupling between authoring and runtime makes independent evolution impossible.
  • Option B (flowJson): flowJson lacks runtime state schema, hard constraint vocabulary, event contract, and versioning (00-overview.md §1).
  • Option D (Single-Target): The separation between evidence collection (runtime) and evidence evaluation (marking) requires distinct compilation targets (Akimov & Malin, 2020).

Why three layers (Specification / Runtime Controller / Pipecat Adapter) instead of two or four?

OptionDescription
A: Two LayersIR compiles to combined Runtime+Pipecat component.
B: Three LayersIR → Runtime Controller → Pipecat Adapter. LLM is a tool invoked by the Runtime Controller.
C: Four LayersIR → Policy Engine → Runtime Controller → Pipecat Adapter.
D: Pipecat + OverlayPipecat FlowManager is primary; lightweight overlay handles what Pipecat cannot.
CriterionSource
Single responsibilityBass et al. (2012) — each layer should have one reason to change
Domain logic isolationYoung (2010) CQRS — domain logic separated from infrastructure
LLM boundary enforcementRebedea et al. (2023) — proxy layer enables programmable constraint enforcement
TestabilityBass et al. (2012) — independently testable layers
Pipecat independenceGamma et al. (1995) — adapter pattern decouples domain from Pipecat
CriterionA: Two LayersB: Three LayersC: Four LayersD: Pipecat+Overlay
Single responsibilityWeakStrongModerateWeak
Domain logic isolationWeakStrongStrongWeak
LLM boundary enforcementModerateStrongStrongModerate
TestabilityWeakStrongStrongWeak
Pipecat independenceWeakStrongStrongWeak

Chosen: Option B — Three Layers. The Runtime Controller acts as a programmable proxy between the LLM and the environment (Rebedea et al., 2023). The LLM is a tool invoked by the Runtime Controller — it does NOT own state, does NOT decide transitions unilaterally, and does NOT persist evidence directly (03-runtime-semantics.md §1.2).

  • Option A: Domain logic mixed with Pipecat integration makes it impossible to swap voice pipeline.
  • Option C: Policy evaluation is lightweight and doesn’t justify a separate layer.
  • Option D: The overlay pattern is reactive rather than proactive; NeMo Guardrails research shows proxy architectures are more reliable.

How should the LLM communicate with the Runtime Controller?

OptionDescription
A: Single report_observationOne function bundles all observations per turn.
B: Multiple Functionsreport_evidence_signal, report_candidate_command, request_transition.
C: Free-Text + ParsingLLM produces free-text; runtime parses for signals/commands.
D: Structured OutputLLM produces structured JSON (no function calling).
CriterionSource
Hallucination riskSchick et al. (2023) — more tools = more hallucination
LatencyYao et al. (2023) — multiple calls = multiple round-trips
AtomicityYoung (2010) — all observations processed atomically
InterpretabilityYao et al. (2023) — transparent reasoning for audit trails
CriterionA: Single FunctionB: Multiple FunctionsC: Free-TextD: Structured Output
Hallucination riskStrongWeakModerateStrong
LatencyStrongWeakStrongStrong
AtomicityStrongWeakModerateStrong
InterpretabilityStrongModerateWeakStrong

Chosen: Option A — Single report_observation. From 04-agent-boundary.md: “Previous design had request_transition + report_evidence_signal + report_candidate_command. This caused multiple LLM round-trips per turn, increased latency, and hallucination risk. One function = one call = one Runtime Controller evaluation = atomic decision-making.”

  • Option B: Correlation problem (matching signals to transitions) and latency cost.
  • Option C: Parsing complexity and prompt injection risk (Greshake et al., 2023).
  • Option D: Functionally similar to A; chosen A for Pipecat integration.

How should evidence of candidate competence be represented?

OptionDescription
A: Discrete Signal Kinds Onlypositive, partial, absent, misconception. No confidence.
B: Continuous Confidence Only0.0–1.0 confidence score. No discrete classification.
C: Rich TaxonomyDiscrete kinds + confidence + process quality + provenance + scaffolding intensity.
D: Rubric-Level JudgmentsLLM directly maps evidence to rubric levels.
CriterionSource
Assessment validityJoughin (1998) — must capture what oral assessment measures
Process quality captureFenton (2025) — “the process of learning rather than the output”
Provenance trackingBuneman et al. (2001) — why/where/how provenance
Moderation supportAkimov & Malin (2020) — human review and override
CriterionA: Discrete OnlyB: Confidence OnlyC: Rich TaxonomyD: Rubric-Level
Assessment validityModerateModerateStrongModerate
Process quality captureWeakWeakStrongWeak
Provenance trackingWeakWeakStrongWeak
Moderation supportModerateModerateStrongWeak

Chosen: Option C — Rich Taxonomy. Eight signal kinds (positive, partial, absent, misconception, flawed_reasoning, process_positive, process_negative, self_correction) capture the full spectrum of assessment evidence. Confidence scores enable calibration. Provenance fields (proposedBy, sttConfidenceSummary, scaffoldingIntensity) support auditing and moderation. Fenton (2025): “Oral assessments reveal the process of learning rather than the output.”

  • Option A: Misses process quality (self-correction, reasoning process).
  • Option B: Misses what type of evidence was observed.
  • Option D: Conflates evidence collection with evidence interpretation.

How should assessment policies be expressed?

OptionDescription
A: Structured Data ObjectsTypeScript interfaces: CompletionPolicy, FollowUpPolicy, etc.
B: CodeExecutable Python/TypeScript functions.
C: Prompt InstructionsNatural language in LLM system prompt.
D: Rule EngineDomain-specific language (e.g., Colang).
CriterionSource
Machine enforceabilityRebedea et al. (2023) — programmable rails over embedded rails
AuthorabilityBass et al. (2012) — accessible to assessment designers
Compile-time validationLattner et al. (2020) — catch errors before runtime
LLM independenceBai et al. (2022) — enforced regardless of LLM behavior
CriterionA: Structured DataB: CodeC: PromptsD: Rule Engine
Machine enforceabilityStrongStrongWeakStrong
AuthorabilityStrongWeakStrongModerate
Compile-time validationStrongModerateWeakStrong
LLM independenceStrongStrongWeakStrong

Chosen: Option A — Structured Data Objects. Policies are typed data structures evaluated deterministically by the Runtime Controller. From 03-runtime-semantics.md: “The Runtime MUST evaluate transition conditions in the order they appear in the specification. The FIRST matching condition wins.” Structured data enables compile-time validation (08-validation-rules.md) and is authorable via the authoring studio UI.

  • Option B: Not auditable by non-engineers; security risk.
  • Option C: Not enforceable; Greshake et al. (2023) show prompt constraints can be overridden.
  • Option D: Near-miss; Colang-style DSLs are powerful but add learning curve.

Decision 6: State Machine vs Implicit Tracking

Section titled “Decision 6: State Machine vs Implicit Tracking”

Should the exam lifecycle be governed by explicit state machines or implicit state from event log?

OptionDescription
A: Explicit State MachinesFormal SMs for exam/node/turn lifecycle.
B: Event SourcingState derived by replaying event log.
C: HybridExplicit SMs for runtime + event log for audit/recovery.
D: Reactive StateObservable streams, no central authority.
CriterionSource
Invariant enforcementvan der Aalst et al. (2003) — structural invariants
Recovery capabilityFowler (2005) — state reconstructable from events
AuditabilityYoung (2010) — every state change recorded
PerformanceYoung (2010) — fast reads for real-time interaction
CriterionA: Explicit SMB: Event SourcingC: HybridD: Reactive
Invariant enforcementStrongModerateStrongWeak
Recovery capabilityModerateStrongStrongWeak
AuditabilityModerateStrongStrongWeak
PerformanceStrongWeakStrongStrong

Chosen: Option C — Hybrid. Explicit state machines (exam/node/turn lifecycle) enforce invariants at runtime. Event log provides audit trail and recovery capability. From 02-schema.md §5: “RuntimeStateSchema is mutable per-session state tracked by the runtime controller. NOT persisted as a log — this is working memory.”

  • Option A: No audit trail or recovery capability.
  • Option B: Replay too slow for real-time voice interaction.
  • Option D: Unnecessary complexity for single-threaded sessions.

Where should the line be drawn between LLM autonomy and Runtime control?

OptionDescription
A: Two LevelsAutonomous / Controlled.
B: Three LevelsAutonomous / Advisory / Controlled.
C: Five LevelsFully Autonomous / Guided / Advisory / Constrained / Forbidden.
D: Full AutonomyLLM decides everything; post-hoc validation.
CriterionSource
Assessment naturalnessJoughin (1998) — “bidirectional adaptation”
FairnessAkimov & Malin (2020) — consistent treatment
SafetyBai et al. (2022) — constitutional principles
Agent systems alignmentYao et al. (2023) — action space augmentation
CriterionA: Two LevelsB: Three LevelsC: Five LevelsD: Full Autonomy
Assessment naturalnessModerateStrongStrongStrong
FairnessStrongStrongStrongWeak
SafetyStrongStrongStrongWeak
Agent systems alignmentModerateStrongModerateWeak

Chosen: Option B — Three Levels. The LLM fully decides wording and dialogue strategy (autonomous), advises on evidence sufficiency and follow-up need (advisory), and is fully controlled on transitions and scoring (controlled). This maps to ReAct’s (Yao et al., 2023) action space augmentation: Thoughts are autonomous, Actions that affect the environment are advisory or controlled.

  • Option A: Too coarse for nuanced evidence assessment.
  • Option C: Over-engineered; three levels provide sufficient granularity.
  • Option D: Violates safety requirements for summative assessment.

What event architecture should the system use?

OptionDescription
A: Push OnlyRuntime pushes events via WebSocket/LiveKit.
B: Pull OnlyConsumers poll for state changes.
C: Event SourcingAppend-only event store; consumers read from store.
D: HybridPush for real-time + event store for persistence.
CriterionSource
Real-time deliveryHohpe & Woolf (2003) — frontend needs low-latency updates
AuditabilityYoung (2010) — all events persisted for audit
Replay capabilityFowler (2005) — sessions reconstructable from events
Transport agnosticismHohpe & Woolf (2003) — same event over multiple transports
CriterionA: Push OnlyB: Pull OnlyC: Event SourcingD: Hybrid
Real-time deliveryStrongWeakModerateStrong
AuditabilityWeakModerateStrongStrong
Replay capabilityWeakWeakStrongStrong
Transport agnosticismModerateModerateStrongStrong

Chosen: Option D — Hybrid. Push events to real-time consumers (frontend via LiveKit data channel, WebSocket). Persist all events to an append-only store for audit and marking. Events are transport-agnostic (05-event-protocol.md Principle E3): the same envelope works over all transports.

  • Option A: No persistence; events lost on failure.
  • Option B: Unacceptable latency for real-time UI.
  • Option C: Replay too slow for real-time consumers.

Decision 9: Evidence as First-Class Output

Section titled “Decision 9: Evidence as First-Class Output”

Should evidence be a separate artifact, embedded in transcript, or extracted post-hoc?

OptionDescription
A: Separate EvidenceLedgerReal-time structured ledger with provenance.
B: Embedded in TranscriptEvidence as metadata on transcript turns.
C: Post-Hoc ExtractionEvidence derived from transcript after exam.
D: HybridReal-time signals + post-hoc enrichment.
CriterionSource
Marking determinismAkimov & Malin (2020) — reproducible marking
ProvenanceBuneman et al. (2001) — why/where/how provenance
Moderation supportAkimov & Malin (2020) — human review and override
Separation of concernsYoung (2010) — collection separate from evaluation
CriterionA: Separate LedgerB: EmbeddedC: Post-HocD: Hybrid
Marking determinismStrongModerateWeakStrong
ProvenanceStrongModerateWeakStrong
Moderation supportStrongModerateWeakStrong
Separation of concernsStrongWeakModerateStrong

Chosen: Option A — Separate EvidenceLedger. Evidence signals are written to a dedicated ledger in real-time. Each signal carries provenance (proposedBy, sttConfidenceSummary, scaffoldingIntensity) and is linked to transcript turns via turnIds. From 06-evidence-ledger.md: “The evidence ledger is not a post-processing step over the transcript. It is a structured, real-time, authoritative record of assessment evidence.”

  • Option B: Conflating evidence with transcript prevents independent override and provenance tracking.
  • Option C: Non-deterministic; no real-time feedback for formative assessment.
  • Option D: Near-miss; the spec supports post-hoc enrichment via proposedBy: "manual_marker".

How should the system handle failures?

OptionDescription
A: Fully AutomatedAll failures handled by runtime.
B: Human-in-the-LoopAll failures require proctor decision.
C: CategorizedTechnical = automated; Assessment = LLM-assisted with runtime guardrails.
D: GraduatedSeverity-based: minor → automated; moderate → LLM-assisted; severe → human.
CriterionSource
Assessment validityFenton (2025) — recovery must not compromise assessment
Candidate welfareAkimov & Malin (2020) — recovery must not cause distress
Operational feasibilityBayley et al. (2024) — must work at scale
Assessment neutralityPearce & Chiavaroli (2020) — recovery must not reveal rubric
CriterionA: AutomatedB: HumanC: CategorizedD: Graduated
Assessment validityModerateStrongStrongStrong
Candidate welfareModerateStrongStrongStrong
Operational feasibilityStrongWeakStrongStrong
Assessment neutralityStrongModerateStrongStrong

Chosen: Option C — Categorized Recovery. Technical failures (network, STT, TTS, silence) are handled automatically. Assessment failures (candidate confusion, distress, off-topic) are handled by the LLM with runtime guardrails. From 03-runtime-semantics.md §6.1: “Recovery MUST NOT reveal model answers, rubric scoring logic, or the ‘correct’ response.”

  • Option A: Cannot handle nuanced affective/pedagogical recovery.
  • Option B: Infeasible at scale (Bayley et al., 2024: 600+ students).
  • Option D: Near-miss; categorization (technical vs. assessment) was chosen over severity because the recovery mechanisms are fundamentally different.

Who decides when to transition between nodes?

OptionDescription
A: Runtime OnlyDeterministic policy evaluation. No LLM input.
B: LLM Proposes + Runtime ApprovesLLM signals readiness; runtime validates against policy.
C: LLM DecidesLLM controls transitions.
D: Runtime + LLM DelayRuntime decides; LLM can request more time.
CriterionSource
FairnessAkimov & Malin (2020) — consistent treatment
AuditabilityBuneman et al. (2001) — traceable to policy
Assessment naturalnessJoughin (1998) — natural transitions
Agent boundaryRebedea et al. (2023) — LLM should not control structure
CriterionA: Runtime OnlyB: Propose+ApproveC: LLM DecidesD: Runtime+Delay
FairnessStrongStrongWeakStrong
AuditabilityStrongStrongWeakStrong
Assessment naturalnessModerateStrongStrongStrong
Agent boundaryStrongStrongWeakStrong

Chosen: Option B — LLM Proposes + Runtime Approves. The LLM signals evidenceSufficient: true in report_observation. The Runtime validates against CompletionPolicy (requiredEvidenceTargetIds, minTurns, timeBudget). The LLM is the best judge of assessment quality; the Runtime is the best judge of structural policy. From 04-agent-boundary.md: “Judging evidence sufficiency: LLM emits opinion; Runtime applies policy.”

  • Option A: Misses assessment quality dimension.
  • Option C: Violates fairness and reliability (Joughin, 1998).
  • Option D: needsFollowUp signal already serves this purpose.

What context does the LLM see between nodes?

OptionDescription
A: RESET OnlyFresh context per node. No carryover.
B: APPEND OnlyFull conversation history across all nodes.
C: HybridRESET + previous node summary.
D: Policy-DrivenContextPolicy governs per-node visibility. RESET default.
CriterionSource
Information leakage preventionGreshake et al. (2023) — minimize attack surface
Conversational naturalnessJoughin (1998) — continuity across topics
Context window efficiencySchick et al. (2023) — manage token limits
SecurityGreshake et al. (2023) — sanitize candidate input
CriterionA: RESETB: APPENDC: HybridD: Policy-Driven
Leakage preventionStrongWeakModerateStrong
NaturalnessWeakStrongStrongStrong
Window efficiencyStrongWeakStrongStrong
SecurityStrongWeakModerateStrong

Chosen: Option D — Policy-Driven (RESET default). Context is reset between nodes (preventing cross-node contamination). The ContextPolicy allows per-node overrides (includePreviousNodes, includeEvidenceStatus, etc.). Rubric criteria ARE shared as evidence vocabulary; scoring logic is NOT shared. From 07-pipecat-adapter.md §6: “Between Nodes: RESET. Within a Node: APPEND.”

  • Option A: No conversational continuity.
  • Option B: Unbounded context growth; prompt injection risk.
  • Option C: Near-miss; the policy-driven approach adds per-node configurability on top of the hybrid base.

Across all 12 decisions, the spec consistently chooses determinism over flexibility for structural concerns (transitions, policies, evidence validation) while preserving flexibility for conversational concerns (wording, dialogue strategy, tone). This reflects the fundamental tension in AI-powered assessment: the conversation must feel natural, but the assessment must be fair and auditable.

Pattern: LLM decides how; Runtime decides when and whether.

2. Separation of Collection and Evaluation

Section titled “2. Separation of Collection and Evaluation”

The spec separates evidence collection (runtime) from evidence evaluation (marking) at every level:

  • IR layer: Defines what to assess (evidence targets) separately from how to score (marking rubric)
  • Runtime layer: Collects evidence signals; does not assign marks
  • Marking layer: Evaluates signals; does not collect evidence

This mirrors CQRS (Young, 2010): the command side (runtime) and query side (marking) have different models optimized for their specific concerns.

No single mechanism is sufficient for safety. The spec implements multiple overlapping layers:

  • IR constraints: Policies defined at compile time
  • Runtime guardrails: Enforced at execution time
  • Output validation: Filters LLM output before it reaches the candidate
  • Context policy: Controls what the LLM can see
  • Audit trail: Records everything for post-hoc review

This directly follows Greshake et al. (2023): “Defense is a difficult ‘whack-a-mole’ game. RLHF and input filtering are not sufficient.”

The LLM is treated as a sensor — it observes candidate responses and proposes evidence signals. The Runtime is the judge — it validates proposals against policy and makes structural decisions. This separation ensures that:

  • The LLM’s creative freedom is preserved (it can adapt dialogue naturally)
  • The Runtime’s authority is preserved (it enforces hard constraints)
  • The marking pipeline receives validated, structured evidence

Every artifact carries provenance metadata:

  • Evidence signals: proposedBy, sttConfidenceSummary, scaffoldingIntensity
  • Runtime events: source, timestamp, correlationId
  • Policy decisions: condition evaluated, reason for transition

This follows Buneman et al. (2001): why-provenance (why was this signal emitted?), where-provenance (where did the evidence come from?), and how-provenance (how was it derived?).


  1. The specification is the contract: Between authoring and execution, between collection and evaluation, between human and machine. It is the single source of truth.

  2. Policies are data, not code or prompts: Machine-enforceable, auditable, testable, authorable.

  3. The LLM proposes, the Runtime disposes: The LLM has judgment; the Runtime has authority. Neither alone is sufficient.

  4. Events are immutable facts: Every state change is recorded. The event log is the audit trail, the recovery mechanism, and the integration backbone.

  5. Evidence is structured, not derived: Signals are produced during the exam, not extracted after. They carry provenance, confidence, and classification.

  6. Context is controlled, not accumulated: The LLM sees what it needs, not everything it could see. Minimizing context minimizes risk.

  7. Recovery is categorized, not uniform: Technical failures require infrastructure actions; assessment failures require pedagogical actions. Different problems need different solutions.

  8. Fairness through determinism: Structural decisions (transitions, completion, time budgets) are deterministic. Conversational decisions (wording, tone, follow-up type) are adaptive. This preserves both fairness and naturalness.

VersionDateChanges
v0.2.02026-06-30Updated design rationale for IOA-ORM framing. Added evaluation criteria for anxiety management and Bloom’s integration. Updated terminology from ‘Exam Runtime IR’ to ‘IOA-ORM’.
v0.1.02026-05-06Initial release.