Design Alternatives & Rationale
Status
Section titled “Status”Draft · v0.2.0 · 2026-06-30
Overview
Section titled “Overview”This chapter documents the systematic design space exploration conducted for the IOA-ORM specification. It follows the QOC (Questions, Options, Criteria) methodology (MacLean et al., 1991) to make explicit the design decisions, alternatives considered, evaluation criteria grounded in literature, and trade-offs accepted.
The design space exploration identified 12 primary design decisions, each evaluated against 4–6 criteria drawn from assessment theory (Joughin, 1998; Akimov & Malin, 2020; Fenton, 2025; Bayley et al., 2024), information systems literature (van der Aalst et al., 2003; Young, 2010; Fowler, 2005), agent systems research (Yao et al., 2023; Schick et al., 2023), AI safety literature (Bai et al., 2022; Rebedea et al., 2023; Greshake et al., 2023), and software engineering (Bass et al., 2012; Gamma et al., 1995; Lattner et al., 2020).
The methodology prevents design fixation — “a deeply ingrained psychological tendency where designers unconsciously adhere to the influence of prior designs” (Jansson & Smith, 1991, cited in Grisold et al., 2021). By systematically exploring alternatives, we ensure that the chosen design is justified by its merits, not by path dependence.
Decision 1: IR as Compilation Target
Section titled “Decision 1: IR as Compilation Target”Question
Section titled “Question”Why introduce an Intermediate Representation (IR) as the compilation target from the authoring studio? Why not have the authoring studio produce flowJson directly?
Options Considered
Section titled “Options Considered”| Option | Description |
|---|---|
| A: Direct Authoring → Runtime | Authoring studio communicates directly with runtime. No intermediate artifact. |
| B: Authoring → flowJson → Runtime | flowJson (Pipecat FlowManager config) serves as the source of truth. |
| C: Authoring → IR → Multi-Target | Rich IR compiles to Pipecat adapter, runtime controller config, and marking config. |
| D: Authoring → IR → Single Target | IR compiles only to the runtime controller. |
Evaluation Criteria (with literature grounding)
Section titled “Evaluation Criteria (with literature grounding)”| Criterion | Source |
|---|---|
| Separation of concerns | Bass, Clements & Kazman (2012) — layered architecture enables independent evolution |
| Multi-target compilation | Lattner et al. (2020) — MLIR concept: one IR, multiple compilation targets |
| Versionability & diffability | Fowler (2005) — event sourcing: stable artifacts enable change tracking |
| Lossless semantic capture | van der Aalst et al. (2003) — workflow patterns: the specification must capture all assessment semantics |
| Authoring independence | Gamma et al. (1995) — adapter pattern decouples authoring from execution |
| Testability | Bass et al. (2012) — compile-time validation of IR packages |
Evaluation Matrix
Section titled “Evaluation Matrix”| Criterion | A: Direct | B: flowJson | C: Multi-Target | D: Single-Target |
|---|---|---|---|---|
| Separation of concerns | Weak | Moderate | Strong | Strong |
| Multi-target compilation | None | Weak | Strong | Moderate |
| Versionability | Weak | Moderate | Strong | Strong |
| Lossless semantic capture | Moderate | Weak | Strong | Strong |
| Authoring independence | Weak | Moderate | Strong | Strong |
| Testability | Weak | Moderate | Strong | Strong |
Decision & Rationale
Section titled “Decision & Rationale”Chosen: Option C — Multi-Target IR. The IR serves as the canonical, versioned, executable specification of a published oral assessment. It is the single source of truth consumed by the Pipecat adapter, runtime controller, and marking runtime. From 00-overview.md: “flowJson is a serialization convenience — a bag of nodes and edges consumed by Pipecat’s FlowManager. It was designed to describe conversational flow, not to serve as the canonical executable specification of a high-stakes oral assessment.”
The multi-target property is critical because different consumers need different views: the Pipecat adapter needs conversational flow, the runtime controller needs policy enforcement, and the marking runtime needs evidence targets and rubric mappings.
Rejected Options
Section titled “Rejected Options”- Option A (Direct): Tight coupling between authoring and runtime makes independent evolution impossible.
- Option B (flowJson): flowJson lacks runtime state schema, hard constraint vocabulary, event contract, and versioning (00-overview.md §1).
- Option D (Single-Target): The separation between evidence collection (runtime) and evidence evaluation (marking) requires distinct compilation targets (Akimov & Malin, 2020).
Decision 2: Three-Layer Architecture
Section titled “Decision 2: Three-Layer Architecture”Question
Section titled “Question”Why three layers (Specification / Runtime Controller / Pipecat Adapter) instead of two or four?
Options Considered
Section titled “Options Considered”| Option | Description |
|---|---|
| A: Two Layers | IR compiles to combined Runtime+Pipecat component. |
| B: Three Layers | IR → Runtime Controller → Pipecat Adapter. LLM is a tool invoked by the Runtime Controller. |
| C: Four Layers | IR → Policy Engine → Runtime Controller → Pipecat Adapter. |
| D: Pipecat + Overlay | Pipecat FlowManager is primary; lightweight overlay handles what Pipecat cannot. |
Evaluation Criteria
Section titled “Evaluation Criteria”| Criterion | Source |
|---|---|
| Single responsibility | Bass et al. (2012) — each layer should have one reason to change |
| Domain logic isolation | Young (2010) CQRS — domain logic separated from infrastructure |
| LLM boundary enforcement | Rebedea et al. (2023) — proxy layer enables programmable constraint enforcement |
| Testability | Bass et al. (2012) — independently testable layers |
| Pipecat independence | Gamma et al. (1995) — adapter pattern decouples domain from Pipecat |
Evaluation Matrix
Section titled “Evaluation Matrix”| Criterion | A: Two Layers | B: Three Layers | C: Four Layers | D: Pipecat+Overlay |
|---|---|---|---|---|
| Single responsibility | Weak | Strong | Moderate | Weak |
| Domain logic isolation | Weak | Strong | Strong | Weak |
| LLM boundary enforcement | Moderate | Strong | Strong | Moderate |
| Testability | Weak | Strong | Strong | Weak |
| Pipecat independence | Weak | Strong | Strong | Weak |
Decision & Rationale
Section titled “Decision & Rationale”Chosen: Option B — Three Layers. The Runtime Controller acts as a programmable proxy between the LLM and the environment (Rebedea et al., 2023). The LLM is a tool invoked by the Runtime Controller — it does NOT own state, does NOT decide transitions unilaterally, and does NOT persist evidence directly (03-runtime-semantics.md §1.2).
Rejected Options
Section titled “Rejected Options”- Option A: Domain logic mixed with Pipecat integration makes it impossible to swap voice pipeline.
- Option C: Policy evaluation is lightweight and doesn’t justify a separate layer.
- Option D: The overlay pattern is reactive rather than proactive; NeMo Guardrails research shows proxy architectures are more reliable.
Decision 3: Agent Communication Model
Section titled “Decision 3: Agent Communication Model”Question
Section titled “Question”How should the LLM communicate with the Runtime Controller?
Options Considered
Section titled “Options Considered”| Option | Description |
|---|---|
A: Single report_observation | One function bundles all observations per turn. |
| B: Multiple Functions | report_evidence_signal, report_candidate_command, request_transition. |
| C: Free-Text + Parsing | LLM produces free-text; runtime parses for signals/commands. |
| D: Structured Output | LLM produces structured JSON (no function calling). |
Evaluation Criteria
Section titled “Evaluation Criteria”| Criterion | Source |
|---|---|
| Hallucination risk | Schick et al. (2023) — more tools = more hallucination |
| Latency | Yao et al. (2023) — multiple calls = multiple round-trips |
| Atomicity | Young (2010) — all observations processed atomically |
| Interpretability | Yao et al. (2023) — transparent reasoning for audit trails |
Evaluation Matrix
Section titled “Evaluation Matrix”| Criterion | A: Single Function | B: Multiple Functions | C: Free-Text | D: Structured Output |
|---|---|---|---|---|
| Hallucination risk | Strong | Weak | Moderate | Strong |
| Latency | Strong | Weak | Strong | Strong |
| Atomicity | Strong | Weak | Moderate | Strong |
| Interpretability | Strong | Moderate | Weak | Strong |
Decision & Rationale
Section titled “Decision & Rationale”Chosen: Option A — Single report_observation. From 04-agent-boundary.md: “Previous design had request_transition + report_evidence_signal + report_candidate_command. This caused multiple LLM round-trips per turn, increased latency, and hallucination risk. One function = one call = one Runtime Controller evaluation = atomic decision-making.”
Rejected Options
Section titled “Rejected Options”- Option B: Correlation problem (matching signals to transitions) and latency cost.
- Option C: Parsing complexity and prompt injection risk (Greshake et al., 2023).
- Option D: Functionally similar to A; chosen A for Pipecat integration.
Decision 4: Evidence Signal Model
Section titled “Decision 4: Evidence Signal Model”Question
Section titled “Question”How should evidence of candidate competence be represented?
Options Considered
Section titled “Options Considered”| Option | Description |
|---|---|
| A: Discrete Signal Kinds Only | positive, partial, absent, misconception. No confidence. |
| B: Continuous Confidence Only | 0.0–1.0 confidence score. No discrete classification. |
| C: Rich Taxonomy | Discrete kinds + confidence + process quality + provenance + scaffolding intensity. |
| D: Rubric-Level Judgments | LLM directly maps evidence to rubric levels. |
Evaluation Criteria
Section titled “Evaluation Criteria”| Criterion | Source |
|---|---|
| Assessment validity | Joughin (1998) — must capture what oral assessment measures |
| Process quality capture | Fenton (2025) — “the process of learning rather than the output” |
| Provenance tracking | Buneman et al. (2001) — why/where/how provenance |
| Moderation support | Akimov & Malin (2020) — human review and override |
Evaluation Matrix
Section titled “Evaluation Matrix”| Criterion | A: Discrete Only | B: Confidence Only | C: Rich Taxonomy | D: Rubric-Level |
|---|---|---|---|---|
| Assessment validity | Moderate | Moderate | Strong | Moderate |
| Process quality capture | Weak | Weak | Strong | Weak |
| Provenance tracking | Weak | Weak | Strong | Weak |
| Moderation support | Moderate | Moderate | Strong | Weak |
Decision & Rationale
Section titled “Decision & Rationale”Chosen: Option C — Rich Taxonomy. Eight signal kinds (positive, partial, absent, misconception, flawed_reasoning, process_positive, process_negative, self_correction) capture the full spectrum of assessment evidence. Confidence scores enable calibration. Provenance fields (proposedBy, sttConfidenceSummary, scaffoldingIntensity) support auditing and moderation. Fenton (2025): “Oral assessments reveal the process of learning rather than the output.”
Rejected Options
Section titled “Rejected Options”- Option A: Misses process quality (self-correction, reasoning process).
- Option B: Misses what type of evidence was observed.
- Option D: Conflates evidence collection with evidence interpretation.
Decision 5: Policy Expression
Section titled “Decision 5: Policy Expression”Question
Section titled “Question”How should assessment policies be expressed?
Options Considered
Section titled “Options Considered”| Option | Description |
|---|---|
| A: Structured Data Objects | TypeScript interfaces: CompletionPolicy, FollowUpPolicy, etc. |
| B: Code | Executable Python/TypeScript functions. |
| C: Prompt Instructions | Natural language in LLM system prompt. |
| D: Rule Engine | Domain-specific language (e.g., Colang). |
Evaluation Criteria
Section titled “Evaluation Criteria”| Criterion | Source |
|---|---|
| Machine enforceability | Rebedea et al. (2023) — programmable rails over embedded rails |
| Authorability | Bass et al. (2012) — accessible to assessment designers |
| Compile-time validation | Lattner et al. (2020) — catch errors before runtime |
| LLM independence | Bai et al. (2022) — enforced regardless of LLM behavior |
Evaluation Matrix
Section titled “Evaluation Matrix”| Criterion | A: Structured Data | B: Code | C: Prompts | D: Rule Engine |
|---|---|---|---|---|
| Machine enforceability | Strong | Strong | Weak | Strong |
| Authorability | Strong | Weak | Strong | Moderate |
| Compile-time validation | Strong | Moderate | Weak | Strong |
| LLM independence | Strong | Strong | Weak | Strong |
Decision & Rationale
Section titled “Decision & Rationale”Chosen: Option A — Structured Data Objects. Policies are typed data structures evaluated deterministically by the Runtime Controller. From 03-runtime-semantics.md: “The Runtime MUST evaluate transition conditions in the order they appear in the specification. The FIRST matching condition wins.” Structured data enables compile-time validation (08-validation-rules.md) and is authorable via the authoring studio UI.
Rejected Options
Section titled “Rejected Options”- Option B: Not auditable by non-engineers; security risk.
- Option C: Not enforceable; Greshake et al. (2023) show prompt constraints can be overridden.
- Option D: Near-miss; Colang-style DSLs are powerful but add learning curve.
Decision 6: State Machine vs Implicit Tracking
Section titled “Decision 6: State Machine vs Implicit Tracking”Question
Section titled “Question”Should the exam lifecycle be governed by explicit state machines or implicit state from event log?
Options Considered
Section titled “Options Considered”| Option | Description |
|---|---|
| A: Explicit State Machines | Formal SMs for exam/node/turn lifecycle. |
| B: Event Sourcing | State derived by replaying event log. |
| C: Hybrid | Explicit SMs for runtime + event log for audit/recovery. |
| D: Reactive State | Observable streams, no central authority. |
Evaluation Criteria
Section titled “Evaluation Criteria”| Criterion | Source |
|---|---|
| Invariant enforcement | van der Aalst et al. (2003) — structural invariants |
| Recovery capability | Fowler (2005) — state reconstructable from events |
| Auditability | Young (2010) — every state change recorded |
| Performance | Young (2010) — fast reads for real-time interaction |
Evaluation Matrix
Section titled “Evaluation Matrix”| Criterion | A: Explicit SM | B: Event Sourcing | C: Hybrid | D: Reactive |
|---|---|---|---|---|
| Invariant enforcement | Strong | Moderate | Strong | Weak |
| Recovery capability | Moderate | Strong | Strong | Weak |
| Auditability | Moderate | Strong | Strong | Weak |
| Performance | Strong | Weak | Strong | Strong |
Decision & Rationale
Section titled “Decision & Rationale”Chosen: Option C — Hybrid. Explicit state machines (exam/node/turn lifecycle) enforce invariants at runtime. Event log provides audit trail and recovery capability. From 02-schema.md §5: “RuntimeStateSchema is mutable per-session state tracked by the runtime controller. NOT persisted as a log — this is working memory.”
Rejected Options
Section titled “Rejected Options”- Option A: No audit trail or recovery capability.
- Option B: Replay too slow for real-time voice interaction.
- Option D: Unnecessary complexity for single-threaded sessions.
Decision 7: Agent Autonomy Gradient
Section titled “Decision 7: Agent Autonomy Gradient”Question
Section titled “Question”Where should the line be drawn between LLM autonomy and Runtime control?
Options Considered
Section titled “Options Considered”| Option | Description |
|---|---|
| A: Two Levels | Autonomous / Controlled. |
| B: Three Levels | Autonomous / Advisory / Controlled. |
| C: Five Levels | Fully Autonomous / Guided / Advisory / Constrained / Forbidden. |
| D: Full Autonomy | LLM decides everything; post-hoc validation. |
Evaluation Criteria
Section titled “Evaluation Criteria”| Criterion | Source |
|---|---|
| Assessment naturalness | Joughin (1998) — “bidirectional adaptation” |
| Fairness | Akimov & Malin (2020) — consistent treatment |
| Safety | Bai et al. (2022) — constitutional principles |
| Agent systems alignment | Yao et al. (2023) — action space augmentation |
Evaluation Matrix
Section titled “Evaluation Matrix”| Criterion | A: Two Levels | B: Three Levels | C: Five Levels | D: Full Autonomy |
|---|---|---|---|---|
| Assessment naturalness | Moderate | Strong | Strong | Strong |
| Fairness | Strong | Strong | Strong | Weak |
| Safety | Strong | Strong | Strong | Weak |
| Agent systems alignment | Moderate | Strong | Moderate | Weak |
Decision & Rationale
Section titled “Decision & Rationale”Chosen: Option B — Three Levels. The LLM fully decides wording and dialogue strategy (autonomous), advises on evidence sufficiency and follow-up need (advisory), and is fully controlled on transitions and scoring (controlled). This maps to ReAct’s (Yao et al., 2023) action space augmentation: Thoughts are autonomous, Actions that affect the environment are advisory or controlled.
Rejected Options
Section titled “Rejected Options”- Option A: Too coarse for nuanced evidence assessment.
- Option C: Over-engineered; three levels provide sufficient granularity.
- Option D: Violates safety requirements for summative assessment.
Decision 8: Event Protocol
Section titled “Decision 8: Event Protocol”Question
Section titled “Question”What event architecture should the system use?
Options Considered
Section titled “Options Considered”| Option | Description |
|---|---|
| A: Push Only | Runtime pushes events via WebSocket/LiveKit. |
| B: Pull Only | Consumers poll for state changes. |
| C: Event Sourcing | Append-only event store; consumers read from store. |
| D: Hybrid | Push for real-time + event store for persistence. |
Evaluation Criteria
Section titled “Evaluation Criteria”| Criterion | Source |
|---|---|
| Real-time delivery | Hohpe & Woolf (2003) — frontend needs low-latency updates |
| Auditability | Young (2010) — all events persisted for audit |
| Replay capability | Fowler (2005) — sessions reconstructable from events |
| Transport agnosticism | Hohpe & Woolf (2003) — same event over multiple transports |
Evaluation Matrix
Section titled “Evaluation Matrix”| Criterion | A: Push Only | B: Pull Only | C: Event Sourcing | D: Hybrid |
|---|---|---|---|---|
| Real-time delivery | Strong | Weak | Moderate | Strong |
| Auditability | Weak | Moderate | Strong | Strong |
| Replay capability | Weak | Weak | Strong | Strong |
| Transport agnosticism | Moderate | Moderate | Strong | Strong |
Decision & Rationale
Section titled “Decision & Rationale”Chosen: Option D — Hybrid. Push events to real-time consumers (frontend via LiveKit data channel, WebSocket). Persist all events to an append-only store for audit and marking. Events are transport-agnostic (05-event-protocol.md Principle E3): the same envelope works over all transports.
Rejected Options
Section titled “Rejected Options”- Option A: No persistence; events lost on failure.
- Option B: Unacceptable latency for real-time UI.
- Option C: Replay too slow for real-time consumers.
Decision 9: Evidence as First-Class Output
Section titled “Decision 9: Evidence as First-Class Output”Question
Section titled “Question”Should evidence be a separate artifact, embedded in transcript, or extracted post-hoc?
Options Considered
Section titled “Options Considered”| Option | Description |
|---|---|
| A: Separate EvidenceLedger | Real-time structured ledger with provenance. |
| B: Embedded in Transcript | Evidence as metadata on transcript turns. |
| C: Post-Hoc Extraction | Evidence derived from transcript after exam. |
| D: Hybrid | Real-time signals + post-hoc enrichment. |
Evaluation Criteria
Section titled “Evaluation Criteria”| Criterion | Source |
|---|---|
| Marking determinism | Akimov & Malin (2020) — reproducible marking |
| Provenance | Buneman et al. (2001) — why/where/how provenance |
| Moderation support | Akimov & Malin (2020) — human review and override |
| Separation of concerns | Young (2010) — collection separate from evaluation |
Evaluation Matrix
Section titled “Evaluation Matrix”| Criterion | A: Separate Ledger | B: Embedded | C: Post-Hoc | D: Hybrid |
|---|---|---|---|---|
| Marking determinism | Strong | Moderate | Weak | Strong |
| Provenance | Strong | Moderate | Weak | Strong |
| Moderation support | Strong | Moderate | Weak | Strong |
| Separation of concerns | Strong | Weak | Moderate | Strong |
Decision & Rationale
Section titled “Decision & Rationale”Chosen: Option A — Separate EvidenceLedger. Evidence signals are written to a dedicated ledger in real-time. Each signal carries provenance (proposedBy, sttConfidenceSummary, scaffoldingIntensity) and is linked to transcript turns via turnIds. From 06-evidence-ledger.md: “The evidence ledger is not a post-processing step over the transcript. It is a structured, real-time, authoritative record of assessment evidence.”
Rejected Options
Section titled “Rejected Options”- Option B: Conflating evidence with transcript prevents independent override and provenance tracking.
- Option C: Non-deterministic; no real-time feedback for formative assessment.
- Option D: Near-miss; the spec supports post-hoc enrichment via
proposedBy: "manual_marker".
Decision 10: Recovery Strategy
Section titled “Decision 10: Recovery Strategy”Question
Section titled “Question”How should the system handle failures?
Options Considered
Section titled “Options Considered”| Option | Description |
|---|---|
| A: Fully Automated | All failures handled by runtime. |
| B: Human-in-the-Loop | All failures require proctor decision. |
| C: Categorized | Technical = automated; Assessment = LLM-assisted with runtime guardrails. |
| D: Graduated | Severity-based: minor → automated; moderate → LLM-assisted; severe → human. |
Evaluation Criteria
Section titled “Evaluation Criteria”| Criterion | Source |
|---|---|
| Assessment validity | Fenton (2025) — recovery must not compromise assessment |
| Candidate welfare | Akimov & Malin (2020) — recovery must not cause distress |
| Operational feasibility | Bayley et al. (2024) — must work at scale |
| Assessment neutrality | Pearce & Chiavaroli (2020) — recovery must not reveal rubric |
Evaluation Matrix
Section titled “Evaluation Matrix”| Criterion | A: Automated | B: Human | C: Categorized | D: Graduated |
|---|---|---|---|---|
| Assessment validity | Moderate | Strong | Strong | Strong |
| Candidate welfare | Moderate | Strong | Strong | Strong |
| Operational feasibility | Strong | Weak | Strong | Strong |
| Assessment neutrality | Strong | Moderate | Strong | Strong |
Decision & Rationale
Section titled “Decision & Rationale”Chosen: Option C — Categorized Recovery. Technical failures (network, STT, TTS, silence) are handled automatically. Assessment failures (candidate confusion, distress, off-topic) are handled by the LLM with runtime guardrails. From 03-runtime-semantics.md §6.1: “Recovery MUST NOT reveal model answers, rubric scoring logic, or the ‘correct’ response.”
Rejected Options
Section titled “Rejected Options”- Option A: Cannot handle nuanced affective/pedagogical recovery.
- Option B: Infeasible at scale (Bayley et al., 2024: 600+ students).
- Option D: Near-miss; categorization (technical vs. assessment) was chosen over severity because the recovery mechanisms are fundamentally different.
Decision 11: Transition Authority
Section titled “Decision 11: Transition Authority”Question
Section titled “Question”Who decides when to transition between nodes?
Options Considered
Section titled “Options Considered”| Option | Description |
|---|---|
| A: Runtime Only | Deterministic policy evaluation. No LLM input. |
| B: LLM Proposes + Runtime Approves | LLM signals readiness; runtime validates against policy. |
| C: LLM Decides | LLM controls transitions. |
| D: Runtime + LLM Delay | Runtime decides; LLM can request more time. |
Evaluation Criteria
Section titled “Evaluation Criteria”| Criterion | Source |
|---|---|
| Fairness | Akimov & Malin (2020) — consistent treatment |
| Auditability | Buneman et al. (2001) — traceable to policy |
| Assessment naturalness | Joughin (1998) — natural transitions |
| Agent boundary | Rebedea et al. (2023) — LLM should not control structure |
Evaluation Matrix
Section titled “Evaluation Matrix”| Criterion | A: Runtime Only | B: Propose+Approve | C: LLM Decides | D: Runtime+Delay |
|---|---|---|---|---|
| Fairness | Strong | Strong | Weak | Strong |
| Auditability | Strong | Strong | Weak | Strong |
| Assessment naturalness | Moderate | Strong | Strong | Strong |
| Agent boundary | Strong | Strong | Weak | Strong |
Decision & Rationale
Section titled “Decision & Rationale”Chosen: Option B — LLM Proposes + Runtime Approves. The LLM signals evidenceSufficient: true in report_observation. The Runtime validates against CompletionPolicy (requiredEvidenceTargetIds, minTurns, timeBudget). The LLM is the best judge of assessment quality; the Runtime is the best judge of structural policy. From 04-agent-boundary.md: “Judging evidence sufficiency: LLM emits opinion; Runtime applies policy.”
Rejected Options
Section titled “Rejected Options”- Option A: Misses assessment quality dimension.
- Option C: Violates fairness and reliability (Joughin, 1998).
- Option D:
needsFollowUpsignal already serves this purpose.
Decision 12: Context Management
Section titled “Decision 12: Context Management”Question
Section titled “Question”What context does the LLM see between nodes?
Options Considered
Section titled “Options Considered”| Option | Description |
|---|---|
| A: RESET Only | Fresh context per node. No carryover. |
| B: APPEND Only | Full conversation history across all nodes. |
| C: Hybrid | RESET + previous node summary. |
| D: Policy-Driven | ContextPolicy governs per-node visibility. RESET default. |
Evaluation Criteria
Section titled “Evaluation Criteria”| Criterion | Source |
|---|---|
| Information leakage prevention | Greshake et al. (2023) — minimize attack surface |
| Conversational naturalness | Joughin (1998) — continuity across topics |
| Context window efficiency | Schick et al. (2023) — manage token limits |
| Security | Greshake et al. (2023) — sanitize candidate input |
Evaluation Matrix
Section titled “Evaluation Matrix”| Criterion | A: RESET | B: APPEND | C: Hybrid | D: Policy-Driven |
|---|---|---|---|---|
| Leakage prevention | Strong | Weak | Moderate | Strong |
| Naturalness | Weak | Strong | Strong | Strong |
| Window efficiency | Strong | Weak | Strong | Strong |
| Security | Strong | Weak | Moderate | Strong |
Decision & Rationale
Section titled “Decision & Rationale”Chosen: Option D — Policy-Driven (RESET default). Context is reset between nodes (preventing cross-node contamination). The ContextPolicy allows per-node overrides (includePreviousNodes, includeEvidenceStatus, etc.). Rubric criteria ARE shared as evidence vocabulary; scoring logic is NOT shared. From 07-pipecat-adapter.md §6: “Between Nodes: RESET. Within a Node: APPEND.”
Rejected Options
Section titled “Rejected Options”- Option A: No conversational continuity.
- Option B: Unbounded context growth; prompt injection risk.
- Option C: Near-miss; the policy-driven approach adds per-node configurability on top of the hybrid base.
Cross-Cutting Themes
Section titled “Cross-Cutting Themes”1. Determinism Where It Matters
Section titled “1. Determinism Where It Matters”Across all 12 decisions, the spec consistently chooses determinism over flexibility for structural concerns (transitions, policies, evidence validation) while preserving flexibility for conversational concerns (wording, dialogue strategy, tone). This reflects the fundamental tension in AI-powered assessment: the conversation must feel natural, but the assessment must be fair and auditable.
Pattern: LLM decides how; Runtime decides when and whether.
2. Separation of Collection and Evaluation
Section titled “2. Separation of Collection and Evaluation”The spec separates evidence collection (runtime) from evidence evaluation (marking) at every level:
- IR layer: Defines what to assess (evidence targets) separately from how to score (marking rubric)
- Runtime layer: Collects evidence signals; does not assign marks
- Marking layer: Evaluates signals; does not collect evidence
This mirrors CQRS (Young, 2010): the command side (runtime) and query side (marking) have different models optimized for their specific concerns.
3. Defense in Depth
Section titled “3. Defense in Depth”No single mechanism is sufficient for safety. The spec implements multiple overlapping layers:
- IR constraints: Policies defined at compile time
- Runtime guardrails: Enforced at execution time
- Output validation: Filters LLM output before it reaches the candidate
- Context policy: Controls what the LLM can see
- Audit trail: Records everything for post-hoc review
This directly follows Greshake et al. (2023): “Defense is a difficult ‘whack-a-mole’ game. RLHF and input filtering are not sufficient.”
4. LLM as Sensor, Runtime as Judge
Section titled “4. LLM as Sensor, Runtime as Judge”The LLM is treated as a sensor — it observes candidate responses and proposes evidence signals. The Runtime is the judge — it validates proposals against policy and makes structural decisions. This separation ensures that:
- The LLM’s creative freedom is preserved (it can adapt dialogue naturally)
- The Runtime’s authority is preserved (it enforces hard constraints)
- The marking pipeline receives validated, structured evidence
5. Provenance as First-Class
Section titled “5. Provenance as First-Class”Every artifact carries provenance metadata:
- Evidence signals: proposedBy, sttConfidenceSummary, scaffoldingIntensity
- Runtime events: source, timestamp, correlationId
- Policy decisions: condition evaluated, reason for transition
This follows Buneman et al. (2001): why-provenance (why was this signal emitted?), where-provenance (where did the evidence come from?), and how-provenance (how was it derived?).
Design Principles Emerged
Section titled “Design Principles Emerged”-
The specification is the contract: Between authoring and execution, between collection and evaluation, between human and machine. It is the single source of truth.
-
Policies are data, not code or prompts: Machine-enforceable, auditable, testable, authorable.
-
The LLM proposes, the Runtime disposes: The LLM has judgment; the Runtime has authority. Neither alone is sufficient.
-
Events are immutable facts: Every state change is recorded. The event log is the audit trail, the recovery mechanism, and the integration backbone.
-
Evidence is structured, not derived: Signals are produced during the exam, not extracted after. They carry provenance, confidence, and classification.
-
Context is controlled, not accumulated: The LLM sees what it needs, not everything it could see. Minimizing context minimizes risk.
-
Recovery is categorized, not uniform: Technical failures require infrastructure actions; assessment failures require pedagogical actions. Different problems need different solutions.
-
Fairness through determinism: Structural decisions (transitions, completion, time budgets) are deterministic. Conversational decisions (wording, tone, follow-up type) are adaptive. This preserves both fairness and naturalness.
Revision History
Section titled “Revision History”| Version | Date | Changes |
|---|---|---|
| v0.2.0 | 2026-06-30 | Updated design rationale for IOA-ORM framing. Added evaluation criteria for anxiety management and Bloom’s integration. Updated terminology from ‘Exam Runtime IR’ to ‘IOA-ORM’. |
| v0.1.0 | 2026-05-06 | Initial release. |