Design Principles

These 10 principles emerged from the systematic design space exploration (morphological analysis + QOC deliberations) and literature integration. They codify the recurring patterns across all 12 design decisions.

Note on terminology: This specification is formally titled the Interactive Oral Assessment Ontology and Reference Model (IOA-ORM). It comprises three layers: a domain ontology (shared vocabulary), a reference model (system abstraction), and an executable specification (machine-readable schema). The term “IR” (intermediate representation) refers specifically to the engineering role played by the executable specification — as a compilation target from authoring tools and a compilation source for runtime engines. “IR” describes a role, not the specification’s primary identity.

P1: The Specification Is the Contract

Between authoring and execution, between collection and evaluation, between human and machine. It is the single source of truth.

The IOA-ORM serves as the canonical, versioned, executable specification of a published oral assessment. The executable specification layer (the IR) is the compilation target from the authoring studio and the compilation source for the runtime controller and marking runtime. The ontology layer provides the shared vocabulary. The reference model layer provides the system abstraction.

Embodied by: ExamRuntimePackage, multi-target compilation, dual versioning scheme, IOA Domain Ontology

Embodied by: ExamRuntimePackage, multi-target compilation, dual versioning scheme

P2: Policies Are Data, Not Code or Prompts

Machine-enforceable, auditable, testable, authorable.

Assessment policies (CompletionPolicy, FollowUpPolicy, TransitionPolicy, RecoveryPolicy) are typed data structures evaluated deterministically by the Runtime Controller. They are NOT prompt instructions (which can be overridden) and NOT executable code (which is opaque to non-engineers).

Embodied by: All policy types in 02-schema.md §6–8, compile-time validation rules in 08-validation-rules.md

Grounding: Greshake et al. (2023) demonstrate that prompt-based constraints can be overridden by adversarial input. Structured data policies are enforced regardless of LLM behavior.

P3: The LLM Proposes, the Runtime Disposes

The LLM has judgment; the Runtime has authority. Neither alone is sufficient.

The AI examiner observes candidate responses and proposes evidence signals. The Runtime Controller validates proposals against policy and makes structural decisions (transitions, completion, time enforcement). This separation preserves both conversational naturalness and assessment fairness.

Embodied by: report_observation() function, Transition Authority (B: LLM Proposes + Runtime Approves)

Grounding: ReAct (Yao et al., 2023) — Thoughts are autonomous (LLM creativity), Actions that affect the environment are controlled (Runtime authority).

P4: Events Are Immutable Facts

Every state change is recorded. The event log is the audit trail, the recovery mechanism, and the integration backbone.

All significant state changes (node entered, turn completed, evidence collected, command processed, policy violation) produce a RuntimeEvent. Events are append-only, transport-agnostic, and persist for audit and replay.

Embodied by: RuntimeEvent, Event Store, Hybrid push+persist event protocol

Grounding: Event sourcing (Fowler, 2005; Young, 2010) — state is reconstructable from events. CQRS separates write (event production) from read (state queries).

P5: Evidence Is Structured, Not Derived

Signals are produced during the exam, not extracted after. They carry provenance, confidence, and classification.

Evidence signals are first-class runtime outputs written to a dedicated ledger in real-time. Each signal references an EvidenceTarget, carries provenance (proposedBy, sttConfidenceSummary, scaffoldingIntensity), and is classified by type. The marking runtime reads the ledger, not the transcript.

Embodied by: EvidenceSignal, EvidenceLedger, 8 signal kinds, provenance chain

Grounding: Buneman et al. (2001) — why/where/how provenance. Akimov & Malin (2020) — structured evidence supports moderation and inter-rater reliability.

P6: Context Is Controlled, Not Accumulated

The LLM sees what it needs, not everything it could see. Minimizing context minimizes risk.

Context is reset between nodes (preventing cross-node contamination). The ContextPolicy allows per-node overrides (includePreviousNodes, includeEvidenceStatus, etc.). Rubric criteria ARE shared as evidence vocabulary; scoring logic is NOT shared.

Embodied by: ContextPolicy (RESET default), per-node overrides, rubric-as-vocabulary pattern

Grounding: Greshake et al. (2023) — minimizing the LLM’s context window minimizes the attack surface for prompt injection.

P7: Recovery Is Categorized, Not Uniform

Technical failures require infrastructure actions; assessment failures require pedagogical actions. Different problems need different solutions.

Technical failures (network, STT, TTS, silence) are handled automatically by the runtime. Assessment failures (candidate confusion, distress, off-topic) are handled by the LLM with runtime guardrails. Recovery MUST NOT reveal model answers, rubric scoring logic, or the “correct” response.

Embodied by: RecoveryPolicy with two categories (technical / assessment), predefined recovery sequences

Grounding: Fenton (2025) — recovery must not compromise assessment validity. Bayley et al. (2024) — recovery must scale to 600+ students.

P9: Cognitive Depth Is a Design Parameter

Not all questions are equal. The specification must know what cognitive level each evidence target assesses.

Bloom’s Taxonomy (1956) defines six cognitive levels from Remember to Create. The specification encodes these as BloomLevel on EvidenceTarget, enabling compile-time validation of cognitive coverage, runtime follow-up escalation toward higher-order thinking, and marking rubrics that weight higher-order responses more heavily.

This principle addresses a key argument for AI-era oral assessment: generative AI performs well at lower Bloom levels but struggles at Create (Fenton, 2025). The specification must formalize this to validate that exams test the intended cognitive depth.

Embodied by: BloomLevel enum, EvidenceTarget.cognitiveLevel, FollowUpPolicy.cognitiveEscalationStrategy

Grounding: Bloom (1956); Fenton (2025): “Generative AI tools have been found to perform well at the lower levels of Bloom’s taxonomy but struggle at the create level.”

P10: Transparency Builds Trust

Candidates who know what to expect perform better and焦虑 less.

Joughin (1998) notes that “students need to know in advance what to expect of the shape of the assessment in order to prepare adequately.” Fenton (2025) recommends providing information about format, criteria, and expectations beforehand. Akimov and Malin (2020) report that 100% of students were nervous — unfamiliarity with the format exacerbates anxiety.

The specification encodes transparency through CandidateBriefing (candidate-facing exam information), warmup nodes with isPractice and anxietyMitigation properties, and FollowUpPolicy.promptingPrinciples.transparency.

Embodied by: CandidateBriefing, warmup node isPractice/anxietyMitigation, promptingPrinciples.transparency

Grounding: Joughin (1998) Dimension 4; Fenton (2025) Recommendations 1, 8; Akimov & Malin (2020) anxiety findings.

P8: Fairness Through Determinism

Structural decisions are deterministic. Conversational decisions are adaptive. This preserves both fairness and naturalness.

Transitions, completion, time budgets, and evidence validation are evaluated deterministically by the Runtime Controller. Wording, tone, follow-up type, and dialogue strategy are adaptive — the LLM generates them naturally within bounded parameters.

Pattern: LLM decides how; Runtime decides when and whether.

Embodied by: Deterministic CompletionPolicy evaluation + adaptive FollowUpPolicy.allowedPromptingLevels

Grounding: Joughin (1998) — reliability requires consistent treatment. Akimov & Malin (2020) — fairness requires same rules for all candidates.