Skip to content

Executive Summary

Draft · v0.2.0 · 2026-06-30


The Rising Need for Interactive Oral Assessment

Section titled “The Rising Need for Interactive Oral Assessment”

Oral examinations assess what written tests cannot: the ability to reason under questioning, defend a position, respond to probing follow-ups, and demonstrate competence through live interaction. Sotiriadou et al. (2020) define this as the “interactive oral” — “a form of assessment asking students to perform real-world tasks to demonstrate meaningful application of necessary knowledge and skills.” Unlike written exams, interactive oral assessments (IOA) probe higher-order thinking: Bloom’s (1956) levels of Analyze, Evaluate, and Create — where candidates must defend, justify, and produce, not merely recall.

The need for IOA is growing. Academic integrity concerns make written exams increasingly unreliable indicators of student competence (Fenton, 2025). Professional accreditation bodies demand assessment of communication, critical thinking, and interpersonal skills — competencies that only live interaction can demonstrate. And as generative AI tools become capable of passing written assessments at the Remember and Understand levels (Fenton, 2025), the case for oral examination as a complement — or alternative — to written assessment strengthens.

But IOA has historically been limited by scale. A human examiner for every candidate is expensive, inconsistent across examiners, and impractical for cohorts of hundreds. This creates a need for systematic, machine-executable oral assessment — where the exam’s structure, evidence capture, and policy enforcement are formally specified and executed by a runtime system, whether that runtime is a human examiner following a script, a rule-based machine, or an AI-powered voice agent.

Kōrero (korero.thesteder.com) is one such platform — an AI-powered system where lecturers design interactive exam flows that an AI examiner conducts with candidates via real-time voice. But Kōrero is one instantiation of a broader pattern. The problem is general: any system that executes interactive oral assessments needs a formal specification that bridges assessment design with runtime execution — regardless of whether the executor is human, machine, or AI. is general: any system that executes interactive oral assessments needs a formal specification that bridges assessment design with runtime execution — regardless of whether the executor is human, machine, or AI.

The Gap: Assessment Theory Meets Runtime Execution

Section titled “The Gap: Assessment Theory Meets Runtime Execution”

The oral assessment literature defines what makes an exam reliable, valid, and fair. Joughin (1998) identifies six dimensions that shape assessment quality: content type, interaction mode, authenticity, structure, examiner configuration, and degree of orality. Akimov and Malin (2020) formalize the validity/reliability/fairness matrix. Bayley et al. (2024) demonstrate scalable oral exam administration for 600+ students. Fenton (2025) defines interactive oral assessment (IOA) components including prompting taxonomy, scaffolding, and moderation.

But translating these theoretical requirements into a running system is hard. Current approaches fall into two categories, both insufficient:

  1. Hard-coded runtime logic. The exam’s behavior is embedded directly in application code. This works for one exam but is opaque, non-portable, and impossible to validate at compile time. Every new exam requires reimplementation. Assessment-theoretic properties (e.g., “this exam uses structured dialogue with moderate openness”) exist only as implicit assumptions in code, not as inspectable, versioned artifacts.

  2. Generic workflow engines. Dialogue graphs, state machines, or workflow DSLs can describe conversational flow — but they lack assessment-specific concepts. They have no notion of evidence targets, candidate commands, completion policies, scaffolding budgets, or moderation workflows. The runtime must improvise, and each improvisation is a potential validity threat.

The Vision: A Formal, Interoperable Exam Specification

Section titled “The Vision: A Formal, Interoperable Exam Specification”

This specification proposes a new kind of artifact: a formal, machine-processable, platform-independent specification for interactive oral assessments. Drawing from the semantic web tradition — where ontologies provide “a formal, explicit specification of a shared conceptualization” (Gruber, 1993) — this specification defines a shared vocabulary and formal semantics for what an oral assessment is, independent of how any particular system executes it.

The specification is implementation-agnostic. An exam specified in this reference model could be executed by:

  • A human examiner following a structured script with policy enforcement
  • A rule-based machine that drives a branching dialogue with deterministic transitions
  • An AI-powered voice agent that generates natural follow-ups within bounded policies

When the executor is an AI agent, the specification provides additional primitives — agent boundaries, evidence provenance, and runtime policy enforcement — that treat the generative model as a first-class component whose behavior must be formally bounded and auditable. But these AI-specific constructs are extensions, not prerequisites. The core specification applies to any execution model.

The key properties of this specification are:

  • Formal semantics. Every construct (evidence target, completion policy, transition rule) has a precise, machine-enforceable meaning — not just a human-readable description. The specification is grounded in oral assessment theory (Joughin, 1998; Akimov & Malin, 2020), encoding theoretical dimensions as executable parameters.
  • Semantic interoperability. The specification provides a shared vocabulary that bridges the conceptual models of assessment designers (rubric criteria, evidence targets, interaction patterns) and runtime engineers (nodes, edges, state transitions). These two communities currently lack a common language; the specification is that common language.
  • Platform independence. The IR is a compilation source, not an execution config. It compiles to platform-specific formats for any runtime engine — making exam specifications portable across systems and preserving them beyond any single platform’s lifecycle.
  • Versionability and auditability. Each published exam is a versioned, immutable artifact with a stable identity, changelog, and structural diff — enabling inspection, regression analysis, and regulatory audit.
  • Structured evidence capture. The specification defines structured evidence capture during live exams, with provenance tracking and confidence scoring — not post-hoc transcript analysis. Evidence is a first-class output, not a byproduct.

Despite the growing adoption of interactive oral assessment platforms, no existing specification provides these properties. Assessment designers think in terms of rubric criteria, evidence targets, and interaction patterns. Runtime engineers think in terms of nodes, edges, and state transitions. These two communities do not speak the same language.

Existing assessment standards (QTI, xAPI, IMS Caliper) were designed for machine-graded written assessments — they cannot express the runtime behavior of an interactive oral exam. Existing dialogue management formalisms (state machines, workflow DSLs) lack assessment-specific concepts: evidence targets, candidate commands, completion policies, scaffolding budgets, and moderation workflows. The result is that every IOA platform must invent its own ad-hoc specification, its own evidence model, and its own policy rules — with no interoperability, no formal validation, and no shared vocabulary.

The Interactive Oral Assessment Ontology and Reference Model is a design science artifact that formalizes the core concepts, relationships, system responsibilities, evidence semantics, runtime policies, and governance boundaries of interactive oral assessment systems. Its machine-processable manifestation is the Interactive Oral Assessment Executable Specification, represented by the ExamRuntimePackage. In the engineering pipeline, this package functions as an intermediate representation between authoring tools, runtime controllers, execution adapters, and marking systems.

The IOA-ORM has four complementary roles:

  1. Domain ontology — it defines the core vocabulary and semantics of interactive oral assessment, including evidence targets, evidence signals, candidate commands, assessment profiles, completion policies, moderation policies, runtime events, and agent boundaries.

  2. Reference model — it defines the reusable system abstraction for IOA platforms, including authoring tools, executable specification packages, runtime controllers, voice runtimes, event stores, evidence ledgers, marking runtimes, and moderation workflows.

  3. Executable specification — it provides a machine-processable, versioned package that encodes exam structure, policies, evidence requirements, runtime semantics, validation constraints, and audit requirements.

  4. Intermediate representation — within the engineering pipeline, the executable specification acts as an intermediate representation between authoring tools, runtime engines, policy enforcement layers, and marking systems.

Note: We use “ontology-grounded” rather than simply “ontology” because the artifact defines a shared vocabulary and formal semantics grounded in assessment theory, but does not currently provide OWL/RDF axioms or description-logic reasoning. The term acknowledges the ontological contribution without over-claiming a full semantic-web implementation.

The canonical package produced by this artifact is the ExamRuntimePackage — a published, versioned, machine-readable specification of an oral assessment. The artifact is not tied to any specific platform. Kōrero is one consumer; any system that conducts interactive oral assessments could adopt this as its canonical exam specification.

This artifact is organized into four layers:

┌─────────────────────────────────────────────────────────────┐
│  Domain Ontology — shared vocabulary and semantics          │
│  (AssessmentProfile, EvidenceTarget, CandidateCommand, …)   │
├─────────────────────────────────────────────────────────────┤
│  Reference Model — reusable system abstraction              │
│  (Authoring → IR → Runtime → Evidence → Marking → Audit)   │
├─────────────────────────────────────────────────────────────┤
│  Executable Specification — machine-readable package        │
│  (ExamRuntimePackage, schema, validation rules)             │
├─────────────────────────────────────────────────────────────┤
│  Intermediate Representation — engineering pipeline role    │
│  (Authoring Model → ExamRuntimePackage → Runtime Config)   │
└─────────────────────────────────────────────────────────────┘

Following Design Science Research (March & Smith, 1995; Gregor & Hevner, 2013), this artifact contributes at multiple levels:

Artifact ComponentDSR Artifact TypeIOA-ORM Layer
EvidenceTarget, EvidenceSignal, CandidateCommand, AssessmentProfile, RuntimeEventConstructsIOA Domain Ontology
ExamRuntimePackage, object model, architecture, component relationshipsModelIOA Reference Model
Validation rules, transition rules, policy evaluation, recovery procedures, compilation mappingsMethodSpecification and Validation Method
Kōrero implementation, runtime adapter, controller, evidence ledger integrationInstantiationPlatform Instantiation

An oral assessment is not a chatbot conversation. It has structural requirements that generic dialogue systems cannot express:

Assessment structure must be enforceable. An exam has a defined sequence of sections, each with time budgets, completion criteria, and transition rules. These are hard constraints — not suggestions to the AI. A runtime controller must enforce them deterministically, regardless of what the generative model produces.

Evidence must be captured during the exam, not derived after. When a candidate demonstrates competence (or fails to), the system must record structured evidence in real time — not rely on post-hoc transcript analysis. A transcript shows what was said; an evidence ledger records what was demonstrated.

The AI examiner must be bounded. An AI examiner needs creative freedom to generate natural follow-ups, handle unexpected responses, and adapt to candidate behavior. But it must not skip exam sections, reveal rubric criteria, score candidates directly, or ignore candidate commands (e.g., “can you repeat that?”). Autonomy must exist within explicit boundaries.

Assessment properties must be inspectable. An exam that claims to assess “interpersonal competence through structured dialogue” (Joughin’s interaction dimension) should have that claim encoded in its specification — not buried in code. The runtime should be able to verify that the exam actually operates as designed.

Fairness and moderation must be built in. At scale, AI-conducted exams need human moderation workflows, calibration profiles, and fairness auditing across demographic dimensions. These cannot be afterthoughts — they must be first-class properties of the exam specification.

This specification addresses the following gaps in current practice:

GapWhat This Specification Provides
No formal specification for AI-conducted oral assessmentsA versioned, machine-readable exam specification with 26 schema sections covering structure, policies, evidence, and runtime behavior
Assessment theory disconnected from runtime executionAssessmentProfile encoding Joughin’s (1998) six dimensions as first-class runtime parameters
No structured evidence capture during live examsEvidenceLedger with real-time EvidenceSignal emission, provenance tracking, and confidence scoring
AI examiner behavior not formally boundedThree-layer agent boundary model with allowed/forbidden action catalog and runtime enforcement
No compile-time validation of exam designs117 validation rules across 10 categories, checking structural, semantic, and assessment-theoretic consistency
Candidate commands not consumed by runtimeCandidateCommand as runtime primitives (repeat, clarification, pause, raise-hand) with processing rules
No event contract for downstream consumersTyped event protocol with 20+ event types, delivery guarantees, and audit trail
Exams not versioned or diffableDual versioning scheme (schema version + assessment-theoretic version) with published package immutability
Recovery from anomalies not standardizedRecoveryPolicy with categorized strategies for silence, unclear answers, off-topic responses, anxiety, and technical failures
Moderation and fairness not built into exam specModerationPolicy, CalibrationProfile, and FairnessAudit as first-class constructs

The AI examiner MUST be autonomous within a bounded creative space:

  • Follow-up generation. Given a candidate response, the examiner SHOULD generate natural, contextually appropriate follow-ups — but MUST respect maxFollowUps and forbiddenFollowUpPatterns from the runtime policy.
  • Evidence judgment. The examiner MAY assess whether a candidate response satisfies an EvidenceTarget, producing an EvidenceSignal — but MUST NOT override explicit rubric thresholds or fabricate signals.
  • Repair and recovery. The examiner SHOULD handle silence, unclear answers, off-topic responses, and candidate anxiety with natural language repair — but MUST follow the prescribed RecoveryPolicy sequence, not invent ad-hoc interventions.
  • Bridging. The examiner MAY generate natural transitions between nodes — but MUST NOT skip nodes, reorder the exam structure, or jump to topics not defined in the graph.

Autonomy lives inside nodes, bounded by policies. The runtime controller enforces boundaries at node entry, during turns, and at transitions.

The runtime controller is the policy enforcement layer between the AI examiner’s generative freedom and the exam’s structural integrity. It MUST:

  1. Gate node transitions. No transition occurs without evaluating the CompletionPolicy of the current node and the TransitionPolicy of the target edge.
  2. Count and cap follow-ups. Every follow-up increments a counter. When maxFollowUps is reached, the controller forces transition — not the LLM.
  3. Enforce time budgets. Per-node and global time limits are hard constraints. The controller MUST force-transition or terminate when budgets expire.
  4. Consume candidate commands. Repeat, clarification, raise-hand, pause — these are runtime primitives, not UI decorations. The controller MUST process them and inject appropriate responses.
  5. Persist the evidence ledger. Every EvidenceSignal produced during the exam MUST be written to the ledger before the exam can complete. The ledger is not a transcript byproduct — it is a first-class output.
  6. Emit structured events. Every state change (node entered, turn completed, evidence collected, command processed, policy violation) MUST produce a RuntimeEvent for the event store.
  7. Enforce the agent boundary. The controller MUST reject any examiner action that violates AllowedAction / ForbiddenAction policies, logging the violation as an event.

A transcript records what was said. An oral assessment requires recording what was demonstrated. The gap:

  • A transcript shows “Candidate discussed photosynthesis for 3 minutes.” The evidence ledger records: EvidenceSignal { targetId: "photosynthesis-mechanism", confidence: 0.85, source: "ai-judgment", turnRange: [12, 15] }.
  • A transcript cannot distinguish between a candidate who gave one brilliant answer and one who needed five follow-ups to reach the same conclusion. The runtime state (follow-up count, recovery attempts) carries assessment-critical information.
  • A transcript is flat. The exam structure (which node, which rubric criterion, which time budget) is lost without the runtime context.

Transcript is necessary but insufficient. The evidence ledger, runtime state, and event log together form the complete marking input.

Why the Evidence Ledger Should Be First-Class

Section titled “Why the Evidence Ledger Should Be First-Class”

The evidence ledger is not a post-processing step over the transcript. It is a structured, real-time, authoritative record of assessment evidence:

  • Signals are emitted during the exam, not derived after. The AI examiner produces EvidenceSignal objects as it judges candidate responses. These are written to the ledger immediately.
  • Signals carry provenance. Each signal records whether it came from AI judgment, explicit rubric match, candidate self-report, or external trigger.
  • Signals are linked to structure. Each signal references an EvidenceTarget defined in the exam specification, connecting evidence to rubric criteria.
  • The marking runtime reads the ledger, not the transcript. The marking pipeline consumes structured signals with confidence scores and turn references — not raw STT output.

When the evidence ledger is first-class, the marking pipeline becomes deterministic, auditable, and separable from the conversational runtime.


This specification is grounded in the oral assessment literature. Its design decisions are informed by four key works:

PaperKey InsightDesign Impact
Joughin (1998)Six dimensions of oral assessment: content type, interaction, authenticity, structure, examiners, oralityAssessmentProfile on ExamRuntimePackage
Akimov & Malin (2020)Validity/reliability/fairness matrix. Recording + moderation for reliability. Question banking for inter-case reliability.ModerationPolicy, QuestionPool, CalibrationProfile
Bayley et al. (2024)ConVOE model for 600+ students: parallel administration, batch grading, practice sessions.expectedCandidateCount, QuestionPool.allowReuseAcrossConcurrentSessions
Fenton (2025)IOA components. Prompting taxonomy. Formative vs. summative. Examiner training. Communication skills.PromptingLevel, assessmentPurpose, scaffoldingBudget, identity_check node
Bloom (1956)Six cognitive levels: Remember → Understand → Apply → Analyze → Evaluate → Create. AI struggles at higher levels.BloomLevel on EvidenceTarget; cognitiveEscalationStrategy on FollowUpPolicy

The inclusion of Bloom’s Taxonomy as a design parameter addresses a key argument for AI-era oral assessment: generative AI tools perform well at the lower levels of Bloom’s taxonomy (Remember, Understand) but struggle at the Create level and at making arguments built on theoretical frameworks (Fenton, 2025). By encoding cognitive levels on evidence targets, the specification enables validation that an exam tests the intended range of cognitive demands — and enables the AI examiner to escalate follow-up probing toward higher-order thinking.

The specification’s most important theoretical move is encoding Joughin’s six dimensions as the AssessmentProfile — a first-class property of the exam package. These dimensions are not metadata decorations; they are design parameters that constrain runtime behavior, inform evidence interpretation, and support validity arguments. An exam that declares interactionMode: "structured_dialogue" and structureProfile.opennessScore: 0.2 is making a claim about its reliability profile that the runtime can verify.

The evidence model separates collection from scoring (the AI examiner proposes signals; the marking runtime assigns marks). This addresses Akimov & Malin’s (2020) concern about intra-rater reliability: the AI’s judgments are recorded but not final — human moderation can override them.


3. What This Artifact Is (Technical Summary)

Section titled “3. What This Artifact Is (Technical Summary)”

The IOA-ORM is the canonical, versioned, executable specification of a published oral assessment. It is:

  • The single source of truth for an exam’s structure, policies, evidence targets, and runtime behavior.
  • A compilation target from the authoring studio’s high-level exam model.
  • A compilation source for the runtime controller configuration, the execution adapter, and the marking runtime configuration.
  • A versioned artifact with a stable identity, changelog, and diffability between versions.
This artifact is NOT…Because…
A UI schemaIt does not describe frontend layout, styling, or component tree. The frontend consumes runtime events and state — it does not render the specification.
An execution engine configThe specification compiles to execution-specific configurations. It carries richer semantics (policies, evidence, constraints) that execution engines typically cannot express.
A prompt templateThe AI examiner’s system prompt is derived from this specification at runtime. The specification defines what the examiner must do; the prompt describes how to speak.
A marking rubricRubric criteria inform EvidenceTarget definitions, but the specification is the runtime executable spec, not the scoring model.
A chatbot workflowGeneric dialogue graphs lack assessment-specific concepts: evidence targets, candidate commands, completion policies, time budgets, recovery strategies.
#GoalDescription
G1Authoring-friendlyIR is a natural compilation target from the authoring studio’s exam flow model. No manual authoring SHOULD be required.
G2Runtime-controllableHard constraints on node progression, follow-ups, transitions, time budgets, candidate commands, and evidence capture. Policies are machine-enforceable.
G3Agentic but boundedThe AI examiner has creative freedom inside nodes — but policies, guardrails, and the runtime controller enforce structural boundaries.
G4Observable and auditableEvery significant state change produces a structured RuntimeEvent. The event log is the audit trail.
G5Marking-readyThe evidence ledger provides structured, linked, confidence-scored signals to the marking runtime — not raw transcript.
G6Execution-agnosticThe IR compiles to execution-specific configurations. It is not tied to any particular runtime engine or voice pipeline.
G7Versioned and diffableEach published exam has a stable specification version. Changes between versions are inspectable.
G8Assessment-theoretically groundedThe specification encodes Joughin’s (1998) six dimensions as design parameters. Design decisions are traceable to the assessment theory knowledge base.
G9Validity-awareThe specification supports structured validity claims (face, content, construct, concurrent), moderation workflows, and calibration profiles.
G10Fairness-auditableThe specification supports fairness auditing across demographic dimensions. The evidence model captures enough data for post-hoc disparity analysis.
#Non-GoalRationale
NG1Replace the execution engineThe runtime handles real-time voice pipeline (STT, LLM, TTS). The specification is the domain spec; the runtime is the execution engine.
NG2Define UI componentsThe frontend is a consumer of runtime events, not a renderer of the specification.
NG3Define scoring algorithmsThe specification provides evidence; scoring logic lives in the marking runtime.
NG4Support non-oral assessmentsThis specification is designed for interactive oral exams. Written, MCQ, or portfolio assessments have different runtime semantics.
NG5Replace the authoring studioThe authoring studio is the human-facing tool. The specification is the machine-facing spec it produces.
NG6Define session management at scaleThe specification defines exam structure; session orchestration, batch processing, and cohort management are platform/runtime concerns.
NG7Define signal processing pipelinesParalinguistic analysis (prosody, speaking rate, pitch) is a runtime/STT concern, not a specification concern. The specification captures assessment-level semantics, not acoustic features.
┌─────────────────────────────────────────────────────────────────┐
│                     AUTHORING STUDIO                            │
│  Lecturers design exam flows, define rubrics, set policies      │
│                                                                 │
│  Exam Flow Model ──compile──► ExamRuntimePackage (IR)           │
└──────────────────────────────┬──────────────────────────────────┘

                    ┌──────────▼──────────┐
                    │  IOA-ORM                                       │
                    │  (canonical spec)   │
                    │  versioned, stable  │
                    └──┬──────┬──────┬───┘
                       │      │      │
          ┌────────────▼┐  ┌─▼──────▼──────────────┐
          │  Execution   │  │  Runtime Controller    │
          │  Adapter     │  │  (policy enforcement)  │
          │              │  │                        │
          │ Compiles IR  │  │ Enforces: transitions, │
          │ to engine    │  │ follow-up caps, time,  │
          │ config +     │  │ commands, evidence     │
          │ node config  │  │ writes, agent boundary │
          └──────┬───────┘  └────────┬───────────────┘
                 │                   │
    ┌────────────▼───────────────────▼────────────┐
    │           REAL-TIME VOICE RUNTIME            │
    │  STT · LLM · TTS · Voice Pipeline           │
    │  (e.g., Pipecat + LiveKit, or equivalent)   │
    └────────────┬───────────────────┬────────────┘
                 │                   │
    ┌────────────▼────────┐  ┌──────▼──────────────┐
    │  Event Store        │  │  Evidence Ledger     │
    │  (RuntimeEvents)    │  │  (EvidenceSignals)   │
    │                     │  │                      │
    │  Audit trail,       │  │  Structured evidence │
    │  analytics, replay  │  │  for marking runtime │
    └─────────────────────┘  └──────────────────────┘

                              ┌───────▼───────────┐
                              │  Marking Runtime   │
                              │  (reads ledger)    │
                              │  produces scores   │
                              └───────────────────┘
ComponentReads FromWrites ToRole
Authoring StudioLecturer inputExamRuntimePackageHuman-facing design tool
IOA-ORMCanonical versioned spec
Runtime ControllerExamRuntimePackage, RuntimeStateRuntimeState, EventStore, EvidenceLedgerPolicy enforcement engine
Execution AdapterExamRuntimePackageEngine-specific config, node configIR → execution config compiler
Voice RuntimeEngine config, node configTranscript, audioReal-time voice pipeline
Event StoreRuntimeEvent streamPersisted event logAudit, analytics, replay
Evidence LedgerEvidenceSignal streamPersisted evidence recordsMarking input
Marking RuntimeEvidenceLedger, TranscriptScores, reportsAssessment outcomes
Frontend Exam RoomEventStore, RuntimeState (via data channel)CandidateCommandsCandidate interface
DocumentContent
00-overview.mdThis file. Purpose, gaps addressed, theoretical grounding, architecture.
01-concepts.mdTheoretical foundations, glossary, domain entities, conceptual object model.
02-schema.mdTypeScript interfaces for all core objects (26 sections).
03-runtime-semantics.mdState machine, transition rules, policy evaluation.
04-agent-boundary.mdAllowed/forbidden actions, guardrail enforcement.
05-event-protocol.mdEvent types, payloads, delivery guarantees.
06-evidence-ledger.mdSignal lifecycle, ledger schema, marking integration.
07-pipecat-adapter.mdCompilation rules, FlowManager mapping, limitations.
08-validation-rules.mdCompile-time validation of specification packages.
09-versioning.mdVersion scheme, migration, compatibility.
10-examples.mdComplete worked examples.
11-migration-plan.mdIncremental migration from existing runtime configs.
12-testing-strategy.mdUnit, integration, simulation testing.
13-open-questions.mdUnresolved design decisions.
VersionDateChanges
v0.2.02026-06-30Reframed as IOA-ORM. Added 4-layer artifact model, DSR contribution table, formal definition. Replaced ‘Exam Runtime IR’ with ‘IOA-ORM’. Added IOA-centric framing (implementation-agnostic). Added Bloom’s Taxonomy rationale.
v0.1.02026-05-06Initial release.