Design Brief
Oral examinations are a proven method for assessing competencies that written tests cannot measure — reasoning under pressure, defending a position, responding to probing questions, and demonstrating live interaction skills. Sotiriadou et al. (2020) define this as the “interactive oral” — a form of assessment where students perform real-world tasks to demonstrate meaningful application of knowledge and skills.
The need for interactive oral assessment (IOA) is growing. Academic integrity concerns, professional accreditation requirements, and the limitations of written assessment in the AI era all drive demand for IOA at scale. But IOA has historically been limited by the need for one-on-one human examiner time.
This creates a need for systematic, machine-executable oral assessment — where the exam’s structure, evidence capture, and policy enforcement are formally specified and executed by a runtime system. That runtime could be a human examiner following a script, a rule-based machine, or an AI-powered voice agent.
Kōrero (korero.thesteder.com) is one such platform — an AI-powered system where lecturers design interactive exam flows that an AI examiner conducts with candidates via real-time voice. But Kōrero is one instantiation of a broader pattern. The core problem is general: any system that executes interactive oral assessments faces the same fundamental challenge — bridging assessment design with runtime execution.
The Problem
Section titled “The Problem”Current approaches to specifying how an oral exam should behave fall into two categories:
-
Application-specific runtime configs. The exam’s behavior is defined in a format specific to the runtime engine (e.g., a conversational flow graph). This works but creates a tight coupling between the exam specification and the execution platform. The config carries no assessment-theoretic semantics — it describes how conversations flow, not what should be assessed and how. Assessment properties like “this exam uses structured dialogue to assess interpersonal competence” exist only as implicit assumptions in code.
-
Hard-coded logic. Exam behavior is embedded directly in application code. This is opaque, non-portable, and impossible to validate at compile time. Every new exam requires reimplementation. There is no separation between the exam specification and the execution engine.
Neither approach provides a formal, portable, versioned specification that:
- Encodes assessment-theoretic properties (validity, reliability, fairness) as first-class parameters
- Separates what the exam assesses from how the runtime executes it
- Enables compile-time validation of exam designs
- Supports versioning, diffing, and auditability
- Defines structured evidence capture during live exams
- Formally bounds AI examiner behavior with enforceable policies
What We Need
Section titled “What We Need”An Interactive Oral Assessment Ontology and Reference Model (IOA-ORM) — a domain-specific specification that sits between the authoring tool and the runtime engine. The specification should be:
- The single source of truth for a published oral assessment’s structure, policies, evidence targets, and runtime behavior
- A compilation target from the authoring studio — lecturers design exams in human-friendly terms; the specification encodes those designs as machine-readable, versioned artifacts
- A compilation source for the runtime execution engine, the policy enforcement layer, and the marking pipeline
- Platform-agnostic — not tied to any specific runtime engine or voice pipeline
- Assessment-theoretically grounded — encoding established oral assessment theory (Joughin, 1998; Akimov & Malin, 2020; Bayley et al., 2024; Fenton, 2025) as executable design parameters
Kōrero as Initial Use Case
Section titled “Kōrero as Initial Use Case”Kōrero serves as the initial instantiation of this specification. The current system compiles exam designs to a conversational flow configuration for a real-time voice pipeline. While functional, this approach has the limitations described above. The specification will serve as the new canonical specification, with a compilation step producing the execution-specific configuration.
However, the specification is designed to be general. Any platform conducting interactive oral assessments — whether powered by AI voice agents, rule-based machines, or structured human examiner scripts — could adopt this specification as its exam specification format.
Design Goals
Section titled “Design Goals”- Authoring-friendly: specification is a natural compilation target from the authoring studio’s high-level exam model. No manual authoring SHOULD be required.
- Runtime-controllable: Hard constraints on node progression, follow-ups, transitions, time budgets, candidate commands, and evidence capture. Policies are machine-enforceable.
- Agentic but bounded: The AI examiner has creative freedom inside nodes — but policies, guardrails, and the runtime controller enforce structural boundaries. The examiner CAN follow up naturally, judge evidence, handle repairs, and generate bridges. The examiner CANNOT skip sections, reveal rubrics, score directly, ignore commands, or bypass guardrails.
- Observable and auditable: Every significant state change produces a structured event. The event log is the audit trail.
- Marking-ready: The evidence ledger provides structured, linked, confidence-scored signals to the marking runtime — not raw transcript.
- Execution-agnostic: The specification compiles to execution-specific configurations. It is not tied to any particular runtime engine.
Architecture
Section titled “Architecture”┌─────────────────────────────────────────────────────────────────┐
│ AUTHORING STUDIO │
│ Lecturers design exam flows, define rubrics, set policies │
│ │
│ Exam Flow Model ──compile──► ExamRuntimePackage (IOA-ORM) │
└──────────────────────────────┬──────────────────────────────────┘
│
┌──────────▼──────────┐
│ IOA-ORM │
│ (canonical spec) │
│ versioned, stable │
└──┬──────┬──────┬───┘
│ │ │
┌────────────▼┐ ┌─▼──────▼──────────────┐
│ Execution │ │ Runtime Controller │
│ Adapter │ │ (policy enforcement) │
└──────┬───────┘ └────────┬───────────────┘
│ │
┌────────────▼───────────────────▼────────────┐
│ REAL-TIME VOICE RUNTIME │
└────────────┬───────────────────┬────────────┘
│ │
┌────────────▼────────┐ ┌──────▼──────────────┐
│ Event Store │ │ Evidence Ledger │
└─────────────────────┘ └──────────────────────┘
│
┌───────▼───────────┐
│ Marking Runtime │
└───────────────────┘
Output Structure
Section titled “Output Structure”spec/
00-overview.md — Purpose, gaps, theoretical grounding, architecture
01-concepts.md — Domain model, glossary, theoretical foundations
02-schema.md — TypeScript interfaces (26 sections)
03-runtime-semantics.md — State machine, transitions, policy evaluation
04-agent-boundary.md — Allowed/forbidden actions, guardrail enforcement
05-event-protocol.md — Event types, payloads, delivery guarantees
06-evidence-ledger.md — Signal lifecycle, ledger schema, marking integration
07-pipecat-adapter.md — Compilation rules for Pipecat execution adapter
08-validation-rules.md — Compile-time validation (117 rules)
09-versioning.md — Version scheme, migration, compatibility
10-examples.md — Complete worked examples
11-migration-plan.md — Incremental migration path
12-testing-strategy.md — Psychometric and integration testing
13-open-questions.md — Unresolved design decisions
14-design-alternatives.md — Design space exploration (12 QOC decisions)
Research Grounding
Section titled “Research Grounding”The specification is grounded in four key works from the oral assessment literature:
| Paper | Key Insight | Design Impact |
|---|---|---|
| Joughin (1998) | Six dimensions of oral assessment: content type, interaction, authenticity, structure, examiners, orality | AssessmentProfile — encoding these dimensions as first-class runtime parameters |
| Akimov & Malin (2020) | Validity/reliability/fairness matrix. Recording + moderation for reliability. Question banking for inter-case reliability. | ModerationPolicy, QuestionPool, CalibrationProfile |
| Bayley et al. (2024) | ConVOE model for 600+ students: parallel administration, batch grading, practice sessions. | expectedCandidateCount, QuestionPool.allowReuseAcrossConcurrentSessions |
| Fenton (2025) | IOA components. Prompting taxonomy. Formative vs. summative. Examiner training. Communication skills. | PromptingLevel, assessmentPurpose, scaffoldingBudget, identity_check node |
Additionally, the design space exploration engages with dialogue management, agent workflow architectures, AI safety/governance frameworks, and real-time voice agent systems.
Quality Requirements
Section titled “Quality Requirements”- Use clear markdown structure
- Provide schemas, JSON examples, event protocols, migration plans — not just concepts
- Designs must be pragmatic and engineering-ready
- Suggest incremental migration, not full rewrite
- Focus on oral assessment / interactive oral exam, not generic chatbot workflow
- Use MUST / SHOULD / MAY / MUST NOT normative language for semantic contracts