Design Brief

Oral examinations are a proven method for assessing competencies that written tests cannot measure — reasoning under pressure, defending a position, responding to probing questions, and demonstrating live interaction skills. Sotiriadou et al. (2020) define this as the “interactive oral” — a form of assessment where students perform real-world tasks to demonstrate meaningful application of knowledge and skills.

The need for interactive oral assessment (IOA) is growing. Academic integrity concerns, professional accreditation requirements, and the limitations of written assessment in the AI era all drive demand for IOA at scale. But IOA has historically been limited by the need for one-on-one human examiner time.

This creates a need for systematic, machine-executable oral assessment — where the exam’s structure, evidence capture, and policy enforcement are formally specified and executed by a runtime system. That runtime could be a human examiner following a script, a rule-based machine, or an AI-powered voice agent.

Kōrero (korero.thesteder.com) is one such platform — an AI-powered system where lecturers design interactive exam flows that an AI examiner conducts with candidates via real-time voice. But Kōrero is one instantiation of a broader pattern. The core problem is general: any system that executes interactive oral assessments faces the same fundamental challenge — bridging assessment design with runtime execution.

The Problem

Current approaches to specifying how an oral exam should behave fall into two categories:

Application-specific runtime configs. The exam’s behavior is defined in a format specific to the runtime engine (e.g., a conversational flow graph). This works but creates a tight coupling between the exam specification and the execution platform. The config carries no assessment-theoretic semantics — it describes how conversations flow, not what should be assessed and how. Assessment properties like “this exam uses structured dialogue to assess interpersonal competence” exist only as implicit assumptions in code.
Hard-coded logic. Exam behavior is embedded directly in application code. This is opaque, non-portable, and impossible to validate at compile time. Every new exam requires reimplementation. There is no separation between the exam specification and the execution engine.

Neither approach provides a formal, portable, versioned specification that:

Encodes assessment-theoretic properties (validity, reliability, fairness) as first-class parameters
Separates what the exam assesses from how the runtime executes it
Enables compile-time validation of exam designs
Supports versioning, diffing, and auditability
Defines structured evidence capture during live exams
Formally bounds AI examiner behavior with enforceable policies

What We Need

An Interactive Oral Assessment Ontology and Reference Model (IOA-ORM) — a domain-specific specification that sits between the authoring tool and the runtime engine. The specification should be:

The single source of truth for a published oral assessment’s structure, policies, evidence targets, and runtime behavior
A compilation target from the authoring studio — lecturers design exams in human-friendly terms; the specification encodes those designs as machine-readable, versioned artifacts
A compilation source for the runtime execution engine, the policy enforcement layer, and the marking pipeline
Platform-agnostic — not tied to any specific runtime engine or voice pipeline
Assessment-theoretically grounded — encoding established oral assessment theory (Joughin, 1998; Akimov & Malin, 2020; Bayley et al., 2024; Fenton, 2025) as executable design parameters

Kōrero as Initial Use Case

Kōrero serves as the initial instantiation of this specification. The current system compiles exam designs to a conversational flow configuration for a real-time voice pipeline. While functional, this approach has the limitations described above. The specification will serve as the new canonical specification, with a compilation step producing the execution-specific configuration.

However, the specification is designed to be general. Any platform conducting interactive oral assessments — whether powered by AI voice agents, rule-based machines, or structured human examiner scripts — could adopt this specification as its exam specification format.

Design Goals

Authoring-friendly: specification is a natural compilation target from the authoring studio’s high-level exam model. No manual authoring SHOULD be required.
Runtime-controllable: Hard constraints on node progression, follow-ups, transitions, time budgets, candidate commands, and evidence capture. Policies are machine-enforceable.
Agentic but bounded: The AI examiner has creative freedom inside nodes — but policies, guardrails, and the runtime controller enforce structural boundaries. The examiner CAN follow up naturally, judge evidence, handle repairs, and generate bridges. The examiner CANNOT skip sections, reveal rubrics, score directly, ignore commands, or bypass guardrails.
Observable and auditable: Every significant state change produces a structured event. The event log is the audit trail.
Marking-ready: The evidence ledger provides structured, linked, confidence-scored signals to the marking runtime — not raw transcript.
Execution-agnostic: The specification compiles to execution-specific configurations. It is not tied to any particular runtime engine.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     AUTHORING STUDIO                            │
│  Lecturers design exam flows, define rubrics, set policies      │
│                                                                 │
│  Exam Flow Model ──compile──► ExamRuntimePackage (IOA-ORM)      │
└──────────────────────────────┬──────────────────────────────────┘
                               │
                    ┌──────────▼──────────┐
                    │  IOA-ORM            │
                    │  (canonical spec)   │
                    │  versioned, stable  │
                    └──┬──────┬──────┬───┘
                       │      │      │
          ┌────────────▼┐  ┌─▼──────▼──────────────┐
          │  Execution   │  │  Runtime Controller    │
          │  Adapter     │  │  (policy enforcement)  │
          └──────┬───────┘  └────────┬───────────────┘
                 │                   │
    ┌────────────▼───────────────────▼────────────┐
    │           REAL-TIME VOICE RUNTIME            │
    └────────────┬───────────────────┬────────────┘
                 │                   │
    ┌────────────▼────────┐  ┌──────▼──────────────┐
    │  Event Store        │  │  Evidence Ledger     │
    └─────────────────────┘  └──────────────────────┘
                                      │
                              ┌───────▼───────────┐
                              │  Marking Runtime   │
                              └───────────────────┘

Output Structure

spec/
  00-overview.md          — Purpose, gaps, theoretical grounding, architecture
  01-concepts.md          — Domain model, glossary, theoretical foundations
  02-schema.md            — TypeScript interfaces (26 sections)
  03-runtime-semantics.md — State machine, transitions, policy evaluation
  04-agent-boundary.md    — Allowed/forbidden actions, guardrail enforcement
  05-event-protocol.md    — Event types, payloads, delivery guarantees
  06-evidence-ledger.md   — Signal lifecycle, ledger schema, marking integration
  07-pipecat-adapter.md   — Compilation rules for Pipecat execution adapter
  08-validation-rules.md  — Compile-time validation (117 rules)
  09-versioning.md        — Version scheme, migration, compatibility
  10-examples.md          — Complete worked examples
  11-migration-plan.md    — Incremental migration path
  12-testing-strategy.md  — Psychometric and integration testing
  13-open-questions.md    — Unresolved design decisions
  14-design-alternatives.md — Design space exploration (12 QOC decisions)

Research Grounding

The specification is grounded in four key works from the oral assessment literature:

Paper	Key Insight	Design Impact
Joughin (1998)	Six dimensions of oral assessment: content type, interaction, authenticity, structure, examiners, orality	`AssessmentProfile` — encoding these dimensions as first-class runtime parameters
Akimov & Malin (2020)	Validity/reliability/fairness matrix. Recording + moderation for reliability. Question banking for inter-case reliability.	`ModerationPolicy`, `QuestionPool`, `CalibrationProfile`
Bayley et al. (2024)	ConVOE model for 600+ students: parallel administration, batch grading, practice sessions.	`expectedCandidateCount`, `QuestionPool.allowReuseAcrossConcurrentSessions`
Fenton (2025)	IOA components. Prompting taxonomy. Formative vs. summative. Examiner training. Communication skills.	`PromptingLevel`, `assessmentPurpose`, `scaffoldingBudget`, `identity_check` node

Additionally, the design space exploration engages with dialogue management, agent workflow architectures, AI safety/governance frameworks, and real-time voice agent systems.

Quality Requirements

Use clear markdown structure
Provide schemas, JSON examples, event protocols, migration plans — not just concepts
Designs must be pragmatic and engineering-ready
Suggest incremental migration, not full rewrite
Focus on oral assessment / interactive oral exam, not generic chatbot workflow
Use MUST / SHOULD / MAY / MUST NOT normative language for semantic contracts