Skip to content

Runtime Semantics

Draft · v0.2.0 · 2026-06-30

Status: Draft Scope: Defines the execution model, state machines, lifecycle transitions, and semantic contracts that govern how an IOA-ORM is executed. All normative statements use RFC 2119 language (MUST / SHOULD / MAY / MUST NOT).


  1. Runtime Execution Model
  2. Exam Lifecycle State Machine
  3. Node Lifecycle State Machine
  4. Turn Lifecycle State Machine
  5. Candidate Readiness State
  6. Recovery State
  7. Completion Semantics
  8. Transition Semantics
  9. Follow-up Semantics
  10. Candidate Command Semantics
  11. Guardrail Enforcement Semantics
  12. Transcript and Evidence Capture Semantics

Appendices


Interactive Oral Assessment (IOA) research (Ward et al., 2023; Sotiriadou et al., 2016) establishes that effective oral assessments are scenario-based, free-flowing conversations — not scripted question-and-answer sessions. The Runtime MUST respect this principle: nodes represent assessment scenarios with rubric-aligned conversation objectives, not rigid question slots. The LLM examiner drives an authentic, professionally-focused dialogue, using rubric criteria as conversation guides (sentence-starters), not as a quiz script.

The runtime semantics are grounded in four key dimensions from the oral assessment literature:

Interaction (Joughin, 1998, pp. 370–371). Oral assessment’s principal advantage is the capacity for bidirectional adaptation — “each [statement] includes a response to that made by the other participant” creating “inherent unpredictability in which neither party knows in advance exactly what questions will be asked or what responses will be made.” The Runtime preserves this by giving the LLM autonomy over dialogue strategy within nodes while maintaining structural control over transitions. However, Joughin warns that “the social interaction entailed in oral assessment may distort communication and affect both a candidate’s performance and how that performance is perceived by the examiner” (p. 370). The guardrails in §11 address this risk.

Authenticity (Joughin, 1998, pp. 371–372; Fenton, 2025, p. 431). IOA research positions oral assessments as “a form of assessment asking students to perform real-world tasks to demonstrate meaningful application of necessary knowledge and skills” (Sotiriadou et al., 2020, cited in Fenton, 2025). The scenario-based node design directly implements Joughin’s “contextualised” pole of the authenticity continuum. The persona system (§11) ensures the examiner stays in character, maintaining the professional context throughout the interaction.

Reliability through Structure (Joughin, 1998, p. 376; Akimov & Malin, 2020, Table 4). Joughin identifies that “reliability is threatened when the ‘interaction’ dimension tends towards the ‘dialogue’ pole, when the ‘structure’ dimension tends towards the ‘open’ pole.” The spec addresses this through a deliberate design choice: closed structure between nodes (Runtime-enforced transitions, deterministic ordering) with open dialogue within nodes (LLM-driven adaptive questioning). This hybrid preserves the validity benefits of dialogue while maintaining the reliability benefits of structure. The follow-up counter and time budget (§9) are hard structural constraints that prevent unbounded dialogue.

Prompting Neutrality (Pearce & Chiavaroli, 2020, cited in Fenton, 2025, p. 434). The spec operationalises four guiding principles for examiner prompting: neutrality (guardrails prevent reassurance or discouragement), consistency (same follow-up policy across nodes), transparency (candidates receive scaffolding practice before assessment), and reflexivity (examiner utterances are logged for post-hoc quality review).

The six IOA components (scaffolding, scenario-based, aligned to program, learning outcomes, accessible/equitable, professionally-focused) inform the following runtime semantics:

  • Scenario context is provided to the LLM as scene-setting, not as a question prompt.
  • Rubric criteria are shared with the LLM as evidence vocabulary (what to listen for), not as scoring weights or model answers.
  • Conversation flow is emergent — the LLM adapts its prompting based on what the candidate demonstrates, nudging toward higher rubric levels when opportunity arises.
  • Scaffolding operates at two levels: (1) pre-exam familiarisation (§2.1) where the candidate practices the IOA format without stakes, and (2) in-assessment scaffolding (§9.5) where the examiner adjusts support based on candidate performance within the Zone of Proximal Development.
  • Equity is enforced by not penalising communication style unless it is a declared learning outcome (§11.2). Fairness is further supported by deterministic policy enforcement for transitions and completion, preventing LLM-driven inconsistency across candidates.

The Runtime MUST maintain exactly one active node at any time during an in-progress exam. When the active node completes or transitions, the Runtime MUST atomically deactivate the current node and activate the next node — there MUST NOT be a gap where zero or more than one node is active.

A Runtime Controller owns all authoritative state. The LLM agent is a tool invoked by the Runtime Controller — it does NOT own state, does NOT decide transitions unilaterally, and does NOT persist evidence directly. The boundary is:

ConcernOwner
Which node is activeRuntime Controller
When a turn starts/endsRuntime Controller
How many follow-ups have been askedRuntime Controller
Time budget remainingRuntime Controller
Evidence ledger writesRuntime Controller (LLM emits signals, Runtime persists)
Question wording, follow-up phrasing, bridge textLLM (within constraints)
Judging answer sufficiencyLLM emits opinion; Runtime applies policy

For each active node, the Runtime Controller executes the following loop. Note that this is a conversation loop, not a rigid Q&A loop — the LLM drives an authentic dialogue, and the loop represents the runtime’s state tracking, not the conversation’s structure.

Pipecat Mapping: Steps marked with 🟢 execute in Pipecat pipeline/FlowManager. Steps marked with 🔵 execute in Runtime Controller. The LLM calls report_observation (the single allowed function) after processing each candidate response; the Runtime Controller handler evaluates the observation and decides the next action.

while node is active:
    1. 🔵 Enter node → emit node_entered, set time budget, reset counters
    2. 🔵 Build NodeConfig from specification (scenario, rubric criteria, persona, constraints)
    3. 🟢 FlowManager.set_node_from_config(config)
    4. 🟢 LLM initiates conversation (task_messages drive the opening)
    5. 🟢 STT captures candidate response → text flows to LLM
    6. 🟢 LLM processes response, calls report_observation(...)
    7. 🔵 Runtime Controller handler receives observation:
        a. If commandDetected → dispatch command (see §10), return
        b. Validate spokenText through output filters (content/topic/action/length)
        c. Write evidence signals to Evidence Ledger
        d. Check guardrails:
            - Time budget exceeded? → transition to best-effort
            - Follow-up requested?
                - Counter < maxFollowUps → inject follow-up context, continue
                - Counter >= maxFollowUps → transition to best-effort
            - Off-topic?
                - Count < maxOffTopicRedirects → inject redirect, continue
                - Count >= maxOffTopicRedirects → transition to best-effort
            - Evidence sufficient + requiredEvidenceCount met? → transition to completed
            - Anxiety detected? → extend time budget
        e. If transitioning:
            - Finalize current node evidence
            - Build next node's NodeConfig
            - 🔵 call flow_manager.set_node_from_config(next_config)
        f. If continuing: return validated spokenText to LLM for TTS
    8. 🟢 TTS speaks the response to candidate
    9. Loop back to 5

The Runtime MUST be single-threaded per exam instance. All state mutations for a given exam MUST be serialised. Event emission MUST occur before the corresponding state mutation is considered committed.

Every Runtime operation that mutates state MUST be idempotent with respect to its event sequence number. Re-applying the same event (e.g., during recovery) MUST produce the same resulting state.


StateDescription
createdExam instance created from published AssessmentPackage. Candidate assigned.
scheduledExam scheduled with a specific time window.
scaffoldingOptional practice phase. Candidate experiences the IOA format with a practice scenario that does NOT count toward the score. Emits scaffolding_started / scaffolding_completed.
readyAll pre-conditions met (candidate authenticated, audio/video checks passed, session token issued, scaffolding completed or skipped).
in_progressExam is live; at least one node has been entered.
pausedExam temporarily suspended (candidate-initiated pause or system-initiated).
completedAll nodes processed or exam explicitly ended. Terminal.
abortedExam ended prematurely due to violation, timeout, or system failure. Terminal.
expiredExam time window closed before completion. Terminal.
created ──[assign_candidate]──► scheduled
scheduled ──[preconditions_met, scaffolding_enabled]──► scaffolding
scheduled ──[preconditions_met, scaffolding_disabled]──► ready
scaffolding ──[practice_complete OR candidate_skips_practice]──► ready
ready ──[exam_start]──► in_progress
in_progress ──[pause_requested]──► paused
paused ──[resume]──► in_progress
in_progress ──[all_nodes_complete OR explicit_end]──► completed
in_progress ──[violation OR system_failure]──► aborted
in_progress ──[time_window_expired]──► expired
paused ──[time_window_expired]──► expired
scheduled ──[time_window_expired]──► expired
  • scheduled → scaffolding: Runtime MUST verify preconditions are met AND scaffolding.enabled === true in the specification. Scaffolding turns MUST use a separate practice scenario (defined in the specification) and MUST NOT produce evidence signals for the marking pipeline.
  • scaffolding → ready: Runtime MUST emit scaffolding_completed with a count of practice turns taken. Scaffolding transcript MAY be retained for QA but MUST be excluded from the MarkingPackage.
  • ready → in_progress: Runtime MUST verify candidate identity is confirmed and all pre-checks (audio, video, network) have passed.
  • in_progress → paused: Runtime MUST record the pause timestamp and remaining time budget. The active node MUST be suspended, not reset.
  • paused → in_progress: Runtime MUST restore the active node state exactly as it was at pause.
  • in_progress → completed: Runtime MUST verify that every required node has been processed (either completed or marked best-effort per §7).
  • in_progress → aborted: Runtime MUST record the abort reason. Abort reasons include: candidate violation (e.g., third-party assistance detected), repeated guardrail violations, or irrecoverable system failure.
Actioncreatedscheduledscaffoldingreadyin_progresspausedcompletedabortedexpired
Assign candidate
Schedule exam
Start scaffolding
Skip scaffolding
Start exam
Candidate speaks
LLM generates question
Transition node
Pause
Resume
Abort
Read transcript
Read evidence

StateDescription
pendingNode has not yet been reached in the exam flow.
activeNode is the current active node; candidate is being assessed on it.
completedNode assessment finished successfully (sufficient evidence gathered).
best_effortNode ended with incomplete evidence (follow-ups exhausted or time budget hit).
skippedNode skipped due to conditional routing (e.g., prerequisite not met).
pending ──[node_enter]──► active
active ──[evidence_sufficient]──► completed
active ──[followups_exhausted OR time_budget_hit]──► best_effort
active ──[transition_condition_skip]──► skipped
pending ──[transition_condition_skip]──► skipped
  • pending → active: Runtime MUST emit node_entered event with node ID, sequence index, and timestamp. Runtime MUST initialise the node’s follow-up counter to 0 and start the time budget timer.
  • active → completed: Runtime MUST emit node_completed event. Evidence signals collected during this node MUST be finalised and written to the Evidence Ledger (see §12).
  • active → best_effort: Runtime MUST emit node_best_effort event with a reason (followups_exhausted, time_budget_hit). All partial evidence MUST still be written to the Evidence Ledger.
  • active → skipped: Runtime MUST emit node_skipped event with the transition condition that triggered the skip. Runtime MUST NOT count skipped nodes as completed for completion purposes (see §7).

A turn is a single exchange: candidate speaks → LLM processes → LLM responds (or transitions).

StateDescription
awaiting_candidateRuntime is waiting for the candidate to speak. LLM has finished presenting or follow-up.
candidate_speakingSTT is capturing the candidate’s audio.
processingCandidate’s response is being processed (STT finalisation, LLM evaluation).
llm_respondingLLM is generating and TTS is playing the response.
turn_completeTurn ended; ready for next turn or node transition.
awaiting_candidate ──[speech_detected]──► candidate_speaking
candidate_speaking ──[speech_ended OR silence_timeout]──► processing
processing ──[evaluation_complete, needs_followup]──► llm_responding
processing ──[evaluation_complete, sufficient]──► turn_complete
processing ──[evaluation_complete, command_detected]──► turn_complete
llm_responding ──[tts_finished]──► awaiting_candidate   (loop for follow-up)
llm_responding ──[tts_finished, transitioning]──► turn_complete
  • If candidate_speaking does not start within silenceTimeoutMs (configurable per node), the Runtime MUST transition to processing with a silence_detected flag.
  • If processing exceeds processingTimeoutMs, the Runtime MUST emit a turn_timeout event and the turn MUST be treated as best-effort for that candidate utterance.
  • llm_responding MUST complete within llmResponseTimeoutMs. If exceeded, Runtime MUST emit llm_timeout and either retry once or fallback to a canned transition message.

The turn lifecycle states map to the canonical event types defined in 02-schema.md §14 (RuntimeEventType) and the event protocol in 05-event-protocol.md. Internal state transitions (e.g., entering candidate_speaking) MAY be tracked internally by the Runtime Controller but are not necessarily emitted as protocol-level events. The canonical turn-related event types are:

  • examiner_turn — examiner produces a turn (maps to transcript_final with speaker “examiner”)
  • candidate_turn — candidate produces a turn (maps to transcript_final with speaker “candidate”)
  • turn_completed — turn cycle completed (both sides have spoken or timeout)
  • turn_timeout — turn processing exceeded processingTimeoutMs

Additionally, the event protocol defines transcript_delta (streaming STT partials), transcript_final (canonical persisted utterance), examiner_utterance_started, and examiner_utterance_final for real-time UI consumption. See 05-event-protocol.md §4.4–4.7.


Before the exam begins, the Runtime MUST verify the candidate is technically and cognitively ready. This prevents starting a high-stakes assessment with broken audio, confused candidate, or unverified identity.

CheckRequiredFailure Behaviour
Identity verification (face match, ID check, or proctoring token)MUST passBlock exam start; emit readiness_identity_failed
Audio input device activeMUST passBlock exam start; emit readiness_audio_failed
Audio output device active (TTS audible)MUST passBlock exam start; emit readiness_audio_output_failed
Video input device active (if required)MUST passBlock exam start; emit readiness_video_failed
Network connectivity (latency < threshold)SHOULD passWarn candidate; allow start with degraded flag
Candidate confirms instructions understoodMUST passRe-present instructions; emit readiness_instructions_not_understood
not_ready ──[check_identity]──► identity_verified
identity_verified ──[check_audio]──► audio_ok
audio_ok ──[check_video]──► video_ok
video_ok ──[check_instructions]──► ready

All states are sequential. Runtime MUST NOT skip a check. Each failed check MUST emit an event and block progression. The Runtime MAY allow a configurable number of retries per check before blocking the exam entirely.


Recovery scenarios are divided into technical failures (infrastructure issues) and assessment failures (pedagogical situations where the assessment interaction goes wrong). Both categories MUST be handled by the Runtime Controller.

ScenarioDetectionRecovery Action
Network disconnectionWebSocket/LiveKit disconnect eventPause exam; attempt reconnect within reconnectTimeoutMs. If reconnected, resume from last committed state. If timeout, abort.
STT failureSTT returns empty/error for N consecutive turnsRetry STT pipeline. If persistent, emit stt_failure event, present written question as fallback, log degraded mode.
STT low confidencetranscript_segment.confidence < 0.6Runtime MUST emit stt_low_confidence event. LLM MAY offer the candidate a chance to repeat. Evidence signals MUST NOT be recorded from segments with confidence below 0.5 (see §12.3).
LLM failureLLM timeout or error responseRetry once with backoff. If persistent, use canned follow-up from specification fallback config. If no fallback, pause exam.
TTS failureTTS returns errorRetry once. If persistent, present question as text on data channel. Emit tts_failure event.
Silence (candidate unresponsive)Silence exceeds silenceTimeoutMsRuntime MUST prompt candidate (via LLM or canned). After maxSilencePrompts, transition to best-effort for current node.
Candidate disconnectsLiveKit participant leave eventPause exam immediately. If candidate reconnects within reconnectTimeoutMs, resume. If not, abort with candidate_disconnect.
Audio loop / echoAudio energy level anomaly detectionMute TTS, present text, attempt audio reset. Emit audio_loop_detected.

Assessment failures are situations where the pedagogical interaction breaks down. Unlike technical failures, these require the LLM and Runtime to collaborate on recovery while preserving assessment validity.

ScenarioDetectionRecovery Action
Candidate misunderstands scenarioLLM detects candidate’s response is inconsistent with the scenario role (e.g., candidate acts as manager when they should be the employee)LLM re-establishes scenario context without revealing assessment content. Emits scenario_clarification event. MUST NOT reveal what the “correct” interpretation is — only re-state the scenario framing. This preserves assessment validity while correcting the misunderstanding.
Question difficulty mismatchLLM signals difficulty_mismatch in report_observation (candidate’s response suggests question was too hard or too easy)Runtime MAY allow one question rephrase at lower or higher complexity level. Emits difficulty_adjusted event. The rephrased question MUST assess the same evidence targets — only the complexity framing changes.
Candidate emotional distressLLM signals distress_detected (beyond anxiety — e.g., crying, aggressive tone, refusal to continue)Runtime offers pause with a welfare message: “We can take a break whenever you need. Would you like to pause?” If candidate continues, Runtime logs distress_event for post-exam review. If candidate does not respond within silenceTimeoutMs, Runtime pauses automatically. Emits welfare_check event.
Examiner gives contradictory informationOutput validation detects contradiction with prior statements in conversation historyRuntime intercepts and re-prompts the LLM with the contradictory statement flagged. Emits consistency_violation event. If second attempt also contradicts, uses canned fallback.
Candidate gives consistently off-topic answersLLM signals off_topic for 3+ consecutive turns despite redirectsRuntime emits persistent_off_topic event. LLM MAY re-state the question more explicitly (without revealing the answer). If still off-topic after maxOffTopicRedirects, transition to best-effort.

Assessment Failure Principle: Recovery MUST NOT reveal model answers, rubric scoring logic, or the “correct” response. The goal is to restore the assessment interaction to a productive state, not to guide the candidate to the right answer. This preserves the assessment validity principle from Fenton (2025): the examiner should “neither discourage nor reassure the student” during prompting.

healthy ──[failure_detected]──► recovering
recovering ──[recovery_successful]──► healthy
recovering ──[recovery_failed]──► degraded
degraded ──[manual_intervention OR timeout]──► aborted
recovering ──[reconnect_timeout]──► aborted
  • The Runtime MUST preserve all events emitted before the failure. Events are the source of truth for recovery.
  • On recovery, the Runtime MUST replay from the last committed event to restore consistent state.
  • The LLM session context MUST be reconstructed from the event log, not from LLM memory.
  • Recovery MUST NOT cause the candidate to lose credit for answers already given. Evidence already written to the ledger is permanent.

A node is complete when ALL of the following are true:

  1. The main question has been asked at least once.
  2. The candidate has provided at least one substantive response (not silence, not a command).
  3. Either:
    • (a) Evidence signals collected meet the node’s CompletionPolicy.requiredEvidenceCount threshold, OR
    • (b) Follow-ups have been exhausted (followUpCount >= maxFollowUps), OR
    • (c) Time budget for the node has been exhausted, OR
    • (d) The LLM has judged evidence sufficient AND the Runtime’s autoCompleteOnSufficient policy is enabled.

When a node ends without meeting condition (a) — i.e., CompletionPolicy.requiredEvidenceCount not reached — the Runtime MUST mark it as best_effort, NOT completed. Best-effort nodes:

  • MUST still have their partial evidence written to the Evidence Ledger.
  • MUST be distinguishable from fully completed nodes in the ledger (via a completionStatus field).
  • MUST count toward overall exam progress for time-budget purposes.
  • SHOULD trigger a review flag for markers if requiredEvidenceCount was not met.

An exam is complete when ALL of the following are true:

  1. All required nodes have been processed (completed, best_effort, or skipped per routing).
  2. No node is in active state.
  3. Runtime has emitted exam_completed event.

An exam is aborted when:

  1. A hard violation occurs (see §11), OR
  2. A recovery failure forces termination (see §6), OR
  3. An explicit abort is issued by the proctoring system.

If the exam is aborted, all nodes processed up to the abort point MUST still have their evidence written to the ledger. The marking pipeline MUST be able to score a partial exam. The Runtime MUST emit exam_partial with the list of completed and best-effort nodes.


TypeTriggerDescription
sequentialCurrent node completesMove to the next node in the specification sequence.
conditionalEdge condition evaluates to trueJump to a non-adjacent node based on evidence or state.
branchBranching node selects pathFollow a specific branch based on candidate profile or answer.
skipSkip condition metSkip one or more nodes.
abortAbort condition metEnd exam immediately.
  • The Runtime MUST evaluate transition conditions in the order they appear in the specification. The FIRST matching condition wins.
  • Transition conditions MAY reference:
    • Evidence signals collected so far (signals:has("topic_x_signal"))
    • Node completion status (node:status("node_1") === "completed")
    • Time elapsed (time:elapsed > 1800)
    • Follow-up count on current node (node:followUpCount)
    • Candidate commands received (commands:received("repeat"))
  • Transition conditions MUST NOT reference LLM internal state or raw transcript content directly. They MUST operate on structured Runtime state only.

When transitioning between nodes, the Runtime MAY instruct the LLM to generate a transition bridge — a natural-language statement connecting the previous topic to the next. The LLM:

  • MUST NOT reveal the next question’s content in the bridge.
  • MUST NOT reference rubric criteria or model answers.
  • SHOULD produce a brief, natural sentence (e.g., “Thank you. Now let’s move on to a different area.”).
  • The Runtime MUST provide the LLM with the previous node’s topic label and the next node’s topic label (not question text) for bridge generation.

A node transition MUST be atomic:

  1. Finalise evidence for the departing node.
  2. Write evidence to ledger.
  3. Emit node_completed (or node_best_effort).
  4. Emit node_entered for the arriving node.
  5. Reset per-node state (follow-up counter, time budget).

Steps 1–5 MUST occur as a single committed transaction. If any step fails, the Runtime MUST roll back to the departing node’s active state and retry.


Follow-ups allow the LLM examiner to probe deeper when a candidate’s initial answer is incomplete, unclear, or insufficiently evidenced. They are the primary mechanism for agentic behaviour within the exam.

In IOA practice, the most important follow-up strategy is rubric-level nudging: when a candidate demonstrates competence at a lower rubric level (e.g., “description” → credit), the assessor uses a follow-up to open the door to a higher level (e.g., “analysis” → distinction). The Runtime MUST support this pattern by providing the LLM with rubric level information as evidence vocabulary, not as scoring logic.

main_question_asked
    └── candidate answers
        ├── sufficient → transition (§8)
        └── insufficient → follow-up check:
            ├── followUps_remaining > 0 AND time_budget > 0 → issue follow-up
            └── followUps_remaining === 0 OR time_budget === 0 → best_effort
  • Each node maintains an independent followUpCount counter, starting at 0.
  • The Runtime MUST increment followUpCount each time a follow-up is issued.
  • followUpCount MUST be persisted in the event log before the follow-up is presented to the candidate.
  • The Runtime MUST NOT allow followUpCount to exceed maxFollowUps for the node. This is a hard constraint — the LLM MUST NOT be asked to generate a follow-up when the counter is at max.

The LLM generates follow-up wording, but the Runtime enforces:

  • Follow-up MUST be on-topic (same node’s assessment objective).
  • Follow-up MUST NOT provide the answer or a strong hint (see §11 for hint refusal).
  • Follow-up MUST NOT introduce a new assessment topic (that requires a new node).
  • Follow-up SHOULD be a question, not a statement.
  • Follow-up SHOULD reference what the candidate said (for naturalness), but MUST NOT quote rubric criteria.

The follow-up taxonomy is derived from oral assessment literature. Pearce & Chiavaroli (2020, cited in Fenton, 2025) establish a prompting continuum from neutral presentation to leading questions; Joughin (1998) describes the bidirectional adaptation that characterises the “dialogue” pole of interaction. The types below span this range.

TypeExampleWhen UsedLiterature Basis
probe”Can you elaborate on what you mean by X?”Candidate’s answer was vague.Pearce & Chiavaroli: probing questions
redirect”Let me rephrase: what would happen if…?”Candidate misunderstood the question.Pearce & Chiavaroli: clarifying questions
scaffold”Think about it from the perspective of Y.”Candidate is stuck; provides a graduated nudge. See §9.6 for scaffolding intensity.Vygotsky ZPD; Fenton (2025): “simplify questions or prompt students who are struggling”
challenge”What about the counterargument that…?”Candidate’s answer is one-dimensional.Joughin: probing reasoning depth
nudge”You’ve described the situation well — can you tell me more about why you think that’s the case?”Candidate demonstrated lower rubric level; opens door to higher level (e.g., description → analysis). This is the core IOA prompting strategy.Pearce & Chiavaroli: probing questions; Fenton (2025): higher-order skills
confirm”So what you’re saying is [paraphrase] — is that right?”LLM paraphrases candidate’s answer for confirmation before proceeding. Serves dual function: confirms understanding AND gives candidate chance to correct.Joughin: interaction as reciprocal adaptation
extend”That’s a solid analysis — now, how would you apply this in [different context]?”Candidate gave a good answer; deepens the exploration at the same rubric level by asking for breadth or application. Distinct from challenge (which questions the answer) and nudge (which pushes to a higher level).Joughin: applied problem solving; probing boundaries of understanding
concede”That’s alright, let’s move on to something else.”Candidate is clearly stuck and scaffolding hasn’t helped. Graceful abandonment of the current line of questioning within the node. Distinct from node-level best-effort completion — this is a turn-level move.Fenton (2025): managing anxiety; examiner warmth
closing”Is there anything else you’d like to add?”Near time budget or follow-up limit.Standard exam practice

The Runtime MUST emit follow_up_issued with the follow-up type. The LLM chooses the type based on context; the Runtime does not enforce type selection but MAY log it for quality assurance.

Scaffolding within the assessment (distinct from pre-exam familiarisation in §2.1) is the primary mechanism for supporting candidates within their Zone of Proximal Development (Vygotsky). When the LLM issues a scaffold follow-up, the amount of scaffolding provided is itself evidence of the candidate’s competence level.

Scaffolding intensity captures how much support the examiner provided:

  • 0 — No scaffolding: candidate answered independently.
  • 1 — Minimal scaffolding: slight rephrasing or redirection.
  • 2 — Moderate scaffolding: provided a conceptual hint or perspective shift.
  • 3 — Heavy scaffolding: significantly simplified the question or broke it into sub-parts.

The Runtime MUST record scaffolding intensity on the evidence signal when a scaffold follow-up is issued (see §12.2). The marking pipeline uses this as evidence: a candidate who needed heavy scaffolding demonstrated a different competence level than one who needed minimal support.

Graduated withdrawal: The LLM SHOULD reduce scaffolding intensity over the course of a node. If the candidate demonstrates competence after scaffolding, subsequent follow-ups SHOULD probe at the original difficulty level. This mirrors the educational scaffolding principle of gradually removing support as competence develops.

Fenton (2025, p. 433): “Educators have the flexibility to simplify questions or prompt students who are struggling” and this flexibility results in “higher grades than would have been achieved with a written assessment.” The spec models this flexibility as a first-class concept, not an exception.

The Runtime SHOULD track conversation quality metrics within each node to support post-hoc assessment quality review. These metrics do not affect runtime behaviour but are included in the MarkingPackage for psychometric analysis (see §12.6).

MetricDescription
candidateTurnCountNumber of substantive candidate responses in this node
examinerFollowUpDepthNumber of follow-ups issued in this node
avgCandidateResponseLatencyMsMean time between examiner prompt and candidate speech start
longestCandidateMonologueSecLongest uninterrupted candidate speech
followUpTypeDistributionCount of each follow-up type used
scaffoldingTrajectorySequence of scaffolding intensities across the node (should trend downward)

Joughin (1998, p. 376): Reliability is threatened when there is “inconsistency between the questions asked of different candidates.” These metrics enable post-hoc analysis of whether conversation paths were comparable across candidates, even when the LLM adapted its questioning style.


Candidates may issue commands during the exam (e.g., “repeat the question”, “can you clarify?”, “I need a moment”). These MUST be handled as structured commands, not as assessment evidence.

CommandDetection MethodBehaviour
repeatKeyword/phrase detection + LLM intent classificationRe-present the current question or follow-up verbatim. MUST NOT count as a follow-up. MUST NOT reset the time budget.
clarificationLLM intent classificationLLM rephrases or explains the question instructions. MUST NOT reveal the model answer. Counts as one clarification_used.
request_rephraseLLM intent classificationCandidate asks “can you say that differently?” — distinct from repeat (which re-presents verbatim) and clarification (which explains instructions). The LLM generates a different phrasing of the same question. MUST NOT reveal the model answer. Counts as one clarification_used.
slow_downKeyword detectionLLM reduces speech rate. Runtime adjusts TTS speed.
pauseExplicit requestTransition to paused state (see §2).
thinking_aloudLLM intent classificationCandidate says “let me think about that for a moment” or similar. Signals metacognitive awareness. Runtime emits candidate_thinking event. DOES NOT consume a pause — the candidate is still engaged. LLM waits silently. Time budget continues.
helpLLM intent classificationProvide general exam instructions (not question-specific help).
skipLLM intent classificationRequest to skip current node. Runtime MAY honour if policy allows; otherwise MUST refuse and explain.
revise_earlier_answerLLM intent classificationCandidate wants to revisit or amend a previous answer. Runtime MAY honour if the previous node is still within a configurable revision window. If honoured, emits answer_revision event. MUST NOT be used to revisit nodes from a different topic area.
finishLLM intent classificationCandidate wants to end the exam. Runtime MUST confirm (“Are you sure?”) before processing.
  1. Commands MUST be detected before the candidate’s utterance is evaluated for evidence. A repeat command MUST NOT generate evidence signals.
  2. Commands MUST be processed by the Runtime Controller, not directly by the LLM. The LLM detects intent; the Runtime decides the action.
  3. The Runtime MUST emit candidate_command event for every detected command, including the command type and raw utterance.
  4. Commands MUST NOT count toward maxFollowUps.
  5. Commands MUST consume time from the time budget (the candidate is using exam time).
  6. If the LLM cannot classify an utterance as either a command or an answer with high confidence, the Runtime SHOULD treat it as an answer and let the LLM proceed accordingly.

The Runtime MUST enforce:

  • Maximum 3 repeat commands per node. After that, the Runtime MUST emit command_repeat_limit_reached and present the question in written form via data channel instead.
  • Maximum 2 clarification commands per node. After that, the Runtime MUST emit command_clarify_limit_reached and proceed.
  • No rate limit on pause, but pause duration counts against the exam time budget.

Guardrails ensure the LLM examiner operates within the boundaries defined in the specification. Guardrails are hard constraints enforced by the Runtime, not prompt-level instructions to the LLM.

GuardrailScopeEnforcement
Max follow-upsPer nodeRuntime counter (§9.3). LLM MUST NOT be invoked for follow-up when counter at max.
Time budgetPer node, per examRuntime timer. Enforced even if LLM wants to continue.
Hint refusalPer nodeRuntime filters LLM output. If LLM response contains content matching the node’s modelAnswer or rubricPhrases, the response MUST be intercepted and replaced with a neutral re-prompt.
No rubric revealGlobalRuntime MUST NOT pass rubric scoring weights, grade boundaries, or model answers to the LLM. However, the Runtime MUST pass rubric criteria and evidence vocabulary to the LLM — in IOA practice, rubric criteria are the conversation guide (sentence-starters), not a secret. The LLM uses criteria to know what to listen for, not how to score. The distinction: criteria describe observable competencies (“explains the mechanism”, “evaluates trade-offs”); scoring logic maps those to marks (criterion X = 5 marks if excellent, 3 if adequate). Only the former is shared.
No scoringGlobalRuntime MUST NOT pass scoring logic (grade boundaries, mark weights, score ranges) to the LLM. The LLM emits evidence signals as observations; scoring happens in the marking pipeline. The LLM MAY know rubric criteria as evidence vocabulary (what to listen for), but MUST NOT know how those criteria map to marks.
No structure changeGlobalRuntime MUST NOT allow the LLM to add, remove, or reorder nodes. The LLM operates within the current node only.
No premature endPer nodeRuntime MUST NOT end a node before the main question has been asked and at least one candidate response received.
No topic jumpPer nodeRuntime MUST constrain the LLM’s context to the current node’s topic. The LLM MUST NOT reference content from future nodes.
Off-topic handlingPer turnIf the LLM signals off_topic, Runtime MUST increment a per-node offTopicCount. After maxOffTopicRedirects (default: 2), Runtime MUST mark node as best-effort.
Silence handlingPer turnIf silence exceeds silenceTimeoutMs, Runtime MUST trigger a silence prompt (not an LLM follow-up). After maxSilencePrompts (default: 2), Runtime MUST mark node as best-effort.
Candidate anxietyPer turnIf the LLM signals candidate_anxiety, Runtime MAY extend the time budget by anxietyTimeExtensionMs (configurable). The Runtime MUST NOT reduce difficulty or simplify questions.
Technical failurePer eventSee §6. Guardrails apply even in degraded mode — the Runtime MUST NOT bypass max-follow-ups or time budgets during recovery.
Persona consistencyPer nodeIf the specification defines a persona for the node (e.g., “hotel manager”), the Runtime MUST validate that every LLM spokenText output stays in character. The output validation pipeline MUST check for persona-break patterns (e.g., “As your examiner…”, “In this assessment…”). On violation, Runtime re-prompts with persona reminder. Emits guardrail_triggered with type persona_break.
Equity — communication styleGlobalUnless communicationStyleIsLearningOutcome: true in the specification, the LLM MUST NOT penalise or comment on accent, fluency, verbal confidence, or speech patterns. The Runtime MUST filter evidence signals that reference communication quality when the flag is false.
Rapport and tone calibrationPer nodeThe LLM SHOULD build rapport through natural dialogue moves (acknowledgement, encouragement, reassurance) that are distinct from follow-ups. These moves MUST NOT count toward maxFollowUps. The LLM SHOULD adapt warmth based on context: warmer at node start, warmer when candidate struggles, more formal during technical probing. Rapport moves MUST NOT cross into assessment bias — encouragement like “Take your time” is permitted; “You’re doing great” is NOT (it provides implicit evaluative feedback). The Runtime MUST log rapport moves for quality assurance.
Neutrality in promptingPer nodeThe LLM’s follow-up prompts MUST aim for neutrality as defined by Pearce & Chiavaroli (2020): “neither discourages nor reassures the student.” This is distinct from rapport — an examiner can be warm (rapport) while remaining assessment-neutral (prompting). The LLM MUST NOT provide evaluative feedback during the assessment (e.g., “Good answer”, “That’s not quite right”). The Runtime’s output validation pipeline MUST check for evaluative language patterns.
Examiner-initiated pause for welfarePer turnIf the LLM signals distress_detected (not just anxiety), the Runtime MAY initiate a pause with a welfare message. This is distinct from candidate-initiated pause — the examiner proactively offers a break. Emits welfare_pause_offered event.
Time budget fairnessPer nodeThe Runtime MUST track whether the candidate received substantially different time-on-task compared to the node’s configured budget. If the LLM’s follow-up strategy causes a node to end significantly early (e.g., < 50% of time budget used), the Runtime SHOULD log this for fairness review. Different candidates should have comparable opportunities to demonstrate competence.

When a guardrail is triggered:

  1. Runtime MUST emit a guardrail_triggered event with the guardrail type, context, and action taken.
  2. If the violation is LLM-caused (e.g., LLM attempted to reveal rubric), Runtime MUST intercept the output, replace it, and log the attempt.
  3. If the violation is structural (e.g., time budget exceeded), Runtime MUST enforce the hard limit regardless of LLM state.
  4. Repeated guardrail violations (configurable threshold) SHOULD trigger an alert to the proctoring system.

Every LLM response during an exam turn MUST pass through the Runtime’s output validation pipeline before being presented to the candidate:

  1. Content filter: Check against forbiddenPhrases (model answer fragments, rubric terms).
  2. Topic filter: Check that the response references only the current node’s topic scope.
  3. Action filter: Check that the response does not attempt to transition, score, or end the exam (these are Runtime actions).
  4. Length filter: Check that the response does not exceed maxResponseLength.

If validation fails, the Runtime MUST:

  • Log the violation.
  • Invoke the LLM again with a corrected prompt (e.g., “Please rephrase without mentioning [X]”).
  • If re-invocation also fails, use a canned fallback response.

12. Transcript and Evidence Capture Semantics

Section titled “12. Transcript and Evidence Capture Semantics”

Every exam MUST produce a structured transcript. The transcript is NOT just raw STT output — it is a structured sequence of turns, each annotated with metadata.

interface TranscriptTurn {
  turnId: string;                    // Unique turn identifier
  nodeId: string;                    // Node this turn belongs to
  turnIndex: number;                 // Sequential index within the node
  role: "candidate" | "examiner";    // Speaker
  content: string;                   // Text (STT output or LLM output)
  timestamp: number;                 // Unix ms
  durationMs: number;                // Duration of the utterance
  metadata: {
    isCommand: boolean;              // Was this a candidate command?
    commandType?: string;            // If command, which type?
    isFollowUp: boolean;             // Was this a follow-up question?
    followUpIndex?: number;          // If follow-up, which one?
    isSilence: boolean;              // Was this a silence event?
    isOffTopic: boolean;             // Was the candidate off-topic?
    confidence: number;              // STT confidence score (0–1)
  };
}

Evidence signals are the structured observations that feed the marking pipeline. They are NOT scores — they are facts about what the candidate demonstrated.

In IOA practice, evidence signals are rubric criteria. The rubric defines what competencies to look for; the evidence signal records whether (and how well) the candidate demonstrated that competency. The Runtime MUST ensure that every signalType in the specification maps to a rubric criterion, and that the LLM receives these as the vocabulary of “what to listen for.”

interface EvidenceSignal {
  signalId: string;                  // Unique signal identifier
  nodeId: string;                    // Node where signal was observed
  signalType: string;                // Type from specification — MUST map to a rubric criterion
  rubricLevel?: string;              // Observed rubric level (e.g., "description", "analysis", "evaluation")
  transversalSkills?: string[];      // Cross-cutting skills observed (e.g., ["critical_thinking", "professional_reasoning"])
  confidence: number;                // LLM's confidence that this signal was observed (0–1)
  turnId: string;                    // Which turn triggered this signal
  excerpt: string;                   // Short excerpt from candidate's response
  timestamp: number;                 // When the signal was observed
  source: "llm_observed" | "runtime_detected";  // Who observed it
  scaffoldingIntensity?: number;     // 0–3 scale: amount of scaffolding provided before this signal
                                     // 0 = no scaffolding (independent answer)
                                     // 1 = minimal (rephrasing/redirect)
                                     // 2 = moderate (conceptual hint)
                                     // 3 = heavy (simplified question/broken into parts)
  scaffoldingEffective?: boolean;    // Did the candidate improve after scaffolding? Only set when scaffoldingIntensity > 0.
}

Scaffolding as Evidence: The scaffoldingIntensity and scaffoldingEffective fields capture a critical piece of assessment information that the current model misses. Joughin (1998) identifies that oral assessment can measure “applied problem solving” — and the amount of scaffolding a candidate needs is itself evidence of their problem-solving competence. A candidate who answers correctly after heavy scaffolding (intensity 3) demonstrated a different competence level than one who answers independently (intensity 0). The marking pipeline MUST use scaffolding intensity as a modifier when evaluating evidence signals, not as a separate score.

Transversal Skills: IOA research identifies cross-cutting competencies (critical thinking, communication, problem-solving, professional reasoning) that span multiple nodes. The transversalSkills field allows the LLM to tag evidence signals with these cross-cutting observations. The marking pipeline MUST aggregate transversal skill signals across all nodes to produce a holistic competency profile. Transversal skill vocabulary is defined in the specification at the exam level, not per-node.

  1. Evidence signals MUST be emitted by the LLM during processing state (see §4).
  2. The Runtime MUST validate that the signalType is defined in the current node’s evidenceSignals in the specification. Unknown signal types MUST be logged and discarded.
  3. The Runtime MUST write validated signals to the Evidence Ledger immediately — not batched at node completion.
  4. The LLM MAY emit multiple signals per turn (candidate may demonstrate several competencies in one answer).
  5. The Runtime MUST NOT allow the LLM to emit signals for a node that is not currently active.
  6. Duplicate signals (same signalType, same turnId) MUST be deduplicated by the Runtime.

The LLM MAY signal evidence_sufficient during processing if it believes enough signals have been collected. However:

  • The Runtime MUST NOT auto-complete the node based solely on LLM judgment unless autoCompleteOnSufficient is enabled in the specification policy.
  • Even with autoCompleteOnSufficient, the Runtime MUST verify that at least CompletionPolicy.requiredEvidenceCount distinct evidence targets have been satisfied.
  • The LLM’s sufficiency judgment is advisory; the Runtime’s requiredEvidenceCount check is authoritative.

At exam completion (or abort), the Runtime MUST:

  1. Write all pending transcript turns to the transcript store.
  2. Write all pending evidence signals to the Evidence Ledger.
  3. Emit a transcript_finalised event.
  4. Compute and store a transcriptHash (SHA-256 of the canonicalised transcript) for integrity verification.

The transcript MUST be immutable after transcript_finalised. No post-hoc edits are permitted. If a correction is needed, it MUST be appended as a correction record with its own hash chain.

The Runtime MUST produce a MarkingPackage containing:

  • The full structured transcript.
  • The evidence ledger (all signals, with node association, transversal skill tags, and scaffolding intensity).
  • Exam metadata (candidate ID, exam ID, start/end timestamps, duration).
  • Node completion statuses (completed / best_effort / skipped).
  • Any guardrail events triggered during the exam.
  • Any assessment failure events (scenario clarification, difficulty adjustment, welfare checks).
  • The transcript hash for integrity verification.
  • Conversation fingerprint (SHA-256 of the ordered conversation path: node sequence + follow-up types + turn count per node). This proves each exam instance is unique — critical for academic integrity auditing. Two candidates on the same specification will have different fingerprints because the conversation unfolds differently.
  • Assessment equivalence (e.g., equivalentWrittenWordCount: 3000) from the specification, for calibration reference.
  • Scaffolding metadata (if scaffolding was used): number of practice turns, whether candidate skipped early. Scaffolding transcript is NOT included.
  • Conversation quality metrics per node: candidate turn count, examiner follow-up depth, average candidate response latency, longest candidate monologue, follow-up type distribution, and scaffolding trajectory. See §9.7.
  • Psychometric equivalence summary: aggregate statistics for marking pipeline analysis — average follow-ups per node, follow-up type distribution, average time per node, scaffolding intensity distribution, and a conversation path variance score (0.0 = identical paths across candidates, 1.0 = maximally different). This enables the marking pipeline to detect whether different candidates received substantially different assessment experiences (see Joughin, 1998, on reliability threats from inconsistent questioning).

This package is the sole input to the marking pipeline. The marking pipeline MUST NOT need to reconstruct state from raw STT output or LLM conversation history.


Appendix A: Time Budget Calibration Guidance

Section titled “Appendix A: Time Budget Calibration Guidance”

Non-normative. This appendix provides heuristics for setting per-node and per-exam time budgets. These are derived from the oral assessment literature and are intended as starting points, not hard requirements.

FactorGuidelineSource
Per-question time5–7 minutes for theoretical questionsAkimov & Malin (2020): “around five minutes’ response time” per theoretical question
Follow-up time5–10 minutes total per nodeAkimov & Malin (2020): “five to ten minutes were allocated for follow-up questions”
Total exam duration15–30 minutes for a clear picture of understandingFenton (2025, citing Sayre, 2014): “it should take no more than 20 minutes to get a clear picture”; Akimov & Malin (2020): 30-minute blocks
Response time per question60 seconds max for concise answersBayley et al. (2024): “only the first 60 seconds of each of their responses would be graded”
Anxiety extensionConfigurable, default 2 minutesSpec design; Akimov & Malin (2020): students found shorter exams “rushed”
Practice/familiarisation5–10 minutes before assessment beginsBayley et al. (2024): practice ConVOE; Fenton (2025): “opportunities to practice where no marks are allocated”
  • Too short: Creates time pressure that disadvantages candidates who think more slowly or are answering in a second language (Akimov & Malin, 2020: students found 10-minute exams “rushed”).
  • Too long: Leads to fatigue and reduced engagement. Fenton (2025) recommends “less is more.”
  • Time as fairness variable: If two candidates get the same node but one uses 3 minutes and the other uses 8 minutes (because the LLM asked different follow-ups), the Runtime tracks this difference in the conversation quality metrics (§9.7). The marking pipeline SHOULD consider whether time-on-task differences correlate with score differences.
  • Proportionality: Time budget SHOULD be proportional to the number of evidence targets on the node. A node with 3 evidence targets needs more time than one with 1.

Fairness Note: The psychometric equivalence summary does not constrain the LLM’s adaptiveness — it measures it. The marking pipeline can use this data to assess whether conversation path variance is correlated with score variance. If candidates who received harder follow-ups systematically score lower, this indicates a fairness problem that requires investigation (Akimov & Malin, 2020, Table 4: “It is hard to determine whether students perform significantly differently if follow-up questions are different”).

VersionDateChanges
v0.2.02026-06-30Added anxiety detection and distress handling semantics. Updated terminology from ‘Exam Runtime IR’ to ‘IOA-ORM’. Refined state machine transitions for welfare checks.
v0.1.02026-05-06Initial release.