Runtime Semantics

Status

Draft · v0.2.0 · 2026-06-30

Status: Draft Scope: Defines the execution model, state machines, lifecycle transitions, and semantic contracts that govern how an IOA-ORM is executed. All normative statements use RFC 2119 language (MUST / SHOULD / MAY / MUST NOT).

Runtime Execution Model
Exam Lifecycle State Machine
Node Lifecycle State Machine
Turn Lifecycle State Machine
Candidate Readiness State
Recovery State
Completion Semantics
Transition Semantics
Follow-up Semantics
Candidate Command Semantics
Guardrail Enforcement Semantics
Transcript and Evidence Capture Semantics

Appendices

Appendix A: Time Budget Calibration Guidance

1. Runtime Execution Model

1.0 IOA Design Alignment

Interactive Oral Assessment (IOA) research (Ward et al., 2023; Sotiriadou et al., 2016) establishes that effective oral assessments are scenario-based, free-flowing conversations — not scripted question-and-answer sessions. The Runtime MUST respect this principle: nodes represent assessment scenarios with rubric-aligned conversation objectives, not rigid question slots. The LLM examiner drives an authentic, professionally-focused dialogue, using rubric criteria as conversation guides (sentence-starters), not as a quiz script.

Theoretical Foundations

The runtime semantics are grounded in four key dimensions from the oral assessment literature:

Interaction (Joughin, 1998, pp. 370–371). Oral assessment’s principal advantage is the capacity for bidirectional adaptation — “each [statement] includes a response to that made by the other participant” creating “inherent unpredictability in which neither party knows in advance exactly what questions will be asked or what responses will be made.” The Runtime preserves this by giving the LLM autonomy over dialogue strategy within nodes while maintaining structural control over transitions. However, Joughin warns that “the social interaction entailed in oral assessment may distort communication and affect both a candidate’s performance and how that performance is perceived by the examiner” (p. 370). The guardrails in §11 address this risk.

Authenticity (Joughin, 1998, pp. 371–372; Fenton, 2025, p. 431). IOA research positions oral assessments as “a form of assessment asking students to perform real-world tasks to demonstrate meaningful application of necessary knowledge and skills” (Sotiriadou et al., 2020, cited in Fenton, 2025). The scenario-based node design directly implements Joughin’s “contextualised” pole of the authenticity continuum. The persona system (§11) ensures the examiner stays in character, maintaining the professional context throughout the interaction.

Reliability through Structure (Joughin, 1998, p. 376; Akimov & Malin, 2020, Table 4). Joughin identifies that “reliability is threatened when the ‘interaction’ dimension tends towards the ‘dialogue’ pole, when the ‘structure’ dimension tends towards the ‘open’ pole.” The spec addresses this through a deliberate design choice: closed structure between nodes (Runtime-enforced transitions, deterministic ordering) with open dialogue within nodes (LLM-driven adaptive questioning). This hybrid preserves the validity benefits of dialogue while maintaining the reliability benefits of structure. The follow-up counter and time budget (§9) are hard structural constraints that prevent unbounded dialogue.

Prompting Neutrality (Pearce & Chiavaroli, 2020, cited in Fenton, 2025, p. 434). The spec operationalises four guiding principles for examiner prompting: neutrality (guardrails prevent reassurance or discouragement), consistency (same follow-up policy across nodes), transparency (candidates receive scaffolding practice before assessment), and reflexivity (examiner utterances are logged for post-hoc quality review).

IOA Component Mapping

The six IOA components (scaffolding, scenario-based, aligned to program, learning outcomes, accessible/equitable, professionally-focused) inform the following runtime semantics:

Scenario context is provided to the LLM as scene-setting, not as a question prompt.
Rubric criteria are shared with the LLM as evidence vocabulary (what to listen for), not as scoring weights or model answers.
Conversation flow is emergent — the LLM adapts its prompting based on what the candidate demonstrates, nudging toward higher rubric levels when opportunity arises.
Scaffolding operates at two levels: (1) pre-exam familiarisation (§2.1) where the candidate practices the IOA format without stakes, and (2) in-assessment scaffolding (§9.5) where the examiner adjusts support based on candidate performance within the Zone of Proximal Development.
Equity is enforced by not penalising communication style unless it is a declared learning outcome (§11.2). Fairness is further supported by deterministic policy enforcement for transitions and completion, preventing LLM-driven inconsistency across candidates.

1.1 Single-Active-Node Principle

The Runtime MUST maintain exactly one active node at any time during an in-progress exam. When the active node completes or transitions, the Runtime MUST atomically deactivate the current node and activate the next node — there MUST NOT be a gap where zero or more than one node is active.

1.2 Runtime Controller

A Runtime Controller owns all authoritative state. The LLM agent is a tool invoked by the Runtime Controller — it does NOT own state, does NOT decide transitions unilaterally, and does NOT persist evidence directly. The boundary is:

Concern	Owner
Which node is active	Runtime Controller
When a turn starts/ends	Runtime Controller
How many follow-ups have been asked	Runtime Controller
Time budget remaining	Runtime Controller
Evidence ledger writes	Runtime Controller (LLM emits signals, Runtime persists)
Question wording, follow-up phrasing, bridge text	LLM (within constraints)
Judging answer sufficiency	LLM emits opinion; Runtime applies policy

1.3 Execution Loop

For each active node, the Runtime Controller executes the following loop. Note that this is a conversation loop, not a rigid Q&A loop — the LLM drives an authentic dialogue, and the loop represents the runtime’s state tracking, not the conversation’s structure.

Pipecat Mapping: Steps marked with 🟢 execute in Pipecat pipeline/FlowManager. Steps marked with 🔵 execute in Runtime Controller. The LLM calls report_observation (the single allowed function) after processing each candidate response; the Runtime Controller handler evaluates the observation and decides the next action.

while node is active:
    1. 🔵 Enter node → emit node_entered, set time budget, reset counters
    2. 🔵 Build NodeConfig from specification (scenario, rubric criteria, persona, constraints)
    3. 🟢 FlowManager.set_node_from_config(config)
    4. 🟢 LLM initiates conversation (task_messages drive the opening)
    5. 🟢 STT captures candidate response → text flows to LLM
    6. 🟢 LLM processes response, calls report_observation(...)
    7. 🔵 Runtime Controller handler receives observation:
        a. If commandDetected → dispatch command (see §10), return
        b. Validate spokenText through output filters (content/topic/action/length)
        c. Write evidence signals to Evidence Ledger
        d. Check guardrails:
            - Time budget exceeded? → transition to best-effort
            - Follow-up requested?
                - Counter < maxFollowUps → inject follow-up context, continue
                - Counter >= maxFollowUps → transition to best-effort
            - Off-topic?
                - Count < maxOffTopicRedirects → inject redirect, continue
                - Count >= maxOffTopicRedirects → transition to best-effort
            - Evidence sufficient + requiredEvidenceCount met? → transition to completed
            - Anxiety detected? → extend time budget
        e. If transitioning:
            - Finalize current node evidence
            - Build next node's NodeConfig
            - 🔵 call flow_manager.set_node_from_config(next_config)
        f. If continuing: return validated spokenText to LLM for TTS
    8. 🟢 TTS speaks the response to candidate
    9. Loop back to 5

1.4 Thread Model

The Runtime MUST be single-threaded per exam instance. All state mutations for a given exam MUST be serialised. Event emission MUST occur before the corresponding state mutation is considered committed.

1.5 Idempotency

Every Runtime operation that mutates state MUST be idempotent with respect to its event sequence number. Re-applying the same event (e.g., during recovery) MUST produce the same resulting state.

2. Exam Lifecycle State Machine

2.1 States

State	Description
`created`	Exam instance created from published AssessmentPackage. Candidate assigned.
`scheduled`	Exam scheduled with a specific time window.
`scaffolding`	Optional practice phase. Candidate experiences the IOA format with a practice scenario that does NOT count toward the score. Emits `scaffolding_started` / `scaffolding_completed`.
`ready`	All pre-conditions met (candidate authenticated, audio/video checks passed, session token issued, scaffolding completed or skipped).
`in_progress`	Exam is live; at least one node has been entered.
`paused`	Exam temporarily suspended (candidate-initiated pause or system-initiated).
`completed`	All nodes processed or exam explicitly ended. Terminal.
`aborted`	Exam ended prematurely due to violation, timeout, or system failure. Terminal.
`expired`	Exam time window closed before completion. Terminal.

2.2 Transitions

created ──[assign_candidate]──► scheduled
scheduled ──[preconditions_met, scaffolding_enabled]──► scaffolding
scheduled ──[preconditions_met, scaffolding_disabled]──► ready
scaffolding ──[practice_complete OR candidate_skips_practice]──► ready
ready ──[exam_start]──► in_progress
in_progress ──[pause_requested]──► paused
paused ──[resume]──► in_progress
in_progress ──[all_nodes_complete OR explicit_end]──► completed
in_progress ──[violation OR system_failure]──► aborted
in_progress ──[time_window_expired]──► expired
paused ──[time_window_expired]──► expired
scheduled ──[time_window_expired]──► expired

2.3 Transition Guards

scheduled → scaffolding: Runtime MUST verify preconditions are met AND scaffolding.enabled === true in the specification. Scaffolding turns MUST use a separate practice scenario (defined in the specification) and MUST NOT produce evidence signals for the marking pipeline.
scaffolding → ready: Runtime MUST emit scaffolding_completed with a count of practice turns taken. Scaffolding transcript MAY be retained for QA but MUST be excluded from the MarkingPackage.
ready → in_progress: Runtime MUST verify candidate identity is confirmed and all pre-checks (audio, video, network) have passed.
in_progress → paused: Runtime MUST record the pause timestamp and remaining time budget. The active node MUST be suspended, not reset.
paused → in_progress: Runtime MUST restore the active node state exactly as it was at pause.
in_progress → completed: Runtime MUST verify that every required node has been processed (either completed or marked best-effort per §7).
in_progress → aborted: Runtime MUST record the abort reason. Abort reasons include: candidate violation (e.g., third-party assistance detected), repeated guardrail violations, or irrecoverable system failure.

2.4 Allowed Actions Per State

Action	`created`	`scheduled`	`scaffolding`	`ready`	`in_progress`	`paused`	`completed`	`aborted`	`expired`
Assign candidate	✓
Schedule exam		✓
Start scaffolding			✓
Skip scaffolding			✓
Start exam				✓
Candidate speaks			✓		✓
LLM generates question					✓
Transition node					✓
Pause					✓
Resume						✓
Abort					✓	✓
Read transcript							✓	✓	✓
Read evidence							✓	✓	✓

3. Node Lifecycle State Machine

3.1 States

State	Description
`pending`	Node has not yet been reached in the exam flow.
`active`	Node is the current active node; candidate is being assessed on it.
`completed`	Node assessment finished successfully (sufficient evidence gathered).
`best_effort`	Node ended with incomplete evidence (follow-ups exhausted or time budget hit).
`skipped`	Node skipped due to conditional routing (e.g., prerequisite not met).

3.2 Transitions

pending ──[node_enter]──► active
active ──[evidence_sufficient]──► completed
active ──[followups_exhausted OR time_budget_hit]──► best_effort
active ──[transition_condition_skip]──► skipped
pending ──[transition_condition_skip]──► skipped

3.3 Transition Guards

pending → active: Runtime MUST emit node_entered event with node ID, sequence index, and timestamp. Runtime MUST initialise the node’s follow-up counter to 0 and start the time budget timer.
active → completed: Runtime MUST emit node_completed event. Evidence signals collected during this node MUST be finalised and written to the Evidence Ledger (see §12).
active → best_effort: Runtime MUST emit node_best_effort event with a reason (followups_exhausted, time_budget_hit). All partial evidence MUST still be written to the Evidence Ledger.
active → skipped: Runtime MUST emit node_skipped event with the transition condition that triggered the skip. Runtime MUST NOT count skipped nodes as completed for completion purposes (see §7).

4. Turn Lifecycle State Machine

A turn is a single exchange: candidate speaks → LLM processes → LLM responds (or transitions).

4.1 States

State	Description
`awaiting_candidate`	Runtime is waiting for the candidate to speak. LLM has finished presenting or follow-up.
`candidate_speaking`	STT is capturing the candidate’s audio.
`processing`	Candidate’s response is being processed (STT finalisation, LLM evaluation).
`llm_responding`	LLM is generating and TTS is playing the response.
`turn_complete`	Turn ended; ready for next turn or node transition.

4.2 Transitions

awaiting_candidate ──[speech_detected]──► candidate_speaking
candidate_speaking ──[speech_ended OR silence_timeout]──► processing
processing ──[evaluation_complete, needs_followup]──► llm_responding
processing ──[evaluation_complete, sufficient]──► turn_complete
processing ──[evaluation_complete, command_detected]──► turn_complete
llm_responding ──[tts_finished]──► awaiting_candidate   (loop for follow-up)
llm_responding ──[tts_finished, transitioning]──► turn_complete

4.3 Turn Timeout

If candidate_speaking does not start within silenceTimeoutMs (configurable per node), the Runtime MUST transition to processing with a silence_detected flag.
If processing exceeds processingTimeoutMs, the Runtime MUST emit a turn_timeout event and the turn MUST be treated as best-effort for that candidate utterance.
llm_responding MUST complete within llmResponseTimeoutMs. If exceeded, Runtime MUST emit llm_timeout and either retry once or fallback to a canned transition message.

4.4 Turn-Level Events

The turn lifecycle states map to the canonical event types defined in 02-schema.md §14 (RuntimeEventType) and the event protocol in 05-event-protocol.md. Internal state transitions (e.g., entering candidate_speaking) MAY be tracked internally by the Runtime Controller but are not necessarily emitted as protocol-level events. The canonical turn-related event types are:

examiner_turn — examiner produces a turn (maps to transcript_final with speaker “examiner”)
candidate_turn — candidate produces a turn (maps to transcript_final with speaker “candidate”)
turn_completed — turn cycle completed (both sides have spoken or timeout)
turn_timeout — turn processing exceeded processingTimeoutMs

Additionally, the event protocol defines transcript_delta (streaming STT partials), transcript_final (canonical persisted utterance), examiner_utterance_started, and examiner_utterance_final for real-time UI consumption. See 05-event-protocol.md §4.4–4.7.

5. Candidate Readiness State

5.1 Purpose

Before the exam begins, the Runtime MUST verify the candidate is technically and cognitively ready. This prevents starting a high-stakes assessment with broken audio, confused candidate, or unverified identity.

5.2 Readiness Checks

Check	Required	Failure Behaviour
Identity verification (face match, ID check, or proctoring token)	MUST pass	Block exam start; emit `readiness_identity_failed`
Audio input device active	MUST pass	Block exam start; emit `readiness_audio_failed`
Audio output device active (TTS audible)	MUST pass	Block exam start; emit `readiness_audio_output_failed`
Video input device active (if required)	MUST pass	Block exam start; emit `readiness_video_failed`
Network connectivity (latency < threshold)	SHOULD pass	Warn candidate; allow start with degraded flag
Candidate confirms instructions understood	MUST pass	Re-present instructions; emit `readiness_instructions_not_understood`

5.3 Readiness State Machine

not_ready ──[check_identity]──► identity_verified
identity_verified ──[check_audio]──► audio_ok
audio_ok ──[check_video]──► video_ok
video_ok ──[check_instructions]──► ready

All states are sequential. Runtime MUST NOT skip a check. Each failed check MUST emit an event and block progression. The Runtime MAY allow a configurable number of retries per check before blocking the exam entirely.

6. Recovery State

6.1 Recovery Scenarios

Recovery scenarios are divided into technical failures (infrastructure issues) and assessment failures (pedagogical situations where the assessment interaction goes wrong). Both categories MUST be handled by the Runtime Controller.

Technical Failures

Scenario	Detection	Recovery Action
Network disconnection	WebSocket/LiveKit disconnect event	Pause exam; attempt reconnect within `reconnectTimeoutMs`. If reconnected, resume from last committed state. If timeout, abort.
STT failure	STT returns empty/error for N consecutive turns	Retry STT pipeline. If persistent, emit `stt_failure` event, present written question as fallback, log degraded mode.
STT low confidence	`transcript_segment.confidence < 0.6`	Runtime MUST emit `stt_low_confidence` event. LLM MAY offer the candidate a chance to repeat. Evidence signals MUST NOT be recorded from segments with confidence below 0.5 (see §12.3).
LLM failure	LLM timeout or error response	Retry once with backoff. If persistent, use canned follow-up from specification fallback config. If no fallback, pause exam.
TTS failure	TTS returns error	Retry once. If persistent, present question as text on data channel. Emit `tts_failure` event.
Silence (candidate unresponsive)	Silence exceeds `silenceTimeoutMs`	Runtime MUST prompt candidate (via LLM or canned). After `maxSilencePrompts`, transition to best-effort for current node.
Candidate disconnects	LiveKit participant leave event	Pause exam immediately. If candidate reconnects within `reconnectTimeoutMs`, resume. If not, abort with `candidate_disconnect`.
Audio loop / echo	Audio energy level anomaly detection	Mute TTS, present text, attempt audio reset. Emit `audio_loop_detected`.

Assessment Failures

Assessment failures are situations where the pedagogical interaction breaks down. Unlike technical failures, these require the LLM and Runtime to collaborate on recovery while preserving assessment validity.

Scenario	Detection	Recovery Action
Candidate misunderstands scenario	LLM detects candidate’s response is inconsistent with the scenario role (e.g., candidate acts as manager when they should be the employee)	LLM re-establishes scenario context without revealing assessment content. Emits `scenario_clarification` event. MUST NOT reveal what the “correct” interpretation is — only re-state the scenario framing. This preserves assessment validity while correcting the misunderstanding.
Question difficulty mismatch	LLM signals `difficulty_mismatch` in `report_observation` (candidate’s response suggests question was too hard or too easy)	Runtime MAY allow one question rephrase at lower or higher complexity level. Emits `difficulty_adjusted` event. The rephrased question MUST assess the same evidence targets — only the complexity framing changes.
Candidate emotional distress	LLM signals `distress_detected` (beyond anxiety — e.g., crying, aggressive tone, refusal to continue)	Runtime offers `pause` with a welfare message: “We can take a break whenever you need. Would you like to pause?” If candidate continues, Runtime logs `distress_event` for post-exam review. If candidate does not respond within `silenceTimeoutMs`, Runtime pauses automatically. Emits `welfare_check` event.
Examiner gives contradictory information	Output validation detects contradiction with prior statements in conversation history	Runtime intercepts and re-prompts the LLM with the contradictory statement flagged. Emits `consistency_violation` event. If second attempt also contradicts, uses canned fallback.
Candidate gives consistently off-topic answers	LLM signals `off_topic` for 3+ consecutive turns despite redirects	Runtime emits `persistent_off_topic` event. LLM MAY re-state the question more explicitly (without revealing the answer). If still off-topic after `maxOffTopicRedirects`, transition to best-effort.

Assessment Failure Principle: Recovery MUST NOT reveal model answers, rubric scoring logic, or the “correct” response. The goal is to restore the assessment interaction to a productive state, not to guide the candidate to the right answer. This preserves the assessment validity principle from Fenton (2025): the examiner should “neither discourage nor reassure the student” during prompting.

6.2 Recovery State Machine

healthy ──[failure_detected]──► recovering
recovering ──[recovery_successful]──► healthy
recovering ──[recovery_failed]──► degraded
degraded ──[manual_intervention OR timeout]──► aborted
recovering ──[reconnect_timeout]──► aborted

6.3 Recovery Guarantees

The Runtime MUST preserve all events emitted before the failure. Events are the source of truth for recovery.
On recovery, the Runtime MUST replay from the last committed event to restore consistent state.
The LLM session context MUST be reconstructed from the event log, not from LLM memory.
Recovery MUST NOT cause the candidate to lose credit for answers already given. Evidence already written to the ledger is permanent.

7. Completion Semantics

7.1 Node Completion Criteria

A node is complete when ALL of the following are true:

The main question has been asked at least once.
The candidate has provided at least one substantive response (not silence, not a command).
Either:
- (a) Evidence signals collected meet the node’s CompletionPolicy.requiredEvidenceCount threshold, OR
- (b) Follow-ups have been exhausted (followUpCount >= maxFollowUps), OR
- (c) Time budget for the node has been exhausted, OR
- (d) The LLM has judged evidence sufficient AND the Runtime’s autoCompleteOnSufficient policy is enabled.

7.2 Best-Effort Completion

When a node ends without meeting condition (a) — i.e., CompletionPolicy.requiredEvidenceCount not reached — the Runtime MUST mark it as best_effort, NOT completed. Best-effort nodes:

MUST still have their partial evidence written to the Evidence Ledger.
MUST be distinguishable from fully completed nodes in the ledger (via a completionStatus field).
MUST count toward overall exam progress for time-budget purposes.
SHOULD trigger a review flag for markers if requiredEvidenceCount was not met.

7.3 Exam Completion

An exam is complete when ALL of the following are true:

All required nodes have been processed (completed, best_effort, or skipped per routing).
No node is in active state.
Runtime has emitted exam_completed event.

An exam is aborted when:

A hard violation occurs (see §11), OR
A recovery failure forces termination (see §6), OR
An explicit abort is issued by the proctoring system.

7.4 Partial Exam

If the exam is aborted, all nodes processed up to the abort point MUST still have their evidence written to the ledger. The marking pipeline MUST be able to score a partial exam. The Runtime MUST emit exam_partial with the list of completed and best-effort nodes.

8. Transition Semantics

8.1 Node Transition Types

Type	Trigger	Description
`sequential`	Current node completes	Move to the next node in the specification sequence.
`conditional`	Edge condition evaluates to true	Jump to a non-adjacent node based on evidence or state.
`branch`	Branching node selects path	Follow a specific branch based on candidate profile or answer.
`skip`	Skip condition met	Skip one or more nodes.
`abort`	Abort condition met	End exam immediately.

8.2 Transition Evaluation

The Runtime MUST evaluate transition conditions in the order they appear in the specification. The FIRST matching condition wins.
Transition conditions MAY reference:
- Evidence signals collected so far (signals:has("topic_x_signal"))
- Node completion status (node:status("node_1") === "completed")
- Time elapsed (time:elapsed > 1800)
- Follow-up count on current node (node:followUpCount)
- Candidate commands received (commands:received("repeat"))
Transition conditions MUST NOT reference LLM internal state or raw transcript content directly. They MUST operate on structured Runtime state only.

8.3 Transition Bridge

When transitioning between nodes, the Runtime MAY instruct the LLM to generate a transition bridge — a natural-language statement connecting the previous topic to the next. The LLM:

MUST NOT reveal the next question’s content in the bridge.
MUST NOT reference rubric criteria or model answers.
SHOULD produce a brief, natural sentence (e.g., “Thank you. Now let’s move on to a different area.”).
The Runtime MUST provide the LLM with the previous node’s topic label and the next node’s topic label (not question text) for bridge generation.

8.4 Transition Atomicity

A node transition MUST be atomic:

Finalise evidence for the departing node.
Write evidence to ledger.
Emit node_completed (or node_best_effort).
Emit node_entered for the arriving node.
Reset per-node state (follow-up counter, time budget).

Steps 1–5 MUST occur as a single committed transaction. If any step fails, the Runtime MUST roll back to the departing node’s active state and retry.

9. Follow-up Semantics

9.1 Purpose

Follow-ups allow the LLM examiner to probe deeper when a candidate’s initial answer is incomplete, unclear, or insufficiently evidenced. They are the primary mechanism for agentic behaviour within the exam.

In IOA practice, the most important follow-up strategy is rubric-level nudging: when a candidate demonstrates competence at a lower rubric level (e.g., “description” → credit), the assessor uses a follow-up to open the door to a higher level (e.g., “analysis” → distinction). The Runtime MUST support this pattern by providing the LLM with rubric level information as evidence vocabulary, not as scoring logic.

9.2 Follow-up Lifecycle

main_question_asked
    └── candidate answers
        ├── sufficient → transition (§8)
        └── insufficient → follow-up check:
            ├── followUps_remaining > 0 AND time_budget > 0 → issue follow-up
            └── followUps_remaining === 0 OR time_budget === 0 → best_effort

9.3 Follow-up Counter

Each node maintains an independent followUpCount counter, starting at 0.
The Runtime MUST increment followUpCount each time a follow-up is issued.
followUpCount MUST be persisted in the event log before the follow-up is presented to the candidate.
The Runtime MUST NOT allow followUpCount to exceed maxFollowUps for the node. This is a hard constraint — the LLM MUST NOT be asked to generate a follow-up when the counter is at max.

9.4 Follow-up Content Constraints

The LLM generates follow-up wording, but the Runtime enforces:

Follow-up MUST be on-topic (same node’s assessment objective).
Follow-up MUST NOT provide the answer or a strong hint (see §11 for hint refusal).
Follow-up MUST NOT introduce a new assessment topic (that requires a new node).
Follow-up SHOULD be a question, not a statement.
Follow-up SHOULD reference what the candidate said (for naturalness), but MUST NOT quote rubric criteria.

9.5 Follow-up Types

The follow-up taxonomy is derived from oral assessment literature. Pearce & Chiavaroli (2020, cited in Fenton, 2025) establish a prompting continuum from neutral presentation to leading questions; Joughin (1998) describes the bidirectional adaptation that characterises the “dialogue” pole of interaction. The types below span this range.

Type	Example	When Used	Literature Basis
`probe`	”Can you elaborate on what you mean by X?”	Candidate’s answer was vague.	Pearce & Chiavaroli: probing questions
`redirect`	”Let me rephrase: what would happen if…?”	Candidate misunderstood the question.	Pearce & Chiavaroli: clarifying questions
`scaffold`	”Think about it from the perspective of Y.”	Candidate is stuck; provides a graduated nudge. See §9.6 for scaffolding intensity.	Vygotsky ZPD; Fenton (2025): “simplify questions or prompt students who are struggling”
`challenge`	”What about the counterargument that…?”	Candidate’s answer is one-dimensional.	Joughin: probing reasoning depth
`nudge`	”You’ve described the situation well — can you tell me more about why you think that’s the case?”	Candidate demonstrated lower rubric level; opens door to higher level (e.g., description → analysis). This is the core IOA prompting strategy.	Pearce & Chiavaroli: probing questions; Fenton (2025): higher-order skills
`confirm`	”So what you’re saying is [paraphrase] — is that right?”	LLM paraphrases candidate’s answer for confirmation before proceeding. Serves dual function: confirms understanding AND gives candidate chance to correct.	Joughin: interaction as reciprocal adaptation
`extend`	”That’s a solid analysis — now, how would you apply this in [different context]?”	Candidate gave a good answer; deepens the exploration at the same rubric level by asking for breadth or application. Distinct from `challenge` (which questions the answer) and `nudge` (which pushes to a higher level).	Joughin: applied problem solving; probing boundaries of understanding
`concede`	”That’s alright, let’s move on to something else.”	Candidate is clearly stuck and scaffolding hasn’t helped. Graceful abandonment of the current line of questioning within the node. Distinct from node-level best-effort completion — this is a turn-level move.	Fenton (2025): managing anxiety; examiner warmth
`closing`	”Is there anything else you’d like to add?”	Near time budget or follow-up limit.	Standard exam practice

The Runtime MUST emit follow_up_issued with the follow-up type. The LLM chooses the type based on context; the Runtime does not enforce type selection but MAY log it for quality assurance.

9.6 In-Assessment Scaffolding

Scaffolding within the assessment (distinct from pre-exam familiarisation in §2.1) is the primary mechanism for supporting candidates within their Zone of Proximal Development (Vygotsky). When the LLM issues a scaffold follow-up, the amount of scaffolding provided is itself evidence of the candidate’s competence level.

Scaffolding intensity captures how much support the examiner provided:

0 — No scaffolding: candidate answered independently.
1 — Minimal scaffolding: slight rephrasing or redirection.
2 — Moderate scaffolding: provided a conceptual hint or perspective shift.
3 — Heavy scaffolding: significantly simplified the question or broke it into sub-parts.

The Runtime MUST record scaffolding intensity on the evidence signal when a scaffold follow-up is issued (see §12.2). The marking pipeline uses this as evidence: a candidate who needed heavy scaffolding demonstrated a different competence level than one who needed minimal support.

Graduated withdrawal: The LLM SHOULD reduce scaffolding intensity over the course of a node. If the candidate demonstrates competence after scaffolding, subsequent follow-ups SHOULD probe at the original difficulty level. This mirrors the educational scaffolding principle of gradually removing support as competence develops.

Fenton (2025, p. 433): “Educators have the flexibility to simplify questions or prompt students who are struggling” and this flexibility results in “higher grades than would have been achieved with a written assessment.” The spec models this flexibility as a first-class concept, not an exception.

9.7 Conversation Quality Tracking

The Runtime SHOULD track conversation quality metrics within each node to support post-hoc assessment quality review. These metrics do not affect runtime behaviour but are included in the MarkingPackage for psychometric analysis (see §12.6).

Metric	Description
`candidateTurnCount`	Number of substantive candidate responses in this node
`examinerFollowUpDepth`	Number of follow-ups issued in this node
`avgCandidateResponseLatencyMs`	Mean time between examiner prompt and candidate speech start
`longestCandidateMonologueSec`	Longest uninterrupted candidate speech
`followUpTypeDistribution`	Count of each follow-up type used
`scaffoldingTrajectory`	Sequence of scaffolding intensities across the node (should trend downward)

Joughin (1998, p. 376): Reliability is threatened when there is “inconsistency between the questions asked of different candidates.” These metrics enable post-hoc analysis of whether conversation paths were comparable across candidates, even when the LLM adapted its questioning style.

10. Candidate Command Semantics

10.1 Purpose

Candidates may issue commands during the exam (e.g., “repeat the question”, “can you clarify?”, “I need a moment”). These MUST be handled as structured commands, not as assessment evidence.

10.2 Command Vocabulary

Command	Detection Method	Behaviour
`repeat`	Keyword/phrase detection + LLM intent classification	Re-present the current question or follow-up verbatim. MUST NOT count as a follow-up. MUST NOT reset the time budget.
`clarification`	LLM intent classification	LLM rephrases or explains the question instructions. MUST NOT reveal the model answer. Counts as one `clarification_used`.
`request_rephrase`	LLM intent classification	Candidate asks “can you say that differently?” — distinct from `repeat` (which re-presents verbatim) and `clarification` (which explains instructions). The LLM generates a different phrasing of the same question. MUST NOT reveal the model answer. Counts as one `clarification_used`.
`slow_down`	Keyword detection	LLM reduces speech rate. Runtime adjusts TTS speed.
`pause`	Explicit request	Transition to `paused` state (see §2).
`thinking_aloud`	LLM intent classification	Candidate says “let me think about that for a moment” or similar. Signals metacognitive awareness. Runtime emits `candidate_thinking` event. DOES NOT consume a `pause` — the candidate is still engaged. LLM waits silently. Time budget continues.
`help`	LLM intent classification	Provide general exam instructions (not question-specific help).
`skip`	LLM intent classification	Request to skip current node. Runtime MAY honour if policy allows; otherwise MUST refuse and explain.
`revise_earlier_answer`	LLM intent classification	Candidate wants to revisit or amend a previous answer. Runtime MAY honour if the previous node is still within a configurable revision window. If honoured, emits `answer_revision` event. MUST NOT be used to revisit nodes from a different topic area.
`finish`	LLM intent classification	Candidate wants to end the exam. Runtime MUST confirm (“Are you sure?”) before processing.

10.3 Command Processing Rules

Commands MUST be detected before the candidate’s utterance is evaluated for evidence. A repeat command MUST NOT generate evidence signals.
Commands MUST be processed by the Runtime Controller, not directly by the LLM. The LLM detects intent; the Runtime decides the action.
The Runtime MUST emit candidate_command event for every detected command, including the command type and raw utterance.
Commands MUST NOT count toward maxFollowUps.
Commands MUST consume time from the time budget (the candidate is using exam time).
If the LLM cannot classify an utterance as either a command or an answer with high confidence, the Runtime SHOULD treat it as an answer and let the LLM proceed accordingly.

10.4 Command Rate Limiting

The Runtime MUST enforce:

Maximum 3 repeat commands per node. After that, the Runtime MUST emit command_repeat_limit_reached and present the question in written form via data channel instead.
Maximum 2 clarification commands per node. After that, the Runtime MUST emit command_clarify_limit_reached and proceed.
No rate limit on pause, but pause duration counts against the exam time budget.

11. Guardrail Enforcement Semantics

11.1 Purpose

Guardrails ensure the LLM examiner operates within the boundaries defined in the specification. Guardrails are hard constraints enforced by the Runtime, not prompt-level instructions to the LLM.

11.2 Guardrail Catalogue

Guardrail	Scope	Enforcement
Max follow-ups	Per node	Runtime counter (§9.3). LLM MUST NOT be invoked for follow-up when counter at max.
Time budget	Per node, per exam	Runtime timer. Enforced even if LLM wants to continue.
Hint refusal	Per node	Runtime filters LLM output. If LLM response contains content matching the node’s `modelAnswer` or `rubricPhrases`, the response MUST be intercepted and replaced with a neutral re-prompt.
No rubric reveal	Global	Runtime MUST NOT pass rubric scoring weights, grade boundaries, or model answers to the LLM. However, the Runtime MUST pass rubric criteria and evidence vocabulary to the LLM — in IOA practice, rubric criteria are the conversation guide (sentence-starters), not a secret. The LLM uses criteria to know what to listen for, not how to score. The distinction: criteria describe observable competencies (“explains the mechanism”, “evaluates trade-offs”); scoring logic maps those to marks (criterion X = 5 marks if excellent, 3 if adequate). Only the former is shared.
No scoring	Global	Runtime MUST NOT pass scoring logic (grade boundaries, mark weights, score ranges) to the LLM. The LLM emits evidence signals as observations; scoring happens in the marking pipeline. The LLM MAY know rubric criteria as evidence vocabulary (what to listen for), but MUST NOT know how those criteria map to marks.
No structure change	Global	Runtime MUST NOT allow the LLM to add, remove, or reorder nodes. The LLM operates within the current node only.
No premature end	Per node	Runtime MUST NOT end a node before the main question has been asked and at least one candidate response received.
No topic jump	Per node	Runtime MUST constrain the LLM’s context to the current node’s topic. The LLM MUST NOT reference content from future nodes.
Off-topic handling	Per turn	If the LLM signals `off_topic`, Runtime MUST increment a per-node `offTopicCount`. After `maxOffTopicRedirects` (default: 2), Runtime MUST mark node as best-effort.
Silence handling	Per turn	If silence exceeds `silenceTimeoutMs`, Runtime MUST trigger a silence prompt (not an LLM follow-up). After `maxSilencePrompts` (default: 2), Runtime MUST mark node as best-effort.
Candidate anxiety	Per turn	If the LLM signals `candidate_anxiety`, Runtime MAY extend the time budget by `anxietyTimeExtensionMs` (configurable). The Runtime MUST NOT reduce difficulty or simplify questions.
Technical failure	Per event	See §6. Guardrails apply even in degraded mode — the Runtime MUST NOT bypass max-follow-ups or time budgets during recovery.
Persona consistency	Per node	If the specification defines a `persona` for the node (e.g., “hotel manager”), the Runtime MUST validate that every LLM `spokenText` output stays in character. The output validation pipeline MUST check for persona-break patterns (e.g., “As your examiner…”, “In this assessment…”). On violation, Runtime re-prompts with persona reminder. Emits `guardrail_triggered` with type `persona_break`.
Equity — communication style	Global	Unless `communicationStyleIsLearningOutcome: true` in the specification, the LLM MUST NOT penalise or comment on accent, fluency, verbal confidence, or speech patterns. The Runtime MUST filter evidence signals that reference communication quality when the flag is false.
Rapport and tone calibration	Per node	The LLM SHOULD build rapport through natural dialogue moves (acknowledgement, encouragement, reassurance) that are distinct from follow-ups. These moves MUST NOT count toward `maxFollowUps`. The LLM SHOULD adapt warmth based on context: warmer at node start, warmer when candidate struggles, more formal during technical probing. Rapport moves MUST NOT cross into assessment bias — encouragement like “Take your time” is permitted; “You’re doing great” is NOT (it provides implicit evaluative feedback). The Runtime MUST log rapport moves for quality assurance.
Neutrality in prompting	Per node	The LLM’s follow-up prompts MUST aim for neutrality as defined by Pearce & Chiavaroli (2020): “neither discourages nor reassures the student.” This is distinct from rapport — an examiner can be warm (rapport) while remaining assessment-neutral (prompting). The LLM MUST NOT provide evaluative feedback during the assessment (e.g., “Good answer”, “That’s not quite right”). The Runtime’s output validation pipeline MUST check for evaluative language patterns.
Examiner-initiated pause for welfare	Per turn	If the LLM signals `distress_detected` (not just anxiety), the Runtime MAY initiate a pause with a welfare message. This is distinct from candidate-initiated `pause` — the examiner proactively offers a break. Emits `welfare_pause_offered` event.
Time budget fairness	Per node	The Runtime MUST track whether the candidate received substantially different time-on-task compared to the node’s configured budget. If the LLM’s follow-up strategy causes a node to end significantly early (e.g., < 50% of time budget used), the Runtime SHOULD log this for fairness review. Different candidates should have comparable opportunities to demonstrate competence.

11.3 Guardrail Violation Handling

When a guardrail is triggered:

Runtime MUST emit a guardrail_triggered event with the guardrail type, context, and action taken.
If the violation is LLM-caused (e.g., LLM attempted to reveal rubric), Runtime MUST intercept the output, replace it, and log the attempt.
If the violation is structural (e.g., time budget exceeded), Runtime MUST enforce the hard limit regardless of LLM state.
Repeated guardrail violations (configurable threshold) SHOULD trigger an alert to the proctoring system.

11.4 LLM Output Validation

Every LLM response during an exam turn MUST pass through the Runtime’s output validation pipeline before being presented to the candidate:

Content filter: Check against forbiddenPhrases (model answer fragments, rubric terms).
Topic filter: Check that the response references only the current node’s topic scope.
Action filter: Check that the response does not attempt to transition, score, or end the exam (these are Runtime actions).
Length filter: Check that the response does not exceed maxResponseLength.

If validation fails, the Runtime MUST:

Log the violation.
Invoke the LLM again with a corrected prompt (e.g., “Please rephrase without mentioning [X]”).
If re-invocation also fails, use a canned fallback response.

12. Transcript and Evidence Capture Semantics

12.1 Transcript Structure

Every exam MUST produce a structured transcript. The transcript is NOT just raw STT output — it is a structured sequence of turns, each annotated with metadata.

Turn Record Schema

interface TranscriptTurn {
  turnId: string;                    // Unique turn identifier
  nodeId: string;                    // Node this turn belongs to
  turnIndex: number;                 // Sequential index within the node
  role: "candidate" | "examiner";    // Speaker
  content: string;                   // Text (STT output or LLM output)
  timestamp: number;                 // Unix ms
  durationMs: number;                // Duration of the utterance
  metadata: {
    isCommand: boolean;              // Was this a candidate command?
    commandType?: string;            // If command, which type?
    isFollowUp: boolean;             // Was this a follow-up question?
    followUpIndex?: number;          // If follow-up, which one?
    isSilence: boolean;              // Was this a silence event?
    isOffTopic: boolean;             // Was the candidate off-topic?
    confidence: number;              // STT confidence score (0–1)
  };
}

12.2 Evidence Signal Capture

Evidence signals are the structured observations that feed the marking pipeline. They are NOT scores — they are facts about what the candidate demonstrated.

In IOA practice, evidence signals are rubric criteria. The rubric defines what competencies to look for; the evidence signal records whether (and how well) the candidate demonstrated that competency. The Runtime MUST ensure that every signalType in the specification maps to a rubric criterion, and that the LLM receives these as the vocabulary of “what to listen for.”

Evidence Signal Schema

interface EvidenceSignal {
  signalId: string;                  // Unique signal identifier
  nodeId: string;                    // Node where signal was observed
  signalType: string;                // Type from specification — MUST map to a rubric criterion
  rubricLevel?: string;              // Observed rubric level (e.g., "description", "analysis", "evaluation")
  transversalSkills?: string[];      // Cross-cutting skills observed (e.g., ["critical_thinking", "professional_reasoning"])
  confidence: number;                // LLM's confidence that this signal was observed (0–1)
  turnId: string;                    // Which turn triggered this signal
  excerpt: string;                   // Short excerpt from candidate's response
  timestamp: number;                 // When the signal was observed
  source: "llm_observed" | "runtime_detected";  // Who observed it
  scaffoldingIntensity?: number;     // 0–3 scale: amount of scaffolding provided before this signal
                                     // 0 = no scaffolding (independent answer)
                                     // 1 = minimal (rephrasing/redirect)
                                     // 2 = moderate (conceptual hint)
                                     // 3 = heavy (simplified question/broken into parts)
  scaffoldingEffective?: boolean;    // Did the candidate improve after scaffolding? Only set when scaffoldingIntensity > 0.
}

Scaffolding as Evidence: The scaffoldingIntensity and scaffoldingEffective fields capture a critical piece of assessment information that the current model misses. Joughin (1998) identifies that oral assessment can measure “applied problem solving” — and the amount of scaffolding a candidate needs is itself evidence of their problem-solving competence. A candidate who answers correctly after heavy scaffolding (intensity 3) demonstrated a different competence level than one who answers independently (intensity 0). The marking pipeline MUST use scaffolding intensity as a modifier when evaluating evidence signals, not as a separate score.

Transversal Skills: IOA research identifies cross-cutting competencies (critical thinking, communication, problem-solving, professional reasoning) that span multiple nodes. The transversalSkills field allows the LLM to tag evidence signals with these cross-cutting observations. The marking pipeline MUST aggregate transversal skill signals across all nodes to produce a holistic competency profile. Transversal skill vocabulary is defined in the specification at the exam level, not per-node.

12.3 Evidence Capture Rules

Evidence signals MUST be emitted by the LLM during processing state (see §4).
The Runtime MUST validate that the signalType is defined in the current node’s evidenceSignals in the specification. Unknown signal types MUST be logged and discarded.
The Runtime MUST write validated signals to the Evidence Ledger immediately — not batched at node completion.
The LLM MAY emit multiple signals per turn (candidate may demonstrate several competencies in one answer).
The Runtime MUST NOT allow the LLM to emit signals for a node that is not currently active.
Duplicate signals (same signalType, same turnId) MUST be deduplicated by the Runtime.

12.4 Evidence Sufficiency

The LLM MAY signal evidence_sufficient during processing if it believes enough signals have been collected. However:

The Runtime MUST NOT auto-complete the node based solely on LLM judgment unless autoCompleteOnSufficient is enabled in the specification policy.
Even with autoCompleteOnSufficient, the Runtime MUST verify that at least CompletionPolicy.requiredEvidenceCount distinct evidence targets have been satisfied.
The LLM’s sufficiency judgment is advisory; the Runtime’s requiredEvidenceCount check is authoritative.

12.5 Transcript Closure

At exam completion (or abort), the Runtime MUST:

Write all pending transcript turns to the transcript store.
Write all pending evidence signals to the Evidence Ledger.
Emit a transcript_finalised event.
Compute and store a transcriptHash (SHA-256 of the canonicalised transcript) for integrity verification.

The transcript MUST be immutable after transcript_finalised. No post-hoc edits are permitted. If a correction is needed, it MUST be appended as a correction record with its own hash chain.

12.6 Marking Pipeline Handoff

The Runtime MUST produce a MarkingPackage containing:

The full structured transcript.
The evidence ledger (all signals, with node association, transversal skill tags, and scaffolding intensity).
Exam metadata (candidate ID, exam ID, start/end timestamps, duration).
Node completion statuses (completed / best_effort / skipped).
Any guardrail events triggered during the exam.
Any assessment failure events (scenario clarification, difficulty adjustment, welfare checks).
The transcript hash for integrity verification.
Conversation fingerprint (SHA-256 of the ordered conversation path: node sequence + follow-up types + turn count per node). This proves each exam instance is unique — critical for academic integrity auditing. Two candidates on the same specification will have different fingerprints because the conversation unfolds differently.
Assessment equivalence (e.g., equivalentWrittenWordCount: 3000) from the specification, for calibration reference.
Scaffolding metadata (if scaffolding was used): number of practice turns, whether candidate skipped early. Scaffolding transcript is NOT included.
Conversation quality metrics per node: candidate turn count, examiner follow-up depth, average candidate response latency, longest candidate monologue, follow-up type distribution, and scaffolding trajectory. See §9.7.
Psychometric equivalence summary: aggregate statistics for marking pipeline analysis — average follow-ups per node, follow-up type distribution, average time per node, scaffolding intensity distribution, and a conversation path variance score (0.0 = identical paths across candidates, 1.0 = maximally different). This enables the marking pipeline to detect whether different candidates received substantially different assessment experiences (see Joughin, 1998, on reliability threats from inconsistent questioning).

This package is the sole input to the marking pipeline. The marking pipeline MUST NOT need to reconstruct state from raw STT output or LLM conversation history.

Appendix A: Time Budget Calibration Guidance

Non-normative. This appendix provides heuristics for setting per-node and per-exam time budgets. These are derived from the oral assessment literature and are intended as starting points, not hard requirements.

Factor	Guideline	Source
Per-question time	5–7 minutes for theoretical questions	Akimov & Malin (2020): “around five minutes’ response time” per theoretical question
Follow-up time	5–10 minutes total per node	Akimov & Malin (2020): “five to ten minutes were allocated for follow-up questions”
Total exam duration	15–30 minutes for a clear picture of understanding	Fenton (2025, citing Sayre, 2014): “it should take no more than 20 minutes to get a clear picture”; Akimov & Malin (2020): 30-minute blocks
Response time per question	60 seconds max for concise answers	Bayley et al. (2024): “only the first 60 seconds of each of their responses would be graded”
Anxiety extension	Configurable, default 2 minutes	Spec design; Akimov & Malin (2020): students found shorter exams “rushed”
Practice/familiarisation	5–10 minutes before assessment begins	Bayley et al. (2024): practice ConVOE; Fenton (2025): “opportunities to practice where no marks are allocated”

Calibration Considerations

Too short: Creates time pressure that disadvantages candidates who think more slowly or are answering in a second language (Akimov & Malin, 2020: students found 10-minute exams “rushed”).
Too long: Leads to fatigue and reduced engagement. Fenton (2025) recommends “less is more.”
Time as fairness variable: If two candidates get the same node but one uses 3 minutes and the other uses 8 minutes (because the LLM asked different follow-ups), the Runtime tracks this difference in the conversation quality metrics (§9.7). The marking pipeline SHOULD consider whether time-on-task differences correlate with score differences.
Proportionality: Time budget SHOULD be proportional to the number of evidence targets on the node. A node with 3 evidence targets needs more time than one with 1.

Fairness Note: The psychometric equivalence summary does not constrain the LLM’s adaptiveness — it measures it. The marking pipeline can use this data to assess whether conversation path variance is correlated with score variance. If candidates who received harder follow-ups systematically score lower, this indicates a fairness problem that requires investigation (Akimov & Malin, 2020, Table 4: “It is hard to determine whether students perform significantly differently if follow-up questions are different”).

Revision History

Version	Date	Changes
v0.2.0	2026-06-30	Added anxiety detection and distress handling semantics. Updated terminology from ‘Exam Runtime IR’ to ‘IOA-ORM’. Refined state machine transitions for welfare checks.
v0.1.0	2026-05-06	Initial release.