Runtime Semantics
Status
Section titled “Status”Draft · v0.2.0 · 2026-06-30
Status: Draft Scope: Defines the execution model, state machines, lifecycle transitions, and semantic contracts that govern how an IOA-ORM is executed. All normative statements use RFC 2119 language (MUST / SHOULD / MAY / MUST NOT).
Table of Contents
Section titled “Table of Contents”- Runtime Execution Model
- Exam Lifecycle State Machine
- Node Lifecycle State Machine
- Turn Lifecycle State Machine
- Candidate Readiness State
- Recovery State
- Completion Semantics
- Transition Semantics
- Follow-up Semantics
- Candidate Command Semantics
- Guardrail Enforcement Semantics
- Transcript and Evidence Capture Semantics
Appendices
1. Runtime Execution Model
Section titled “1. Runtime Execution Model”1.0 IOA Design Alignment
Section titled “1.0 IOA Design Alignment”Interactive Oral Assessment (IOA) research (Ward et al., 2023; Sotiriadou et al., 2016) establishes that effective oral assessments are scenario-based, free-flowing conversations — not scripted question-and-answer sessions. The Runtime MUST respect this principle: nodes represent assessment scenarios with rubric-aligned conversation objectives, not rigid question slots. The LLM examiner drives an authentic, professionally-focused dialogue, using rubric criteria as conversation guides (sentence-starters), not as a quiz script.
Theoretical Foundations
Section titled “Theoretical Foundations”The runtime semantics are grounded in four key dimensions from the oral assessment literature:
Interaction (Joughin, 1998, pp. 370–371). Oral assessment’s principal advantage is the capacity for bidirectional adaptation — “each [statement] includes a response to that made by the other participant” creating “inherent unpredictability in which neither party knows in advance exactly what questions will be asked or what responses will be made.” The Runtime preserves this by giving the LLM autonomy over dialogue strategy within nodes while maintaining structural control over transitions. However, Joughin warns that “the social interaction entailed in oral assessment may distort communication and affect both a candidate’s performance and how that performance is perceived by the examiner” (p. 370). The guardrails in §11 address this risk.
Authenticity (Joughin, 1998, pp. 371–372; Fenton, 2025, p. 431). IOA research positions oral assessments as “a form of assessment asking students to perform real-world tasks to demonstrate meaningful application of necessary knowledge and skills” (Sotiriadou et al., 2020, cited in Fenton, 2025). The scenario-based node design directly implements Joughin’s “contextualised” pole of the authenticity continuum. The persona system (§11) ensures the examiner stays in character, maintaining the professional context throughout the interaction.
Reliability through Structure (Joughin, 1998, p. 376; Akimov & Malin, 2020, Table 4). Joughin identifies that “reliability is threatened when the ‘interaction’ dimension tends towards the ‘dialogue’ pole, when the ‘structure’ dimension tends towards the ‘open’ pole.” The spec addresses this through a deliberate design choice: closed structure between nodes (Runtime-enforced transitions, deterministic ordering) with open dialogue within nodes (LLM-driven adaptive questioning). This hybrid preserves the validity benefits of dialogue while maintaining the reliability benefits of structure. The follow-up counter and time budget (§9) are hard structural constraints that prevent unbounded dialogue.
Prompting Neutrality (Pearce & Chiavaroli, 2020, cited in Fenton, 2025, p. 434). The spec operationalises four guiding principles for examiner prompting: neutrality (guardrails prevent reassurance or discouragement), consistency (same follow-up policy across nodes), transparency (candidates receive scaffolding practice before assessment), and reflexivity (examiner utterances are logged for post-hoc quality review).
IOA Component Mapping
Section titled “IOA Component Mapping”The six IOA components (scaffolding, scenario-based, aligned to program, learning outcomes, accessible/equitable, professionally-focused) inform the following runtime semantics:
- Scenario context is provided to the LLM as scene-setting, not as a question prompt.
- Rubric criteria are shared with the LLM as evidence vocabulary (what to listen for), not as scoring weights or model answers.
- Conversation flow is emergent — the LLM adapts its prompting based on what the candidate demonstrates, nudging toward higher rubric levels when opportunity arises.
- Scaffolding operates at two levels: (1) pre-exam familiarisation (§2.1) where the candidate practices the IOA format without stakes, and (2) in-assessment scaffolding (§9.5) where the examiner adjusts support based on candidate performance within the Zone of Proximal Development.
- Equity is enforced by not penalising communication style unless it is a declared learning outcome (§11.2). Fairness is further supported by deterministic policy enforcement for transitions and completion, preventing LLM-driven inconsistency across candidates.
1.1 Single-Active-Node Principle
Section titled “1.1 Single-Active-Node Principle”The Runtime MUST maintain exactly one active node at any time during an in-progress exam. When the active node completes or transitions, the Runtime MUST atomically deactivate the current node and activate the next node — there MUST NOT be a gap where zero or more than one node is active.
1.2 Runtime Controller
Section titled “1.2 Runtime Controller”A Runtime Controller owns all authoritative state. The LLM agent is a tool invoked by the Runtime Controller — it does NOT own state, does NOT decide transitions unilaterally, and does NOT persist evidence directly. The boundary is:
| Concern | Owner |
|---|---|
| Which node is active | Runtime Controller |
| When a turn starts/ends | Runtime Controller |
| How many follow-ups have been asked | Runtime Controller |
| Time budget remaining | Runtime Controller |
| Evidence ledger writes | Runtime Controller (LLM emits signals, Runtime persists) |
| Question wording, follow-up phrasing, bridge text | LLM (within constraints) |
| Judging answer sufficiency | LLM emits opinion; Runtime applies policy |
1.3 Execution Loop
Section titled “1.3 Execution Loop”For each active node, the Runtime Controller executes the following loop. Note that this is a conversation loop, not a rigid Q&A loop — the LLM drives an authentic dialogue, and the loop represents the runtime’s state tracking, not the conversation’s structure.
Pipecat Mapping: Steps marked with 🟢 execute in Pipecat pipeline/FlowManager. Steps marked with 🔵 execute in Runtime Controller. The LLM calls
report_observation(the single allowed function) after processing each candidate response; the Runtime Controller handler evaluates the observation and decides the next action.
while node is active:
1. 🔵 Enter node → emit node_entered, set time budget, reset counters
2. 🔵 Build NodeConfig from specification (scenario, rubric criteria, persona, constraints)
3. 🟢 FlowManager.set_node_from_config(config)
4. 🟢 LLM initiates conversation (task_messages drive the opening)
5. 🟢 STT captures candidate response → text flows to LLM
6. 🟢 LLM processes response, calls report_observation(...)
7. 🔵 Runtime Controller handler receives observation:
a. If commandDetected → dispatch command (see §10), return
b. Validate spokenText through output filters (content/topic/action/length)
c. Write evidence signals to Evidence Ledger
d. Check guardrails:
- Time budget exceeded? → transition to best-effort
- Follow-up requested?
- Counter < maxFollowUps → inject follow-up context, continue
- Counter >= maxFollowUps → transition to best-effort
- Off-topic?
- Count < maxOffTopicRedirects → inject redirect, continue
- Count >= maxOffTopicRedirects → transition to best-effort
- Evidence sufficient + requiredEvidenceCount met? → transition to completed
- Anxiety detected? → extend time budget
e. If transitioning:
- Finalize current node evidence
- Build next node's NodeConfig
- 🔵 call flow_manager.set_node_from_config(next_config)
f. If continuing: return validated spokenText to LLM for TTS
8. 🟢 TTS speaks the response to candidate
9. Loop back to 5
1.4 Thread Model
Section titled “1.4 Thread Model”The Runtime MUST be single-threaded per exam instance. All state mutations for a given exam MUST be serialised. Event emission MUST occur before the corresponding state mutation is considered committed.
1.5 Idempotency
Section titled “1.5 Idempotency”Every Runtime operation that mutates state MUST be idempotent with respect to its event sequence number. Re-applying the same event (e.g., during recovery) MUST produce the same resulting state.
2. Exam Lifecycle State Machine
Section titled “2. Exam Lifecycle State Machine”2.1 States
Section titled “2.1 States”| State | Description |
|---|---|
created | Exam instance created from published AssessmentPackage. Candidate assigned. |
scheduled | Exam scheduled with a specific time window. |
scaffolding | Optional practice phase. Candidate experiences the IOA format with a practice scenario that does NOT count toward the score. Emits scaffolding_started / scaffolding_completed. |
ready | All pre-conditions met (candidate authenticated, audio/video checks passed, session token issued, scaffolding completed or skipped). |
in_progress | Exam is live; at least one node has been entered. |
paused | Exam temporarily suspended (candidate-initiated pause or system-initiated). |
completed | All nodes processed or exam explicitly ended. Terminal. |
aborted | Exam ended prematurely due to violation, timeout, or system failure. Terminal. |
expired | Exam time window closed before completion. Terminal. |
2.2 Transitions
Section titled “2.2 Transitions”created ──[assign_candidate]──► scheduled
scheduled ──[preconditions_met, scaffolding_enabled]──► scaffolding
scheduled ──[preconditions_met, scaffolding_disabled]──► ready
scaffolding ──[practice_complete OR candidate_skips_practice]──► ready
ready ──[exam_start]──► in_progress
in_progress ──[pause_requested]──► paused
paused ──[resume]──► in_progress
in_progress ──[all_nodes_complete OR explicit_end]──► completed
in_progress ──[violation OR system_failure]──► aborted
in_progress ──[time_window_expired]──► expired
paused ──[time_window_expired]──► expired
scheduled ──[time_window_expired]──► expired
2.3 Transition Guards
Section titled “2.3 Transition Guards”scheduled → scaffolding: Runtime MUST verify preconditions are met ANDscaffolding.enabled === truein the specification. Scaffolding turns MUST use a separate practice scenario (defined in the specification) and MUST NOT produce evidence signals for the marking pipeline.scaffolding → ready: Runtime MUST emitscaffolding_completedwith a count of practice turns taken. Scaffolding transcript MAY be retained for QA but MUST be excluded from the MarkingPackage.ready → in_progress: Runtime MUST verify candidate identity is confirmed and all pre-checks (audio, video, network) have passed.in_progress → paused: Runtime MUST record the pause timestamp and remaining time budget. The active node MUST be suspended, not reset.paused → in_progress: Runtime MUST restore the active node state exactly as it was at pause.in_progress → completed: Runtime MUST verify that every required node has been processed (either completed or marked best-effort per §7).in_progress → aborted: Runtime MUST record the abort reason. Abort reasons include: candidate violation (e.g., third-party assistance detected), repeated guardrail violations, or irrecoverable system failure.
2.4 Allowed Actions Per State
Section titled “2.4 Allowed Actions Per State”| Action | created | scheduled | scaffolding | ready | in_progress | paused | completed | aborted | expired |
|---|---|---|---|---|---|---|---|---|---|
| Assign candidate | ✓ | ||||||||
| Schedule exam | ✓ | ||||||||
| Start scaffolding | ✓ | ||||||||
| Skip scaffolding | ✓ | ||||||||
| Start exam | ✓ | ||||||||
| Candidate speaks | ✓ | ✓ | |||||||
| LLM generates question | ✓ | ||||||||
| Transition node | ✓ | ||||||||
| Pause | ✓ | ||||||||
| Resume | ✓ | ||||||||
| Abort | ✓ | ✓ | |||||||
| Read transcript | ✓ | ✓ | ✓ | ||||||
| Read evidence | ✓ | ✓ | ✓ |
3. Node Lifecycle State Machine
Section titled “3. Node Lifecycle State Machine”3.1 States
Section titled “3.1 States”| State | Description |
|---|---|
pending | Node has not yet been reached in the exam flow. |
active | Node is the current active node; candidate is being assessed on it. |
completed | Node assessment finished successfully (sufficient evidence gathered). |
best_effort | Node ended with incomplete evidence (follow-ups exhausted or time budget hit). |
skipped | Node skipped due to conditional routing (e.g., prerequisite not met). |
3.2 Transitions
Section titled “3.2 Transitions”pending ──[node_enter]──► active
active ──[evidence_sufficient]──► completed
active ──[followups_exhausted OR time_budget_hit]──► best_effort
active ──[transition_condition_skip]──► skipped
pending ──[transition_condition_skip]──► skipped
3.3 Transition Guards
Section titled “3.3 Transition Guards”pending → active: Runtime MUST emitnode_enteredevent with node ID, sequence index, and timestamp. Runtime MUST initialise the node’s follow-up counter to 0 and start the time budget timer.active → completed: Runtime MUST emitnode_completedevent. Evidence signals collected during this node MUST be finalised and written to the Evidence Ledger (see §12).active → best_effort: Runtime MUST emitnode_best_effortevent with a reason (followups_exhausted, time_budget_hit). All partial evidence MUST still be written to the Evidence Ledger.active → skipped: Runtime MUST emitnode_skippedevent with the transition condition that triggered the skip. Runtime MUST NOT count skipped nodes as completed for completion purposes (see §7).
4. Turn Lifecycle State Machine
Section titled “4. Turn Lifecycle State Machine”A turn is a single exchange: candidate speaks → LLM processes → LLM responds (or transitions).
4.1 States
Section titled “4.1 States”| State | Description |
|---|---|
awaiting_candidate | Runtime is waiting for the candidate to speak. LLM has finished presenting or follow-up. |
candidate_speaking | STT is capturing the candidate’s audio. |
processing | Candidate’s response is being processed (STT finalisation, LLM evaluation). |
llm_responding | LLM is generating and TTS is playing the response. |
turn_complete | Turn ended; ready for next turn or node transition. |
4.2 Transitions
Section titled “4.2 Transitions”awaiting_candidate ──[speech_detected]──► candidate_speaking
candidate_speaking ──[speech_ended OR silence_timeout]──► processing
processing ──[evaluation_complete, needs_followup]──► llm_responding
processing ──[evaluation_complete, sufficient]──► turn_complete
processing ──[evaluation_complete, command_detected]──► turn_complete
llm_responding ──[tts_finished]──► awaiting_candidate (loop for follow-up)
llm_responding ──[tts_finished, transitioning]──► turn_complete
4.3 Turn Timeout
Section titled “4.3 Turn Timeout”- If
candidate_speakingdoes not start withinsilenceTimeoutMs(configurable per node), the Runtime MUST transition toprocessingwith asilence_detectedflag. - If
processingexceedsprocessingTimeoutMs, the Runtime MUST emit aturn_timeoutevent and the turn MUST be treated as best-effort for that candidate utterance. llm_respondingMUST complete withinllmResponseTimeoutMs. If exceeded, Runtime MUST emitllm_timeoutand either retry once or fallback to a canned transition message.
4.4 Turn-Level Events
Section titled “4.4 Turn-Level Events”The turn lifecycle states map to the canonical event types defined in 02-schema.md §14 (RuntimeEventType) and the event protocol in 05-event-protocol.md. Internal state transitions (e.g., entering candidate_speaking) MAY be tracked internally by the Runtime Controller but are not necessarily emitted as protocol-level events. The canonical turn-related event types are:
examiner_turn— examiner produces a turn (maps totranscript_finalwith speaker “examiner”)candidate_turn— candidate produces a turn (maps totranscript_finalwith speaker “candidate”)turn_completed— turn cycle completed (both sides have spoken or timeout)turn_timeout— turn processing exceededprocessingTimeoutMs
Additionally, the event protocol defines transcript_delta (streaming STT partials), transcript_final (canonical persisted utterance), examiner_utterance_started, and examiner_utterance_final for real-time UI consumption. See 05-event-protocol.md §4.4–4.7.
5. Candidate Readiness State
Section titled “5. Candidate Readiness State”5.1 Purpose
Section titled “5.1 Purpose”Before the exam begins, the Runtime MUST verify the candidate is technically and cognitively ready. This prevents starting a high-stakes assessment with broken audio, confused candidate, or unverified identity.
5.2 Readiness Checks
Section titled “5.2 Readiness Checks”| Check | Required | Failure Behaviour |
|---|---|---|
| Identity verification (face match, ID check, or proctoring token) | MUST pass | Block exam start; emit readiness_identity_failed |
| Audio input device active | MUST pass | Block exam start; emit readiness_audio_failed |
| Audio output device active (TTS audible) | MUST pass | Block exam start; emit readiness_audio_output_failed |
| Video input device active (if required) | MUST pass | Block exam start; emit readiness_video_failed |
| Network connectivity (latency < threshold) | SHOULD pass | Warn candidate; allow start with degraded flag |
| Candidate confirms instructions understood | MUST pass | Re-present instructions; emit readiness_instructions_not_understood |
5.3 Readiness State Machine
Section titled “5.3 Readiness State Machine”not_ready ──[check_identity]──► identity_verified
identity_verified ──[check_audio]──► audio_ok
audio_ok ──[check_video]──► video_ok
video_ok ──[check_instructions]──► ready
All states are sequential. Runtime MUST NOT skip a check. Each failed check MUST emit an event and block progression. The Runtime MAY allow a configurable number of retries per check before blocking the exam entirely.
6. Recovery State
Section titled “6. Recovery State”6.1 Recovery Scenarios
Section titled “6.1 Recovery Scenarios”Recovery scenarios are divided into technical failures (infrastructure issues) and assessment failures (pedagogical situations where the assessment interaction goes wrong). Both categories MUST be handled by the Runtime Controller.
Technical Failures
Section titled “Technical Failures”| Scenario | Detection | Recovery Action |
|---|---|---|
| Network disconnection | WebSocket/LiveKit disconnect event | Pause exam; attempt reconnect within reconnectTimeoutMs. If reconnected, resume from last committed state. If timeout, abort. |
| STT failure | STT returns empty/error for N consecutive turns | Retry STT pipeline. If persistent, emit stt_failure event, present written question as fallback, log degraded mode. |
| STT low confidence | transcript_segment.confidence < 0.6 | Runtime MUST emit stt_low_confidence event. LLM MAY offer the candidate a chance to repeat. Evidence signals MUST NOT be recorded from segments with confidence below 0.5 (see §12.3). |
| LLM failure | LLM timeout or error response | Retry once with backoff. If persistent, use canned follow-up from specification fallback config. If no fallback, pause exam. |
| TTS failure | TTS returns error | Retry once. If persistent, present question as text on data channel. Emit tts_failure event. |
| Silence (candidate unresponsive) | Silence exceeds silenceTimeoutMs | Runtime MUST prompt candidate (via LLM or canned). After maxSilencePrompts, transition to best-effort for current node. |
| Candidate disconnects | LiveKit participant leave event | Pause exam immediately. If candidate reconnects within reconnectTimeoutMs, resume. If not, abort with candidate_disconnect. |
| Audio loop / echo | Audio energy level anomaly detection | Mute TTS, present text, attempt audio reset. Emit audio_loop_detected. |
Assessment Failures
Section titled “Assessment Failures”Assessment failures are situations where the pedagogical interaction breaks down. Unlike technical failures, these require the LLM and Runtime to collaborate on recovery while preserving assessment validity.
| Scenario | Detection | Recovery Action |
|---|---|---|
| Candidate misunderstands scenario | LLM detects candidate’s response is inconsistent with the scenario role (e.g., candidate acts as manager when they should be the employee) | LLM re-establishes scenario context without revealing assessment content. Emits scenario_clarification event. MUST NOT reveal what the “correct” interpretation is — only re-state the scenario framing. This preserves assessment validity while correcting the misunderstanding. |
| Question difficulty mismatch | LLM signals difficulty_mismatch in report_observation (candidate’s response suggests question was too hard or too easy) | Runtime MAY allow one question rephrase at lower or higher complexity level. Emits difficulty_adjusted event. The rephrased question MUST assess the same evidence targets — only the complexity framing changes. |
| Candidate emotional distress | LLM signals distress_detected (beyond anxiety — e.g., crying, aggressive tone, refusal to continue) | Runtime offers pause with a welfare message: “We can take a break whenever you need. Would you like to pause?” If candidate continues, Runtime logs distress_event for post-exam review. If candidate does not respond within silenceTimeoutMs, Runtime pauses automatically. Emits welfare_check event. |
| Examiner gives contradictory information | Output validation detects contradiction with prior statements in conversation history | Runtime intercepts and re-prompts the LLM with the contradictory statement flagged. Emits consistency_violation event. If second attempt also contradicts, uses canned fallback. |
| Candidate gives consistently off-topic answers | LLM signals off_topic for 3+ consecutive turns despite redirects | Runtime emits persistent_off_topic event. LLM MAY re-state the question more explicitly (without revealing the answer). If still off-topic after maxOffTopicRedirects, transition to best-effort. |
Assessment Failure Principle: Recovery MUST NOT reveal model answers, rubric scoring logic, or the “correct” response. The goal is to restore the assessment interaction to a productive state, not to guide the candidate to the right answer. This preserves the assessment validity principle from Fenton (2025): the examiner should “neither discourage nor reassure the student” during prompting.
6.2 Recovery State Machine
Section titled “6.2 Recovery State Machine”healthy ──[failure_detected]──► recovering
recovering ──[recovery_successful]──► healthy
recovering ──[recovery_failed]──► degraded
degraded ──[manual_intervention OR timeout]──► aborted
recovering ──[reconnect_timeout]──► aborted
6.3 Recovery Guarantees
Section titled “6.3 Recovery Guarantees”- The Runtime MUST preserve all events emitted before the failure. Events are the source of truth for recovery.
- On recovery, the Runtime MUST replay from the last committed event to restore consistent state.
- The LLM session context MUST be reconstructed from the event log, not from LLM memory.
- Recovery MUST NOT cause the candidate to lose credit for answers already given. Evidence already written to the ledger is permanent.
7. Completion Semantics
Section titled “7. Completion Semantics”7.1 Node Completion Criteria
Section titled “7.1 Node Completion Criteria”A node is complete when ALL of the following are true:
- The main question has been asked at least once.
- The candidate has provided at least one substantive response (not silence, not a command).
- Either:
- (a) Evidence signals collected meet the node’s
CompletionPolicy.requiredEvidenceCountthreshold, OR - (b) Follow-ups have been exhausted (
followUpCount >= maxFollowUps), OR - (c) Time budget for the node has been exhausted, OR
- (d) The LLM has judged evidence sufficient AND the Runtime’s
autoCompleteOnSufficientpolicy is enabled.
- (a) Evidence signals collected meet the node’s
7.2 Best-Effort Completion
Section titled “7.2 Best-Effort Completion”When a node ends without meeting condition (a) — i.e., CompletionPolicy.requiredEvidenceCount not reached — the Runtime MUST mark it as best_effort, NOT completed. Best-effort nodes:
- MUST still have their partial evidence written to the Evidence Ledger.
- MUST be distinguishable from fully completed nodes in the ledger (via a
completionStatusfield). - MUST count toward overall exam progress for time-budget purposes.
- SHOULD trigger a review flag for markers if
requiredEvidenceCountwas not met.
7.3 Exam Completion
Section titled “7.3 Exam Completion”An exam is complete when ALL of the following are true:
- All required nodes have been processed (completed, best_effort, or skipped per routing).
- No node is in
activestate. - Runtime has emitted
exam_completedevent.
An exam is aborted when:
- A hard violation occurs (see §11), OR
- A recovery failure forces termination (see §6), OR
- An explicit abort is issued by the proctoring system.
7.4 Partial Exam
Section titled “7.4 Partial Exam”If the exam is aborted, all nodes processed up to the abort point MUST still have their evidence written to the ledger. The marking pipeline MUST be able to score a partial exam. The Runtime MUST emit exam_partial with the list of completed and best-effort nodes.
8. Transition Semantics
Section titled “8. Transition Semantics”8.1 Node Transition Types
Section titled “8.1 Node Transition Types”| Type | Trigger | Description |
|---|---|---|
sequential | Current node completes | Move to the next node in the specification sequence. |
conditional | Edge condition evaluates to true | Jump to a non-adjacent node based on evidence or state. |
branch | Branching node selects path | Follow a specific branch based on candidate profile or answer. |
skip | Skip condition met | Skip one or more nodes. |
abort | Abort condition met | End exam immediately. |
8.2 Transition Evaluation
Section titled “8.2 Transition Evaluation”- The Runtime MUST evaluate transition conditions in the order they appear in the specification. The FIRST matching condition wins.
- Transition conditions MAY reference:
- Evidence signals collected so far (
signals:has("topic_x_signal")) - Node completion status (
node:status("node_1") === "completed") - Time elapsed (
time:elapsed > 1800) - Follow-up count on current node (
node:followUpCount) - Candidate commands received (
commands:received("repeat"))
- Evidence signals collected so far (
- Transition conditions MUST NOT reference LLM internal state or raw transcript content directly. They MUST operate on structured Runtime state only.
8.3 Transition Bridge
Section titled “8.3 Transition Bridge”When transitioning between nodes, the Runtime MAY instruct the LLM to generate a transition bridge — a natural-language statement connecting the previous topic to the next. The LLM:
- MUST NOT reveal the next question’s content in the bridge.
- MUST NOT reference rubric criteria or model answers.
- SHOULD produce a brief, natural sentence (e.g., “Thank you. Now let’s move on to a different area.”).
- The Runtime MUST provide the LLM with the previous node’s topic label and the next node’s topic label (not question text) for bridge generation.
8.4 Transition Atomicity
Section titled “8.4 Transition Atomicity”A node transition MUST be atomic:
- Finalise evidence for the departing node.
- Write evidence to ledger.
- Emit
node_completed(ornode_best_effort). - Emit
node_enteredfor the arriving node. - Reset per-node state (follow-up counter, time budget).
Steps 1–5 MUST occur as a single committed transaction. If any step fails, the Runtime MUST roll back to the departing node’s active state and retry.
9. Follow-up Semantics
Section titled “9. Follow-up Semantics”9.1 Purpose
Section titled “9.1 Purpose”Follow-ups allow the LLM examiner to probe deeper when a candidate’s initial answer is incomplete, unclear, or insufficiently evidenced. They are the primary mechanism for agentic behaviour within the exam.
In IOA practice, the most important follow-up strategy is rubric-level nudging: when a candidate demonstrates competence at a lower rubric level (e.g., “description” → credit), the assessor uses a follow-up to open the door to a higher level (e.g., “analysis” → distinction). The Runtime MUST support this pattern by providing the LLM with rubric level information as evidence vocabulary, not as scoring logic.
9.2 Follow-up Lifecycle
Section titled “9.2 Follow-up Lifecycle”main_question_asked
└── candidate answers
├── sufficient → transition (§8)
└── insufficient → follow-up check:
├── followUps_remaining > 0 AND time_budget > 0 → issue follow-up
└── followUps_remaining === 0 OR time_budget === 0 → best_effort
9.3 Follow-up Counter
Section titled “9.3 Follow-up Counter”- Each node maintains an independent
followUpCountcounter, starting at 0. - The Runtime MUST increment
followUpCounteach time a follow-up is issued. followUpCountMUST be persisted in the event log before the follow-up is presented to the candidate.- The Runtime MUST NOT allow
followUpCountto exceedmaxFollowUpsfor the node. This is a hard constraint — the LLM MUST NOT be asked to generate a follow-up when the counter is at max.
9.4 Follow-up Content Constraints
Section titled “9.4 Follow-up Content Constraints”The LLM generates follow-up wording, but the Runtime enforces:
- Follow-up MUST be on-topic (same node’s assessment objective).
- Follow-up MUST NOT provide the answer or a strong hint (see §11 for hint refusal).
- Follow-up MUST NOT introduce a new assessment topic (that requires a new node).
- Follow-up SHOULD be a question, not a statement.
- Follow-up SHOULD reference what the candidate said (for naturalness), but MUST NOT quote rubric criteria.
9.5 Follow-up Types
Section titled “9.5 Follow-up Types”The follow-up taxonomy is derived from oral assessment literature. Pearce & Chiavaroli (2020, cited in Fenton, 2025) establish a prompting continuum from neutral presentation to leading questions; Joughin (1998) describes the bidirectional adaptation that characterises the “dialogue” pole of interaction. The types below span this range.
| Type | Example | When Used | Literature Basis |
|---|---|---|---|
probe | ”Can you elaborate on what you mean by X?” | Candidate’s answer was vague. | Pearce & Chiavaroli: probing questions |
redirect | ”Let me rephrase: what would happen if…?” | Candidate misunderstood the question. | Pearce & Chiavaroli: clarifying questions |
scaffold | ”Think about it from the perspective of Y.” | Candidate is stuck; provides a graduated nudge. See §9.6 for scaffolding intensity. | Vygotsky ZPD; Fenton (2025): “simplify questions or prompt students who are struggling” |
challenge | ”What about the counterargument that…?” | Candidate’s answer is one-dimensional. | Joughin: probing reasoning depth |
nudge | ”You’ve described the situation well — can you tell me more about why you think that’s the case?” | Candidate demonstrated lower rubric level; opens door to higher level (e.g., description → analysis). This is the core IOA prompting strategy. | Pearce & Chiavaroli: probing questions; Fenton (2025): higher-order skills |
confirm | ”So what you’re saying is [paraphrase] — is that right?” | LLM paraphrases candidate’s answer for confirmation before proceeding. Serves dual function: confirms understanding AND gives candidate chance to correct. | Joughin: interaction as reciprocal adaptation |
extend | ”That’s a solid analysis — now, how would you apply this in [different context]?” | Candidate gave a good answer; deepens the exploration at the same rubric level by asking for breadth or application. Distinct from challenge (which questions the answer) and nudge (which pushes to a higher level). | Joughin: applied problem solving; probing boundaries of understanding |
concede | ”That’s alright, let’s move on to something else.” | Candidate is clearly stuck and scaffolding hasn’t helped. Graceful abandonment of the current line of questioning within the node. Distinct from node-level best-effort completion — this is a turn-level move. | Fenton (2025): managing anxiety; examiner warmth |
closing | ”Is there anything else you’d like to add?” | Near time budget or follow-up limit. | Standard exam practice |
The Runtime MUST emit follow_up_issued with the follow-up type. The LLM chooses the type based on context; the Runtime does not enforce type selection but MAY log it for quality assurance.
9.6 In-Assessment Scaffolding
Section titled “9.6 In-Assessment Scaffolding”Scaffolding within the assessment (distinct from pre-exam familiarisation in §2.1) is the primary mechanism for supporting candidates within their Zone of Proximal Development (Vygotsky). When the LLM issues a scaffold follow-up, the amount of scaffolding provided is itself evidence of the candidate’s competence level.
Scaffolding intensity captures how much support the examiner provided:
0— No scaffolding: candidate answered independently.1— Minimal scaffolding: slight rephrasing or redirection.2— Moderate scaffolding: provided a conceptual hint or perspective shift.3— Heavy scaffolding: significantly simplified the question or broke it into sub-parts.
The Runtime MUST record scaffolding intensity on the evidence signal when a scaffold follow-up is issued (see §12.2). The marking pipeline uses this as evidence: a candidate who needed heavy scaffolding demonstrated a different competence level than one who needed minimal support.
Graduated withdrawal: The LLM SHOULD reduce scaffolding intensity over the course of a node. If the candidate demonstrates competence after scaffolding, subsequent follow-ups SHOULD probe at the original difficulty level. This mirrors the educational scaffolding principle of gradually removing support as competence develops.
Fenton (2025, p. 433): “Educators have the flexibility to simplify questions or prompt students who are struggling” and this flexibility results in “higher grades than would have been achieved with a written assessment.” The spec models this flexibility as a first-class concept, not an exception.
9.7 Conversation Quality Tracking
Section titled “9.7 Conversation Quality Tracking”The Runtime SHOULD track conversation quality metrics within each node to support post-hoc assessment quality review. These metrics do not affect runtime behaviour but are included in the MarkingPackage for psychometric analysis (see §12.6).
| Metric | Description |
|---|---|
candidateTurnCount | Number of substantive candidate responses in this node |
examinerFollowUpDepth | Number of follow-ups issued in this node |
avgCandidateResponseLatencyMs | Mean time between examiner prompt and candidate speech start |
longestCandidateMonologueSec | Longest uninterrupted candidate speech |
followUpTypeDistribution | Count of each follow-up type used |
scaffoldingTrajectory | Sequence of scaffolding intensities across the node (should trend downward) |
Joughin (1998, p. 376): Reliability is threatened when there is “inconsistency between the questions asked of different candidates.” These metrics enable post-hoc analysis of whether conversation paths were comparable across candidates, even when the LLM adapted its questioning style.
10. Candidate Command Semantics
Section titled “10. Candidate Command Semantics”10.1 Purpose
Section titled “10.1 Purpose”Candidates may issue commands during the exam (e.g., “repeat the question”, “can you clarify?”, “I need a moment”). These MUST be handled as structured commands, not as assessment evidence.
10.2 Command Vocabulary
Section titled “10.2 Command Vocabulary”| Command | Detection Method | Behaviour |
|---|---|---|
repeat | Keyword/phrase detection + LLM intent classification | Re-present the current question or follow-up verbatim. MUST NOT count as a follow-up. MUST NOT reset the time budget. |
clarification | LLM intent classification | LLM rephrases or explains the question instructions. MUST NOT reveal the model answer. Counts as one clarification_used. |
request_rephrase | LLM intent classification | Candidate asks “can you say that differently?” — distinct from repeat (which re-presents verbatim) and clarification (which explains instructions). The LLM generates a different phrasing of the same question. MUST NOT reveal the model answer. Counts as one clarification_used. |
slow_down | Keyword detection | LLM reduces speech rate. Runtime adjusts TTS speed. |
pause | Explicit request | Transition to paused state (see §2). |
thinking_aloud | LLM intent classification | Candidate says “let me think about that for a moment” or similar. Signals metacognitive awareness. Runtime emits candidate_thinking event. DOES NOT consume a pause — the candidate is still engaged. LLM waits silently. Time budget continues. |
help | LLM intent classification | Provide general exam instructions (not question-specific help). |
skip | LLM intent classification | Request to skip current node. Runtime MAY honour if policy allows; otherwise MUST refuse and explain. |
revise_earlier_answer | LLM intent classification | Candidate wants to revisit or amend a previous answer. Runtime MAY honour if the previous node is still within a configurable revision window. If honoured, emits answer_revision event. MUST NOT be used to revisit nodes from a different topic area. |
finish | LLM intent classification | Candidate wants to end the exam. Runtime MUST confirm (“Are you sure?”) before processing. |
10.3 Command Processing Rules
Section titled “10.3 Command Processing Rules”- Commands MUST be detected before the candidate’s utterance is evaluated for evidence. A
repeatcommand MUST NOT generate evidence signals. - Commands MUST be processed by the Runtime Controller, not directly by the LLM. The LLM detects intent; the Runtime decides the action.
- The Runtime MUST emit
candidate_commandevent for every detected command, including the command type and raw utterance. - Commands MUST NOT count toward
maxFollowUps. - Commands MUST consume time from the time budget (the candidate is using exam time).
- If the LLM cannot classify an utterance as either a command or an answer with high confidence, the Runtime SHOULD treat it as an answer and let the LLM proceed accordingly.
10.4 Command Rate Limiting
Section titled “10.4 Command Rate Limiting”The Runtime MUST enforce:
- Maximum 3
repeatcommands per node. After that, the Runtime MUST emitcommand_repeat_limit_reachedand present the question in written form via data channel instead. - Maximum 2
clarificationcommands per node. After that, the Runtime MUST emitcommand_clarify_limit_reachedand proceed. - No rate limit on
pause, butpauseduration counts against the exam time budget.
11. Guardrail Enforcement Semantics
Section titled “11. Guardrail Enforcement Semantics”11.1 Purpose
Section titled “11.1 Purpose”Guardrails ensure the LLM examiner operates within the boundaries defined in the specification. Guardrails are hard constraints enforced by the Runtime, not prompt-level instructions to the LLM.
11.2 Guardrail Catalogue
Section titled “11.2 Guardrail Catalogue”| Guardrail | Scope | Enforcement |
|---|---|---|
| Max follow-ups | Per node | Runtime counter (§9.3). LLM MUST NOT be invoked for follow-up when counter at max. |
| Time budget | Per node, per exam | Runtime timer. Enforced even if LLM wants to continue. |
| Hint refusal | Per node | Runtime filters LLM output. If LLM response contains content matching the node’s modelAnswer or rubricPhrases, the response MUST be intercepted and replaced with a neutral re-prompt. |
| No rubric reveal | Global | Runtime MUST NOT pass rubric scoring weights, grade boundaries, or model answers to the LLM. However, the Runtime MUST pass rubric criteria and evidence vocabulary to the LLM — in IOA practice, rubric criteria are the conversation guide (sentence-starters), not a secret. The LLM uses criteria to know what to listen for, not how to score. The distinction: criteria describe observable competencies (“explains the mechanism”, “evaluates trade-offs”); scoring logic maps those to marks (criterion X = 5 marks if excellent, 3 if adequate). Only the former is shared. |
| No scoring | Global | Runtime MUST NOT pass scoring logic (grade boundaries, mark weights, score ranges) to the LLM. The LLM emits evidence signals as observations; scoring happens in the marking pipeline. The LLM MAY know rubric criteria as evidence vocabulary (what to listen for), but MUST NOT know how those criteria map to marks. |
| No structure change | Global | Runtime MUST NOT allow the LLM to add, remove, or reorder nodes. The LLM operates within the current node only. |
| No premature end | Per node | Runtime MUST NOT end a node before the main question has been asked and at least one candidate response received. |
| No topic jump | Per node | Runtime MUST constrain the LLM’s context to the current node’s topic. The LLM MUST NOT reference content from future nodes. |
| Off-topic handling | Per turn | If the LLM signals off_topic, Runtime MUST increment a per-node offTopicCount. After maxOffTopicRedirects (default: 2), Runtime MUST mark node as best-effort. |
| Silence handling | Per turn | If silence exceeds silenceTimeoutMs, Runtime MUST trigger a silence prompt (not an LLM follow-up). After maxSilencePrompts (default: 2), Runtime MUST mark node as best-effort. |
| Candidate anxiety | Per turn | If the LLM signals candidate_anxiety, Runtime MAY extend the time budget by anxietyTimeExtensionMs (configurable). The Runtime MUST NOT reduce difficulty or simplify questions. |
| Technical failure | Per event | See §6. Guardrails apply even in degraded mode — the Runtime MUST NOT bypass max-follow-ups or time budgets during recovery. |
| Persona consistency | Per node | If the specification defines a persona for the node (e.g., “hotel manager”), the Runtime MUST validate that every LLM spokenText output stays in character. The output validation pipeline MUST check for persona-break patterns (e.g., “As your examiner…”, “In this assessment…”). On violation, Runtime re-prompts with persona reminder. Emits guardrail_triggered with type persona_break. |
| Equity — communication style | Global | Unless communicationStyleIsLearningOutcome: true in the specification, the LLM MUST NOT penalise or comment on accent, fluency, verbal confidence, or speech patterns. The Runtime MUST filter evidence signals that reference communication quality when the flag is false. |
| Rapport and tone calibration | Per node | The LLM SHOULD build rapport through natural dialogue moves (acknowledgement, encouragement, reassurance) that are distinct from follow-ups. These moves MUST NOT count toward maxFollowUps. The LLM SHOULD adapt warmth based on context: warmer at node start, warmer when candidate struggles, more formal during technical probing. Rapport moves MUST NOT cross into assessment bias — encouragement like “Take your time” is permitted; “You’re doing great” is NOT (it provides implicit evaluative feedback). The Runtime MUST log rapport moves for quality assurance. |
| Neutrality in prompting | Per node | The LLM’s follow-up prompts MUST aim for neutrality as defined by Pearce & Chiavaroli (2020): “neither discourages nor reassures the student.” This is distinct from rapport — an examiner can be warm (rapport) while remaining assessment-neutral (prompting). The LLM MUST NOT provide evaluative feedback during the assessment (e.g., “Good answer”, “That’s not quite right”). The Runtime’s output validation pipeline MUST check for evaluative language patterns. |
| Examiner-initiated pause for welfare | Per turn | If the LLM signals distress_detected (not just anxiety), the Runtime MAY initiate a pause with a welfare message. This is distinct from candidate-initiated pause — the examiner proactively offers a break. Emits welfare_pause_offered event. |
| Time budget fairness | Per node | The Runtime MUST track whether the candidate received substantially different time-on-task compared to the node’s configured budget. If the LLM’s follow-up strategy causes a node to end significantly early (e.g., < 50% of time budget used), the Runtime SHOULD log this for fairness review. Different candidates should have comparable opportunities to demonstrate competence. |
11.3 Guardrail Violation Handling
Section titled “11.3 Guardrail Violation Handling”When a guardrail is triggered:
- Runtime MUST emit a
guardrail_triggeredevent with the guardrail type, context, and action taken. - If the violation is LLM-caused (e.g., LLM attempted to reveal rubric), Runtime MUST intercept the output, replace it, and log the attempt.
- If the violation is structural (e.g., time budget exceeded), Runtime MUST enforce the hard limit regardless of LLM state.
- Repeated guardrail violations (configurable threshold) SHOULD trigger an alert to the proctoring system.
11.4 LLM Output Validation
Section titled “11.4 LLM Output Validation”Every LLM response during an exam turn MUST pass through the Runtime’s output validation pipeline before being presented to the candidate:
- Content filter: Check against
forbiddenPhrases(model answer fragments, rubric terms). - Topic filter: Check that the response references only the current node’s topic scope.
- Action filter: Check that the response does not attempt to transition, score, or end the exam (these are Runtime actions).
- Length filter: Check that the response does not exceed
maxResponseLength.
If validation fails, the Runtime MUST:
- Log the violation.
- Invoke the LLM again with a corrected prompt (e.g., “Please rephrase without mentioning [X]”).
- If re-invocation also fails, use a canned fallback response.
12. Transcript and Evidence Capture Semantics
Section titled “12. Transcript and Evidence Capture Semantics”12.1 Transcript Structure
Section titled “12.1 Transcript Structure”Every exam MUST produce a structured transcript. The transcript is NOT just raw STT output — it is a structured sequence of turns, each annotated with metadata.
Turn Record Schema
Section titled “Turn Record Schema”interface TranscriptTurn {
turnId: string; // Unique turn identifier
nodeId: string; // Node this turn belongs to
turnIndex: number; // Sequential index within the node
role: "candidate" | "examiner"; // Speaker
content: string; // Text (STT output or LLM output)
timestamp: number; // Unix ms
durationMs: number; // Duration of the utterance
metadata: {
isCommand: boolean; // Was this a candidate command?
commandType?: string; // If command, which type?
isFollowUp: boolean; // Was this a follow-up question?
followUpIndex?: number; // If follow-up, which one?
isSilence: boolean; // Was this a silence event?
isOffTopic: boolean; // Was the candidate off-topic?
confidence: number; // STT confidence score (0–1)
};
}
12.2 Evidence Signal Capture
Section titled “12.2 Evidence Signal Capture”Evidence signals are the structured observations that feed the marking pipeline. They are NOT scores — they are facts about what the candidate demonstrated.
In IOA practice, evidence signals are rubric criteria. The rubric defines what competencies to look for; the evidence signal records whether (and how well) the candidate demonstrated that competency. The Runtime MUST ensure that every signalType in the specification maps to a rubric criterion, and that the LLM receives these as the vocabulary of “what to listen for.”
Evidence Signal Schema
Section titled “Evidence Signal Schema”interface EvidenceSignal {
signalId: string; // Unique signal identifier
nodeId: string; // Node where signal was observed
signalType: string; // Type from specification — MUST map to a rubric criterion
rubricLevel?: string; // Observed rubric level (e.g., "description", "analysis", "evaluation")
transversalSkills?: string[]; // Cross-cutting skills observed (e.g., ["critical_thinking", "professional_reasoning"])
confidence: number; // LLM's confidence that this signal was observed (0–1)
turnId: string; // Which turn triggered this signal
excerpt: string; // Short excerpt from candidate's response
timestamp: number; // When the signal was observed
source: "llm_observed" | "runtime_detected"; // Who observed it
scaffoldingIntensity?: number; // 0–3 scale: amount of scaffolding provided before this signal
// 0 = no scaffolding (independent answer)
// 1 = minimal (rephrasing/redirect)
// 2 = moderate (conceptual hint)
// 3 = heavy (simplified question/broken into parts)
scaffoldingEffective?: boolean; // Did the candidate improve after scaffolding? Only set when scaffoldingIntensity > 0.
}
Scaffolding as Evidence: The
scaffoldingIntensityandscaffoldingEffectivefields capture a critical piece of assessment information that the current model misses. Joughin (1998) identifies that oral assessment can measure “applied problem solving” — and the amount of scaffolding a candidate needs is itself evidence of their problem-solving competence. A candidate who answers correctly after heavy scaffolding (intensity 3) demonstrated a different competence level than one who answers independently (intensity 0). The marking pipeline MUST use scaffolding intensity as a modifier when evaluating evidence signals, not as a separate score.
Transversal Skills: IOA research identifies cross-cutting competencies (critical thinking, communication, problem-solving, professional reasoning) that span multiple nodes. The
transversalSkillsfield allows the LLM to tag evidence signals with these cross-cutting observations. The marking pipeline MUST aggregate transversal skill signals across all nodes to produce a holistic competency profile. Transversal skill vocabulary is defined in the specification at the exam level, not per-node.
12.3 Evidence Capture Rules
Section titled “12.3 Evidence Capture Rules”- Evidence signals MUST be emitted by the LLM during
processingstate (see §4). - The Runtime MUST validate that the
signalTypeis defined in the current node’sevidenceSignalsin the specification. Unknown signal types MUST be logged and discarded. - The Runtime MUST write validated signals to the Evidence Ledger immediately — not batched at node completion.
- The LLM MAY emit multiple signals per turn (candidate may demonstrate several competencies in one answer).
- The Runtime MUST NOT allow the LLM to emit signals for a node that is not currently active.
- Duplicate signals (same
signalType, sameturnId) MUST be deduplicated by the Runtime.
12.4 Evidence Sufficiency
Section titled “12.4 Evidence Sufficiency”The LLM MAY signal evidence_sufficient during processing if it believes enough signals have been collected. However:
- The Runtime MUST NOT auto-complete the node based solely on LLM judgment unless
autoCompleteOnSufficientis enabled in the specification policy. - Even with
autoCompleteOnSufficient, the Runtime MUST verify that at leastCompletionPolicy.requiredEvidenceCountdistinct evidence targets have been satisfied. - The LLM’s sufficiency judgment is advisory; the Runtime’s
requiredEvidenceCountcheck is authoritative.
12.5 Transcript Closure
Section titled “12.5 Transcript Closure”At exam completion (or abort), the Runtime MUST:
- Write all pending transcript turns to the transcript store.
- Write all pending evidence signals to the Evidence Ledger.
- Emit a
transcript_finalisedevent. - Compute and store a
transcriptHash(SHA-256 of the canonicalised transcript) for integrity verification.
The transcript MUST be immutable after transcript_finalised. No post-hoc edits are permitted. If a correction is needed, it MUST be appended as a correction record with its own hash chain.
12.6 Marking Pipeline Handoff
Section titled “12.6 Marking Pipeline Handoff”The Runtime MUST produce a MarkingPackage containing:
- The full structured transcript.
- The evidence ledger (all signals, with node association, transversal skill tags, and scaffolding intensity).
- Exam metadata (candidate ID, exam ID, start/end timestamps, duration).
- Node completion statuses (completed / best_effort / skipped).
- Any guardrail events triggered during the exam.
- Any assessment failure events (scenario clarification, difficulty adjustment, welfare checks).
- The transcript hash for integrity verification.
- Conversation fingerprint (SHA-256 of the ordered conversation path: node sequence + follow-up types + turn count per node). This proves each exam instance is unique — critical for academic integrity auditing. Two candidates on the same specification will have different fingerprints because the conversation unfolds differently.
- Assessment equivalence (e.g.,
equivalentWrittenWordCount: 3000) from the specification, for calibration reference. - Scaffolding metadata (if scaffolding was used): number of practice turns, whether candidate skipped early. Scaffolding transcript is NOT included.
- Conversation quality metrics per node: candidate turn count, examiner follow-up depth, average candidate response latency, longest candidate monologue, follow-up type distribution, and scaffolding trajectory. See §9.7.
- Psychometric equivalence summary: aggregate statistics for marking pipeline analysis — average follow-ups per node, follow-up type distribution, average time per node, scaffolding intensity distribution, and a conversation path variance score (0.0 = identical paths across candidates, 1.0 = maximally different). This enables the marking pipeline to detect whether different candidates received substantially different assessment experiences (see Joughin, 1998, on reliability threats from inconsistent questioning).
This package is the sole input to the marking pipeline. The marking pipeline MUST NOT need to reconstruct state from raw STT output or LLM conversation history.
Appendix A: Time Budget Calibration Guidance
Section titled “Appendix A: Time Budget Calibration Guidance”Non-normative. This appendix provides heuristics for setting per-node and per-exam time budgets. These are derived from the oral assessment literature and are intended as starting points, not hard requirements.
| Factor | Guideline | Source |
|---|---|---|
| Per-question time | 5–7 minutes for theoretical questions | Akimov & Malin (2020): “around five minutes’ response time” per theoretical question |
| Follow-up time | 5–10 minutes total per node | Akimov & Malin (2020): “five to ten minutes were allocated for follow-up questions” |
| Total exam duration | 15–30 minutes for a clear picture of understanding | Fenton (2025, citing Sayre, 2014): “it should take no more than 20 minutes to get a clear picture”; Akimov & Malin (2020): 30-minute blocks |
| Response time per question | 60 seconds max for concise answers | Bayley et al. (2024): “only the first 60 seconds of each of their responses would be graded” |
| Anxiety extension | Configurable, default 2 minutes | Spec design; Akimov & Malin (2020): students found shorter exams “rushed” |
| Practice/familiarisation | 5–10 minutes before assessment begins | Bayley et al. (2024): practice ConVOE; Fenton (2025): “opportunities to practice where no marks are allocated” |
Calibration Considerations
Section titled “Calibration Considerations”- Too short: Creates time pressure that disadvantages candidates who think more slowly or are answering in a second language (Akimov & Malin, 2020: students found 10-minute exams “rushed”).
- Too long: Leads to fatigue and reduced engagement. Fenton (2025) recommends “less is more.”
- Time as fairness variable: If two candidates get the same node but one uses 3 minutes and the other uses 8 minutes (because the LLM asked different follow-ups), the Runtime tracks this difference in the conversation quality metrics (§9.7). The marking pipeline SHOULD consider whether time-on-task differences correlate with score differences.
- Proportionality: Time budget SHOULD be proportional to the number of evidence targets on the node. A node with 3 evidence targets needs more time than one with 1.
Fairness Note: The psychometric equivalence summary does not constrain the LLM’s adaptiveness — it measures it. The marking pipeline can use this data to assess whether conversation path variance is correlated with score variance. If candidates who received harder follow-ups systematically score lower, this indicates a fairness problem that requires investigation (Akimov & Malin, 2020, Table 4: “It is hard to determine whether students perform significantly differently if follow-up questions are different”).
Revision History
Section titled “Revision History”| Version | Date | Changes |
|---|---|---|
| v0.2.0 | 2026-06-30 | Added anxiety detection and distress handling semantics. Updated terminology from ‘Exam Runtime IR’ to ‘IOA-ORM’. Refined state machine transitions for welfare checks. |
| v0.1.0 | 2026-05-06 | Initial release. |