Concepts & Domain Model

01 - Core Concepts & Domain Model

Status

Draft · v0.2.0 · 2026-06-30

0. Theoretical Foundations

This specification is grounded in the oral assessment literature. The design decisions documented here are informed by the following key works:

Paper	Key Contribution to This Specification
Joughin (1998; 2010) - Dimensions of Oral Assessment	Six dimensions (content type, interaction, authenticity, structure, examiners, orality) as design parameters for the `AssessmentProfile`. Reliability/validity trade-offs along continua. Three-way classification (presentations, interrogations, applications) from Joughin (2010).
Akimov & Malin (2020) - Oral Examination as Online Assessment Tool	Validity/reliability/fairness matrix. Question banking for inter-case reliability. Recording and moderation for intra-rater reliability. Identity verification. Anxiety management.
Bayley et al. (2024) - Implementing Large-Scale Oral Exams (ConVOEs)	Scalability patterns for 600+ students: parallel administration, batch grading, cross-section consistency. Practice sessions for anxiety reduction.
Fenton (2025) - Reconsidering Oral Exams and Assessments	IOA definition and components. Prompting taxonomy (Pearce & Chiavaroli, 2020). Formative vs. summative distinction. Examiner training. Communication skills as learning outcome.

§0.1 Bloom’s Taxonomy

Bloom’s Taxonomy (Bloom, 1956) defines six levels of cognitive engagement, progressing from lower-order thinking skills (Remember, Understand, Apply) to higher-order critical thinking (Analyze, Evaluate, Create). In the context of AI-powered oral assessment, Bloom’s Taxonomy is particularly significant: generative AI tools perform well at the lower levels but struggle at the Create level (Fenton, 2025). This makes oral assessment — which can probe higher-order thinking through interactive dialogue — a valuable complement to written assessment in the AI era.

The specification encodes Bloom’s levels as the BloomLevel enum and attaches them to EvidenceTarget via the optional cognitiveLevel field. This enables:

Compile-time validation that an exam covers the intended range of cognitive demands
Runtime follow-up escalation strategy (cognitiveEscalationStrategy) that targets higher-order thinking
Marking rubric alignment that weights higher-order responses more heavily

§0.2 Interactive Oral Assessment

Sotiriadou et al. (2020) define the ‘interactive oral’ as ‘a form of assessment asking students to perform real-world tasks to demonstrate meaningful application of necessary knowledge and skills.’ This positions oral assessment as scenario-driven, real-world-task-based, and conversation-oriented — not merely a spoken examination.

The specification supports interactive oral assessment through task and scenario node kinds, which carry persona, promptSeed, and scenario context for immersive role-play. The AssessmentProfile.authenticityProfile captures where the exam sits on the decontextualised → authentic spectrum.

§0.3 Anxiety Management

Student anxiety is one of the most discussed challenges in the oral assessment literature. Akimov and Malin (2020) report that 100% of students were nervous, with 53% ‘very nervous.’ Fenton (2025) notes that ‘the anxiety some students experience may be linked to the fact that they are unfamiliar with the format’ and recommends practice sessions, format familiarization, and clear communication of expectations.

The specification addresses anxiety through:

CandidateBriefing — candidate-facing exam information for preparation
warmup node kind with isPractice and anxietyMitigation properties
RecoveryPolicy with anxiety-specific recovery strategies
FormativeFeedbackPolicy for learning-oriented feedback that reduces uncertainty

How Theory Maps to Schema

Joughin’s (1998) six dimensions are encoded as the AssessmentProfile on ExamRuntimePackage:

Joughin Dimension	Schema Construct	Rationale
1. Primary Content Type	`AssessmentProfile.contentTypes`	Determines what counts as valid evidence. Knowledge can be assessed from a single response; interpersonal competence requires evaluating interaction quality across turns.
2. Interaction	`AssessmentProfile.interactionMode`	Reliability is threatened when interaction tends toward dialogue (Joughin, p. 376). The specification must capture where on the continuum the exam sits.
3. Authenticity	`AssessmentProfile.authenticityProfile`	Authenticity relates to face and construct validity (Akimov & Malin, 2020). The specification must express what professional context is being simulated.
4. Structure	`AssessmentProfile.structureProfile`	Closed structure improves reliability; open structure improves validity for probing understanding (Joughin, p. 376).
5. Examiners	`AssessmentProfile.examinerConfig`	Supports AI solo, human solo, panel, and AI-with-moderator. Enables inter-rater reliability tracking (Akimov & Malin, 2020).
6. Orality	`AssessmentProfile.oralityProfile`	Many oral exams involve supplementary written work (Joughin). The specification must support oral defense of prior submissions.

Additional theory-driven constructs:

Prompting taxonomy (Pearce & Chiavaroli, 2020, via Fenton 2025) → FollowUpPolicy.allowedPromptingLevels
Question banking (Akimov & Malin, 2020; Bayley et al., 2024) → QuestionPool
Moderation (Akimov & Malin, 2020) → ModerationPolicy, ModerationRecord
Assessment-significant moments (Fenton, 2025) → hesitation_detected, self_correction_detected events
Identity verification (Akimov & Malin, 2020; Fenton, 2025) → identity_check node kind
Formative vs. summative (Fenton, 2025; Akimov & Malin, 2020) → ExamMetadata.assessmentPurpose

1. Glossary

Term	Definition
Exam	A published, versioned oral assessment with defined structure, policies, and evidence targets.
Exam Runtime Package	The canonical specification artifact representing a complete published exam. The single source of truth.
Runtime Node	A discrete unit of the exam flow - a question, task, scenario segment, or transition point.
Runtime Session	A single candidate’s attempt at an exam. One exam may have many sessions.
Runtime State	The mutable, per-session state tracked by the runtime controller during execution.
Runtime Event	An immutable record of a significant state change during a session.
Candidate Command	A structured input from the candidate that the runtime controller MUST process (e.g., repeat, pause, clarification).
Transcript Turn	A single utterance in the conversation, attributed to examiner or candidate, with timing and node context.
Evidence Target	A rubric-aligned definition of what the exam is trying to assess at a given node.
Evidence Signal	A runtime-emitted record that a specific evidence target was (or was not) demonstrated, with confidence and provenance.
Evidence Ledger	The authoritative, structured collection of all evidence signals produced during a session.
Completion Policy	Rules governing when a node is “done” - how many turns, what evidence is required, time limits.
Follow-Up Policy	Rules governing how many follow-ups the examiner may issue, and under what conditions.
Transition Policy	Rules governing how and when the runtime moves from one node to another.
Recovery Policy	Rules governing how the runtime handles anomalies - silence, unclear answers, off-topic, anxiety, network issues.
Telemetry Policy	Rules governing what operational data is emitted and where.
Context Policy	Rules governing what exam context (rubric, previous nodes, candidate history) the AI examiner may access.
Pipecat Adapter Output	The compiled Pipecat-specific configuration (FlowManager config + NodeConfig) generated from the specification.
Agent Boundary	The explicit set of allowed and forbidden actions for the AI examiner, enforced by the runtime controller.
Marking Runtime	The downstream system that reads the evidence ledger and produces assessment scores.
Authoring Studio	The lecturer-facing tool for designing exam flows. Compiles to the specification on publish.
Assessment Profile	A structured declaration of the exam’s position on Joughin’s (1998) six dimensions of oral assessment. Captures design parameters that determine what the exam measures and how validity/reliability claims are supported.
Question Pool	A set of equivalent question variants from which one or more are drawn per session. Enables inter-case reliability (Akimov & Malin, 2020).
Prompting Level	A classification of examiner follow-up moves based on Pearce & Chiavaroli’s (2020) taxonomy: from neutral presentation to leading guidance.
Scaffolding Budget	Maximum scaffolding intensity permitted at a node (0-10). The amount of scaffolding provided is itself evidence of candidate competence (Fenton, 2025).
Moderation Policy	Rules for human review of AI-generated evidence signals. Supports inter-rater reliability (Akimov & Malin, 2020).
Calibration Profile	References to calibration exercises and measured accuracy metrics for the AI examiner. Ensures consistent assessment quality (Fenton, 2025).
Fairness Audit	Structured analysis of assessment outcomes across demographic dimensions to detect systematic disparities (Akimov & Malin, 2020; Fenton, 2025).
Content Type	Joughin’s (1998) four primary categories of what oral assessment can measure: knowledge/understanding, applied problem solving, interpersonal competence, intrapersonal qualities.
Validity Claim	A structured declaration of how the exam addresses face, content, construct, concurrent, inter-rater, inter-case, or fairness validity.

2. Core Domain Entities

2.1 ExamRuntimePackage

The top-level artifact. A published, versioned, complete specification of an oral exam. Contains metadata, the node graph, global policies, evidence target definitions, and the optional assessment profile.

Key properties:

Stable identity (examId) and version (version)
Metadata (title, subject, duration, institution, assessment purpose)
AssessmentProfile - Joughin’s six dimensions as design parameters (optional in v1)
The ordered graph of ExamRuntimeNode objects
GlobalRuntimePolicies that apply across all nodes
Evidence target registry
Question pools for randomized delivery
Pipecat adapter configuration hints

Theoretical grounding: The assessmentProfile field encodes Joughin’s (1998) six dimensions as first-class design parameters. This is not metadata decoration - these dimensions constrain runtime behavior, inform evidence interpretation, and support validity arguments. The assessmentPurpose field (formative/summative/diagnostic) affects whether evidence contributes to grades, whether candidates receive real-time feedback, and whether sessions are recorded (Fenton, 2025; Akimov & Malin, 2020).

2.2 AssessmentProfile

A structured declaration of the exam’s position on Joughin’s (1998) six dimensions of oral assessment. Optional in v1 - when absent, defaults are inferred from node-level policies. When present, it constrains runtime behavior, informs evidence interpretation, and supports validity/reliability arguments.

Joughin’s Six Dimensions (encoded as schema properties):

Dimension	Property	Why It Matters
1. Primary Content Type	`contentTypes`	Determines what counts as valid evidence. Knowledge/understanding can be assessed from a single correct response; interpersonal competence requires evaluating interaction quality across multiple turns (Joughin, 1998, p. 369).
2. Interaction	`interactionMode`	Ranges from presentation (one-way) to free dialogue. Reliability is threatened when interaction tends toward dialogue (Joughin, 1998, p. 376). The runtime should report on interaction mode consistency.
3. Authenticity	`authenticityProfile`	Ranges from decontextualised (abstract questions) to contextualised (genuine professional practice). Relates to face and construct validity (Akimov & Malin, 2020).
4. Structure	`structureProfile`	Ranges from closed (set questions, fixed order) to open (examiner follows responses). Closed structure improves reliability; open structure improves validity for probing understanding (Joughin, 1998, p. 376).
5. Examiners	`examinerConfig`	Supports self, peer, authority, panel, and AI-with-moderator. Enables inter-rater reliability tracking and moderation workflows (Akimov & Malin, 2020).
6. Orality	`oralityProfile`	Ranges from purely oral to oral-as-secondary (defending written work). Supports viva voce and multi-modal assessments (Joughin, 1998, p. 367).

Additional properties:

validityClaims - structured declarations of validity/reliability/fairness evidence
moderationPolicy - rules for human review of AI-generated signals
calibrationProfile - AI examiner accuracy metrics and calibration references

2.3 QuestionPool

A set of equivalent question variants from which one or more are drawn per session. Addresses inter-case reliability: when different candidates receive different questions, the questions must be of equivalent difficulty.

Key properties:

Pool ID, label
List of QuestionVariant objects (each with prompt seed, difficulty estimate, evidence targets)
Draw count (how many variants per session)
Whether reuse across concurrent sessions is allowed

Theoretical grounding: Akimov & Malin (2020) describe a bank of 69 questions from which students draw randomly. Bayley et al. (2024) note that question-sharing via group chat is a real concern at scale. The question pool model enables anti-collusion measures (no reuse across concurrent sessions) and difficulty calibration (estimated difficulty per variant).

2.4 ExamRuntimeNode

A single unit in the exam flow. Nodes are the vertices of the exam graph. Each node has a type (kind), local policies, evidence targets, and transition rules.

Node kinds:

question - A direct question to the candidate
scenario - A scenario presentation (read aloud, display, etc.)
task - A structured task (role-play, problem-solving, demonstration)
discussion - An open-ended discussion segment
warmup - Pre-assessment rapport building
wrapup - Closing segment
branch - Conditional routing node (no candidate interaction)
identity_check - Pre-exam identity verification (not assessed)

Key properties:

Unique node ID within the package
kind - the node type
promptSeed - the base content/prompt for this node (not the full system prompt)
questionPoolId - optional reference to a question pool for randomized delivery
Local CompletionPolicy, FollowUpPolicy, RecoveryPolicy
evidenceTargets - what this node is trying to assess
transitions - edges to successor nodes with conditions
candidateCommands - which commands are valid at this node
timeBudget - maximum time for this node

2.5 RuntimeSession

A candidate’s live attempt at an exam. Created when a session starts, persists until completion or termination. Contains the mutable runtime state and references the immutable package.

Key properties:

Session ID, candidate ID, package ID + version
RuntimeState - current mutable state
TranscriptTurn[] - full conversation transcript
EvidenceLedger - accumulated evidence
RuntimeEvent[] - event log for this session
Start time, end time, status

2.6 RuntimeState

The mutable state tracked by the runtime controller during a session. This is NOT persisted as a log - it is the working memory of the controller.

Key properties:

currentNodeId - which node the session is in
currentNodeTurnCount - turns in the current node
currentNodeFollowUpCount - follow-ups issued in the current node
globalElapsedMs - total session time
nodeElapsedMs - time in current node
candidateCommandHistory - commands issued by the candidate
evidenceCoverage - which evidence targets have signals
recoveryAttempts - recovery actions taken
status - active | paused | completed | terminated

2.7 RuntimeEvent

An immutable record of a significant state change. Events are the audit trail and the mechanism by which downstream systems (frontend, analytics, evidence ledger) learn about session activity.

Event categories:

Lifecycle: session_started, session_paused, session_resumed, session_completed, session_terminated
Node: node_entered, node_exited, node_timeout
Turn: examiner_turn, candidate_turn, turn_completed
Evidence: evidence_signal_emitted, evidence_target_satisfied, evidence_target_missed
Command: candidate_command_received, candidate_command_processed
Policy: follow_up_limit_reached, time_budget_warning, time_budget_exceeded, transition_forced, recovery_triggered, policy_violation
Agent: agent_action_allowed, agent_action_blocked
Assessment-significant: hesitation_detected, self_correction_detected

2.8 CandidateCommand

A structured input from the candidate that the runtime controller MUST process. These are not free-text — they are semantic intents recognized from candidate speech or UI interactions.

Command types:

repeat — “Can you repeat that?”
clarification — “What do you mean by…?”
request_rephrase — “Can you say that differently?” (signals active engagement)
pause — “Can I have a moment?”
thinking_aloud — “Let me think about this…” (assessment-significant metacognitive signal)
raise_hand — Candidate signals they want to speak / interrupt
skip — “Can I skip this?” (subject to policy)
volume_up / volume_down — Technical adjustment
language_switch — If multi-language support is enabled
challenge_premise — Candidate questions the framing of a question (extended)
revise_earlier_answer — Candidate wants to revisit a previous answer (extended)

Commands are runtime primitives, not UI decorations. The runtime controller MUST process them according to the CandidateCommandPolicy.

Theoretical grounding: Joughin (1998) identifies dialogue as a key dimension — candidates in a dialogue may redirect conversation, challenge premises, or revisit earlier points. Fenton (2025) notes that oral assessments allow “self-correction” — the revise_earlier_answer command supports this. The thinking_aloud command captures metacognitive awareness, which is assessment-significant evidence.

2.9 TranscriptTurn

A single attributed utterance in the conversation. Richer than raw STT output - carries node context, timing, and semantic metadata.

Key properties:

turnIndex - sequential index in the session
role - examiner | candidate | system
text - the transcribed text
nodeId - which node this turn occurred in
timestampMs - when the turn started
durationMs - how long the turn lasted
isFollowUp - whether this examiner turn was a follow-up
followUpIndex - if follow-up, which one (0-based)
candidateCommandDetected - if a candidate command was detected in this turn

2.10 EvidenceTarget

A rubric-aligned definition of what the exam is trying to assess. Defined at the package level, referenced by nodes.

Key properties:

targetId - unique identifier
label - human-readable name (e.g., “Explain the mechanism of photosynthesis”)
description - detailed description of what constitutes evidence
rubricCriteriaIds - links to rubric criteria in the marking model
requiredConfidence - minimum confidence for the signal to be considered satisfied
maxSignals - maximum signals this target can receive (prevents over-counting)
isRequired - whether this target MUST be satisfied for the exam to be valid

2.11 EvidenceSignal

A runtime-emitted record that a specific evidence target was demonstrated (or not). Produced by the AI examiner during conversation, written to the ledger immediately.

Key properties:

signalId - unique identifier
targetId - which EvidenceTarget this signal addresses
nodeId - which node the evidence was gathered in
turnRange - [startTurnIndex, endTurnIndex] - the transcript turns containing this evidence
confidence - 0.0 to 1.0, how confident the AI is that this target was met
source - ai_judgment | rubric_match | candidate_self_report | external_trigger
rationale - brief explanation of why this signal was emitted
timestampMs - when the signal was emitted

2.12 EvidenceLedger

The authoritative, structured collection of all evidence signals for a session. First-class output consumed by the marking runtime.

Key properties:

sessionId - which session this ledger belongs to
signals - ordered list of EvidenceSignal objects
coverageSummary - which EvidenceTarget IDs have at least one signal
satisfiedTargets - which required targets have signals meeting requiredConfidence
unsatisfiedTargets - which required targets lack sufficient signals

The ledger is not a transcript derivative. It is a real-time, structured, machine-readable evidence record maintained by the runtime controller.

2.13 CompletionPolicy

Rules governing when a node is considered “done.” The runtime controller evaluates this policy after every turn to determine whether to allow or force transition.

Completion conditions (any/all):

minTurns - minimum candidate turns before completion is possible
maxTurns - hard cap on turns (forces completion)
requiredEvidenceTargets - specific targets that MUST have signals before completion
requiredEvidenceThreshold - minimum number of satisfied targets
timeBudgetMs - maximum time in this node (forces completion on expiry)
explicitExaminerComplete - examiner explicitly signals “we’re done with this”
candidateDecline - candidate declines to continue (subject to policy)

2.14 FollowUpPolicy

Rules governing the AI examiner’s follow-up behavior within a node.

Key properties:

maxFollowUps - hard cap on follow-ups per node
followUpStyle - probing | scaffolding | clarifying | redirecting | free
minIntervalMs - minimum time between follow-ups
requireEvidenceGap - only follow up if an evidence target is unsatisfied
forbiddenFollowUpPatterns - patterns the examiner MUST NOT use (e.g., “giving away the answer”)
escalationRule - what to do when max follow-ups is reached (transition, wrap-up, etc.)
allowedPromptingLevels - constrains the examiner’s follow-up moves based on Pearce & Chiavaroli’s (2020) taxonomy
requireConsistentPrompting - whether prompting must be consistent across candidates
disclosePromptingStyle - whether candidates should be informed about prompting style in advance
scaffoldingBudget - maximum scaffolding intensity (0-10); the amount of scaffolding provided is itself evidence of candidate competence

Theoretical grounding: The prompting taxonomy is based on Pearce & Chiavaroli (2020), cited in Fenton (2025), which defines five levels from neutral presentation to leading guidance. The guiding principles are neutrality, consistency, transparency, and reflexivity. The scaffolding budget draws on Vygotsky’s Zone of Proximal Development (ZPD) theory: the examiner adjusts support based on the candidate’s demonstrated competence level (Fenton, 2025).

2.15 TransitionPolicy

Rules governing how the runtime moves between nodes.

Key properties:

targetNodeId - the destination node
condition - a structured condition that must be true for this transition to fire
priority - when multiple transitions are eligible, which wins
isForced - whether this transition can override completion policy (used for timeout, error recovery)
bridgePrompt - optional prompt seed for the examiner to generate a natural transition utterance

Condition types:

always - unconditional
evidence_satisfied - specific evidence targets are met
turn_count_reached - minimum turns completed
time_elapsed - time threshold crossed
candidate_command - candidate issued a specific command
policy_escalation - a policy limit was reached (e.g., max follow-ups)

2.16 RecoveryPolicy

Rules governing how the runtime handles anomalies.

Recovery scenarios:

silence - candidate is not responding
unclear_answer - STT confidence is low or response is ambiguous
off_topic - candidate is not addressing the question
anxiety - candidate signals stress or discomfort
interruption - candidate interrupts the examiner
network_issue - audio/connection degradation
repetition_loop - candidate keeps asking for repeats

Key properties:

scenario - which recovery scenario this rule addresses
maxAttempts - how many times to attempt recovery before escalation
escalation - retry | rephrase | skip_node | pause_session | terminate
recoveryPrompt - prompt seed for the examiner’s recovery utterance
cooldownMs - minimum wait before next recovery attempt

2.17 TelemetryPolicy

Rules governing what operational data is emitted and where.

Key properties:

emitTurnEvents - whether to emit events for every turn
emitEvidenceEvents - whether to emit events for evidence signals
emitStateTransitions - whether to emit events for state changes
emitPolicyViolations - whether to emit events for policy violations (SHOULD always be true)
samplingRate - for high-frequency events, what fraction to emit
destinations - where events go (event store, analytics, debug console)

2.18 ContextPolicy

Rules governing what context the AI examiner can access during the session.

Key properties:

includeRubric - whether the examiner can see rubric criteria
includePreviousNodes - whether the examiner can see transcript from prior nodes
includeEvidenceStatus - whether the examiner can see which evidence targets are satisfied
includeCandidateHistory - whether the examiner can see prior session data for this candidate
maxContextTokens - token budget for context injection
redactedFields - fields that MUST NOT appear in the examiner’s context

This is a critical agent boundary mechanism. The examiner’s context window is shaped by this policy - what it doesn’t see, it can’t leak or misuse.

2.19 PipecatAdapterOutput

The compiled output of running the specification through the Pipecat adapter. Contains everything needed to configure Pipecat’s FlowManager and per-node behavior.

Key properties:

flowManagerConfig - the FlowManager-compatible graph structure
nodeConfigs - per-node configuration (system prompt, voice, STT settings)
dataChannelSchema - schema for runtime events sent via LiveKit data channel
controllerOverlay - configuration for the runtime controller that sits alongside Pipecat
compilationWarnings - any degradation or lossy mappings during compilation

3. Conceptual Object Model

3.1 Entity Relationship Diagram

erDiagram
    ExamRuntimePackage ||--o| AssessmentProfile : has
    ExamRuntimePackage ||--o{ ExamRuntimeNode : contains
    ExamRuntimePackage ||--|| GlobalRuntimePolicies : has
    ExamRuntimePackage ||--o{ EvidenceTarget : defines
    ExamRuntimePackage ||--o{ QuestionPool : has

    AssessmentProfile ||--o| AuthenticityProfile : has
    AssessmentProfile ||--o| StructureProfile : has
    AssessmentProfile ||--o| ExaminerConfiguration : has
    AssessmentProfile ||--o| OralityProfile : has
    AssessmentProfile ||--o{ ValidityClaim : declares
    AssessmentProfile ||--o| ModerationPolicy : has
    AssessmentProfile ||--o| CalibrationProfile : has

    QuestionPool ||--o{ QuestionVariant : contains

    ExamRuntimeNode ||--o| CompletionPolicy : has
    ExamRuntimeNode ||--o| FollowUpPolicy : has
    ExamRuntimeNode ||--o| RecoveryPolicy : has
    ExamRuntimeNode ||--o{ TransitionPolicy : has_transitions
    ExamRuntimeNode ||--o{ EvidenceTarget : assesses
    ExamRuntimeNode ||--o| CandidateCommandPolicy : allows_commands
    ExamRuntimeNode }o--o| QuestionPool : draws_from

    RuntimeSession ||--|| RuntimeState : tracks
    RuntimeSession ||--|| ExamRuntimePackage : references
    RuntimeSession ||--o{ TranscriptTurn : contains
    RuntimeSession ||--|| EvidenceLedger : maintains
    RuntimeSession ||--o{ RuntimeEvent : emits
    RuntimeSession ||--o| SessionRecording : has
    RuntimeSession ||--o| ModerationRecord : reviewed_by

    EvidenceLedger ||--o{ EvidenceSignal : contains
    EvidenceSignal }o--|| EvidenceTarget : addresses

    RuntimeEvent }o--|| RuntimeSession : belongs_to
    TranscriptTurn }o--|| RuntimeSession : belongs_to

    ExamRuntimePackage ||--|| PipecatAdapterOutput : compiles_to
    ExamRuntimePackage ||--o{ FairnessAudit : audited_by

3.2 Node Lifecycle

stateDiagram-v2
    [*] --> Waiting : session created
    Waiting --> Active : node_entered
    Active --> Active : candidate_turn / examiner_turn
    Active --> Evaluating : completion_check triggered
    Evaluating --> Active : completion criteria NOT met
    Evaluating --> Transitioning : completion criteria met
    Active --> Recovering : anomaly detected
    Recovering --> Active : recovery successful
    Recovering --> Transitioning : max recovery attempts
    Active --> Timeout : time budget exceeded
    Timeout --> Transitioning : forced transition
    Transitioning --> Active : next node entered
    Transitioning --> Completed : no more nodes
    Completed --> [*]

3.3 Agent Agency Boundary

The AI examiner operates within a bounded creative space. The boundary is defined at multiple levels:

┌──────────────────────────────────────────────────────────┐
│                    GLOBAL POLICIES                        │
│  (apply to entire exam - agent boundary, telemetry, etc.) │
│                                                          │
│  ┌────────────────────────────────────────────────────┐  │
│  │              NODE-LOCAL POLICIES                    │  │
│  │  (per-node overrides - completion, follow-up,       │  │
│  │   recovery, commands)                               │  │
│  │                                                     │  │
│  │  ┌──────────────────────────────────────────────┐  │  │
│  │  │         AGENT CREATIVE SPACE                 │  │  │
│  │  │                                              │  │  │
│  │  │  - Generate natural follow-ups               │  │  │
│  │  │  - Judge evidence signals                    │  │  │
│  │  │  - Produce repair utterances                 │  │  │
│  │  │  - Create natural bridges between nodes      │  │  │
│  │  │  - Adapt tone and pace to candidate          │  │  │
│  │  │                                              │  │  │
│  │  │  CANNOT:                                     │  │  │
│  │  │  - Jump topics or skip nodes                 │  │  │
│  │  │  - Reveal rubric or scoring                  │  │  │
│  │  │  - Exceed follow-up limits                   │  │  │
│  │  │  - Ignore candidate commands                 │  │  │
│  │  │  - Change exam structure                     │  │  │
│  │  │  - Fabricate evidence                        │  │  │
│  │  │  - End exam prematurely                      │  │  │
│  │  └──────────────────────────────────────────────┘  │  │
│  └────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────┘

Authoring vs. Runtime Concepts:

Authoring Concept	Runtime Concept	Notes
Question bank	QuestionPool + ExamRuntimeNode	Questions become pools of variants with nodes as draw targets
Rubric criterion	EvidenceTarget	Rubric maps to evidence targets
Follow-up template	FollowUpPolicy + promptSeed	Templates become policy-constrained generation
Marking scheme	EvidenceLedger + MarkingRuntime input	Marking scheme defines targets; runtime produces signals
Exam duration	Global time budget + per-node budgets	Duration is distributed across nodes
Exam instructions	ContextPolicy + node promptSeeds	Instructions shape what the examiner knows
Assessment design	AssessmentProfile	Joughin’s six dimensions as design parameters
Moderation plan	ModerationPolicy	Rules for human review of AI-generated evidence
Calibration exercises	CalibrationProfile	Accuracy metrics and calibration references
Fairness review	FairnessAudit	Demographic disparity analysis

Persistent vs. Transient Runtime Nodes:

Persistent (in specification)	Transient (in Runtime State)
ExamRuntimeNode definitions	Current node pointer
Policies and constraints	Turn/follow-up counters
Evidence targets	Evidence coverage map
Transition rules	Recovery attempt counts
Command policies	Command history
Time budgets	Elapsed time trackers

The specification is immutable once published. Runtime state is ephemeral - created fresh per session, destroyed on completion. The evidence ledger and event log are persistent outputs derived from runtime execution.

Revision History

Version	Date	Changes
v0.2.0	2026-06-30	Added §0.1 Bloom’s Taxonomy, §0.2 Interactive Oral Assessment, §0.3 Anxiety Management. Updated Joughin reference to include 2010. Added IOA-ORM terminology.
v0.1.0	2026-05-06	Initial release.