Skip to content

Concepts & Domain Model

Draft · v0.2.0 · 2026-06-30


This specification is grounded in the oral assessment literature. The design decisions documented here are informed by the following key works:

PaperKey Contribution to This Specification
Joughin (1998; 2010) - Dimensions of Oral AssessmentSix dimensions (content type, interaction, authenticity, structure, examiners, orality) as design parameters for the AssessmentProfile. Reliability/validity trade-offs along continua. Three-way classification (presentations, interrogations, applications) from Joughin (2010).
Akimov & Malin (2020) - Oral Examination as Online Assessment ToolValidity/reliability/fairness matrix. Question banking for inter-case reliability. Recording and moderation for intra-rater reliability. Identity verification. Anxiety management.
Bayley et al. (2024) - Implementing Large-Scale Oral Exams (ConVOEs)Scalability patterns for 600+ students: parallel administration, batch grading, cross-section consistency. Practice sessions for anxiety reduction.
Fenton (2025) - Reconsidering Oral Exams and AssessmentsIOA definition and components. Prompting taxonomy (Pearce & Chiavaroli, 2020). Formative vs. summative distinction. Examiner training. Communication skills as learning outcome.

Bloom’s Taxonomy (Bloom, 1956) defines six levels of cognitive engagement, progressing from lower-order thinking skills (Remember, Understand, Apply) to higher-order critical thinking (Analyze, Evaluate, Create). In the context of AI-powered oral assessment, Bloom’s Taxonomy is particularly significant: generative AI tools perform well at the lower levels but struggle at the Create level (Fenton, 2025). This makes oral assessment — which can probe higher-order thinking through interactive dialogue — a valuable complement to written assessment in the AI era.

The specification encodes Bloom’s levels as the BloomLevel enum and attaches them to EvidenceTarget via the optional cognitiveLevel field. This enables:

  • Compile-time validation that an exam covers the intended range of cognitive demands
  • Runtime follow-up escalation strategy (cognitiveEscalationStrategy) that targets higher-order thinking
  • Marking rubric alignment that weights higher-order responses more heavily

Sotiriadou et al. (2020) define the ‘interactive oral’ as ‘a form of assessment asking students to perform real-world tasks to demonstrate meaningful application of necessary knowledge and skills.’ This positions oral assessment as scenario-driven, real-world-task-based, and conversation-oriented — not merely a spoken examination.

The specification supports interactive oral assessment through task and scenario node kinds, which carry persona, promptSeed, and scenario context for immersive role-play. The AssessmentProfile.authenticityProfile captures where the exam sits on the decontextualised → authentic spectrum.

Student anxiety is one of the most discussed challenges in the oral assessment literature. Akimov and Malin (2020) report that 100% of students were nervous, with 53% ‘very nervous.’ Fenton (2025) notes that ‘the anxiety some students experience may be linked to the fact that they are unfamiliar with the format’ and recommends practice sessions, format familiarization, and clear communication of expectations.

The specification addresses anxiety through:

  • CandidateBriefing — candidate-facing exam information for preparation
  • warmup node kind with isPractice and anxietyMitigation properties
  • RecoveryPolicy with anxiety-specific recovery strategies
  • FormativeFeedbackPolicy for learning-oriented feedback that reduces uncertainty

Joughin’s (1998) six dimensions are encoded as the AssessmentProfile on ExamRuntimePackage:

Joughin DimensionSchema ConstructRationale
1. Primary Content TypeAssessmentProfile.contentTypesDetermines what counts as valid evidence. Knowledge can be assessed from a single response; interpersonal competence requires evaluating interaction quality across turns.
2. InteractionAssessmentProfile.interactionModeReliability is threatened when interaction tends toward dialogue (Joughin, p. 376). The specification must capture where on the continuum the exam sits.
3. AuthenticityAssessmentProfile.authenticityProfileAuthenticity relates to face and construct validity (Akimov & Malin, 2020). The specification must express what professional context is being simulated.
4. StructureAssessmentProfile.structureProfileClosed structure improves reliability; open structure improves validity for probing understanding (Joughin, p. 376).
5. ExaminersAssessmentProfile.examinerConfigSupports AI solo, human solo, panel, and AI-with-moderator. Enables inter-rater reliability tracking (Akimov & Malin, 2020).
6. OralityAssessmentProfile.oralityProfileMany oral exams involve supplementary written work (Joughin). The specification must support oral defense of prior submissions.

Additional theory-driven constructs:

  • Prompting taxonomy (Pearce & Chiavaroli, 2020, via Fenton 2025) → FollowUpPolicy.allowedPromptingLevels
  • Question banking (Akimov & Malin, 2020; Bayley et al., 2024) → QuestionPool
  • Moderation (Akimov & Malin, 2020) → ModerationPolicy, ModerationRecord
  • Assessment-significant moments (Fenton, 2025) → hesitation_detected, self_correction_detected events
  • Identity verification (Akimov & Malin, 2020; Fenton, 2025) → identity_check node kind
  • Formative vs. summative (Fenton, 2025; Akimov & Malin, 2020) → ExamMetadata.assessmentPurpose

TermDefinition
ExamA published, versioned oral assessment with defined structure, policies, and evidence targets.
Exam Runtime PackageThe canonical specification artifact representing a complete published exam. The single source of truth.
Runtime NodeA discrete unit of the exam flow - a question, task, scenario segment, or transition point.
Runtime SessionA single candidate’s attempt at an exam. One exam may have many sessions.
Runtime StateThe mutable, per-session state tracked by the runtime controller during execution.
Runtime EventAn immutable record of a significant state change during a session.
Candidate CommandA structured input from the candidate that the runtime controller MUST process (e.g., repeat, pause, clarification).
Transcript TurnA single utterance in the conversation, attributed to examiner or candidate, with timing and node context.
Evidence TargetA rubric-aligned definition of what the exam is trying to assess at a given node.
Evidence SignalA runtime-emitted record that a specific evidence target was (or was not) demonstrated, with confidence and provenance.
Evidence LedgerThe authoritative, structured collection of all evidence signals produced during a session.
Completion PolicyRules governing when a node is “done” - how many turns, what evidence is required, time limits.
Follow-Up PolicyRules governing how many follow-ups the examiner may issue, and under what conditions.
Transition PolicyRules governing how and when the runtime moves from one node to another.
Recovery PolicyRules governing how the runtime handles anomalies - silence, unclear answers, off-topic, anxiety, network issues.
Telemetry PolicyRules governing what operational data is emitted and where.
Context PolicyRules governing what exam context (rubric, previous nodes, candidate history) the AI examiner may access.
Pipecat Adapter OutputThe compiled Pipecat-specific configuration (FlowManager config + NodeConfig) generated from the specification.
Agent BoundaryThe explicit set of allowed and forbidden actions for the AI examiner, enforced by the runtime controller.
Marking RuntimeThe downstream system that reads the evidence ledger and produces assessment scores.
Authoring StudioThe lecturer-facing tool for designing exam flows. Compiles to the specification on publish.
Assessment ProfileA structured declaration of the exam’s position on Joughin’s (1998) six dimensions of oral assessment. Captures design parameters that determine what the exam measures and how validity/reliability claims are supported.
Question PoolA set of equivalent question variants from which one or more are drawn per session. Enables inter-case reliability (Akimov & Malin, 2020).
Prompting LevelA classification of examiner follow-up moves based on Pearce & Chiavaroli’s (2020) taxonomy: from neutral presentation to leading guidance.
Scaffolding BudgetMaximum scaffolding intensity permitted at a node (0-10). The amount of scaffolding provided is itself evidence of candidate competence (Fenton, 2025).
Moderation PolicyRules for human review of AI-generated evidence signals. Supports inter-rater reliability (Akimov & Malin, 2020).
Calibration ProfileReferences to calibration exercises and measured accuracy metrics for the AI examiner. Ensures consistent assessment quality (Fenton, 2025).
Fairness AuditStructured analysis of assessment outcomes across demographic dimensions to detect systematic disparities (Akimov & Malin, 2020; Fenton, 2025).
Content TypeJoughin’s (1998) four primary categories of what oral assessment can measure: knowledge/understanding, applied problem solving, interpersonal competence, intrapersonal qualities.
Validity ClaimA structured declaration of how the exam addresses face, content, construct, concurrent, inter-rater, inter-case, or fairness validity.

The top-level artifact. A published, versioned, complete specification of an oral exam. Contains metadata, the node graph, global policies, evidence target definitions, and the optional assessment profile.

Key properties:

  • Stable identity (examId) and version (version)
  • Metadata (title, subject, duration, institution, assessment purpose)
  • AssessmentProfile - Joughin’s six dimensions as design parameters (optional in v1)
  • The ordered graph of ExamRuntimeNode objects
  • GlobalRuntimePolicies that apply across all nodes
  • Evidence target registry
  • Question pools for randomized delivery
  • Pipecat adapter configuration hints

Theoretical grounding: The assessmentProfile field encodes Joughin’s (1998) six dimensions as first-class design parameters. This is not metadata decoration - these dimensions constrain runtime behavior, inform evidence interpretation, and support validity arguments. The assessmentPurpose field (formative/summative/diagnostic) affects whether evidence contributes to grades, whether candidates receive real-time feedback, and whether sessions are recorded (Fenton, 2025; Akimov & Malin, 2020).

A structured declaration of the exam’s position on Joughin’s (1998) six dimensions of oral assessment. Optional in v1 - when absent, defaults are inferred from node-level policies. When present, it constrains runtime behavior, informs evidence interpretation, and supports validity/reliability arguments.

Joughin’s Six Dimensions (encoded as schema properties):

DimensionPropertyWhy It Matters
1. Primary Content TypecontentTypesDetermines what counts as valid evidence. Knowledge/understanding can be assessed from a single correct response; interpersonal competence requires evaluating interaction quality across multiple turns (Joughin, 1998, p. 369).
2. InteractioninteractionModeRanges from presentation (one-way) to free dialogue. Reliability is threatened when interaction tends toward dialogue (Joughin, 1998, p. 376). The runtime should report on interaction mode consistency.
3. AuthenticityauthenticityProfileRanges from decontextualised (abstract questions) to contextualised (genuine professional practice). Relates to face and construct validity (Akimov & Malin, 2020).
4. StructurestructureProfileRanges from closed (set questions, fixed order) to open (examiner follows responses). Closed structure improves reliability; open structure improves validity for probing understanding (Joughin, 1998, p. 376).
5. ExaminersexaminerConfigSupports self, peer, authority, panel, and AI-with-moderator. Enables inter-rater reliability tracking and moderation workflows (Akimov & Malin, 2020).
6. OralityoralityProfileRanges from purely oral to oral-as-secondary (defending written work). Supports viva voce and multi-modal assessments (Joughin, 1998, p. 367).

Additional properties:

  • validityClaims - structured declarations of validity/reliability/fairness evidence
  • moderationPolicy - rules for human review of AI-generated signals
  • calibrationProfile - AI examiner accuracy metrics and calibration references

A set of equivalent question variants from which one or more are drawn per session. Addresses inter-case reliability: when different candidates receive different questions, the questions must be of equivalent difficulty.

Key properties:

  • Pool ID, label
  • List of QuestionVariant objects (each with prompt seed, difficulty estimate, evidence targets)
  • Draw count (how many variants per session)
  • Whether reuse across concurrent sessions is allowed

Theoretical grounding: Akimov & Malin (2020) describe a bank of 69 questions from which students draw randomly. Bayley et al. (2024) note that question-sharing via group chat is a real concern at scale. The question pool model enables anti-collusion measures (no reuse across concurrent sessions) and difficulty calibration (estimated difficulty per variant).

A single unit in the exam flow. Nodes are the vertices of the exam graph. Each node has a type (kind), local policies, evidence targets, and transition rules.

Node kinds:

  • question - A direct question to the candidate
  • scenario - A scenario presentation (read aloud, display, etc.)
  • task - A structured task (role-play, problem-solving, demonstration)
  • discussion - An open-ended discussion segment
  • warmup - Pre-assessment rapport building
  • wrapup - Closing segment
  • branch - Conditional routing node (no candidate interaction)
  • identity_check - Pre-exam identity verification (not assessed)

Key properties:

  • Unique node ID within the package
  • kind - the node type
  • promptSeed - the base content/prompt for this node (not the full system prompt)
  • questionPoolId - optional reference to a question pool for randomized delivery
  • Local CompletionPolicy, FollowUpPolicy, RecoveryPolicy
  • evidenceTargets - what this node is trying to assess
  • transitions - edges to successor nodes with conditions
  • candidateCommands - which commands are valid at this node
  • timeBudget - maximum time for this node

A candidate’s live attempt at an exam. Created when a session starts, persists until completion or termination. Contains the mutable runtime state and references the immutable package.

Key properties:

  • Session ID, candidate ID, package ID + version
  • RuntimeState - current mutable state
  • TranscriptTurn[] - full conversation transcript
  • EvidenceLedger - accumulated evidence
  • RuntimeEvent[] - event log for this session
  • Start time, end time, status

The mutable state tracked by the runtime controller during a session. This is NOT persisted as a log - it is the working memory of the controller.

Key properties:

  • currentNodeId - which node the session is in
  • currentNodeTurnCount - turns in the current node
  • currentNodeFollowUpCount - follow-ups issued in the current node
  • globalElapsedMs - total session time
  • nodeElapsedMs - time in current node
  • candidateCommandHistory - commands issued by the candidate
  • evidenceCoverage - which evidence targets have signals
  • recoveryAttempts - recovery actions taken
  • status - active | paused | completed | terminated

An immutable record of a significant state change. Events are the audit trail and the mechanism by which downstream systems (frontend, analytics, evidence ledger) learn about session activity.

Event categories:

  • Lifecycle: session_started, session_paused, session_resumed, session_completed, session_terminated
  • Node: node_entered, node_exited, node_timeout
  • Turn: examiner_turn, candidate_turn, turn_completed
  • Evidence: evidence_signal_emitted, evidence_target_satisfied, evidence_target_missed
  • Command: candidate_command_received, candidate_command_processed
  • Policy: follow_up_limit_reached, time_budget_warning, time_budget_exceeded, transition_forced, recovery_triggered, policy_violation
  • Agent: agent_action_allowed, agent_action_blocked
  • Assessment-significant: hesitation_detected, self_correction_detected

A structured input from the candidate that the runtime controller MUST process. These are not free-text — they are semantic intents recognized from candidate speech or UI interactions.

Command types:

  • repeat — “Can you repeat that?”
  • clarification — “What do you mean by…?”
  • request_rephrase — “Can you say that differently?” (signals active engagement)
  • pause — “Can I have a moment?”
  • thinking_aloud — “Let me think about this…” (assessment-significant metacognitive signal)
  • raise_hand — Candidate signals they want to speak / interrupt
  • skip — “Can I skip this?” (subject to policy)
  • volume_up / volume_down — Technical adjustment
  • language_switch — If multi-language support is enabled
  • challenge_premise — Candidate questions the framing of a question (extended)
  • revise_earlier_answer — Candidate wants to revisit a previous answer (extended)

Commands are runtime primitives, not UI decorations. The runtime controller MUST process them according to the CandidateCommandPolicy.

Theoretical grounding: Joughin (1998) identifies dialogue as a key dimension — candidates in a dialogue may redirect conversation, challenge premises, or revisit earlier points. Fenton (2025) notes that oral assessments allow “self-correction” — the revise_earlier_answer command supports this. The thinking_aloud command captures metacognitive awareness, which is assessment-significant evidence.

A single attributed utterance in the conversation. Richer than raw STT output - carries node context, timing, and semantic metadata.

Key properties:

  • turnIndex - sequential index in the session
  • role - examiner | candidate | system
  • text - the transcribed text
  • nodeId - which node this turn occurred in
  • timestampMs - when the turn started
  • durationMs - how long the turn lasted
  • isFollowUp - whether this examiner turn was a follow-up
  • followUpIndex - if follow-up, which one (0-based)
  • candidateCommandDetected - if a candidate command was detected in this turn

A rubric-aligned definition of what the exam is trying to assess. Defined at the package level, referenced by nodes.

Key properties:

  • targetId - unique identifier
  • label - human-readable name (e.g., “Explain the mechanism of photosynthesis”)
  • description - detailed description of what constitutes evidence
  • rubricCriteriaIds - links to rubric criteria in the marking model
  • requiredConfidence - minimum confidence for the signal to be considered satisfied
  • maxSignals - maximum signals this target can receive (prevents over-counting)
  • isRequired - whether this target MUST be satisfied for the exam to be valid

A runtime-emitted record that a specific evidence target was demonstrated (or not). Produced by the AI examiner during conversation, written to the ledger immediately.

Key properties:

  • signalId - unique identifier
  • targetId - which EvidenceTarget this signal addresses
  • nodeId - which node the evidence was gathered in
  • turnRange - [startTurnIndex, endTurnIndex] - the transcript turns containing this evidence
  • confidence - 0.0 to 1.0, how confident the AI is that this target was met
  • source - ai_judgment | rubric_match | candidate_self_report | external_trigger
  • rationale - brief explanation of why this signal was emitted
  • timestampMs - when the signal was emitted

The authoritative, structured collection of all evidence signals for a session. First-class output consumed by the marking runtime.

Key properties:

  • sessionId - which session this ledger belongs to
  • signals - ordered list of EvidenceSignal objects
  • coverageSummary - which EvidenceTarget IDs have at least one signal
  • satisfiedTargets - which required targets have signals meeting requiredConfidence
  • unsatisfiedTargets - which required targets lack sufficient signals

The ledger is not a transcript derivative. It is a real-time, structured, machine-readable evidence record maintained by the runtime controller.

Rules governing when a node is considered “done.” The runtime controller evaluates this policy after every turn to determine whether to allow or force transition.

Completion conditions (any/all):

  • minTurns - minimum candidate turns before completion is possible
  • maxTurns - hard cap on turns (forces completion)
  • requiredEvidenceTargets - specific targets that MUST have signals before completion
  • requiredEvidenceThreshold - minimum number of satisfied targets
  • timeBudgetMs - maximum time in this node (forces completion on expiry)
  • explicitExaminerComplete - examiner explicitly signals “we’re done with this”
  • candidateDecline - candidate declines to continue (subject to policy)

Rules governing the AI examiner’s follow-up behavior within a node.

Key properties:

  • maxFollowUps - hard cap on follow-ups per node
  • followUpStyle - probing | scaffolding | clarifying | redirecting | free
  • minIntervalMs - minimum time between follow-ups
  • requireEvidenceGap - only follow up if an evidence target is unsatisfied
  • forbiddenFollowUpPatterns - patterns the examiner MUST NOT use (e.g., “giving away the answer”)
  • escalationRule - what to do when max follow-ups is reached (transition, wrap-up, etc.)
  • allowedPromptingLevels - constrains the examiner’s follow-up moves based on Pearce & Chiavaroli’s (2020) taxonomy
  • requireConsistentPrompting - whether prompting must be consistent across candidates
  • disclosePromptingStyle - whether candidates should be informed about prompting style in advance
  • scaffoldingBudget - maximum scaffolding intensity (0-10); the amount of scaffolding provided is itself evidence of candidate competence

Theoretical grounding: The prompting taxonomy is based on Pearce & Chiavaroli (2020), cited in Fenton (2025), which defines five levels from neutral presentation to leading guidance. The guiding principles are neutrality, consistency, transparency, and reflexivity. The scaffolding budget draws on Vygotsky’s Zone of Proximal Development (ZPD) theory: the examiner adjusts support based on the candidate’s demonstrated competence level (Fenton, 2025).

Rules governing how the runtime moves between nodes.

Key properties:

  • targetNodeId - the destination node
  • condition - a structured condition that must be true for this transition to fire
  • priority - when multiple transitions are eligible, which wins
  • isForced - whether this transition can override completion policy (used for timeout, error recovery)
  • bridgePrompt - optional prompt seed for the examiner to generate a natural transition utterance

Condition types:

  • always - unconditional
  • evidence_satisfied - specific evidence targets are met
  • turn_count_reached - minimum turns completed
  • time_elapsed - time threshold crossed
  • candidate_command - candidate issued a specific command
  • policy_escalation - a policy limit was reached (e.g., max follow-ups)

Rules governing how the runtime handles anomalies.

Recovery scenarios:

  • silence - candidate is not responding
  • unclear_answer - STT confidence is low or response is ambiguous
  • off_topic - candidate is not addressing the question
  • anxiety - candidate signals stress or discomfort
  • interruption - candidate interrupts the examiner
  • network_issue - audio/connection degradation
  • repetition_loop - candidate keeps asking for repeats

Key properties:

  • scenario - which recovery scenario this rule addresses
  • maxAttempts - how many times to attempt recovery before escalation
  • escalation - retry | rephrase | skip_node | pause_session | terminate
  • recoveryPrompt - prompt seed for the examiner’s recovery utterance
  • cooldownMs - minimum wait before next recovery attempt

Rules governing what operational data is emitted and where.

Key properties:

  • emitTurnEvents - whether to emit events for every turn
  • emitEvidenceEvents - whether to emit events for evidence signals
  • emitStateTransitions - whether to emit events for state changes
  • emitPolicyViolations - whether to emit events for policy violations (SHOULD always be true)
  • samplingRate - for high-frequency events, what fraction to emit
  • destinations - where events go (event store, analytics, debug console)

Rules governing what context the AI examiner can access during the session.

Key properties:

  • includeRubric - whether the examiner can see rubric criteria
  • includePreviousNodes - whether the examiner can see transcript from prior nodes
  • includeEvidenceStatus - whether the examiner can see which evidence targets are satisfied
  • includeCandidateHistory - whether the examiner can see prior session data for this candidate
  • maxContextTokens - token budget for context injection
  • redactedFields - fields that MUST NOT appear in the examiner’s context

This is a critical agent boundary mechanism. The examiner’s context window is shaped by this policy - what it doesn’t see, it can’t leak or misuse.

The compiled output of running the specification through the Pipecat adapter. Contains everything needed to configure Pipecat’s FlowManager and per-node behavior.

Key properties:

  • flowManagerConfig - the FlowManager-compatible graph structure
  • nodeConfigs - per-node configuration (system prompt, voice, STT settings)
  • dataChannelSchema - schema for runtime events sent via LiveKit data channel
  • controllerOverlay - configuration for the runtime controller that sits alongside Pipecat
  • compilationWarnings - any degradation or lossy mappings during compilation

erDiagram
    ExamRuntimePackage ||--o| AssessmentProfile : has
    ExamRuntimePackage ||--o{ ExamRuntimeNode : contains
    ExamRuntimePackage ||--|| GlobalRuntimePolicies : has
    ExamRuntimePackage ||--o{ EvidenceTarget : defines
    ExamRuntimePackage ||--o{ QuestionPool : has

    AssessmentProfile ||--o| AuthenticityProfile : has
    AssessmentProfile ||--o| StructureProfile : has
    AssessmentProfile ||--o| ExaminerConfiguration : has
    AssessmentProfile ||--o| OralityProfile : has
    AssessmentProfile ||--o{ ValidityClaim : declares
    AssessmentProfile ||--o| ModerationPolicy : has
    AssessmentProfile ||--o| CalibrationProfile : has

    QuestionPool ||--o{ QuestionVariant : contains

    ExamRuntimeNode ||--o| CompletionPolicy : has
    ExamRuntimeNode ||--o| FollowUpPolicy : has
    ExamRuntimeNode ||--o| RecoveryPolicy : has
    ExamRuntimeNode ||--o{ TransitionPolicy : has_transitions
    ExamRuntimeNode ||--o{ EvidenceTarget : assesses
    ExamRuntimeNode ||--o| CandidateCommandPolicy : allows_commands
    ExamRuntimeNode }o--o| QuestionPool : draws_from

    RuntimeSession ||--|| RuntimeState : tracks
    RuntimeSession ||--|| ExamRuntimePackage : references
    RuntimeSession ||--o{ TranscriptTurn : contains
    RuntimeSession ||--|| EvidenceLedger : maintains
    RuntimeSession ||--o{ RuntimeEvent : emits
    RuntimeSession ||--o| SessionRecording : has
    RuntimeSession ||--o| ModerationRecord : reviewed_by

    EvidenceLedger ||--o{ EvidenceSignal : contains
    EvidenceSignal }o--|| EvidenceTarget : addresses

    RuntimeEvent }o--|| RuntimeSession : belongs_to
    TranscriptTurn }o--|| RuntimeSession : belongs_to

    ExamRuntimePackage ||--|| PipecatAdapterOutput : compiles_to
    ExamRuntimePackage ||--o{ FairnessAudit : audited_by
stateDiagram-v2
    [*] --> Waiting : session created
    Waiting --> Active : node_entered
    Active --> Active : candidate_turn / examiner_turn
    Active --> Evaluating : completion_check triggered
    Evaluating --> Active : completion criteria NOT met
    Evaluating --> Transitioning : completion criteria met
    Active --> Recovering : anomaly detected
    Recovering --> Active : recovery successful
    Recovering --> Transitioning : max recovery attempts
    Active --> Timeout : time budget exceeded
    Timeout --> Transitioning : forced transition
    Transitioning --> Active : next node entered
    Transitioning --> Completed : no more nodes
    Completed --> [*]

The AI examiner operates within a bounded creative space. The boundary is defined at multiple levels:

┌──────────────────────────────────────────────────────────┐
│                    GLOBAL POLICIES                        │
│  (apply to entire exam - agent boundary, telemetry, etc.) │
│                                                          │
│  ┌────────────────────────────────────────────────────┐  │
│  │              NODE-LOCAL POLICIES                    │  │
│  │  (per-node overrides - completion, follow-up,       │  │
│  │   recovery, commands)                               │  │
│  │                                                     │  │
│  │  ┌──────────────────────────────────────────────┐  │  │
│  │  │         AGENT CREATIVE SPACE                 │  │  │
│  │  │                                              │  │  │
│  │  │  - Generate natural follow-ups               │  │  │
│  │  │  - Judge evidence signals                    │  │  │
│  │  │  - Produce repair utterances                 │  │  │
│  │  │  - Create natural bridges between nodes      │  │  │
│  │  │  - Adapt tone and pace to candidate          │  │  │
│  │  │                                              │  │  │
│  │  │  CANNOT:                                     │  │  │
│  │  │  - Jump topics or skip nodes                 │  │  │
│  │  │  - Reveal rubric or scoring                  │  │  │
│  │  │  - Exceed follow-up limits                   │  │  │
│  │  │  - Ignore candidate commands                 │  │  │
│  │  │  - Change exam structure                     │  │  │
│  │  │  - Fabricate evidence                        │  │  │
│  │  │  - End exam prematurely                      │  │  │
│  │  └──────────────────────────────────────────────┘  │  │
│  └────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────┘

Authoring vs. Runtime Concepts:

Authoring ConceptRuntime ConceptNotes
Question bankQuestionPool + ExamRuntimeNodeQuestions become pools of variants with nodes as draw targets
Rubric criterionEvidenceTargetRubric maps to evidence targets
Follow-up templateFollowUpPolicy + promptSeedTemplates become policy-constrained generation
Marking schemeEvidenceLedger + MarkingRuntime inputMarking scheme defines targets; runtime produces signals
Exam durationGlobal time budget + per-node budgetsDuration is distributed across nodes
Exam instructionsContextPolicy + node promptSeedsInstructions shape what the examiner knows
Assessment designAssessmentProfileJoughin’s six dimensions as design parameters
Moderation planModerationPolicyRules for human review of AI-generated evidence
Calibration exercisesCalibrationProfileAccuracy metrics and calibration references
Fairness reviewFairnessAuditDemographic disparity analysis

Persistent vs. Transient Runtime Nodes:

Persistent (in specification)Transient (in Runtime State)
ExamRuntimeNode definitionsCurrent node pointer
Policies and constraintsTurn/follow-up counters
Evidence targetsEvidence coverage map
Transition rulesRecovery attempt counts
Command policiesCommand history
Time budgetsElapsed time trackers

The specification is immutable once published. Runtime state is ephemeral - created fresh per session, destroyed on completion. The evidence ledger and event log are persistent outputs derived from runtime execution.

VersionDateChanges
v0.2.02026-06-30Added §0.1 Bloom’s Taxonomy, §0.2 Interactive Oral Assessment, §0.3 Anxiety Management. Updated Joughin reference to include 2010. Added IOA-ORM terminology.
v0.1.02026-05-06Initial release.