Concepts & Domain Model
01 - Core Concepts & Domain Model
Section titled “01 - Core Concepts & Domain Model”Status
Section titled “Status”Draft · v0.2.0 · 2026-06-30
0. Theoretical Foundations
Section titled “0. Theoretical Foundations”This specification is grounded in the oral assessment literature. The design decisions documented here are informed by the following key works:
| Paper | Key Contribution to This Specification |
|---|---|
| Joughin (1998; 2010) - Dimensions of Oral Assessment | Six dimensions (content type, interaction, authenticity, structure, examiners, orality) as design parameters for the AssessmentProfile. Reliability/validity trade-offs along continua. Three-way classification (presentations, interrogations, applications) from Joughin (2010). |
| Akimov & Malin (2020) - Oral Examination as Online Assessment Tool | Validity/reliability/fairness matrix. Question banking for inter-case reliability. Recording and moderation for intra-rater reliability. Identity verification. Anxiety management. |
| Bayley et al. (2024) - Implementing Large-Scale Oral Exams (ConVOEs) | Scalability patterns for 600+ students: parallel administration, batch grading, cross-section consistency. Practice sessions for anxiety reduction. |
| Fenton (2025) - Reconsidering Oral Exams and Assessments | IOA definition and components. Prompting taxonomy (Pearce & Chiavaroli, 2020). Formative vs. summative distinction. Examiner training. Communication skills as learning outcome. |
§0.1 Bloom’s Taxonomy
Section titled “§0.1 Bloom’s Taxonomy”Bloom’s Taxonomy (Bloom, 1956) defines six levels of cognitive engagement, progressing from lower-order thinking skills (Remember, Understand, Apply) to higher-order critical thinking (Analyze, Evaluate, Create). In the context of AI-powered oral assessment, Bloom’s Taxonomy is particularly significant: generative AI tools perform well at the lower levels but struggle at the Create level (Fenton, 2025). This makes oral assessment — which can probe higher-order thinking through interactive dialogue — a valuable complement to written assessment in the AI era.
The specification encodes Bloom’s levels as the BloomLevel enum and attaches them to EvidenceTarget via the optional cognitiveLevel field. This enables:
- Compile-time validation that an exam covers the intended range of cognitive demands
- Runtime follow-up escalation strategy (
cognitiveEscalationStrategy) that targets higher-order thinking - Marking rubric alignment that weights higher-order responses more heavily
§0.2 Interactive Oral Assessment
Section titled “§0.2 Interactive Oral Assessment”Sotiriadou et al. (2020) define the ‘interactive oral’ as ‘a form of assessment asking students to perform real-world tasks to demonstrate meaningful application of necessary knowledge and skills.’ This positions oral assessment as scenario-driven, real-world-task-based, and conversation-oriented — not merely a spoken examination.
The specification supports interactive oral assessment through task and scenario node kinds, which carry persona, promptSeed, and scenario context for immersive role-play. The AssessmentProfile.authenticityProfile captures where the exam sits on the decontextualised → authentic spectrum.
§0.3 Anxiety Management
Section titled “§0.3 Anxiety Management”Student anxiety is one of the most discussed challenges in the oral assessment literature. Akimov and Malin (2020) report that 100% of students were nervous, with 53% ‘very nervous.’ Fenton (2025) notes that ‘the anxiety some students experience may be linked to the fact that they are unfamiliar with the format’ and recommends practice sessions, format familiarization, and clear communication of expectations.
The specification addresses anxiety through:
CandidateBriefing— candidate-facing exam information for preparationwarmupnode kind withisPracticeandanxietyMitigationpropertiesRecoveryPolicywith anxiety-specific recovery strategiesFormativeFeedbackPolicyfor learning-oriented feedback that reduces uncertainty
How Theory Maps to Schema
Section titled “How Theory Maps to Schema”Joughin’s (1998) six dimensions are encoded as the AssessmentProfile on ExamRuntimePackage:
| Joughin Dimension | Schema Construct | Rationale |
|---|---|---|
| 1. Primary Content Type | AssessmentProfile.contentTypes | Determines what counts as valid evidence. Knowledge can be assessed from a single response; interpersonal competence requires evaluating interaction quality across turns. |
| 2. Interaction | AssessmentProfile.interactionMode | Reliability is threatened when interaction tends toward dialogue (Joughin, p. 376). The specification must capture where on the continuum the exam sits. |
| 3. Authenticity | AssessmentProfile.authenticityProfile | Authenticity relates to face and construct validity (Akimov & Malin, 2020). The specification must express what professional context is being simulated. |
| 4. Structure | AssessmentProfile.structureProfile | Closed structure improves reliability; open structure improves validity for probing understanding (Joughin, p. 376). |
| 5. Examiners | AssessmentProfile.examinerConfig | Supports AI solo, human solo, panel, and AI-with-moderator. Enables inter-rater reliability tracking (Akimov & Malin, 2020). |
| 6. Orality | AssessmentProfile.oralityProfile | Many oral exams involve supplementary written work (Joughin). The specification must support oral defense of prior submissions. |
Additional theory-driven constructs:
- Prompting taxonomy (Pearce & Chiavaroli, 2020, via Fenton 2025) →
FollowUpPolicy.allowedPromptingLevels - Question banking (Akimov & Malin, 2020; Bayley et al., 2024) →
QuestionPool - Moderation (Akimov & Malin, 2020) →
ModerationPolicy,ModerationRecord - Assessment-significant moments (Fenton, 2025) →
hesitation_detected,self_correction_detectedevents - Identity verification (Akimov & Malin, 2020; Fenton, 2025) →
identity_checknode kind - Formative vs. summative (Fenton, 2025; Akimov & Malin, 2020) →
ExamMetadata.assessmentPurpose
1. Glossary
Section titled “1. Glossary”| Term | Definition |
|---|---|
| Exam | A published, versioned oral assessment with defined structure, policies, and evidence targets. |
| Exam Runtime Package | The canonical specification artifact representing a complete published exam. The single source of truth. |
| Runtime Node | A discrete unit of the exam flow - a question, task, scenario segment, or transition point. |
| Runtime Session | A single candidate’s attempt at an exam. One exam may have many sessions. |
| Runtime State | The mutable, per-session state tracked by the runtime controller during execution. |
| Runtime Event | An immutable record of a significant state change during a session. |
| Candidate Command | A structured input from the candidate that the runtime controller MUST process (e.g., repeat, pause, clarification). |
| Transcript Turn | A single utterance in the conversation, attributed to examiner or candidate, with timing and node context. |
| Evidence Target | A rubric-aligned definition of what the exam is trying to assess at a given node. |
| Evidence Signal | A runtime-emitted record that a specific evidence target was (or was not) demonstrated, with confidence and provenance. |
| Evidence Ledger | The authoritative, structured collection of all evidence signals produced during a session. |
| Completion Policy | Rules governing when a node is “done” - how many turns, what evidence is required, time limits. |
| Follow-Up Policy | Rules governing how many follow-ups the examiner may issue, and under what conditions. |
| Transition Policy | Rules governing how and when the runtime moves from one node to another. |
| Recovery Policy | Rules governing how the runtime handles anomalies - silence, unclear answers, off-topic, anxiety, network issues. |
| Telemetry Policy | Rules governing what operational data is emitted and where. |
| Context Policy | Rules governing what exam context (rubric, previous nodes, candidate history) the AI examiner may access. |
| Pipecat Adapter Output | The compiled Pipecat-specific configuration (FlowManager config + NodeConfig) generated from the specification. |
| Agent Boundary | The explicit set of allowed and forbidden actions for the AI examiner, enforced by the runtime controller. |
| Marking Runtime | The downstream system that reads the evidence ledger and produces assessment scores. |
| Authoring Studio | The lecturer-facing tool for designing exam flows. Compiles to the specification on publish. |
| Assessment Profile | A structured declaration of the exam’s position on Joughin’s (1998) six dimensions of oral assessment. Captures design parameters that determine what the exam measures and how validity/reliability claims are supported. |
| Question Pool | A set of equivalent question variants from which one or more are drawn per session. Enables inter-case reliability (Akimov & Malin, 2020). |
| Prompting Level | A classification of examiner follow-up moves based on Pearce & Chiavaroli’s (2020) taxonomy: from neutral presentation to leading guidance. |
| Scaffolding Budget | Maximum scaffolding intensity permitted at a node (0-10). The amount of scaffolding provided is itself evidence of candidate competence (Fenton, 2025). |
| Moderation Policy | Rules for human review of AI-generated evidence signals. Supports inter-rater reliability (Akimov & Malin, 2020). |
| Calibration Profile | References to calibration exercises and measured accuracy metrics for the AI examiner. Ensures consistent assessment quality (Fenton, 2025). |
| Fairness Audit | Structured analysis of assessment outcomes across demographic dimensions to detect systematic disparities (Akimov & Malin, 2020; Fenton, 2025). |
| Content Type | Joughin’s (1998) four primary categories of what oral assessment can measure: knowledge/understanding, applied problem solving, interpersonal competence, intrapersonal qualities. |
| Validity Claim | A structured declaration of how the exam addresses face, content, construct, concurrent, inter-rater, inter-case, or fairness validity. |
2. Core Domain Entities
Section titled “2. Core Domain Entities”2.1 ExamRuntimePackage
Section titled “2.1 ExamRuntimePackage”The top-level artifact. A published, versioned, complete specification of an oral exam. Contains metadata, the node graph, global policies, evidence target definitions, and the optional assessment profile.
Key properties:
- Stable identity (
examId) and version (version) - Metadata (title, subject, duration, institution, assessment purpose)
AssessmentProfile- Joughin’s six dimensions as design parameters (optional in v1)- The ordered graph of
ExamRuntimeNodeobjects GlobalRuntimePoliciesthat apply across all nodes- Evidence target registry
- Question pools for randomized delivery
- Pipecat adapter configuration hints
Theoretical grounding: The assessmentProfile field encodes Joughin’s (1998) six dimensions as first-class design parameters. This is not metadata decoration - these dimensions constrain runtime behavior, inform evidence interpretation, and support validity arguments. The assessmentPurpose field (formative/summative/diagnostic) affects whether evidence contributes to grades, whether candidates receive real-time feedback, and whether sessions are recorded (Fenton, 2025; Akimov & Malin, 2020).
2.2 AssessmentProfile
Section titled “2.2 AssessmentProfile”A structured declaration of the exam’s position on Joughin’s (1998) six dimensions of oral assessment. Optional in v1 - when absent, defaults are inferred from node-level policies. When present, it constrains runtime behavior, informs evidence interpretation, and supports validity/reliability arguments.
Joughin’s Six Dimensions (encoded as schema properties):
| Dimension | Property | Why It Matters |
|---|---|---|
| 1. Primary Content Type | contentTypes | Determines what counts as valid evidence. Knowledge/understanding can be assessed from a single correct response; interpersonal competence requires evaluating interaction quality across multiple turns (Joughin, 1998, p. 369). |
| 2. Interaction | interactionMode | Ranges from presentation (one-way) to free dialogue. Reliability is threatened when interaction tends toward dialogue (Joughin, 1998, p. 376). The runtime should report on interaction mode consistency. |
| 3. Authenticity | authenticityProfile | Ranges from decontextualised (abstract questions) to contextualised (genuine professional practice). Relates to face and construct validity (Akimov & Malin, 2020). |
| 4. Structure | structureProfile | Ranges from closed (set questions, fixed order) to open (examiner follows responses). Closed structure improves reliability; open structure improves validity for probing understanding (Joughin, 1998, p. 376). |
| 5. Examiners | examinerConfig | Supports self, peer, authority, panel, and AI-with-moderator. Enables inter-rater reliability tracking and moderation workflows (Akimov & Malin, 2020). |
| 6. Orality | oralityProfile | Ranges from purely oral to oral-as-secondary (defending written work). Supports viva voce and multi-modal assessments (Joughin, 1998, p. 367). |
Additional properties:
validityClaims- structured declarations of validity/reliability/fairness evidencemoderationPolicy- rules for human review of AI-generated signalscalibrationProfile- AI examiner accuracy metrics and calibration references
2.3 QuestionPool
Section titled “2.3 QuestionPool”A set of equivalent question variants from which one or more are drawn per session. Addresses inter-case reliability: when different candidates receive different questions, the questions must be of equivalent difficulty.
Key properties:
- Pool ID, label
- List of
QuestionVariantobjects (each with prompt seed, difficulty estimate, evidence targets) - Draw count (how many variants per session)
- Whether reuse across concurrent sessions is allowed
Theoretical grounding: Akimov & Malin (2020) describe a bank of 69 questions from which students draw randomly. Bayley et al. (2024) note that question-sharing via group chat is a real concern at scale. The question pool model enables anti-collusion measures (no reuse across concurrent sessions) and difficulty calibration (estimated difficulty per variant).
2.4 ExamRuntimeNode
Section titled “2.4 ExamRuntimeNode”A single unit in the exam flow. Nodes are the vertices of the exam graph. Each node has a type (kind), local policies, evidence targets, and transition rules.
Node kinds:
question- A direct question to the candidatescenario- A scenario presentation (read aloud, display, etc.)task- A structured task (role-play, problem-solving, demonstration)discussion- An open-ended discussion segmentwarmup- Pre-assessment rapport buildingwrapup- Closing segmentbranch- Conditional routing node (no candidate interaction)identity_check- Pre-exam identity verification (not assessed)
Key properties:
- Unique node ID within the package
kind- the node typepromptSeed- the base content/prompt for this node (not the full system prompt)questionPoolId- optional reference to a question pool for randomized delivery- Local
CompletionPolicy,FollowUpPolicy,RecoveryPolicy evidenceTargets- what this node is trying to assesstransitions- edges to successor nodes with conditionscandidateCommands- which commands are valid at this nodetimeBudget- maximum time for this node
2.5 RuntimeSession
Section titled “2.5 RuntimeSession”A candidate’s live attempt at an exam. Created when a session starts, persists until completion or termination. Contains the mutable runtime state and references the immutable package.
Key properties:
- Session ID, candidate ID, package ID + version
RuntimeState- current mutable stateTranscriptTurn[]- full conversation transcriptEvidenceLedger- accumulated evidenceRuntimeEvent[]- event log for this session- Start time, end time, status
2.6 RuntimeState
Section titled “2.6 RuntimeState”The mutable state tracked by the runtime controller during a session. This is NOT persisted as a log - it is the working memory of the controller.
Key properties:
currentNodeId- which node the session is incurrentNodeTurnCount- turns in the current nodecurrentNodeFollowUpCount- follow-ups issued in the current nodeglobalElapsedMs- total session timenodeElapsedMs- time in current nodecandidateCommandHistory- commands issued by the candidateevidenceCoverage- which evidence targets have signalsrecoveryAttempts- recovery actions takenstatus-active|paused|completed|terminated
2.7 RuntimeEvent
Section titled “2.7 RuntimeEvent”An immutable record of a significant state change. Events are the audit trail and the mechanism by which downstream systems (frontend, analytics, evidence ledger) learn about session activity.
Event categories:
- Lifecycle:
session_started,session_paused,session_resumed,session_completed,session_terminated - Node:
node_entered,node_exited,node_timeout - Turn:
examiner_turn,candidate_turn,turn_completed - Evidence:
evidence_signal_emitted,evidence_target_satisfied,evidence_target_missed - Command:
candidate_command_received,candidate_command_processed - Policy:
follow_up_limit_reached,time_budget_warning,time_budget_exceeded,transition_forced,recovery_triggered,policy_violation - Agent:
agent_action_allowed,agent_action_blocked - Assessment-significant:
hesitation_detected,self_correction_detected
2.8 CandidateCommand
Section titled “2.8 CandidateCommand”A structured input from the candidate that the runtime controller MUST process. These are not free-text — they are semantic intents recognized from candidate speech or UI interactions.
Command types:
repeat— “Can you repeat that?”clarification— “What do you mean by…?”request_rephrase— “Can you say that differently?” (signals active engagement)pause— “Can I have a moment?”thinking_aloud— “Let me think about this…” (assessment-significant metacognitive signal)raise_hand— Candidate signals they want to speak / interruptskip— “Can I skip this?” (subject to policy)volume_up/volume_down— Technical adjustmentlanguage_switch— If multi-language support is enabledchallenge_premise— Candidate questions the framing of a question (extended)revise_earlier_answer— Candidate wants to revisit a previous answer (extended)
Commands are runtime primitives, not UI decorations. The runtime controller MUST process them according to the CandidateCommandPolicy.
Theoretical grounding: Joughin (1998) identifies dialogue as a key dimension — candidates in a dialogue may redirect conversation, challenge premises, or revisit earlier points. Fenton (2025) notes that oral assessments allow “self-correction” — the revise_earlier_answer command supports this. The thinking_aloud command captures metacognitive awareness, which is assessment-significant evidence.
2.9 TranscriptTurn
Section titled “2.9 TranscriptTurn”A single attributed utterance in the conversation. Richer than raw STT output - carries node context, timing, and semantic metadata.
Key properties:
turnIndex- sequential index in the sessionrole-examiner|candidate|systemtext- the transcribed textnodeId- which node this turn occurred intimestampMs- when the turn starteddurationMs- how long the turn lastedisFollowUp- whether this examiner turn was a follow-upfollowUpIndex- if follow-up, which one (0-based)candidateCommandDetected- if a candidate command was detected in this turn
2.10 EvidenceTarget
Section titled “2.10 EvidenceTarget”A rubric-aligned definition of what the exam is trying to assess. Defined at the package level, referenced by nodes.
Key properties:
targetId- unique identifierlabel- human-readable name (e.g., “Explain the mechanism of photosynthesis”)description- detailed description of what constitutes evidencerubricCriteriaIds- links to rubric criteria in the marking modelrequiredConfidence- minimum confidence for the signal to be considered satisfiedmaxSignals- maximum signals this target can receive (prevents over-counting)isRequired- whether this target MUST be satisfied for the exam to be valid
2.11 EvidenceSignal
Section titled “2.11 EvidenceSignal”A runtime-emitted record that a specific evidence target was demonstrated (or not). Produced by the AI examiner during conversation, written to the ledger immediately.
Key properties:
signalId- unique identifiertargetId- whichEvidenceTargetthis signal addressesnodeId- which node the evidence was gathered inturnRange-[startTurnIndex, endTurnIndex]- the transcript turns containing this evidenceconfidence- 0.0 to 1.0, how confident the AI is that this target was metsource-ai_judgment|rubric_match|candidate_self_report|external_triggerrationale- brief explanation of why this signal was emittedtimestampMs- when the signal was emitted
2.12 EvidenceLedger
Section titled “2.12 EvidenceLedger”The authoritative, structured collection of all evidence signals for a session. First-class output consumed by the marking runtime.
Key properties:
sessionId- which session this ledger belongs tosignals- ordered list ofEvidenceSignalobjectscoverageSummary- whichEvidenceTargetIDs have at least one signalsatisfiedTargets- which required targets have signals meetingrequiredConfidenceunsatisfiedTargets- which required targets lack sufficient signals
The ledger is not a transcript derivative. It is a real-time, structured, machine-readable evidence record maintained by the runtime controller.
2.13 CompletionPolicy
Section titled “2.13 CompletionPolicy”Rules governing when a node is considered “done.” The runtime controller evaluates this policy after every turn to determine whether to allow or force transition.
Completion conditions (any/all):
minTurns- minimum candidate turns before completion is possiblemaxTurns- hard cap on turns (forces completion)requiredEvidenceTargets- specific targets that MUST have signals before completionrequiredEvidenceThreshold- minimum number of satisfied targetstimeBudgetMs- maximum time in this node (forces completion on expiry)explicitExaminerComplete- examiner explicitly signals “we’re done with this”candidateDecline- candidate declines to continue (subject to policy)
2.14 FollowUpPolicy
Section titled “2.14 FollowUpPolicy”Rules governing the AI examiner’s follow-up behavior within a node.
Key properties:
maxFollowUps- hard cap on follow-ups per nodefollowUpStyle-probing|scaffolding|clarifying|redirecting|freeminIntervalMs- minimum time between follow-upsrequireEvidenceGap- only follow up if an evidence target is unsatisfiedforbiddenFollowUpPatterns- patterns the examiner MUST NOT use (e.g., “giving away the answer”)escalationRule- what to do when max follow-ups is reached (transition, wrap-up, etc.)allowedPromptingLevels- constrains the examiner’s follow-up moves based on Pearce & Chiavaroli’s (2020) taxonomyrequireConsistentPrompting- whether prompting must be consistent across candidatesdisclosePromptingStyle- whether candidates should be informed about prompting style in advancescaffoldingBudget- maximum scaffolding intensity (0-10); the amount of scaffolding provided is itself evidence of candidate competence
Theoretical grounding: The prompting taxonomy is based on Pearce & Chiavaroli (2020), cited in Fenton (2025), which defines five levels from neutral presentation to leading guidance. The guiding principles are neutrality, consistency, transparency, and reflexivity. The scaffolding budget draws on Vygotsky’s Zone of Proximal Development (ZPD) theory: the examiner adjusts support based on the candidate’s demonstrated competence level (Fenton, 2025).
2.15 TransitionPolicy
Section titled “2.15 TransitionPolicy”Rules governing how the runtime moves between nodes.
Key properties:
targetNodeId- the destination nodecondition- a structured condition that must be true for this transition to firepriority- when multiple transitions are eligible, which winsisForced- whether this transition can override completion policy (used for timeout, error recovery)bridgePrompt- optional prompt seed for the examiner to generate a natural transition utterance
Condition types:
always- unconditionalevidence_satisfied- specific evidence targets are metturn_count_reached- minimum turns completedtime_elapsed- time threshold crossedcandidate_command- candidate issued a specific commandpolicy_escalation- a policy limit was reached (e.g., max follow-ups)
2.16 RecoveryPolicy
Section titled “2.16 RecoveryPolicy”Rules governing how the runtime handles anomalies.
Recovery scenarios:
silence- candidate is not respondingunclear_answer- STT confidence is low or response is ambiguousoff_topic- candidate is not addressing the questionanxiety- candidate signals stress or discomfortinterruption- candidate interrupts the examinernetwork_issue- audio/connection degradationrepetition_loop- candidate keeps asking for repeats
Key properties:
scenario- which recovery scenario this rule addressesmaxAttempts- how many times to attempt recovery before escalationescalation-retry|rephrase|skip_node|pause_session|terminaterecoveryPrompt- prompt seed for the examiner’s recovery utterancecooldownMs- minimum wait before next recovery attempt
2.17 TelemetryPolicy
Section titled “2.17 TelemetryPolicy”Rules governing what operational data is emitted and where.
Key properties:
emitTurnEvents- whether to emit events for every turnemitEvidenceEvents- whether to emit events for evidence signalsemitStateTransitions- whether to emit events for state changesemitPolicyViolations- whether to emit events for policy violations (SHOULD always be true)samplingRate- for high-frequency events, what fraction to emitdestinations- where events go (event store, analytics, debug console)
2.18 ContextPolicy
Section titled “2.18 ContextPolicy”Rules governing what context the AI examiner can access during the session.
Key properties:
includeRubric- whether the examiner can see rubric criteriaincludePreviousNodes- whether the examiner can see transcript from prior nodesincludeEvidenceStatus- whether the examiner can see which evidence targets are satisfiedincludeCandidateHistory- whether the examiner can see prior session data for this candidatemaxContextTokens- token budget for context injectionredactedFields- fields that MUST NOT appear in the examiner’s context
This is a critical agent boundary mechanism. The examiner’s context window is shaped by this policy - what it doesn’t see, it can’t leak or misuse.
2.19 PipecatAdapterOutput
Section titled “2.19 PipecatAdapterOutput”The compiled output of running the specification through the Pipecat adapter. Contains everything needed to configure Pipecat’s FlowManager and per-node behavior.
Key properties:
flowManagerConfig- the FlowManager-compatible graph structurenodeConfigs- per-node configuration (system prompt, voice, STT settings)dataChannelSchema- schema for runtime events sent via LiveKit data channelcontrollerOverlay- configuration for the runtime controller that sits alongside PipecatcompilationWarnings- any degradation or lossy mappings during compilation
3. Conceptual Object Model
Section titled “3. Conceptual Object Model”3.1 Entity Relationship Diagram
Section titled “3.1 Entity Relationship Diagram”erDiagram
ExamRuntimePackage ||--o| AssessmentProfile : has
ExamRuntimePackage ||--o{ ExamRuntimeNode : contains
ExamRuntimePackage ||--|| GlobalRuntimePolicies : has
ExamRuntimePackage ||--o{ EvidenceTarget : defines
ExamRuntimePackage ||--o{ QuestionPool : has
AssessmentProfile ||--o| AuthenticityProfile : has
AssessmentProfile ||--o| StructureProfile : has
AssessmentProfile ||--o| ExaminerConfiguration : has
AssessmentProfile ||--o| OralityProfile : has
AssessmentProfile ||--o{ ValidityClaim : declares
AssessmentProfile ||--o| ModerationPolicy : has
AssessmentProfile ||--o| CalibrationProfile : has
QuestionPool ||--o{ QuestionVariant : contains
ExamRuntimeNode ||--o| CompletionPolicy : has
ExamRuntimeNode ||--o| FollowUpPolicy : has
ExamRuntimeNode ||--o| RecoveryPolicy : has
ExamRuntimeNode ||--o{ TransitionPolicy : has_transitions
ExamRuntimeNode ||--o{ EvidenceTarget : assesses
ExamRuntimeNode ||--o| CandidateCommandPolicy : allows_commands
ExamRuntimeNode }o--o| QuestionPool : draws_from
RuntimeSession ||--|| RuntimeState : tracks
RuntimeSession ||--|| ExamRuntimePackage : references
RuntimeSession ||--o{ TranscriptTurn : contains
RuntimeSession ||--|| EvidenceLedger : maintains
RuntimeSession ||--o{ RuntimeEvent : emits
RuntimeSession ||--o| SessionRecording : has
RuntimeSession ||--o| ModerationRecord : reviewed_by
EvidenceLedger ||--o{ EvidenceSignal : contains
EvidenceSignal }o--|| EvidenceTarget : addresses
RuntimeEvent }o--|| RuntimeSession : belongs_to
TranscriptTurn }o--|| RuntimeSession : belongs_to
ExamRuntimePackage ||--|| PipecatAdapterOutput : compiles_to
ExamRuntimePackage ||--o{ FairnessAudit : audited_by
3.2 Node Lifecycle
Section titled “3.2 Node Lifecycle”stateDiagram-v2
[*] --> Waiting : session created
Waiting --> Active : node_entered
Active --> Active : candidate_turn / examiner_turn
Active --> Evaluating : completion_check triggered
Evaluating --> Active : completion criteria NOT met
Evaluating --> Transitioning : completion criteria met
Active --> Recovering : anomaly detected
Recovering --> Active : recovery successful
Recovering --> Transitioning : max recovery attempts
Active --> Timeout : time budget exceeded
Timeout --> Transitioning : forced transition
Transitioning --> Active : next node entered
Transitioning --> Completed : no more nodes
Completed --> [*]
3.3 Agent Agency Boundary
Section titled “3.3 Agent Agency Boundary”The AI examiner operates within a bounded creative space. The boundary is defined at multiple levels:
┌──────────────────────────────────────────────────────────┐
│ GLOBAL POLICIES │
│ (apply to entire exam - agent boundary, telemetry, etc.) │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ NODE-LOCAL POLICIES │ │
│ │ (per-node overrides - completion, follow-up, │ │
│ │ recovery, commands) │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ AGENT CREATIVE SPACE │ │ │
│ │ │ │ │ │
│ │ │ - Generate natural follow-ups │ │ │
│ │ │ - Judge evidence signals │ │ │
│ │ │ - Produce repair utterances │ │ │
│ │ │ - Create natural bridges between nodes │ │ │
│ │ │ - Adapt tone and pace to candidate │ │ │
│ │ │ │ │ │
│ │ │ CANNOT: │ │ │
│ │ │ - Jump topics or skip nodes │ │ │
│ │ │ - Reveal rubric or scoring │ │ │
│ │ │ - Exceed follow-up limits │ │ │
│ │ │ - Ignore candidate commands │ │ │
│ │ │ - Change exam structure │ │ │
│ │ │ - Fabricate evidence │ │ │
│ │ │ - End exam prematurely │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
Authoring vs. Runtime Concepts:
| Authoring Concept | Runtime Concept | Notes |
|---|---|---|
| Question bank | QuestionPool + ExamRuntimeNode | Questions become pools of variants with nodes as draw targets |
| Rubric criterion | EvidenceTarget | Rubric maps to evidence targets |
| Follow-up template | FollowUpPolicy + promptSeed | Templates become policy-constrained generation |
| Marking scheme | EvidenceLedger + MarkingRuntime input | Marking scheme defines targets; runtime produces signals |
| Exam duration | Global time budget + per-node budgets | Duration is distributed across nodes |
| Exam instructions | ContextPolicy + node promptSeeds | Instructions shape what the examiner knows |
| Assessment design | AssessmentProfile | Joughin’s six dimensions as design parameters |
| Moderation plan | ModerationPolicy | Rules for human review of AI-generated evidence |
| Calibration exercises | CalibrationProfile | Accuracy metrics and calibration references |
| Fairness review | FairnessAudit | Demographic disparity analysis |
Persistent vs. Transient Runtime Nodes:
| Persistent (in specification) | Transient (in Runtime State) |
|---|---|
| ExamRuntimeNode definitions | Current node pointer |
| Policies and constraints | Turn/follow-up counters |
| Evidence targets | Evidence coverage map |
| Transition rules | Recovery attempt counts |
| Command policies | Command history |
| Time budgets | Elapsed time trackers |
The specification is immutable once published. Runtime state is ephemeral - created fresh per session, destroyed on completion. The evidence ledger and event log are persistent outputs derived from runtime execution.
Revision History
Section titled “Revision History”| Version | Date | Changes |
|---|---|---|
| v0.2.0 | 2026-06-30 | Added §0.1 Bloom’s Taxonomy, §0.2 Interactive Oral Assessment, §0.3 Anxiety Management. Updated Joughin reference to include 2010. Added IOA-ORM terminology. |
| v0.1.0 | 2026-05-06 | Initial release. |