Skip to content

Testing Strategy

Draft · v0.2.0 · 2026-06-30

This chapter defines the comprehensive testing strategy for the IOA-ORM, covering every layer from schema validation to adversarial candidate simulation. Tests are organised by category, with specific test cases, expected outcomes, and tooling recommendations.


         ╱  E2E / Chaos  ╲           (few, slow, high confidence)
        ╱  Integration     ╲          (moderate count, moderate speed)
       ╱  Contract / Adapter ╲        (many, fast)
      ╱  Unit / Schema / State ╲      (many, very fast)
     ╱______________________________╲
LayerCount TargetSpeed TargetRuns On
Unit / Schema200+<5s totalEvery commit
State Machine50+<10s totalEvery commit
Contract / Adapter80+<15s totalEvery PR
Integration30+<2min totalEvery PR
E2E / Scripted Simulation15+<10min totalNightly
Adversarial20+<5min totalNightly
Chaos5+<5min totalWeekly
Psychometric15+Pre-deploymentPer exam package

Theoretical grounding: The psychometric testing track addresses a critical gap identified by the review: the testing strategy validates technical correctness but not assessment quality. Akimov & Malin (2020) evaluate oral exams against a formal validity/reliability/fairness matrix with eight criteria. The spec’s testing strategy must include analogous psychometric validation to ensure that technically correct specification packages also produce valid, reliable, and fair assessments (Hevner et al., 2004, Guideline 3: “the utility, quality, and efficacy of a design artifact must be rigorously demonstrated”).


Purpose: Ensure every specification document conforms to the published JSON Schema.

IDTestExpected
SV-001Valid minimal specification (opening → end)Passes validation
SV-002Valid full specification (opening → 2 questions → closing → end)Passes validation
SV-003Missing irVersionRejects with "irVersion is required"
SV-004Invalid irVersion format (e.g., “1.0”)Rejects with "irVersion must be semver"
SV-005Node with nodeId containing spacesRejects with "nodeId must be [a-z0-9-]"
SV-006Duplicate nodeId across nodesRejects with "nodeId must be unique"
SV-007Transition target references non-existent nodeIdRejects with "target nodeId not found"
SV-008maxFollowUps set to 0 on a question nodePasses (valid — no follow-ups allowed)
SV-009maxFollowUps set to -1Rejects with "maxFollowUps must be >= 0"
SV-010evidenceTargets array with duplicate IDs within one nodeRejects with "evidenceTarget ID must be unique within node"
SV-011evidenceTargets with missing level fieldRejects with "level is required"
SV-012candidateCommands with unknown command typeRejects with "unknown command type"
SV-013transitionPolicy.conditions referencing undefined evidence targetRejects with "evidence target not found"
SV-014timeBudgetSeconds set to 0 on a question nodeRejects with "timeBudgetSeconds must be > 0"
SV-015Node of type end with transitions definedRejects with "end node must not have transitions"
SV-016Node of type question without questionStemRejects with "questionStem is required for question nodes"
SV-017guardrails.forbidden contains unknown valueRejects with "unknown forbidden value"
SV-018Specification with examinerPersona containing valid tone and stylePasses
SV-019Specification with metadata.language set to unsupported localeWarns but does not reject (MAY support)
SV-020followUp.triggerCondition uses undefined variableRejects with "undefined variable in triggerCondition"
  • JSON Schema Draft 2020-12 with custom format validators.
  • ajv (Node.js) or jsonschema (Python) for validation.
  • CI: validate all fixture IRs on every commit.

Purpose: Test individual functions and modules in isolation.

IDTestExpected
USM-001Initialise state for a question nodefollowUpCount = 0, timeElapsed = 0, evidenceCovered = []
USM-002Increment follow-up counterfollowUpCount increases by 1
USM-003Follow-up counter at max — attempt incrementCounter stays at max; returns blocked status
USM-004Follow-up counter NEVER decrementsAfter increment, decrement is a no-op
USM-005Candidate repeat command — follow-up counter unchangedfollowUpCount same before and after
USM-006Candidate clarification command — follow-up counter unchangedfollowUpCount same before and after
USM-007Candidate raise_hand — timer pausestimeElapsed stops increasing during pause
USM-008Timer resumes after pause durationtimeElapsed resumes from pre-pause value
USM-009raise_hand exceeds maxPerNode — command rejectedReturns max_commands_reached
USM-010Transition to allowed targetReturns approved
USM-011Transition to disallowed targetReturns blocked with reason
USM-012Transition when condition not yet satisfiedReturns blocked with unsatisfied conditions
USM-013Time budget warning at 80%Emits time_budget_warning
USM-014Time budget exceeded at 100%Emits time_budget_exceeded; triggers forced transition
USM-015State reconciliation — runtime and FlowManager agreeReturns consistent
USM-016State reconciliation — runtime and FlowManager disagreeReturns mismatch with details
USM-017Evidence covered set updated on signalevidenceCovered includes new target ID
USM-018Duplicate evidence signal for same targetevidenceCovered still contains target once; latest confidence/rationale stored
IDTestExpected
UCC-001”Can you repeat that?” → repeatClassified as repeat with confidence > 0.9
UCC-002”Sorry, say that again” → repeatClassified as repeat
UCC-003”What do you mean by starvation?” → clarificationClassified as clarification
UCC-004”Could you explain that term?” → clarificationClassified as clarification
UCC-005”Can I have a moment?” → raise_handClassified as raise_hand
UCC-006”I need to think for a second” → raise_handClassified as raise_hand
UCC-007”Process scheduling is when the OS…” → answerClassified as answer (not a command)
UCC-008”I think Round Robin is better because…” → answerClassified as answer
UCC-009”Repeat? I mean, um, scheduling is…” → answerClassified as answer (false-positive suppression)
UCC-010”Can you repeat the question and also explain what you mean?” → ambiguousReturns top-2 with confidence; runtime uses primary
UCC-011Empty string / silenceReturns none
UCC-012”Stop the exam” → raise_hand or specialClassified as raise_hand (closest match); logged for review
IDTestExpected
UEV-001Emit node_entered with valid payloadEvent persisted with correct schema
UEV-002Emit transcript_final with spanIdSpan ID is unique, monotonic
UEV-003Emit evidence_signal with valid transcriptSpanIdSignal persisted, span exists in transcript store
UEV-004Emit evidence_signal with invalid transcriptSpanIdSignal rejected; error logged
UEV-005Emit exam_completed exactly onceSecond emission is rejected
UEV-006Emit events out of order (node_exited before node_entered)Rejected with invalid_sequence error
UEV-007Emit transition_decision with decision: "blocked"Event persisted with reason
UEV-008Emit guardrail_violationEvent persisted; original text captured
IDTestExpected
UGR-001LLM output contains “rubric”Blocked; guardrail_violation emitted
UGR-002LLM output contains “key criteria for this question”Blocked (rubric reveal variant)
UGR-003LLM output says “you scored 8 out of 10”Blocked (reveal_score)
UGR-004LLM output suggests “you might want to mention X”Blocked (suggest_answer)
UGR-005LLM output discusses “exam format policy”Blocked (forbidden_topics)
UGR-006LLM output is clean, on-topic responsePasses; no violation
UGR-007LLM output attempts transition to closing from q1Blocked (unauthorized_transition)
UGR-008LLM output contains rubric text embedded in a storyBlocked (semantic detection, not just keyword)
UGR-009LLM output contains grading threshold numbersBlocked (forbidden_topics)
UGR-010LLM output provides a helpful clarificationPasses (clarification is allowed within guardrails)

Purpose: Verify that the specification-to-Pipecat adapter produces correct FlowManager configurations.

IDTestExpected
PAD-001Opening node → Pipecat configtask_messages contains system + assistant prompt; edges has one target
PAD-002Question node with 2 follow-ups → Pipecat configSystem prompt mentions follow-up guidance; runtime_config.maxFollowUps = 2
PAD-003Question node with maxFollowUps: 0 → Pipecat configSystem prompt says “do not ask follow-ups”; no follow-up guidance in config
PAD-004End node → Pipecat configNo task_messages; post_actions includes exam_completed event
PAD-005Node with time budget → Pipecat configruntime_config.timeBudgetSeconds present
PAD-006Node with evidence targets → Pipecat configruntime_config.evidenceTargets array present
PAD-007Node with guardrails.forbidden → Pipecat configSystem prompt includes forbidden action list
PAD-008Node with candidateCommands → Pipecat configSystem prompt includes command handling instructions
PAD-009Adapter preserves node orderinitial_node matches specification’s first node; edge order matches
PAD-010Adapter rejects invalid specificationReturns clear error, does not produce partial config
PAD-011Adapter round-trip: IR → Pipecat config → (mock) executionExecution produces expected events
PAD-012Adapter handles specification with no examinerPersonaSystem prompt uses default persona
PAD-013Adapter handles specification with overrunPolicy: "warn_at_80pct_hard_at_100pct"Runtime config includes overrun settings

Purpose: Run a complete exam session with a scripted candidate (no LLM) to verify end-to-end behaviour.

Approach: A “candidate simulator” plays predetermined utterances at predetermined times. The runtime, events, evidence ledger, and marking input are all verified against expected outcomes.

StepCandidate ActionExpected Runtime Behaviour
1(opening plays)node_entered(opening)node_exited(opening)node_entered(q1)
2Answers Q1 wellevidence_signal(ev-q1-scheduling-concept, covered)
3Follow-up 1 askednode_progress(followUpCount: 1)
4Answers follow-up 1evidence_signal(ev-q1-preemptive-cooperative, covered)
5Follow-up 2 askednode_progress(followUpCount: 2)
6Answers follow-up 2evidence_signal(ev-q1-context-switch, covered)
7Runtime approves move to q2transition_decision(move_to_next_node)
8Q2 stem playsnode_entered(q2)
9Answers Q2 partiallyevidence_signal(ev-q2-algorithm-choice, covered)
10Follow-up 1 askednode_progress(followUpCount: 1)
11Answers follow-up 1evidence_signal(ev-q2-starvation, covered)
12Runtime decides sufficient evidencetransition_decision(move_to_next_node)
13Closing playsnode_entered(closing)node_exited(closing)node_entered(end)
14(end)exam_completed emitted; marking input assembled

Verify: All events present and in order. Evidence ledger has 5/6 covered. Marking input is complete.

Scenario B — Candidate Uses All Commands

Section titled “Scenario B — Candidate Uses All Commands”
StepCandidate ActionExpected
1”Can you repeat that?” after Q1 stemcandidate_command(repeat), followUpCount unchanged
2”What do you mean by scheduling?”candidate_command(clarification), LLM clarifies, followUpCount unchanged
3”I need a moment” before answeringcandidate_command(raise_hand), timer pauses for 10s
4Answers normally after pausefollowUpCount at 0 (none of the commands counted)
StepCandidate ActionExpected
1Candidate gives slow, rambling answersTimer approaches 80% of q2 budget
2(automatic)time_budget_warning emitted
3Candidate continuesTimer reaches 100%
4(automatic)time_budget_exceeded emitted
5(automatic)transition_decision(move_to_next_node, reason: "time budget exceeded")
6LLM delivers graceful bridge”We’re running short on time, so let’s move on”
7Closing playsnode_entered(closing)

Purpose: Verify that the runtime handles hostile, confused, or edge-case candidate behaviour without breaking assessment integrity.

12.6.1 LLM Attempts Third Follow-Up When max=2

Section titled “12.6.1 LLM Attempts Third Follow-Up When max=2”
IDTestExpected
ADV-001After 2 follow-ups, LLM generates a third follow-up questionRuntime blocks the follow-up; injects “move to next node” instruction
ADV-002LLM generates third follow-up phrased as a statementRuntime detects follow-up intent; blocks
ADV-003LLM embeds a follow-up inside a transition sentenceRuntime detects; blocks and forces transition
IDTestExpected
ADV-004Candidate asks to repeat 10 times in a rowFirst 3 succeed (maxPerNode: 3); remaining 7 rejected with polite message
ADV-005Candidate asks for clarification on a term that IS the answerLLM provides a safe explanation without revealing the answer; guardrail passes
ADV-006Candidate says “raise hand” then immediately starts answeringTimer paused for 10s regardless; answer recorded after resume
ADV-007Candidate sends command via data channel while LLM is speakingCommand queued; processed after current utterance completes
IDTestExpected
ADV-008Candidate stays silent for 16 seconds (threshold is 15s)silenceAction: "gentle_prompt" triggered — LLM says “Take your time” or similar
ADV-009Candidate stays silent for 45 seconds (3 gentle prompts)Runtime escalates: emits candidate_silence_extended; LLM says “Would you like me to move on?”
ADV-010Candidate gives an off-topic answer (“I like pizza”)LLM gently redirects: “That’s interesting, but let’s focus on the question about scheduling.” No follow-up counted.
ADV-011Candidate gives a partially on-topic answerLLM asks a follow-up to fill the gap; follow-up counted normally
ADV-012Candidate says “I don’t know”LLM offers a gentle hint within guardrails; follow-up counted
IDTestExpected
ADV-013STT partial arrives but final never comesAfter timeout (5s), runtime treats last partial as final with degraded confidence flag
ADV-014STT produces two overlapping finals for the same utteranceDeduplicated by span ID; second one ignored
ADV-015STT produces empty final (candidate mumbled)Emitted as transcript_final with empty text; LLM handles gracefully
ADV-016STT garbles text (low-confidence transcription)transcript_final includes confidence field; low confidence logged for review
IDTestExpected
ADV-017LLM attempts to jump from q1 directly to closingRuntime blocks; guardrail_violation(unauthorized_transition) emitted
ADV-018LLM attempts to jump from q1 to q2 (correct) without runtime approvalRuntime intercepts; approves only after transition condition check
ADV-019LLM attempts to loop back to q1 from q2Runtime blocks; q1 is not in q2’s allowedTargets
ADV-020LLM attempts to end exam mid-questionRuntime blocks; forces continuation or graceful time-budget handling
IDTestExpected
ADV-021Evidence signal references invalid span IDSignal rejected; error logged; does not corrupt ledger
ADV-022Evidence signal claims “covered” but transcript excerpt contradictsConfidence check: if excerpt is empty or contradictory, signal downgraded to “uncertain”
ADV-023Multiple evidence signals for same target with increasing confidenceLatest signal overwrites earlier; confidence increases
ADV-024Evidence signal arrives after exam_completedRejected; exam_completed is final

12.7 Transcript & Evidence Consistency Tests

Section titled “12.7 Transcript & Evidence Consistency Tests”

Purpose: Ensure the transcript and evidence ledger are internally consistent and can be independently verified.

IDTestExpected
TEC-001Every transcriptSpanId referenced in evidence ledger exists in transcript store100% match
TEC-002Every evidence target marked “covered” has at least one transcript span100% match
TEC-003Evidence “not_covered” has zero transcript spans (or empty excerpt)Consistent
TEC-004Transcript is complete — every node_entered has corresponding transcript spansNo gaps
TEC-005Transcript order matches event orderMonotonic timestamps
TEC-006Evidence ledger is complete — every evidenceTarget in specification has a ledger entry100% coverage (even if “not_covered”)
TEC-007Evidence confidence values are in [0, 1] rangeAll valid
TEC-008Evidence rationale is non-empty for all “covered” entriesPresent

Purpose: Test the full flow from candidate utterance through command classification to runtime action and event emission.

IDTestExpected
CCI-001Candidate says “repeat” during Q1 stemCommand classified → re-prompt issued → candidate_command event → followUpCount unchanged
CCI-002Candidate says “what does that mean?” during follow-upCommand classified as clarification → LLM clarifies → candidate_command event → followUpCount unchanged
CCI-003Candidate says “give me a second” mid-answerCommand classified as raise_hand → timer pauses → time_budget_paused event → resumes after 10s
CCI-004Candidate sends raw data channel command { command: "repeat" }Processed same as spoken command; event emitted
CCI-005Candidate sends malformed data channel commandRejected with error; logged; no crash
CCI-006Candidate uses command during opening (non-question node)Command rejected (or handled gracefully) — e.g., repeat works, raise_hand is ignored

Purpose: Test guardrails in context — LLM generation → guardrail check → block/allow → event emission.

IDTestExpected
GI-001LLM generates rubric-revealing response during Q1Blocked; guardrail_violation event; LLM regenerates clean response
GI-002LLM generates unauthorised transition text from q1Blocked; guardrail_violation; runtime forces q2 transition
GI-003LLM generates a helpful clarification (allowed)Passes; no violation event
GI-004LLM generates response that hints at score (“about 70%“)Blocked; guardrail_violation(reveal_score)
GI-005LLM generates response discussing exam logisticsBlocked; guardrail_violation(forbidden_topics)
GI-006Guardrail check latencyMUST complete in <50ms; log if exceeds

12.10 Regression Tests for Published Packages

Section titled “12.10 Regression Tests for Published Packages”

Purpose: Ensure existing published packages continue to work through every phase of the migration.

IDTestExpected
REG-001Package published with flowJson v1 (pre-specification) — Phase 1Events emitted; transcript persisted; no candidate-facing change
REG-002Package published with flowJson v1 — Phase 2Commands not configured; runtime state tracked but commands don’t activate
REG-003Package published with flowJson v1 — Phase 3No evidence targets; evidence ledger is empty; marking uses raw transcript
REG-004Package published with flowJson v1 — Phase 4No transition policy; LLM decides transitions (current behaviour)
REG-005Package published with flowJson v1 — Phase 5Auto-migrated to specification v1.0.0; verify Pipecat adapter output matches original flowJson
REG-006Package published with specification v1.0.0 — after Phase 5Full specification pipeline; all features available
REG-007Package published with specification v1.1.0 (future) — loaded by runtime v1.0Backward compatible; new fields ignored; warning logged
REG-008Package published with specification v2.0.0 (breaking) — loaded by runtime v1.0Rejected with clear error: “specification version 2.0.0 requires runtime >= 2.0.0”

Purpose: Verify that the frontend correctly consumes and renders all event types.

IDTestExpected
UIC-001bot_ready receivedUI shows “Connected” status
UIC-002node_entered(q1) receivedUI updates to “Question 1” display
UIC-003node_progress(followUpCount: 1, maxFollowUps: 2) receivedUI shows “Follow-up 1 of 2”
UIC-004node_progress(evidenceCovered: ["ev-q1-scheduling-concept"]) receivedUI updates evidence progress indicator
UIC-005time_budget_warning receivedUI shows time warning (e.g., yellow indicator)
UIC-006time_budget_exceeded receivedUI shows time exceeded (e.g., red indicator)
UIC-007candidate_command(repeat) sent via UI buttonData channel message sent; command acknowledged
UIC-008candidate_command(raise_hand) sent via UI buttonTimer pause acknowledged; UI shows “Paused”
UIC-009exam_completed receivedUI shows “Assessment Complete” screen
UIC-010guardrail_violation event (admin view only)Admin UI shows violation details; candidate UI unaffected
UIC-011Unknown event type receivedUI ignores gracefully; logs warning
UIC-012Events arrive out of orderUI reorders by timestamp before rendering

Purpose: Verify the marking input package is complete, correct, and consumable by the marking pipeline.

IDTestExpected
MRI-001Marking input contains all evidence ledger entriesCount matches specification’s evidenceTargets count
MRI-002Marking input contains full transcriptAll transcript_final spans present
MRI-003Marking input contains runtime auditnodesVisited, followUpsUsed, transitionDecisions all present
MRI-004Marking input contains specification snapshotFrozen copy of specification used for session
MRI-005Marking input for exam with no evidence targetsEvidence ledger empty; transcript and audit still present
MRI-006Marking input for exam that ended early (timeout)Partial evidence; not_covered entries for missing targets
MRI-007Marking input for exam with guardrail violationsViolations listed in runtime audit
MRI-008Marking input for exam with candidate commandsCommands listed in runtime audit
MRI-009Marking input schema validationPasses JSON Schema for marking input
MRI-010Marking input version compatibilityinputVersion matches expected version

Purpose: Verify the system handles failures gracefully.

IDTestExpected
CHA-001Bot crashes mid-questionexam_completed fires from guaranteed hook; partial transcript preserved
CHA-002STT service disconnects during candidate answerLast partial treated as final (degraded); session continues after reconnect
CHA-003Event store unavailableEvents queued in memory; flushed on recovery; session does not block
CHA-004Candidate disconnects and reconnectsSession resumes from current node; state preserved
CHA-005LLM service timeout during follow-upRetry once; if still fails, skip follow-up and move to next node

IDTestTarget
PERF-001Specification schema validation latency<10ms per specification
PERF-002Pipecat adapter compilation latency<500ms per specification
PERF-003Command classification latency<200ms per utterance
PERF-004Guardrail check latency<50ms per LLM response
PERF-005Event emission throughput>1000 events/sec to event store
PERF-006Evidence signal generation latency<2s (async, does not block dialogue)
PERF-007100 concurrent exam sessionsNo degradation in event latency or dialogue responsiveness
PERF-008Marking input assembly latency<5s per exam

Purpose: Validate that specification packages produce assessments that are not only technically correct but also psychometrically sound — valid, reliable, and fair. This track addresses the “psychometrically blind” gap identified in the specification review.

Theoretical grounding: Akimov & Malin (2020) evaluate their oral exam against a validity/reliability/fairness matrix with eight criteria: face validity, content validity, construct validity, concurrent validity, inter-item consistency, inter-case reliability, inter-rater reliability, and fairness. Joughin (1998) warns that “reliability is threatened when examiners are poorly prepared” (p. 376) and when “interaction tends towards the dialogue pole” (p. 376). Fenton (2025) notes that “careful preparation is recommended to avoid any bias” and identifies risks around “gender, ethnicity, language skills, speed of answering, and subjective grading.”

These tests are not run on every commit. They are run per exam package before deployment to high-stakes summative assessment. Low-stakes or formative exams MAY skip psychometric validation.

IDTestMethodExpected
PSY-001Evidence targets align with declared learning outcomesExpert panel review (≥3 subject matter experts)≥90% agreement that targets cover declared LOs
PSY-002Question stem elicits the intended cognitive levelBloom’s taxonomy classification by 2+ independent ratersInter-rater agreement ≥ 0.8 (Cohen’s κ)
PSY-003Follow-up questions probe deeper understanding, not just recallExpert review of follow-up bank against Bloom’s levelsFollow-ups escalate at least one Bloom level from stem
PSY-004Assessment covers the full scope of the learning outcomesContent coverage map: LO → evidence target → nodeEvery declared LO has ≥1 evidence target mapped
IDTestMethodExpected
PSY-010Specification evidence signals correlate with independent measures of the same constructCorrelation between specification evidence signals and independent written exam scores on same topicsModerate positive correlation (r > 0.4)
PSY-011Different evidence targets measure different constructsFactor analysis of evidence signal patterns across ≥50 sessionsEvidence targets cluster into distinct factors matching declared content types
PSY-012Transversal skills (communication, critical thinking) are distinguishable from content knowledgePartial correlation: transversal signals vs content signals controlling for overall abilityTransversal and content signals are not perfectly correlated (r < 0.9)
IDTestMethodExpected
PSY-020AI examiner produces consistent evidence signals across sessionsTwo independent LLM instances assess the same transcript (≥30 transcripts)Cohen’s κ ≥ 0.8 for signal classification (covered/partial/absent)
PSY-021AI examiner agrees with human marker on evidence signalsAI-generated signals vs human-annotated signals on same transcriptsCohen’s κ ≥ 0.75
PSY-022Confidence calibration: 0.8-confidence signals are correct ~80% of the timeBinomial test: proportion of correct signals at each confidence levelProportion within ±10% of declared confidence
PSY-023Confidence drift detectionMonitor average confidence over session duration; test for monotonic trendNo significant drift (p > 0.05, Mann-Kendall test)
IDTestMethodExpected
PSY-030Candidates receiving different follow-up paths get comparable evidence opportunitiesCompare evidence coverage rates across candidates on same nodeCoverage rate variance < 15%
PSY-031Follow-up type distribution is consistent across candidatesChi-squared test on follow-up type distribution across ≥30 sessionsNo significant deviation from expected distribution (p > 0.05)
PSY-032Conversation path variance does not correlate with final scoresCorrelation between path variance metric and evidence coverager < 0.3
IDTestMethodExpected
PSY-040No significant difference in evidence signal accuracy by language backgroundCompare AI marker agreement with human marker for native vs non-native speakersNo statistically significant difference (p > 0.05)
PSY-041No significant difference in assessment outcomes by genderCompare mean evidence coverage rates across gender groupsNo statistically significant difference (p > 0.05)
PSY-042Follow-up count not correlated with demographic variablesRegression: follow-up count ~ demographics + abilityDemographic coefficients not significant (p > 0.05)
PSY-043Time budget is adequate for all candidatesCompare completion rates across language backgroundsNon-native speakers do not have significantly lower completion rates

12.15.6 Face Validity and Candidate Experience

Section titled “12.15.6 Face Validity and Candidate Experience”
IDTestMethodExpected
PSY-050Candidates perceive the assessment as testing what it claimsPost-exam survey: “This assessment tested my understanding of [topic]“≥80% agree or strongly agree
PSY-051Candidates find the AI examiner interaction naturalPost-exam survey: “The examiner’s questions felt natural and relevant”≥70% agree or strongly agree
PSY-052Anxiety levels are manageablePre/post anxiety survey (GAD-7 or equivalent)Post-exam anxiety not significantly higher than pre-exam
IDTestMethodExpected
PSY-060Students adopt deeper learning strategies when oral exams are introducedPre/post survey of study strategies (surface vs deep approaches)Shift toward deeper strategies
PSY-061Students value the oral format over written alternativesPost-exam preference survey≥60% prefer oral format or find it equally valuable

12.15.8 Psychometric Test Execution Protocol

Section titled “12.15.8 Psychometric Test Execution Protocol”
  1. When to run: Before deploying a new exam package for summative assessment. Formative exams MAY skip psychometric validation.
  2. Minimum sample size: 30 candidate sessions for reliability tests; 50+ for fairness tests.
  3. Who runs: Assessment design team with psychometric support. Results documented in metadata.assessmentProfile.validityEvidence.
  4. What happens on failure: If any PSY test fails, the exam package MUST NOT be used for summative assessment until the issue is resolved.
  5. Re-validation triggers: Changing evidence targets, follow-up policies, time budgets, or examiner persona requires re-running affected PSY tests.

FixtureDescription
ir-minimal.jsonOpening → end (no questions)
ir-single-question.jsonOpening → 1 question → closing → end
ir-two-questions.jsonOpening → 2 questions → closing → end (§10 example)
ir-no-followups.jsonQuestions with maxFollowUps: 0
ir-many-followups.jsonQuestions with maxFollowUps: 5
ir-time-budget.jsonQuestions with aggressive time budgets
ir-with-commands.jsonFull candidateCommands configuration
ir-with-evidence.jsonFull evidenceTargets configuration
ir-invalid-*.jsonVarious invalid IRs for schema validation tests
FixtureDescription
transcript-normal.jsonComplete, well-structured transcript
transcript-partial.jsonTranscript with missing spans
transcript-overlapping.jsonTranscript with duplicate/overlapping finals
transcript-empty-candidate.jsonCandidate says nothing
transcript-long-rambling.jsonCandidate gives very long answers
FixtureDescription
utterances-repeat.json50 variations of repeat requests
utterances-clarification.json50 variations of clarification requests
utterances-raise-hand.json30 variations of pause requests
utterances-answers.json100 normal answers across topics
utterances-adversarial.jsonEdge cases: off-topic, silence, confusion

StageTests RunGate?
Pre-commitSchema validation, unit testsBlock on failure
PRAll of above + adapter tests + contract testsBlock on failure
Merge to mainAll of above + integration testsBlock on failure
NightlyAll of above + scripted simulation + adversarialAlert on failure
WeeklyAll of above + chaos + performanceAlert on failure
Pre-releaseFull suite + regression for all published packagesBlock on failure
VersionDateChanges
v0.2.02026-06-30Added test cases for new schema fields (anxietyMitigation, BloomLevel, etc.). Updated terminology from ‘Exam Runtime IR’ to ‘IOA-ORM’.
v0.1.02026-05-06Initial release.