Draft · v0.2.0 · 2026-06-30
This chapter defines the comprehensive testing strategy for the IOA-ORM,
covering every layer from schema validation to adversarial candidate simulation.
Tests are organised by category, with specific test cases, expected outcomes,
and tooling recommendations.
╱ E2E / Chaos ╲ (few, slow, high confidence)
╱ Integration ╲ (moderate count, moderate speed)
╱ Contract / Adapter ╲ (many, fast)
╱ Unit / Schema / State ╲ (many, very fast)
╱______________________________╲
| Layer | Count Target | Speed Target | Runs On |
|---|
| Unit / Schema | 200+ | <5s total | Every commit |
| State Machine | 50+ | <10s total | Every commit |
| Contract / Adapter | 80+ | <15s total | Every PR |
| Integration | 30+ | <2min total | Every PR |
| E2E / Scripted Simulation | 15+ | <10min total | Nightly |
| Adversarial | 20+ | <5min total | Nightly |
| Chaos | 5+ | <5min total | Weekly |
| Psychometric | 15+ | Pre-deployment | Per exam package |
Theoretical grounding: The psychometric testing track addresses a critical
gap identified by the review: the testing strategy validates technical
correctness but not assessment quality. Akimov & Malin (2020) evaluate oral
exams against a formal validity/reliability/fairness matrix with eight criteria.
The spec’s testing strategy must include analogous psychometric validation to
ensure that technically correct specification packages also produce valid, reliable, and
fair assessments (Hevner et al., 2004, Guideline 3: “the utility, quality, and
efficacy of a design artifact must be rigorously demonstrated”).
Purpose: Ensure every specification document conforms to the published JSON Schema.
| ID | Test | Expected |
|---|
| SV-001 | Valid minimal specification (opening → end) | Passes validation |
| SV-002 | Valid full specification (opening → 2 questions → closing → end) | Passes validation |
| SV-003 | Missing irVersion | Rejects with "irVersion is required" |
| SV-004 | Invalid irVersion format (e.g., “1.0”) | Rejects with "irVersion must be semver" |
| SV-005 | Node with nodeId containing spaces | Rejects with "nodeId must be [a-z0-9-]" |
| SV-006 | Duplicate nodeId across nodes | Rejects with "nodeId must be unique" |
| SV-007 | Transition target references non-existent nodeId | Rejects with "target nodeId not found" |
| SV-008 | maxFollowUps set to 0 on a question node | Passes (valid — no follow-ups allowed) |
| SV-009 | maxFollowUps set to -1 | Rejects with "maxFollowUps must be >= 0" |
| SV-010 | evidenceTargets array with duplicate IDs within one node | Rejects with "evidenceTarget ID must be unique within node" |
| SV-011 | evidenceTargets with missing level field | Rejects with "level is required" |
| SV-012 | candidateCommands with unknown command type | Rejects with "unknown command type" |
| SV-013 | transitionPolicy.conditions referencing undefined evidence target | Rejects with "evidence target not found" |
| SV-014 | timeBudgetSeconds set to 0 on a question node | Rejects with "timeBudgetSeconds must be > 0" |
| SV-015 | Node of type end with transitions defined | Rejects with "end node must not have transitions" |
| SV-016 | Node of type question without questionStem | Rejects with "questionStem is required for question nodes" |
| SV-017 | guardrails.forbidden contains unknown value | Rejects with "unknown forbidden value" |
| SV-018 | Specification with examinerPersona containing valid tone and style | Passes |
| SV-019 | Specification with metadata.language set to unsupported locale | Warns but does not reject (MAY support) |
| SV-020 | followUp.triggerCondition uses undefined variable | Rejects with "undefined variable in triggerCondition" |
- JSON Schema Draft 2020-12 with custom
format validators.
ajv (Node.js) or jsonschema (Python) for validation.
- CI: validate all fixture IRs on every commit.
Purpose: Test individual functions and modules in isolation.
| ID | Test | Expected |
|---|
| USM-001 | Initialise state for a question node | followUpCount = 0, timeElapsed = 0, evidenceCovered = [] |
| USM-002 | Increment follow-up counter | followUpCount increases by 1 |
| USM-003 | Follow-up counter at max — attempt increment | Counter stays at max; returns blocked status |
| USM-004 | Follow-up counter NEVER decrements | After increment, decrement is a no-op |
| USM-005 | Candidate repeat command — follow-up counter unchanged | followUpCount same before and after |
| USM-006 | Candidate clarification command — follow-up counter unchanged | followUpCount same before and after |
| USM-007 | Candidate raise_hand — timer pauses | timeElapsed stops increasing during pause |
| USM-008 | Timer resumes after pause duration | timeElapsed resumes from pre-pause value |
| USM-009 | raise_hand exceeds maxPerNode — command rejected | Returns max_commands_reached |
| USM-010 | Transition to allowed target | Returns approved |
| USM-011 | Transition to disallowed target | Returns blocked with reason |
| USM-012 | Transition when condition not yet satisfied | Returns blocked with unsatisfied conditions |
| USM-013 | Time budget warning at 80% | Emits time_budget_warning |
| USM-014 | Time budget exceeded at 100% | Emits time_budget_exceeded; triggers forced transition |
| USM-015 | State reconciliation — runtime and FlowManager agree | Returns consistent |
| USM-016 | State reconciliation — runtime and FlowManager disagree | Returns mismatch with details |
| USM-017 | Evidence covered set updated on signal | evidenceCovered includes new target ID |
| USM-018 | Duplicate evidence signal for same target | evidenceCovered still contains target once; latest confidence/rationale stored |
| ID | Test | Expected |
|---|
| UCC-001 | ”Can you repeat that?” → repeat | Classified as repeat with confidence > 0.9 |
| UCC-002 | ”Sorry, say that again” → repeat | Classified as repeat |
| UCC-003 | ”What do you mean by starvation?” → clarification | Classified as clarification |
| UCC-004 | ”Could you explain that term?” → clarification | Classified as clarification |
| UCC-005 | ”Can I have a moment?” → raise_hand | Classified as raise_hand |
| UCC-006 | ”I need to think for a second” → raise_hand | Classified as raise_hand |
| UCC-007 | ”Process scheduling is when the OS…” → answer | Classified as answer (not a command) |
| UCC-008 | ”I think Round Robin is better because…” → answer | Classified as answer |
| UCC-009 | ”Repeat? I mean, um, scheduling is…” → answer | Classified as answer (false-positive suppression) |
| UCC-010 | ”Can you repeat the question and also explain what you mean?” → ambiguous | Returns top-2 with confidence; runtime uses primary |
| UCC-011 | Empty string / silence | Returns none |
| UCC-012 | ”Stop the exam” → raise_hand or special | Classified as raise_hand (closest match); logged for review |
| ID | Test | Expected |
|---|
| UEV-001 | Emit node_entered with valid payload | Event persisted with correct schema |
| UEV-002 | Emit transcript_final with spanId | Span ID is unique, monotonic |
| UEV-003 | Emit evidence_signal with valid transcriptSpanId | Signal persisted, span exists in transcript store |
| UEV-004 | Emit evidence_signal with invalid transcriptSpanId | Signal rejected; error logged |
| UEV-005 | Emit exam_completed exactly once | Second emission is rejected |
| UEV-006 | Emit events out of order (node_exited before node_entered) | Rejected with invalid_sequence error |
| UEV-007 | Emit transition_decision with decision: "blocked" | Event persisted with reason |
| UEV-008 | Emit guardrail_violation | Event persisted; original text captured |
| ID | Test | Expected |
|---|
| UGR-001 | LLM output contains “rubric” | Blocked; guardrail_violation emitted |
| UGR-002 | LLM output contains “key criteria for this question” | Blocked (rubric reveal variant) |
| UGR-003 | LLM output says “you scored 8 out of 10” | Blocked (reveal_score) |
| UGR-004 | LLM output suggests “you might want to mention X” | Blocked (suggest_answer) |
| UGR-005 | LLM output discusses “exam format policy” | Blocked (forbidden_topics) |
| UGR-006 | LLM output is clean, on-topic response | Passes; no violation |
| UGR-007 | LLM output attempts transition to closing from q1 | Blocked (unauthorized_transition) |
| UGR-008 | LLM output contains rubric text embedded in a story | Blocked (semantic detection, not just keyword) |
| UGR-009 | LLM output contains grading threshold numbers | Blocked (forbidden_topics) |
| UGR-010 | LLM output provides a helpful clarification | Passes (clarification is allowed within guardrails) |
Purpose: Verify that the specification-to-Pipecat adapter produces correct
FlowManager configurations.
| ID | Test | Expected |
|---|
| PAD-001 | Opening node → Pipecat config | task_messages contains system + assistant prompt; edges has one target |
| PAD-002 | Question node with 2 follow-ups → Pipecat config | System prompt mentions follow-up guidance; runtime_config.maxFollowUps = 2 |
| PAD-003 | Question node with maxFollowUps: 0 → Pipecat config | System prompt says “do not ask follow-ups”; no follow-up guidance in config |
| PAD-004 | End node → Pipecat config | No task_messages; post_actions includes exam_completed event |
| PAD-005 | Node with time budget → Pipecat config | runtime_config.timeBudgetSeconds present |
| PAD-006 | Node with evidence targets → Pipecat config | runtime_config.evidenceTargets array present |
| PAD-007 | Node with guardrails.forbidden → Pipecat config | System prompt includes forbidden action list |
| PAD-008 | Node with candidateCommands → Pipecat config | System prompt includes command handling instructions |
| PAD-009 | Adapter preserves node order | initial_node matches specification’s first node; edge order matches |
| PAD-010 | Adapter rejects invalid specification | Returns clear error, does not produce partial config |
| PAD-011 | Adapter round-trip: IR → Pipecat config → (mock) execution | Execution produces expected events |
| PAD-012 | Adapter handles specification with no examinerPersona | System prompt uses default persona |
| PAD-013 | Adapter handles specification with overrunPolicy: "warn_at_80pct_hard_at_100pct" | Runtime config includes overrun settings |
Purpose: Run a complete exam session with a scripted candidate (no LLM) to
verify end-to-end behaviour.
Approach: A “candidate simulator” plays predetermined utterances at
predetermined times. The runtime, events, evidence ledger, and marking input
are all verified against expected outcomes.
| Step | Candidate Action | Expected Runtime Behaviour |
|---|
| 1 | (opening plays) | node_entered(opening) → node_exited(opening) → node_entered(q1) |
| 2 | Answers Q1 well | evidence_signal(ev-q1-scheduling-concept, covered) |
| 3 | Follow-up 1 asked | node_progress(followUpCount: 1) |
| 4 | Answers follow-up 1 | evidence_signal(ev-q1-preemptive-cooperative, covered) |
| 5 | Follow-up 2 asked | node_progress(followUpCount: 2) |
| 6 | Answers follow-up 2 | evidence_signal(ev-q1-context-switch, covered) |
| 7 | Runtime approves move to q2 | transition_decision(move_to_next_node) |
| 8 | Q2 stem plays | node_entered(q2) |
| 9 | Answers Q2 partially | evidence_signal(ev-q2-algorithm-choice, covered) |
| 10 | Follow-up 1 asked | node_progress(followUpCount: 1) |
| 11 | Answers follow-up 1 | evidence_signal(ev-q2-starvation, covered) |
| 12 | Runtime decides sufficient evidence | transition_decision(move_to_next_node) |
| 13 | Closing plays | node_entered(closing) → node_exited(closing) → node_entered(end) |
| 14 | (end) | exam_completed emitted; marking input assembled |
Verify: All events present and in order. Evidence ledger has 5/6 covered.
Marking input is complete.
| Step | Candidate Action | Expected |
|---|
| 1 | ”Can you repeat that?” after Q1 stem | candidate_command(repeat), followUpCount unchanged |
| 2 | ”What do you mean by scheduling?” | candidate_command(clarification), LLM clarifies, followUpCount unchanged |
| 3 | ”I need a moment” before answering | candidate_command(raise_hand), timer pauses for 10s |
| 4 | Answers normally after pause | followUpCount at 0 (none of the commands counted) |
| Step | Candidate Action | Expected |
|---|
| 1 | Candidate gives slow, rambling answers | Timer approaches 80% of q2 budget |
| 2 | (automatic) | time_budget_warning emitted |
| 3 | Candidate continues | Timer reaches 100% |
| 4 | (automatic) | time_budget_exceeded emitted |
| 5 | (automatic) | transition_decision(move_to_next_node, reason: "time budget exceeded") |
| 6 | LLM delivers graceful bridge | ”We’re running short on time, so let’s move on” |
| 7 | Closing plays | node_entered(closing) |
Purpose: Verify that the runtime handles hostile, confused, or edge-case
candidate behaviour without breaking assessment integrity.
| ID | Test | Expected |
|---|
| ADV-001 | After 2 follow-ups, LLM generates a third follow-up question | Runtime blocks the follow-up; injects “move to next node” instruction |
| ADV-002 | LLM generates third follow-up phrased as a statement | Runtime detects follow-up intent; blocks |
| ADV-003 | LLM embeds a follow-up inside a transition sentence | Runtime detects; blocks and forces transition |
| ID | Test | Expected |
|---|
| ADV-004 | Candidate asks to repeat 10 times in a row | First 3 succeed (maxPerNode: 3); remaining 7 rejected with polite message |
| ADV-005 | Candidate asks for clarification on a term that IS the answer | LLM provides a safe explanation without revealing the answer; guardrail passes |
| ADV-006 | Candidate says “raise hand” then immediately starts answering | Timer paused for 10s regardless; answer recorded after resume |
| ADV-007 | Candidate sends command via data channel while LLM is speaking | Command queued; processed after current utterance completes |
| ID | Test | Expected |
|---|
| ADV-008 | Candidate stays silent for 16 seconds (threshold is 15s) | silenceAction: "gentle_prompt" triggered — LLM says “Take your time” or similar |
| ADV-009 | Candidate stays silent for 45 seconds (3 gentle prompts) | Runtime escalates: emits candidate_silence_extended; LLM says “Would you like me to move on?” |
| ADV-010 | Candidate gives an off-topic answer (“I like pizza”) | LLM gently redirects: “That’s interesting, but let’s focus on the question about scheduling.” No follow-up counted. |
| ADV-011 | Candidate gives a partially on-topic answer | LLM asks a follow-up to fill the gap; follow-up counted normally |
| ADV-012 | Candidate says “I don’t know” | LLM offers a gentle hint within guardrails; follow-up counted |
| ID | Test | Expected |
|---|
| ADV-013 | STT partial arrives but final never comes | After timeout (5s), runtime treats last partial as final with degraded confidence flag |
| ADV-014 | STT produces two overlapping finals for the same utterance | Deduplicated by span ID; second one ignored |
| ADV-015 | STT produces empty final (candidate mumbled) | Emitted as transcript_final with empty text; LLM handles gracefully |
| ADV-016 | STT garbles text (low-confidence transcription) | transcript_final includes confidence field; low confidence logged for review |
| ID | Test | Expected |
|---|
| ADV-017 | LLM attempts to jump from q1 directly to closing | Runtime blocks; guardrail_violation(unauthorized_transition) emitted |
| ADV-018 | LLM attempts to jump from q1 to q2 (correct) without runtime approval | Runtime intercepts; approves only after transition condition check |
| ADV-019 | LLM attempts to loop back to q1 from q2 | Runtime blocks; q1 is not in q2’s allowedTargets |
| ADV-020 | LLM attempts to end exam mid-question | Runtime blocks; forces continuation or graceful time-budget handling |
| ID | Test | Expected |
|---|
| ADV-021 | Evidence signal references invalid span ID | Signal rejected; error logged; does not corrupt ledger |
| ADV-022 | Evidence signal claims “covered” but transcript excerpt contradicts | Confidence check: if excerpt is empty or contradictory, signal downgraded to “uncertain” |
| ADV-023 | Multiple evidence signals for same target with increasing confidence | Latest signal overwrites earlier; confidence increases |
| ADV-024 | Evidence signal arrives after exam_completed | Rejected; exam_completed is final |
Purpose: Ensure the transcript and evidence ledger are internally consistent
and can be independently verified.
| ID | Test | Expected |
|---|
| TEC-001 | Every transcriptSpanId referenced in evidence ledger exists in transcript store | 100% match |
| TEC-002 | Every evidence target marked “covered” has at least one transcript span | 100% match |
| TEC-003 | Evidence “not_covered” has zero transcript spans (or empty excerpt) | Consistent |
| TEC-004 | Transcript is complete — every node_entered has corresponding transcript spans | No gaps |
| TEC-005 | Transcript order matches event order | Monotonic timestamps |
| TEC-006 | Evidence ledger is complete — every evidenceTarget in specification has a ledger entry | 100% coverage (even if “not_covered”) |
| TEC-007 | Evidence confidence values are in [0, 1] range | All valid |
| TEC-008 | Evidence rationale is non-empty for all “covered” entries | Present |
Purpose: Test the full flow from candidate utterance through command
classification to runtime action and event emission.
| ID | Test | Expected |
|---|
| CCI-001 | Candidate says “repeat” during Q1 stem | Command classified → re-prompt issued → candidate_command event → followUpCount unchanged |
| CCI-002 | Candidate says “what does that mean?” during follow-up | Command classified as clarification → LLM clarifies → candidate_command event → followUpCount unchanged |
| CCI-003 | Candidate says “give me a second” mid-answer | Command classified as raise_hand → timer pauses → time_budget_paused event → resumes after 10s |
| CCI-004 | Candidate sends raw data channel command { command: "repeat" } | Processed same as spoken command; event emitted |
| CCI-005 | Candidate sends malformed data channel command | Rejected with error; logged; no crash |
| CCI-006 | Candidate uses command during opening (non-question node) | Command rejected (or handled gracefully) — e.g., repeat works, raise_hand is ignored |
Purpose: Test guardrails in context — LLM generation → guardrail check →
block/allow → event emission.
| ID | Test | Expected |
|---|
| GI-001 | LLM generates rubric-revealing response during Q1 | Blocked; guardrail_violation event; LLM regenerates clean response |
| GI-002 | LLM generates unauthorised transition text from q1 | Blocked; guardrail_violation; runtime forces q2 transition |
| GI-003 | LLM generates a helpful clarification (allowed) | Passes; no violation event |
| GI-004 | LLM generates response that hints at score (“about 70%“) | Blocked; guardrail_violation(reveal_score) |
| GI-005 | LLM generates response discussing exam logistics | Blocked; guardrail_violation(forbidden_topics) |
| GI-006 | Guardrail check latency | MUST complete in <50ms; log if exceeds |
Purpose: Ensure existing published packages continue to work through every
phase of the migration.
| ID | Test | Expected |
|---|
| REG-001 | Package published with flowJson v1 (pre-specification) — Phase 1 | Events emitted; transcript persisted; no candidate-facing change |
| REG-002 | Package published with flowJson v1 — Phase 2 | Commands not configured; runtime state tracked but commands don’t activate |
| REG-003 | Package published with flowJson v1 — Phase 3 | No evidence targets; evidence ledger is empty; marking uses raw transcript |
| REG-004 | Package published with flowJson v1 — Phase 4 | No transition policy; LLM decides transitions (current behaviour) |
| REG-005 | Package published with flowJson v1 — Phase 5 | Auto-migrated to specification v1.0.0; verify Pipecat adapter output matches original flowJson |
| REG-006 | Package published with specification v1.0.0 — after Phase 5 | Full specification pipeline; all features available |
| REG-007 | Package published with specification v1.1.0 (future) — loaded by runtime v1.0 | Backward compatible; new fields ignored; warning logged |
| REG-008 | Package published with specification v2.0.0 (breaking) — loaded by runtime v1.0 | Rejected with clear error: “specification version 2.0.0 requires runtime >= 2.0.0” |
Purpose: Verify that the frontend correctly consumes and renders all event
types.
| ID | Test | Expected |
|---|
| UIC-001 | bot_ready received | UI shows “Connected” status |
| UIC-002 | node_entered(q1) received | UI updates to “Question 1” display |
| UIC-003 | node_progress(followUpCount: 1, maxFollowUps: 2) received | UI shows “Follow-up 1 of 2” |
| UIC-004 | node_progress(evidenceCovered: ["ev-q1-scheduling-concept"]) received | UI updates evidence progress indicator |
| UIC-005 | time_budget_warning received | UI shows time warning (e.g., yellow indicator) |
| UIC-006 | time_budget_exceeded received | UI shows time exceeded (e.g., red indicator) |
| UIC-007 | candidate_command(repeat) sent via UI button | Data channel message sent; command acknowledged |
| UIC-008 | candidate_command(raise_hand) sent via UI button | Timer pause acknowledged; UI shows “Paused” |
| UIC-009 | exam_completed received | UI shows “Assessment Complete” screen |
| UIC-010 | guardrail_violation event (admin view only) | Admin UI shows violation details; candidate UI unaffected |
| UIC-011 | Unknown event type received | UI ignores gracefully; logs warning |
| UIC-012 | Events arrive out of order | UI reorders by timestamp before rendering |
Purpose: Verify the marking input package is complete, correct, and
consumable by the marking pipeline.
| ID | Test | Expected |
|---|
| MRI-001 | Marking input contains all evidence ledger entries | Count matches specification’s evidenceTargets count |
| MRI-002 | Marking input contains full transcript | All transcript_final spans present |
| MRI-003 | Marking input contains runtime audit | nodesVisited, followUpsUsed, transitionDecisions all present |
| MRI-004 | Marking input contains specification snapshot | Frozen copy of specification used for session |
| MRI-005 | Marking input for exam with no evidence targets | Evidence ledger empty; transcript and audit still present |
| MRI-006 | Marking input for exam that ended early (timeout) | Partial evidence; not_covered entries for missing targets |
| MRI-007 | Marking input for exam with guardrail violations | Violations listed in runtime audit |
| MRI-008 | Marking input for exam with candidate commands | Commands listed in runtime audit |
| MRI-009 | Marking input schema validation | Passes JSON Schema for marking input |
| MRI-010 | Marking input version compatibility | inputVersion matches expected version |
Purpose: Verify the system handles failures gracefully.
| ID | Test | Expected |
|---|
| CHA-001 | Bot crashes mid-question | exam_completed fires from guaranteed hook; partial transcript preserved |
| CHA-002 | STT service disconnects during candidate answer | Last partial treated as final (degraded); session continues after reconnect |
| CHA-003 | Event store unavailable | Events queued in memory; flushed on recovery; session does not block |
| CHA-004 | Candidate disconnects and reconnects | Session resumes from current node; state preserved |
| CHA-005 | LLM service timeout during follow-up | Retry once; if still fails, skip follow-up and move to next node |
| ID | Test | Target |
|---|
| PERF-001 | Specification schema validation latency | <10ms per specification |
| PERF-002 | Pipecat adapter compilation latency | <500ms per specification |
| PERF-003 | Command classification latency | <200ms per utterance |
| PERF-004 | Guardrail check latency | <50ms per LLM response |
| PERF-005 | Event emission throughput | >1000 events/sec to event store |
| PERF-006 | Evidence signal generation latency | <2s (async, does not block dialogue) |
| PERF-007 | 100 concurrent exam sessions | No degradation in event latency or dialogue responsiveness |
| PERF-008 | Marking input assembly latency | <5s per exam |
Purpose: Validate that specification packages produce assessments that are not only
technically correct but also psychometrically sound — valid, reliable, and fair.
This track addresses the “psychometrically blind” gap identified in the
specification review.
Theoretical grounding: Akimov & Malin (2020) evaluate their oral exam
against a validity/reliability/fairness matrix with eight criteria: face
validity, content validity, construct validity, concurrent validity, inter-item
consistency, inter-case reliability, inter-rater reliability, and fairness.
Joughin (1998) warns that “reliability is threatened when examiners are poorly
prepared” (p. 376) and when “interaction tends towards the dialogue pole”
(p. 376). Fenton (2025) notes that “careful preparation is recommended to
avoid any bias” and identifies risks around “gender, ethnicity, language
skills, speed of answering, and subjective grading.”
These tests are not run on every commit. They are run per exam package before
deployment to high-stakes summative assessment. Low-stakes or formative exams
MAY skip psychometric validation.
| ID | Test | Method | Expected |
|---|
| PSY-001 | Evidence targets align with declared learning outcomes | Expert panel review (≥3 subject matter experts) | ≥90% agreement that targets cover declared LOs |
| PSY-002 | Question stem elicits the intended cognitive level | Bloom’s taxonomy classification by 2+ independent raters | Inter-rater agreement ≥ 0.8 (Cohen’s κ) |
| PSY-003 | Follow-up questions probe deeper understanding, not just recall | Expert review of follow-up bank against Bloom’s levels | Follow-ups escalate at least one Bloom level from stem |
| PSY-004 | Assessment covers the full scope of the learning outcomes | Content coverage map: LO → evidence target → node | Every declared LO has ≥1 evidence target mapped |
| ID | Test | Method | Expected |
|---|
| PSY-010 | Specification evidence signals correlate with independent measures of the same construct | Correlation between specification evidence signals and independent written exam scores on same topics | Moderate positive correlation (r > 0.4) |
| PSY-011 | Different evidence targets measure different constructs | Factor analysis of evidence signal patterns across ≥50 sessions | Evidence targets cluster into distinct factors matching declared content types |
| PSY-012 | Transversal skills (communication, critical thinking) are distinguishable from content knowledge | Partial correlation: transversal signals vs content signals controlling for overall ability | Transversal and content signals are not perfectly correlated (r < 0.9) |
| ID | Test | Method | Expected |
|---|
| PSY-020 | AI examiner produces consistent evidence signals across sessions | Two independent LLM instances assess the same transcript (≥30 transcripts) | Cohen’s κ ≥ 0.8 for signal classification (covered/partial/absent) |
| PSY-021 | AI examiner agrees with human marker on evidence signals | AI-generated signals vs human-annotated signals on same transcripts | Cohen’s κ ≥ 0.75 |
| PSY-022 | Confidence calibration: 0.8-confidence signals are correct ~80% of the time | Binomial test: proportion of correct signals at each confidence level | Proportion within ±10% of declared confidence |
| PSY-023 | Confidence drift detection | Monitor average confidence over session duration; test for monotonic trend | No significant drift (p > 0.05, Mann-Kendall test) |
| ID | Test | Method | Expected |
|---|
| PSY-030 | Candidates receiving different follow-up paths get comparable evidence opportunities | Compare evidence coverage rates across candidates on same node | Coverage rate variance < 15% |
| PSY-031 | Follow-up type distribution is consistent across candidates | Chi-squared test on follow-up type distribution across ≥30 sessions | No significant deviation from expected distribution (p > 0.05) |
| PSY-032 | Conversation path variance does not correlate with final scores | Correlation between path variance metric and evidence coverage | r < 0.3 |
| ID | Test | Method | Expected |
|---|
| PSY-040 | No significant difference in evidence signal accuracy by language background | Compare AI marker agreement with human marker for native vs non-native speakers | No statistically significant difference (p > 0.05) |
| PSY-041 | No significant difference in assessment outcomes by gender | Compare mean evidence coverage rates across gender groups | No statistically significant difference (p > 0.05) |
| PSY-042 | Follow-up count not correlated with demographic variables | Regression: follow-up count ~ demographics + ability | Demographic coefficients not significant (p > 0.05) |
| PSY-043 | Time budget is adequate for all candidates | Compare completion rates across language backgrounds | Non-native speakers do not have significantly lower completion rates |
| ID | Test | Method | Expected |
|---|
| PSY-050 | Candidates perceive the assessment as testing what it claims | Post-exam survey: “This assessment tested my understanding of [topic]“ | ≥80% agree or strongly agree |
| PSY-051 | Candidates find the AI examiner interaction natural | Post-exam survey: “The examiner’s questions felt natural and relevant” | ≥70% agree or strongly agree |
| PSY-052 | Anxiety levels are manageable | Pre/post anxiety survey (GAD-7 or equivalent) | Post-exam anxiety not significantly higher than pre-exam |
| ID | Test | Method | Expected |
|---|
| PSY-060 | Students adopt deeper learning strategies when oral exams are introduced | Pre/post survey of study strategies (surface vs deep approaches) | Shift toward deeper strategies |
| PSY-061 | Students value the oral format over written alternatives | Post-exam preference survey | ≥60% prefer oral format or find it equally valuable |
- When to run: Before deploying a new exam package for summative
assessment. Formative exams MAY skip psychometric validation.
- Minimum sample size: 30 candidate sessions for reliability tests;
50+ for fairness tests.
- Who runs: Assessment design team with psychometric support. Results
documented in
metadata.assessmentProfile.validityEvidence.
- What happens on failure: If any PSY test fails, the exam package
MUST NOT be used for summative assessment until the issue is resolved.
- Re-validation triggers: Changing evidence targets, follow-up policies,
time budgets, or examiner persona requires re-running affected PSY tests.
| Fixture | Description |
|---|
ir-minimal.json | Opening → end (no questions) |
ir-single-question.json | Opening → 1 question → closing → end |
ir-two-questions.json | Opening → 2 questions → closing → end (§10 example) |
ir-no-followups.json | Questions with maxFollowUps: 0 |
ir-many-followups.json | Questions with maxFollowUps: 5 |
ir-time-budget.json | Questions with aggressive time budgets |
ir-with-commands.json | Full candidateCommands configuration |
ir-with-evidence.json | Full evidenceTargets configuration |
ir-invalid-*.json | Various invalid IRs for schema validation tests |
| Fixture | Description |
|---|
transcript-normal.json | Complete, well-structured transcript |
transcript-partial.json | Transcript with missing spans |
transcript-overlapping.json | Transcript with duplicate/overlapping finals |
transcript-empty-candidate.json | Candidate says nothing |
transcript-long-rambling.json | Candidate gives very long answers |
| Fixture | Description |
|---|
utterances-repeat.json | 50 variations of repeat requests |
utterances-clarification.json | 50 variations of clarification requests |
utterances-raise-hand.json | 30 variations of pause requests |
utterances-answers.json | 100 normal answers across topics |
utterances-adversarial.json | Edge cases: off-topic, silence, confusion |
| Stage | Tests Run | Gate? |
|---|
| Pre-commit | Schema validation, unit tests | Block on failure |
| PR | All of above + adapter tests + contract tests | Block on failure |
| Merge to main | All of above + integration tests | Block on failure |
| Nightly | All of above + scripted simulation + adversarial | Alert on failure |
| Weekly | All of above + chaos + performance | Alert on failure |
| Pre-release | Full suite + regression for all published packages | Block on failure |
| Version | Date | Changes |
|---|
| v0.2.0 | 2026-06-30 | Added test cases for new schema fields (anxietyMitigation, BloomLevel, etc.). Updated terminology from ‘Exam Runtime IR’ to ‘IOA-ORM’. |
| v0.1.0 | 2026-05-06 | Initial release. |