Testing Strategy

Status

Draft · v0.2.0 · 2026-06-30

This chapter defines the comprehensive testing strategy for the IOA-ORM, covering every layer from schema validation to adversarial candidate simulation. Tests are organised by category, with specific test cases, expected outcomes, and tooling recommendations.

12.1 Test Pyramid Overview

         ╱  E2E / Chaos  ╲           (few, slow, high confidence)
        ╱  Integration     ╲          (moderate count, moderate speed)
       ╱  Contract / Adapter ╲        (many, fast)
      ╱  Unit / Schema / State ╲      (many, very fast)
     ╱______________________________╲

Layer	Count Target	Speed Target	Runs On
Unit / Schema	200+	<5s total	Every commit
State Machine	50+	<10s total	Every commit
Contract / Adapter	80+	<15s total	Every PR
Integration	30+	<2min total	Every PR
E2E / Scripted Simulation	15+	<10min total	Nightly
Adversarial	20+	<5min total	Nightly
Chaos	5+	<5min total	Weekly
Psychometric	15+	Pre-deployment	Per exam package

Theoretical grounding: The psychometric testing track addresses a critical gap identified by the review: the testing strategy validates technical correctness but not assessment quality. Akimov & Malin (2020) evaluate oral exams against a formal validity/reliability/fairness matrix with eight criteria. The spec’s testing strategy must include analogous psychometric validation to ensure that technically correct specification packages also produce valid, reliable, and fair assessments (Hevner et al., 2004, Guideline 3: “the utility, quality, and efficacy of a design artifact must be rigorously demonstrated”).

12.2 Schema Validation Tests

Purpose: Ensure every specification document conforms to the published JSON Schema.

Test Cases

ID	Test	Expected
SV-001	Valid minimal specification (opening → end)	Passes validation
SV-002	Valid full specification (opening → 2 questions → closing → end)	Passes validation
SV-003	Missing `irVersion`	Rejects with `"irVersion is required"`
SV-004	Invalid `irVersion` format (e.g., “1.0”)	Rejects with `"irVersion must be semver"`
SV-005	Node with `nodeId` containing spaces	Rejects with `"nodeId must be [a-z0-9-]"`
SV-006	Duplicate `nodeId` across nodes	Rejects with `"nodeId must be unique"`
SV-007	Transition target references non-existent `nodeId`	Rejects with `"target nodeId not found"`
SV-008	`maxFollowUps` set to 0 on a question node	Passes (valid — no follow-ups allowed)
SV-009	`maxFollowUps` set to -1	Rejects with `"maxFollowUps must be >= 0"`
SV-010	`evidenceTargets` array with duplicate IDs within one node	Rejects with `"evidenceTarget ID must be unique within node"`
SV-011	`evidenceTargets` with missing `level` field	Rejects with `"level is required"`
SV-012	`candidateCommands` with unknown command type	Rejects with `"unknown command type"`
SV-013	`transitionPolicy.conditions` referencing undefined evidence target	Rejects with `"evidence target not found"`
SV-014	`timeBudgetSeconds` set to 0 on a question node	Rejects with `"timeBudgetSeconds must be > 0"`
SV-015	Node of type `end` with transitions defined	Rejects with `"end node must not have transitions"`
SV-016	Node of type `question` without `questionStem`	Rejects with `"questionStem is required for question nodes"`
SV-017	`guardrails.forbidden` contains unknown value	Rejects with `"unknown forbidden value"`
SV-018	Specification with `examinerPersona` containing valid tone and style	Passes
SV-019	Specification with `metadata.language` set to unsupported locale	Warns but does not reject (MAY support)
SV-020	`followUp.triggerCondition` uses undefined variable	Rejects with `"undefined variable in triggerCondition"`

Tooling

JSON Schema Draft 2020-12 with custom format validators.
ajv (Node.js) or jsonschema (Python) for validation.
CI: validate all fixture IRs on every commit.

12.3 Unit Tests

Purpose: Test individual functions and modules in isolation.

12.3.1 Runtime State Machine

ID	Test	Expected
USM-001	Initialise state for a question node	`followUpCount = 0`, `timeElapsed = 0`, `evidenceCovered = []`
USM-002	Increment follow-up counter	`followUpCount` increases by 1
USM-003	Follow-up counter at max — attempt increment	Counter stays at max; returns `blocked` status
USM-004	Follow-up counter NEVER decrements	After increment, decrement is a no-op
USM-005	Candidate `repeat` command — follow-up counter unchanged	`followUpCount` same before and after
USM-006	Candidate `clarification` command — follow-up counter unchanged	`followUpCount` same before and after
USM-007	Candidate `raise_hand` — timer pauses	`timeElapsed` stops increasing during pause
USM-008	Timer resumes after pause duration	`timeElapsed` resumes from pre-pause value
USM-009	`raise_hand` exceeds `maxPerNode` — command rejected	Returns `max_commands_reached`
USM-010	Transition to allowed target	Returns `approved`
USM-011	Transition to disallowed target	Returns `blocked` with reason
USM-012	Transition when condition not yet satisfied	Returns `blocked` with unsatisfied conditions
USM-013	Time budget warning at 80%	Emits `time_budget_warning`
USM-014	Time budget exceeded at 100%	Emits `time_budget_exceeded`; triggers forced transition
USM-015	State reconciliation — runtime and FlowManager agree	Returns `consistent`
USM-016	State reconciliation — runtime and FlowManager disagree	Returns `mismatch` with details
USM-017	Evidence covered set updated on signal	`evidenceCovered` includes new target ID
USM-018	Duplicate evidence signal for same target	`evidenceCovered` still contains target once; latest confidence/rationale stored

12.3.2 Command Classifier

ID	Test	Expected
UCC-001	”Can you repeat that?” → `repeat`	Classified as `repeat` with confidence > 0.9
UCC-002	”Sorry, say that again” → `repeat`	Classified as `repeat`
UCC-003	”What do you mean by starvation?” → `clarification`	Classified as `clarification`
UCC-004	”Could you explain that term?” → `clarification`	Classified as `clarification`
UCC-005	”Can I have a moment?” → `raise_hand`	Classified as `raise_hand`
UCC-006	”I need to think for a second” → `raise_hand`	Classified as `raise_hand`
UCC-007	”Process scheduling is when the OS…” → `answer`	Classified as `answer` (not a command)
UCC-008	”I think Round Robin is better because…” → `answer`	Classified as `answer`
UCC-009	”Repeat? I mean, um, scheduling is…” → `answer`	Classified as `answer` (false-positive suppression)
UCC-010	”Can you repeat the question and also explain what you mean?” → ambiguous	Returns top-2 with confidence; runtime uses primary
UCC-011	Empty string / silence	Returns `none`
UCC-012	”Stop the exam” → `raise_hand` or special	Classified as `raise_hand` (closest match); logged for review

12.3.3 Event Emission

ID	Test	Expected
UEV-001	Emit `node_entered` with valid payload	Event persisted with correct schema
UEV-002	Emit `transcript_final` with `spanId`	Span ID is unique, monotonic
UEV-003	Emit `evidence_signal` with valid `transcriptSpanId`	Signal persisted, span exists in transcript store
UEV-004	Emit `evidence_signal` with invalid `transcriptSpanId`	Signal rejected; error logged
UEV-005	Emit `exam_completed` exactly once	Second emission is rejected
UEV-006	Emit events out of order (node_exited before node_entered)	Rejected with `invalid_sequence` error
UEV-007	Emit `transition_decision` with `decision: "blocked"`	Event persisted with reason
UEV-008	Emit `guardrail_violation`	Event persisted; original text captured

12.3.4 Guardrail Checks

ID	Test	Expected
UGR-001	LLM output contains “rubric”	Blocked; `guardrail_violation` emitted
UGR-002	LLM output contains “key criteria for this question”	Blocked (rubric reveal variant)
UGR-003	LLM output says “you scored 8 out of 10”	Blocked (`reveal_score`)
UGR-004	LLM output suggests “you might want to mention X”	Blocked (`suggest_answer`)
UGR-005	LLM output discusses “exam format policy”	Blocked (`forbidden_topics`)
UGR-006	LLM output is clean, on-topic response	Passes; no violation
UGR-007	LLM output attempts transition to closing from q1	Blocked (`unauthorized_transition`)
UGR-008	LLM output contains rubric text embedded in a story	Blocked (semantic detection, not just keyword)
UGR-009	LLM output contains grading threshold numbers	Blocked (`forbidden_topics`)
UGR-010	LLM output provides a helpful clarification	Passes (clarification is allowed within guardrails)

12.4 Pipecat Adapter Tests

Purpose: Verify that the specification-to-Pipecat adapter produces correct FlowManager configurations.

ID	Test	Expected
PAD-001	Opening node → Pipecat config	`task_messages` contains system + assistant prompt; `edges` has one target
PAD-002	Question node with 2 follow-ups → Pipecat config	System prompt mentions follow-up guidance; `runtime_config.maxFollowUps = 2`
PAD-003	Question node with `maxFollowUps: 0` → Pipecat config	System prompt says “do not ask follow-ups”; no follow-up guidance in config
PAD-004	End node → Pipecat config	No `task_messages`; `post_actions` includes `exam_completed` event
PAD-005	Node with time budget → Pipecat config	`runtime_config.timeBudgetSeconds` present
PAD-006	Node with evidence targets → Pipecat config	`runtime_config.evidenceTargets` array present
PAD-007	Node with `guardrails.forbidden` → Pipecat config	System prompt includes forbidden action list
PAD-008	Node with `candidateCommands` → Pipecat config	System prompt includes command handling instructions
PAD-009	Adapter preserves node order	`initial_node` matches specification’s first node; edge order matches
PAD-010	Adapter rejects invalid specification	Returns clear error, does not produce partial config
PAD-011	Adapter round-trip: IR → Pipecat config → (mock) execution	Execution produces expected events
PAD-012	Adapter handles specification with no `examinerPersona`	System prompt uses default persona
PAD-013	Adapter handles specification with `overrunPolicy: "warn_at_80pct_hard_at_100pct"`	Runtime config includes overrun settings

12.5 Scripted Candidate Simulation

Purpose: Run a complete exam session with a scripted candidate (no LLM) to verify end-to-end behaviour.

Approach: A “candidate simulator” plays predetermined utterances at predetermined times. The runtime, events, evidence ledger, and marking input are all verified against expected outcomes.

Scenario A — Normal Completion

Step	Candidate Action	Expected Runtime Behaviour
1	(opening plays)	`node_entered(opening)` → `node_exited(opening)` → `node_entered(q1)`
2	Answers Q1 well	`evidence_signal(ev-q1-scheduling-concept, covered)`
3	Follow-up 1 asked	`node_progress(followUpCount: 1)`
4	Answers follow-up 1	`evidence_signal(ev-q1-preemptive-cooperative, covered)`
5	Follow-up 2 asked	`node_progress(followUpCount: 2)`
6	Answers follow-up 2	`evidence_signal(ev-q1-context-switch, covered)`
7	Runtime approves move to q2	`transition_decision(move_to_next_node)`
8	Q2 stem plays	`node_entered(q2)`
9	Answers Q2 partially	`evidence_signal(ev-q2-algorithm-choice, covered)`
10	Follow-up 1 asked	`node_progress(followUpCount: 1)`
11	Answers follow-up 1	`evidence_signal(ev-q2-starvation, covered)`
12	Runtime decides sufficient evidence	`transition_decision(move_to_next_node)`
13	Closing plays	`node_entered(closing)` → `node_exited(closing)` → `node_entered(end)`
14	(end)	`exam_completed` emitted; marking input assembled

Verify: All events present and in order. Evidence ledger has 5/6 covered. Marking input is complete.

Scenario B — Candidate Uses All Commands

Step	Candidate Action	Expected
1	”Can you repeat that?” after Q1 stem	`candidate_command(repeat)`, `followUpCount` unchanged
2	”What do you mean by scheduling?”	`candidate_command(clarification)`, LLM clarifies, `followUpCount` unchanged
3	”I need a moment” before answering	`candidate_command(raise_hand)`, timer pauses for 10s
4	Answers normally after pause	`followUpCount` at 0 (none of the commands counted)

Scenario C — Time Budget Exceeded

Step	Candidate Action	Expected
1	Candidate gives slow, rambling answers	Timer approaches 80% of q2 budget
2	(automatic)	`time_budget_warning` emitted
3	Candidate continues	Timer reaches 100%
4	(automatic)	`time_budget_exceeded` emitted
5	(automatic)	`transition_decision(move_to_next_node, reason: "time budget exceeded")`
6	LLM delivers graceful bridge	”We’re running short on time, so let’s move on”
7	Closing plays	`node_entered(closing)`

12.6 Adversarial Candidate Tests

Purpose: Verify that the runtime handles hostile, confused, or edge-case candidate behaviour without breaking assessment integrity.

12.6.1 LLM Attempts Third Follow-Up When max=2

ID	Test	Expected
ADV-001	After 2 follow-ups, LLM generates a third follow-up question	Runtime blocks the follow-up; injects “move to next node” instruction
ADV-002	LLM generates third follow-up phrased as a statement	Runtime detects follow-up intent; blocks
ADV-003	LLM embeds a follow-up inside a transition sentence	Runtime detects; blocks and forces transition

12.6.2 Candidate Commands

ID	Test	Expected
ADV-004	Candidate asks to repeat 10 times in a row	First 3 succeed (`maxPerNode: 3`); remaining 7 rejected with polite message
ADV-005	Candidate asks for clarification on a term that IS the answer	LLM provides a safe explanation without revealing the answer; guardrail passes
ADV-006	Candidate says “raise hand” then immediately starts answering	Timer paused for 10s regardless; answer recorded after resume
ADV-007	Candidate sends command via data channel while LLM is speaking	Command queued; processed after current utterance completes

12.6.3 Silence & Off-Topic

ID	Test	Expected
ADV-008	Candidate stays silent for 16 seconds (threshold is 15s)	`silenceAction: "gentle_prompt"` triggered — LLM says “Take your time” or similar
ADV-009	Candidate stays silent for 45 seconds (3 gentle prompts)	Runtime escalates: emits `candidate_silence_extended`; LLM says “Would you like me to move on?”
ADV-010	Candidate gives an off-topic answer (“I like pizza”)	LLM gently redirects: “That’s interesting, but let’s focus on the question about scheduling.” No follow-up counted.
ADV-011	Candidate gives a partially on-topic answer	LLM asks a follow-up to fill the gap; follow-up counted normally
ADV-012	Candidate says “I don’t know”	LLM offers a gentle hint within guardrails; follow-up counted

12.6.4 STT Edge Cases

ID	Test	Expected
ADV-013	STT partial arrives but final never comes	After timeout (5s), runtime treats last partial as final with degraded confidence flag
ADV-014	STT produces two overlapping finals for the same utterance	Deduplicated by span ID; second one ignored
ADV-015	STT produces empty final (candidate mumbled)	Emitted as `transcript_final` with empty text; LLM handles gracefully
ADV-016	STT garbles text (low-confidence transcription)	`transcript_final` includes `confidence` field; low confidence logged for review

12.6.5 Unauthorised Transitions

ID	Test	Expected
ADV-017	LLM attempts to jump from q1 directly to closing	Runtime blocks; `guardrail_violation(unauthorized_transition)` emitted
ADV-018	LLM attempts to jump from q1 to q2 (correct) without runtime approval	Runtime intercepts; approves only after transition condition check
ADV-019	LLM attempts to loop back to q1 from q2	Runtime blocks; q1 is not in q2’s `allowedTargets`
ADV-020	LLM attempts to end exam mid-question	Runtime blocks; forces continuation or graceful time-budget handling

12.6.6 Evidence Signal Edge Cases

ID	Test	Expected
ADV-021	Evidence signal references invalid span ID	Signal rejected; error logged; does not corrupt ledger
ADV-022	Evidence signal claims “covered” but transcript excerpt contradicts	Confidence check: if excerpt is empty or contradictory, signal downgraded to “uncertain”
ADV-023	Multiple evidence signals for same target with increasing confidence	Latest signal overwrites earlier; confidence increases
ADV-024	Evidence signal arrives after `exam_completed`	Rejected; `exam_completed` is final

12.7 Transcript & Evidence Consistency Tests

Purpose: Ensure the transcript and evidence ledger are internally consistent and can be independently verified.

ID	Test	Expected
TEC-001	Every `transcriptSpanId` referenced in evidence ledger exists in transcript store	100% match
TEC-002	Every evidence target marked “covered” has at least one transcript span	100% match
TEC-003	Evidence “not_covered” has zero transcript spans (or empty excerpt)	Consistent
TEC-004	Transcript is complete — every `node_entered` has corresponding transcript spans	No gaps
TEC-005	Transcript order matches event order	Monotonic timestamps
TEC-006	Evidence ledger is complete — every `evidenceTarget` in specification has a ledger entry	100% coverage (even if “not_covered”)
TEC-007	Evidence confidence values are in [0, 1] range	All valid
TEC-008	Evidence rationale is non-empty for all “covered” entries	Present

12.8 Candidate Command Integration Tests

Purpose: Test the full flow from candidate utterance through command classification to runtime action and event emission.

ID	Test	Expected
CCI-001	Candidate says “repeat” during Q1 stem	Command classified → re-prompt issued → `candidate_command` event → `followUpCount` unchanged
CCI-002	Candidate says “what does that mean?” during follow-up	Command classified as `clarification` → LLM clarifies → `candidate_command` event → `followUpCount` unchanged
CCI-003	Candidate says “give me a second” mid-answer	Command classified as `raise_hand` → timer pauses → `time_budget_paused` event → resumes after 10s
CCI-004	Candidate sends raw data channel command `{ command: "repeat" }`	Processed same as spoken command; event emitted
CCI-005	Candidate sends malformed data channel command	Rejected with error; logged; no crash
CCI-006	Candidate uses command during opening (non-question node)	Command rejected (or handled gracefully) — e.g., repeat works, raise_hand is ignored

12.9 Guardrail Integration Tests

Purpose: Test guardrails in context — LLM generation → guardrail check → block/allow → event emission.

ID	Test	Expected
GI-001	LLM generates rubric-revealing response during Q1	Blocked; `guardrail_violation` event; LLM regenerates clean response
GI-002	LLM generates unauthorised transition text from q1	Blocked; `guardrail_violation`; runtime forces q2 transition
GI-003	LLM generates a helpful clarification (allowed)	Passes; no violation event
GI-004	LLM generates response that hints at score (“about 70%“)	Blocked; `guardrail_violation(reveal_score)`
GI-005	LLM generates response discussing exam logistics	Blocked; `guardrail_violation(forbidden_topics)`
GI-006	Guardrail check latency	MUST complete in <50ms; log if exceeds

12.10 Regression Tests for Published Packages

Purpose: Ensure existing published packages continue to work through every phase of the migration.

ID	Test	Expected
REG-001	Package published with flowJson v1 (pre-specification) — Phase 1	Events emitted; transcript persisted; no candidate-facing change
REG-002	Package published with flowJson v1 — Phase 2	Commands not configured; runtime state tracked but commands don’t activate
REG-003	Package published with flowJson v1 — Phase 3	No evidence targets; evidence ledger is empty; marking uses raw transcript
REG-004	Package published with flowJson v1 — Phase 4	No transition policy; LLM decides transitions (current behaviour)
REG-005	Package published with flowJson v1 — Phase 5	Auto-migrated to specification v1.0.0; verify Pipecat adapter output matches original flowJson
REG-006	Package published with specification v1.0.0 — after Phase 5	Full specification pipeline; all features available
REG-007	Package published with specification v1.1.0 (future) — loaded by runtime v1.0	Backward compatible; new fields ignored; warning logged
REG-008	Package published with specification v2.0.0 (breaking) — loaded by runtime v1.0	Rejected with clear error: “specification version 2.0.0 requires runtime >= 2.0.0”

12.11 UI Event Contract Tests

Purpose: Verify that the frontend correctly consumes and renders all event types.

ID	Test	Expected
UIC-001	`bot_ready` received	UI shows “Connected” status
UIC-002	`node_entered(q1)` received	UI updates to “Question 1” display
UIC-003	`node_progress(followUpCount: 1, maxFollowUps: 2)` received	UI shows “Follow-up 1 of 2”
UIC-004	`node_progress(evidenceCovered: ["ev-q1-scheduling-concept"])` received	UI updates evidence progress indicator
UIC-005	`time_budget_warning` received	UI shows time warning (e.g., yellow indicator)
UIC-006	`time_budget_exceeded` received	UI shows time exceeded (e.g., red indicator)
UIC-007	`candidate_command(repeat)` sent via UI button	Data channel message sent; command acknowledged
UIC-008	`candidate_command(raise_hand)` sent via UI button	Timer pause acknowledged; UI shows “Paused”
UIC-009	`exam_completed` received	UI shows “Assessment Complete” screen
UIC-010	`guardrail_violation` event (admin view only)	Admin UI shows violation details; candidate UI unaffected
UIC-011	Unknown event type received	UI ignores gracefully; logs warning
UIC-012	Events arrive out of order	UI reorders by timestamp before rendering

12.12 markRuntime Input Tests

Purpose: Verify the marking input package is complete, correct, and consumable by the marking pipeline.

ID	Test	Expected
MRI-001	Marking input contains all evidence ledger entries	Count matches specification’s `evidenceTargets` count
MRI-002	Marking input contains full transcript	All `transcript_final` spans present
MRI-003	Marking input contains runtime audit	`nodesVisited`, `followUpsUsed`, `transitionDecisions` all present
MRI-004	Marking input contains specification snapshot	Frozen copy of specification used for session
MRI-005	Marking input for exam with no evidence targets	Evidence ledger empty; transcript and audit still present
MRI-006	Marking input for exam that ended early (timeout)	Partial evidence; `not_covered` entries for missing targets
MRI-007	Marking input for exam with guardrail violations	Violations listed in runtime audit
MRI-008	Marking input for exam with candidate commands	Commands listed in runtime audit
MRI-009	Marking input schema validation	Passes JSON Schema for marking input
MRI-010	Marking input version compatibility	`inputVersion` matches expected version

12.13 Chaos & Resilience Tests

Purpose: Verify the system handles failures gracefully.

ID	Test	Expected
CHA-001	Bot crashes mid-question	`exam_completed` fires from guaranteed hook; partial transcript preserved
CHA-002	STT service disconnects during candidate answer	Last partial treated as final (degraded); session continues after reconnect
CHA-003	Event store unavailable	Events queued in memory; flushed on recovery; session does not block
CHA-004	Candidate disconnects and reconnects	Session resumes from current node; state preserved
CHA-005	LLM service timeout during follow-up	Retry once; if still fails, skip follow-up and move to next node

12.14 Performance & Load Tests

ID	Test	Target
PERF-001	Specification schema validation latency	<10ms per specification
PERF-002	Pipecat adapter compilation latency	<500ms per specification
PERF-003	Command classification latency	<200ms per utterance
PERF-004	Guardrail check latency	<50ms per LLM response
PERF-005	Event emission throughput	>1000 events/sec to event store
PERF-006	Evidence signal generation latency	<2s (async, does not block dialogue)
PERF-007	100 concurrent exam sessions	No degradation in event latency or dialogue responsiveness
PERF-008	Marking input assembly latency	<5s per exam

12.15 Psychometric Testing Track

Purpose: Validate that specification packages produce assessments that are not only technically correct but also psychometrically sound — valid, reliable, and fair. This track addresses the “psychometrically blind” gap identified in the specification review.

Theoretical grounding: Akimov & Malin (2020) evaluate their oral exam against a validity/reliability/fairness matrix with eight criteria: face validity, content validity, construct validity, concurrent validity, inter-item consistency, inter-case reliability, inter-rater reliability, and fairness. Joughin (1998) warns that “reliability is threatened when examiners are poorly prepared” (p. 376) and when “interaction tends towards the dialogue pole” (p. 376). Fenton (2025) notes that “careful preparation is recommended to avoid any bias” and identifies risks around “gender, ethnicity, language skills, speed of answering, and subjective grading.”

These tests are not run on every commit. They are run per exam package before deployment to high-stakes summative assessment. Low-stakes or formative exams MAY skip psychometric validation.

12.15.1 Content Validity

ID	Test	Method	Expected
PSY-001	Evidence targets align with declared learning outcomes	Expert panel review (≥3 subject matter experts)	≥90% agreement that targets cover declared LOs
PSY-002	Question stem elicits the intended cognitive level	Bloom’s taxonomy classification by 2+ independent raters	Inter-rater agreement ≥ 0.8 (Cohen’s κ)
PSY-003	Follow-up questions probe deeper understanding, not just recall	Expert review of follow-up bank against Bloom’s levels	Follow-ups escalate at least one Bloom level from stem
PSY-004	Assessment covers the full scope of the learning outcomes	Content coverage map: LO → evidence target → node	Every declared LO has ≥1 evidence target mapped

12.15.2 Construct Validity

ID	Test	Method	Expected
PSY-010	Specification evidence signals correlate with independent measures of the same construct	Correlation between specification evidence signals and independent written exam scores on same topics	Moderate positive correlation (r > 0.4)
PSY-011	Different evidence targets measure different constructs	Factor analysis of evidence signal patterns across ≥50 sessions	Evidence targets cluster into distinct factors matching declared content types
PSY-012	Transversal skills (communication, critical thinking) are distinguishable from content knowledge	Partial correlation: transversal signals vs content signals controlling for overall ability	Transversal and content signals are not perfectly correlated (r < 0.9)

12.15.3 Inter-Rater Reliability

ID	Test	Method	Expected
PSY-020	AI examiner produces consistent evidence signals across sessions	Two independent LLM instances assess the same transcript (≥30 transcripts)	Cohen’s κ ≥ 0.8 for signal classification (covered/partial/absent)
PSY-021	AI examiner agrees with human marker on evidence signals	AI-generated signals vs human-annotated signals on same transcripts	Cohen’s κ ≥ 0.75
PSY-022	Confidence calibration: 0.8-confidence signals are correct ~80% of the time	Binomial test: proportion of correct signals at each confidence level	Proportion within ±10% of declared confidence
PSY-023	Confidence drift detection	Monitor average confidence over session duration; test for monotonic trend	No significant drift (p > 0.05, Mann-Kendall test)

12.15.4 Inter-Case Reliability

ID	Test	Method	Expected
PSY-030	Candidates receiving different follow-up paths get comparable evidence opportunities	Compare evidence coverage rates across candidates on same node	Coverage rate variance < 15%
PSY-031	Follow-up type distribution is consistent across candidates	Chi-squared test on follow-up type distribution across ≥30 sessions	No significant deviation from expected distribution (p > 0.05)
PSY-032	Conversation path variance does not correlate with final scores	Correlation between path variance metric and evidence coverage	r < 0.3

12.15.5 Fairness

ID	Test	Method	Expected
PSY-040	No significant difference in evidence signal accuracy by language background	Compare AI marker agreement with human marker for native vs non-native speakers	No statistically significant difference (p > 0.05)
PSY-041	No significant difference in assessment outcomes by gender	Compare mean evidence coverage rates across gender groups	No statistically significant difference (p > 0.05)
PSY-042	Follow-up count not correlated with demographic variables	Regression: follow-up count ~ demographics + ability	Demographic coefficients not significant (p > 0.05)
PSY-043	Time budget is adequate for all candidates	Compare completion rates across language backgrounds	Non-native speakers do not have significantly lower completion rates

12.15.6 Face Validity and Candidate Experience

ID	Test	Method	Expected
PSY-050	Candidates perceive the assessment as testing what it claims	Post-exam survey: “This assessment tested my understanding of [topic]“	≥80% agree or strongly agree
PSY-051	Candidates find the AI examiner interaction natural	Post-exam survey: “The examiner’s questions felt natural and relevant”	≥70% agree or strongly agree
PSY-052	Anxiety levels are manageable	Pre/post anxiety survey (GAD-7 or equivalent)	Post-exam anxiety not significantly higher than pre-exam

12.15.7 Washback Effect

ID	Test	Method	Expected
PSY-060	Students adopt deeper learning strategies when oral exams are introduced	Pre/post survey of study strategies (surface vs deep approaches)	Shift toward deeper strategies
PSY-061	Students value the oral format over written alternatives	Post-exam preference survey	≥60% prefer oral format or find it equally valuable

12.15.8 Psychometric Test Execution Protocol

When to run: Before deploying a new exam package for summative assessment. Formative exams MAY skip psychometric validation.
Minimum sample size: 30 candidate sessions for reliability tests; 50+ for fairness tests.
Who runs: Assessment design team with psychometric support. Results documented in metadata.assessmentProfile.validityEvidence.
What happens on failure: If any PSY test fails, the exam package MUST NOT be used for summative assessment until the issue is resolved.
Re-validation triggers: Changing evidence targets, follow-up policies, time budgets, or examiner persona requires re-running affected PSY tests.

12.16 Test Data & Fixtures

Specification Fixtures

Fixture	Description
`ir-minimal.json`	Opening → end (no questions)
`ir-single-question.json`	Opening → 1 question → closing → end
`ir-two-questions.json`	Opening → 2 questions → closing → end (§10 example)
`ir-no-followups.json`	Questions with `maxFollowUps: 0`
`ir-many-followups.json`	Questions with `maxFollowUps: 5`
`ir-time-budget.json`	Questions with aggressive time budgets
`ir-with-commands.json`	Full `candidateCommands` configuration
`ir-with-evidence.json`	Full `evidenceTargets` configuration
`ir-invalid-*.json`	Various invalid IRs for schema validation tests

Transcript Fixtures

Fixture	Description
`transcript-normal.json`	Complete, well-structured transcript
`transcript-partial.json`	Transcript with missing spans
`transcript-overlapping.json`	Transcript with duplicate/overlapping finals
`transcript-empty-candidate.json`	Candidate says nothing
`transcript-long-rambling.json`	Candidate gives very long answers

Candidate Utterance Fixtures

Fixture	Description
`utterances-repeat.json`	50 variations of repeat requests
`utterances-clarification.json`	50 variations of clarification requests
`utterances-raise-hand.json`	30 variations of pause requests
`utterances-answers.json`	100 normal answers across topics
`utterances-adversarial.json`	Edge cases: off-topic, silence, confusion

12.17 CI/CD Integration

Stage	Tests Run	Gate?
Pre-commit	Schema validation, unit tests	Block on failure
PR	All of above + adapter tests + contract tests	Block on failure
Merge to main	All of above + integration tests	Block on failure
Nightly	All of above + scripted simulation + adversarial	Alert on failure
Weekly	All of above + chaos + performance	Alert on failure
Pre-release	Full suite + regression for all published packages	Block on failure

Revision History

Version	Date	Changes
v0.2.0	2026-06-30	Added test cases for new schema fields (anxietyMitigation, BloomLevel, etc.). Updated terminology from ‘Exam Runtime IR’ to ‘IOA-ORM’.
v0.1.0	2026-05-06	Initial release.