Migration Plan

Status

Draft · v0.2.0 · 2026-06-30

This chapter defines an incremental, five-phase migration from the current flowJson-based architecture to the full IOA-ORM. Each phase is independently shippable, backward-compatible, and testable. No phase requires a full rewrite; each builds on the previous.

Guiding principles:

Incremental over revolutionary. Each phase delivers user-visible or engineering-visible value on its own.
Backward-compatible by default. Existing published packages MUST continue to work. New features are additive.
Feature-flagged. Each phase uses server-side feature flags so it can be enabled per-exam or per-tenant.
Reversible. Any phase can be rolled back without data loss.
Pedagogy-driven. Each phase is justified by assessment quality goals, not just engineering goals. The migration improves validity, reliability, and fairness — not just technical capability.

Phase 0 — Institutional Readiness (Parallel Workstream)

Goal: Prepare the human and institutional infrastructure for AI-powered oral assessment. This phase runs in parallel with Phases 1–4 and addresses the adoption barriers that Akimov & Malin (2020) and Fenton (2025) identify as critical to successful oral exam implementation.

Theoretical grounding: Akimov & Malin (2020) report that “administering oral examinations for larger classes would present several additional challenges because it would either require multiple examiners, which in turn could raise inter-rater consistency issues” (p. 1205). Fenton (2025) explicitly recommends “a training or shadowing program with experienced instructors leading novices” and “all examiners should receive training in oral assessment procedures.” Bayley et al. (2024) found that approximately 75% of students engaged with a practice ConVOE before the real assessment, highlighting the importance of student preparation.

0.1 Examiner Training Programme

Deliverable	Description	Owner
Training materials	How to author evidence targets, design follow-up banks, set time budgets, and calibrate assessment standards	Assessment design team
Calibration exercises	Pre-scored sample exams where new examiners practice generating evidence signals and compare against ground truth	Assessment design team
Shadowing protocol	New examiners observe experienced examiners conducting IOAs before designing their own	Academic development
Bias awareness module	Training on examiner bias (Akimov & Malin, 2020: “a large number of examiners acknowledged the presence of various biases”), language sensitivity, and cultural communication styles	DEI / accessibility team

0.2 Student Preparation

| Deliverable | Description | Owner | |---|---| | Student-facing documentation | What to expect, how to prepare, what commands are available, time structure | Assessment design team | | Practice exam | A low-stakes familiarisation exam using the specification’s scaffolding feature (see INFOSYS110 example, §10.11.8) | Assessment design team | | Accessibility assessment | Identify candidates who may need accommodations (extra time, alternative formats, language support) | Accessibility team | | Anxiety reduction resources | Information about the exam format, sample questions, and tips for managing exam anxiety (Fenton, 2025: “as much student-facing support as you can to reduce anxiety”) | Student wellbeing |

0.3 Pilot Planning

| Deliverable | Description | Owner | |---|---| | Pilot course selection | Select 2–3 courses for initial deployment; prefer low-stakes or formative assessments first | Assessment design team | | Feedback loops | Establish candidate and examiner feedback mechanisms (surveys, focus groups) | Quality assurance | | Baseline metrics | Collect pre-migration data on student satisfaction, assessment outcomes, and marking consistency for comparison | Analytics team |

0.4 Scale Readiness (for 100+ candidates)

Theoretical grounding: Bayley et al. (2024) implemented ConVOEs for 600+ students. Their key lesson: concurrent administration requires standardised question formats, parallel grading, and technology failure handling. The migration plan must account for these at the institutional level.

| Deliverable | Description | Owner | |---|---| | Capacity planning | Estimate concurrent session capacity; load test infrastructure | Platform team | | Question pool design | For large cohorts, design question pools with equivalent difficulty variants (Bayley et al., 2024) | Assessment design team | | Moderation workflow | Design human review process for AI-generated evidence signals | Assessment design + QA | | Technology failure protocol | Establish SLA for bot crashes, STT failures, and disconnections during concurrent sessions | Platform + support teams |

Phase 1 — Event Protocol & Transcript Closure

Goal: Establish a reliable event stream and ensure exam transcripts are complete, structured, and persisted.

1.1 What Changes

Area	Before	After
Event emission	Ad-hoc; some events emitted via data channel, some lost	Every key lifecycle moment emits a typed event to the event store
`bot_ready`	Not consistently emitted	MUST emit when the bot session is established and ready
`node_entered`	Implicit in flowJson node switch	Explicit event with `nodeId`, `nodeType`, `timestamp`
`node_exited`	Not emitted	MUST emit before next `node_entered`
`transcript_delta`	Partial STT results sent to UI	Formalised event with `nodeId`, `speaker`, `isFinal` flag
`transcript_final`	Emitted but not reliably persisted	MUST be persisted server-side with `nodeId`, `speaker`, `spanId`
`exam_completed`	Inconsistent; some sessions end without it	MUST emit exactly once, guaranteed by runtime controller
UI event consumption	UI parses raw data channel messages	UI consumes typed events from a standardised event contract
Server transcript persistence	Transcripts scattered across logs	Unified transcript store, queryable by `examId` + `sessionId`

1.2 Why It Matters

Without reliable events, observability is impossible — debugging failed exams requires guessing.
Without transcript closure, the marking pipeline receives incomplete or unstructured data, producing unreliable marks.
exam_completed consistency is prerequisite for triggering post-exam workflows (marking, analytics, candidate notification).

1.3 Engineering Tasks

Define event schema (EventProtocol types in §05-event-protocol.md). Implement as a TypeScript interface + JSON Schema for validation.
Instrument the Pipecat bot lifecycle:
- Emit bot_ready at session start.
- Emit node_entered / node_exited around each FlowManager node transition.
- Emit transcript_delta / transcript_final from STT pipeline.
Implement event store persistence layer:
- New exam_events table (or equivalent) with examId, sessionId, eventType, payload, timestamp.
- Index on (examId, sessionId, timestamp).
Guarantee exam_completed emission:
- Add a finally-style hook in the runtime controller that fires exam_completed regardless of how the session ends (normal, timeout, error, disconnection).
Update the frontend exam room to consume typed events instead of raw data channel messages. Create an event dispatcher that maps event types to UI update functions.
Add transcript aggregation service:
- Collects transcript_final events per session.
- Produces an ordered, deduplicated transcript with span IDs.
- Persists to exam_transcripts table.

1.4 Risks

Risk	Mitigation
STT produces duplicate or overlapping final transcripts	Deduplicate by span ID; use monotonic timestamps
`exam_completed` fires before all transcripts are persisted	Add a flush-and-wait step before emitting `exam_completed`
Increased event volume impacts bot latency	Event emission MUST be async (fire-and-forget to a queue); MUST NOT block the dialogue loop
Existing bots emit different event formats	Adapter normalises legacy events during migration window

1.5 Testing Strategy

Unit tests: Schema validation for every event type. Malformed payloads MUST be rejected with clear error messages.
Integration tests: Spin up a test bot session, verify every expected event is emitted in order and persisted.
Contract tests: Frontend event consumer tests — verify UI renders correctly for each event type.
Regression: Run existing exam flows; verify no change in candidate-facing behaviour.
Chaos test: Kill the bot mid-session; verify exam_completed still fires (from the guaranteed hook) and transcript is recoverable.

1.6 Effect on Published Packages

None. This phase adds events and persistence. It does not change the flowJson format or the candidate-facing experience. Existing published packages continue to work — they simply don’t emit the new events until re-published.

Phase 2 — Node State, Progress & Candidate Commands

Goal: Introduce runtime-managed node state, question progress tracking, and candidate command consumption.

2.1 What Changes

Area	Before	After
Runtime node state	Bot tracks which FlowManager node is active; no richer state	Runtime controller maintains per-node state: `followUpCount`, `timeElapsed`, `evidenceCovered`
Question progress	Not tracked; LLM decides when to move on	Runtime emits `node_progress` events; UI can show “Question 1 of 2 — Follow-up 1/2”
Candidate commands	`repeat`, `clarification`, `raise_hand` are UI-only or handled ad-hoc by LLM prompt	Runtime intercepts candidate commands, applies policy, emits `candidate_command` events
Data channel command protocol	No standardised candidate→bot command channel	Formalised command protocol: candidate sends typed command via data channel, runtime validates and routes
`node_progress` event	Doesn’t exist	New event emitted on every state change within a node

2.2 Why It Matters

Progress visibility: Candidates and proctors need to see where they are in the exam. Without runtime state, the UI is blind.
Command determinism: Candidate commands currently rely on the LLM correctly interpreting intent from natural language. This is fragile. Runtime interception provides deterministic, auditable command handling.
Follow-up counting: Without runtime tracking, the LLM can exceed the author’s intended follow-up limit. This is a fairness issue.

2.3 Engineering Tasks

Implement runtime node state store:
- In-memory state object per active session.
- Schema: { nodeId, followUpCount, maxFollowUps, timeBudgetSeconds, timeElapsed, evidenceCovered: string[], candidateCommandsUsed: [...] }.
- Emit node_progress on every state mutation.
Implement candidate command classifier:
- Receives STT output for candidate utterances.
- Classifies intent: repeat, clarification, raise_hand, or answer.
- Uses a lightweight classifier (rule-based or small model) — NOT the main LLM, to avoid latency and cost.
Implement data channel command protocol:
- Candidate UI sends: { type: "candidate_command", command: "repeat" }.
- Runtime validates: Is this command allowed at this point? Is it within maxPerNode? Does it cost a follow-up?
- Routes to appropriate handler (re-prompt, clarification, pause timer).
Wire candidate commands to runtime state:
- repeat → re-emit the current prompt, do NOT increment followUpCount.
- clarification → allow LLM to clarify within guardrails, do NOT increment followUpCount.
- raise_hand → pause timeBudgetSeconds countdown for configured duration.
Update the frontend to display node_progress data and send candidate commands via the data channel protocol.

2.4 Risks

Risk	Mitigation
Command classifier misclassifies an answer as a command (or vice versa)	Confidence threshold; fallback to treating ambiguous utterances as answers. Log misclassifications for retraining.
Candidate uses command strategically to waste time (e.g., repeated `raise_hand`)	`maxPerNode` limits enforced by runtime. Exceeded commands logged and ignored.
Runtime state diverges from FlowManager state	Add reconciliation check: runtime state and FlowManager node MUST agree. Emit `state_mismatch` alert if they diverge.
Latency increase from command classification	Command classifier MUST complete in <200ms. Use a fast local model or rule-based system, not the main LLM.

2.5 Testing Strategy

Unit tests: Command classifier accuracy — test with a corpus of 500+ candidate utterances across all command types and edge cases.
State machine tests: Verify followUpCount increments correctly, pauses work, maxPerNode is enforced.
Integration tests: Full session with candidate commands — verify events are emitted, state is updated, and LLM responds correctly.
Adversarial tests: Candidate sends 10 repeat commands in a row. Verify: first 3 work (within maxPerNode), rest are rejected with a polite message. Verify followUpCount never increments for repeat.
Regression: Existing exams without commands continue to work normally.

2.6 Effect on Published Packages

Minimal. Published packages that don’t declare candidateCommands continue to work unchanged. Packages that want to support commands need to be re-published with the new candidateCommands section in the specification. This is opt-in.

Phase 3 — Evidence Target & Evidence Ledger

Goal: Attach structured evidence targets to questions, emit evidence_signal events during the exam, and produce a complete evidence ledger for the marking pipeline.

3.1 What Changes

Area	Before	After
Evidence targets in specification	Not present in flowJson	`evidenceTargets` array on each question node with `id`, `description`, `rubric`, `level`
Evidence detection	LLM judges evidence ad-hoc in its context; no structured output	LLM emits structured `evidence_signal` events during the exam; runtime validates and persists
Transcript span mapping	No link between evidence and transcript	`evidence_signal` includes `transcriptSpanId` linking to the exact transcript excerpt
Evidence ledger	Doesn’t exist; marking uses raw transcript	Structured ledger: per-evidence-target, with signal status, confidence, rationale, transcript excerpts
markingRuntime input	Raw transcript only	Structured input: evidence ledger + transcript + runtime audit

3.2 Why It Matters

Marking quality: Without structured evidence, the marking pipeline must re-analyse the entire transcript. This is expensive, slow, and inconsistent.
Auditability: Evidence signals with transcript links enable human markers to verify the AI’s assessment quickly.
Rubric alignment: Evidence targets in the specification create an explicit contract between the author’s intent and the runtime’s execution.

3.3 Engineering Tasks

Extend the specification schema with evidenceTargets on question nodes (see §02-schema.md).
Implement evidence detection in the LLM pipeline:
- After each candidate answer, the LLM evaluates which evidence targets have been addressed.
- Outputs a structured evidence_signal (not free text).
- This can be a separate LLM call (judge model) or a structured output from the main dialogue LLM.
Implement transcript span mapping:
- When evidence_signal is emitted, link it to the most recent transcript_final span(s) that contain the relevant content.
- Store transcriptSpanIds in the signal payload.
Implement evidence ledger persistence:
- New exam_evidence table keyed by (examId, sessionId, evidenceTargetId).
- Upsert on each evidence_signal — later signals can override earlier ones if confidence increases.
Build the markRuntime input assembly:
- New service that, on exam_completed, assembles the full marking input: evidence ledger + transcript + runtime audit + specification snapshot.
- Persist and make available to the marking pipeline.

3.4 Risks

Risk	Mitigation
Evidence detection LLM hallucinates — marks evidence as “covered” when it isn’t	Use confidence threshold (e.g., 0.7); below threshold, mark as “uncertain” for human review. Always include rationale.
Evidence detection adds latency to each answer turn	Run evidence detection asynchronously after `transcript_final`; do not block the dialogue loop. Evidence signal may arrive seconds after the transcript.
Transcript span mapping is wrong — links evidence to the wrong excerpt	Use the most recent candidate `transcript_final` before the signal. Validate that the span text actually contains content relevant to the evidence target.
Marking pipeline doesn’t use the evidence ledger	Phase 3 delivers the data; marking pipeline integration is a separate workstream. Ensure the marking team is aligned on consuming the new input format.

3.5 Testing Strategy

Unit tests: Evidence signal schema validation. Transcript span linking correctness.
Integration tests: Full session with evidence detection — verify ledger is complete and accurate.
Accuracy tests: Run 50+ recorded exams through the evidence detector; compare against human-annotated ground truth. Target: >85% agreement.
Edge case tests: Candidate gives a one-word answer. Candidate gives a rambling answer that partially covers multiple evidence targets. Candidate contradicts themselves.
Regression: Existing exams without evidence targets continue to work. The evidence ledger is simply empty for those exams.

3.6 Effect on Published Packages

Opt-in additive. Existing published packages don’t have evidenceTargets and continue to work. Packages re-published with evidenceTargets get structured evidence collection. The marking pipeline MUST handle both: exams with evidence ledgers (new) and exams with raw transcripts only (legacy).

Phase 4 — Hard Follow-Up & Transition Policy

Goal: Enforce follow-up limits and transition policies at the runtime level, preventing the LLM from exceeding author-defined constraints.

4.1 What Changes

Area	Before	After
Follow-up enforcement	Follow-up limit is a prompt instruction; LLM may exceed it	Runtime tracks `followUpCount`; blocks LLM from generating follow-up when limit reached
Transition authority	LLM decides when to move to the next node	Runtime approves all transitions; LLM proposes, runtime decides
Transition blocking	No mechanism to prevent LLM from jumping nodes	Runtime blocks unauthorised transitions; emits `guardrail_violation`
Transition decision log	No record of why transitions happened	`transition_decision` event with `decision`, `reason`, `targetNodeId`
Time budget enforcement	Time budget is a prompt hint	Runtime enforces: warns at 80%, hard-moves at 100%

4.2 Why It Matters

Fairness: If the LLM can exceed follow-up limits, some candidates get more chances than others. This is a serious assessment integrity issue.
Structural integrity: The author designed a specific flow. The LLM should not be able to deviate from it. Runtime enforcement guarantees this.
Auditability: Transition decisions are now logged with reasons. This is essential for appeals, quality assurance, and exam reviews.

4.3 Engineering Tasks

Implement follow-up counter in runtime controller:
- Increment on each LLM-generated follow-up question.
- Decrement policy: NEVER (follow-ups are permanent).
- When followUpCount >= maxFollowUps, inject a “move to next question” instruction into the LLM context instead of allowing another follow-up.
Implement transition approval gate:
- LLM signals intent to transition (via structured output or a special token).
- Runtime checks: Is the target in allowedTargets? Is the transition condition satisfied?
- If approved: emit transition_decision with decision: "move_to_next_node".
- If blocked: emit transition_decision with decision: "blocked" and re-inject the current node’s prompt.
Implement time budget enforcement:
- Runtime tracks elapsed time per node.
- At 80% of budget: emit time_budget_warning event.
- At 100%: emit time_budget_exceeded and force transition (per overrunPolicy).
Implement guardrail violation handling:
- When the LLM generates text that violates a forbidden rule, block it.
- Emit guardrail_violation event.
- Regenerate the response without the violation.
Add transition_decision event to the event protocol.

4.4 Risks

Risk	Mitigation
Runtime blocks a transition that the LLM correctly identified as appropriate	Transition conditions are authored; if they’re too strict, the author should adjust. Log blocked transitions for review.
Forced transition due to time budget feels abrupt to the candidate	The LLM is instructed to provide a graceful bridge: “We’re running short on time, so let’s move to the next question.”
LLM ignores the “move to next question” instruction after follow-up limit	If the LLM generates another follow-up despite the instruction, the runtime MUST block it and inject the next node’s stem directly.
Transition approval adds latency	Transition checks are pure in-memory logic — MUST complete in <10ms. No network calls.

4.5 Testing Strategy

Unit tests: Follow-up counter: increments, caps at max, never decrements. Transition approval: allowed and blocked cases.
State machine tests: Full state machine simulation — verify all paths through the node graph with various follow-up counts and time budgets.
Adversarial tests: LLM prompt injection attempts — try to get the LLM to skip a question, reveal the rubric, or exceed follow-up limits. Verify runtime blocks all of these.
Integration tests: Full session with hard transitions — verify the candidate experience is smooth even when transitions are forced.
Regression: Existing exams continue to work. The new enforcement is additive — it only activates for IRs that declare transitionPolicy and maxFollowUps.

4.6 Effect on Published Packages

Backward-compatible. Existing published packages that don’t declare transitionPolicy or maxFollowUps continue to use the current LLM-decided transitions. Packages that declare these get runtime enforcement. This is opt-in until Phase 5 makes it mandatory.

Phase 5 — Promote flowJson to Formal IOA-ORM

Goal: Make the IOA-ORM the single source of truth for exam runtime configuration. flowJson becomes a legacy compatibility layer.

5.1 What Changes

Area	Before	After
Source of truth	`flowJson` is compiled and passed to Pipecat directly	`IOA-ORM` is the source of truth; Pipecat config is an adapter output
Compilation pipeline	`AssessmentPackage → flowJson → Pipecat`	`AssessmentPackage → IOA-ORM → (adapter) → Pipecat config`
Versioning	No formal versioning on flowJson	`irVersion` field in the specification; semantic versioning; backward compatibility rules
Backward compatibility	N/A	Published packages with old flowJson are auto-migrated to IOA-ORM v1.0.0 on first use
Schema validation	Ad-hoc	JSON Schema for ExamRuntimeIR; CI validation; runtime validation on load
Pipecat adapter	flowJson IS the Pipecat config	Separate adapter module that compiles specification → Pipecat config; isolates Pipecat-specific concerns
Documentation	Scattered across code comments	Formal specification (this document suite); API docs; migration guides

5.2 Why It Matters

Single source of truth: Eliminates the drift between what the author intended and what the runtime executes.
Portability: If Pipecat is replaced or supplemented, only the adapter changes. The specification and the runtime controller are unaffected.
Ecosystem: Other tools (analytics, reporting, quality assurance) can consume the specification directly, without understanding Pipecat internals.
Governance: Versioned specification enables controlled evolution, deprecation policies, and migration tooling.

5.3 Engineering Tasks

Finalise the specification schema (all fields from Phases 1–4, plus metadata, versioning, and any remaining gaps).
Implement the Pipecat adapter module:
- Input: ExamRuntimeIR.
- Output: Pipecat FlowManager config.
- The adapter MUST NOT add domain logic — it is a pure translation layer.
Implement IR versioning and migration:
- irVersion follows semver.
- Migration tool converts old flowJson → IOA-ORM v1.0.0.
- Breaking changes increment major version; migration tool provided.
Implement schema validation:
- JSON Schema published and versioned.
- CI: validate specification on build.
- Runtime: validate specification on load; reject invalid specifications with clear errors.
Auto-migrate existing published packages:
- On first access, detect old flowJson format.
- Convert to IOA-ORM v1.0.0 and persist.
- Original flowJson preserved for rollback.
Update the Assessment Studio to compile to IOA-ORM instead of flowJson.
Deprecate flowJson:
- Add deprecation warnings to flowJson code paths.
- Set a sunset date (e.g., 6 months after Phase 5 ships).
- After sunset, flowJson code paths are removed.

5.4 Risks

Risk	Mitigation
Auto-migration introduces bugs in existing exams	Run migration in dry-run mode first; compare generated specification against expected output. Validate 100% of existing published packages before enabling auto-migration.
Breaking change in specification forces all packages to be re-published	Semver policy: major version bump for breaking changes. Migration tool provided. Old major versions supported for at least 2 major versions.
Pipecat adapter introduces bugs	Adapter is a pure function — easily testable. Comprehensive test suite mapping specification → expected Pipecat config for every node type.
Team resists the migration because flowJson “works fine”	Phase 5 is the culmination — by this point, the team has already seen the value of events, evidence, and runtime control in Phases 1–4. Phase 5 just formalises it.
External integrations depend on flowJson format	Provide a compatibility shim that produces flowJson from specification. Deprecate the shim on the same timeline.

5.5 Testing Strategy

Migration tests: Run migration on every existing published package. Verify: specification is valid, Pipecat adapter output matches original flowJson (where applicable), no candidate-facing changes.
Schema validation tests: Valid IRs pass; invalid IRs are rejected with specific error messages.
Adapter tests: For every node type, verify adapter output matches expected Pipecat config.
End-to-end tests: Full exam session using IOA-ORM as source of truth. Verify: events, evidence, commands, transitions, marking input — all correct.
Performance tests: Specification compilation + adapter MUST complete in <500ms for a typical exam.
Regression: All existing tests continue to pass.

5.6 Effect on Published Packages

This is the migration phase. All existing published packages are affected: they are auto-migrated from flowJson to IOA-ORM v1.0.0. After migration, they continue to work exactly as before — but now through the specification pipeline.

Post-migration:

New packages MUST be published as IOA-ORM.
Old packages continue to work via auto-migration.
flowJson is deprecated with a sunset date.

Phase Summary

Phase	Duration Estimate	Key Deliverable	Breaking?
0 — Institutional Readiness	2–3 weeks (parallel)	Examiner training, student prep, pilot planning	No
1 — Event Protocol & Transcript Closure	3–4 weeks	Reliable event stream + persisted transcripts	No
2 — Node State & Candidate Commands	3–4 weeks	Runtime state + candidate command handling	No (opt-in)
3 — Evidence Target & Ledger	4–5 weeks	Structured evidence for marking	No (opt-in)
4 — Hard Follow-Up & Transition Policy	3–4 weeks	Runtime-enforced constraints	No (opt-in)
5 — Promote to IOA-ORM	4–6 weeks	IOA-ORM as source of truth + migration	Auto-migration for all packages

Total estimated duration: 19–25 weeks (5–6 months), assuming one team + parallel institutional readiness workstream.

Phases 1–4 can partially overlap — they build on each other but each delivers independent value. Phase 0 runs in parallel with Phases 1–3. Phase 5 depends on all previous phases being stable.

Dependency Graph

Phase 0 ──────────────────────────────▶ (runs in parallel with Phases 1–3)
   │
Phase 1 ──▶ Phase 2 ──▶ Phase 4
   │                      ▲
   └──────▶ Phase 3 ─────┘
                          │
                          └──▶ Phase 5

Phase 0 runs in parallel with Phases 1–3. It must complete before Phase 5 (which involves all published packages) but does not block engineering phases.
Phase 2 depends on Phase 1 (events are needed for state tracking).
Phase 3 depends on Phase 1 (transcript spans are needed for evidence linking).
Phase 4 depends on Phases 2 and 3 (follow-up counting needs node state; transition conditions may reference evidence coverage).
Phase 5 depends on all previous phases being stable and shipped.
Phase 5 also depends on Phase 0: examiner training and student preparation must be complete before mass migration.

Revision History

Version	Date	Changes
v0.2.0	2026-06-30	Updated migration plan for IOA-ORM naming. Adjusted phase dependencies.
v0.1.0	2026-05-06	Initial release.