Skip to content

Migration Plan

Draft · v0.2.0 · 2026-06-30

This chapter defines an incremental, five-phase migration from the current flowJson-based architecture to the full IOA-ORM. Each phase is independently shippable, backward-compatible, and testable. No phase requires a full rewrite; each builds on the previous.

Guiding principles:

  • Incremental over revolutionary. Each phase delivers user-visible or engineering-visible value on its own.
  • Backward-compatible by default. Existing published packages MUST continue to work. New features are additive.
  • Feature-flagged. Each phase uses server-side feature flags so it can be enabled per-exam or per-tenant.
  • Reversible. Any phase can be rolled back without data loss.
  • Pedagogy-driven. Each phase is justified by assessment quality goals, not just engineering goals. The migration improves validity, reliability, and fairness — not just technical capability.

Phase 0 — Institutional Readiness (Parallel Workstream)

Section titled “Phase 0 — Institutional Readiness (Parallel Workstream)”

Goal: Prepare the human and institutional infrastructure for AI-powered oral assessment. This phase runs in parallel with Phases 1–4 and addresses the adoption barriers that Akimov & Malin (2020) and Fenton (2025) identify as critical to successful oral exam implementation.

Theoretical grounding: Akimov & Malin (2020) report that “administering oral examinations for larger classes would present several additional challenges because it would either require multiple examiners, which in turn could raise inter-rater consistency issues” (p. 1205). Fenton (2025) explicitly recommends “a training or shadowing program with experienced instructors leading novices” and “all examiners should receive training in oral assessment procedures.” Bayley et al. (2024) found that approximately 75% of students engaged with a practice ConVOE before the real assessment, highlighting the importance of student preparation.

DeliverableDescriptionOwner
Training materialsHow to author evidence targets, design follow-up banks, set time budgets, and calibrate assessment standardsAssessment design team
Calibration exercisesPre-scored sample exams where new examiners practice generating evidence signals and compare against ground truthAssessment design team
Shadowing protocolNew examiners observe experienced examiners conducting IOAs before designing their ownAcademic development
Bias awareness moduleTraining on examiner bias (Akimov & Malin, 2020: “a large number of examiners acknowledged the presence of various biases”), language sensitivity, and cultural communication stylesDEI / accessibility team

| Deliverable | Description | Owner | |---|---| | Student-facing documentation | What to expect, how to prepare, what commands are available, time structure | Assessment design team | | Practice exam | A low-stakes familiarisation exam using the specification’s scaffolding feature (see INFOSYS110 example, §10.11.8) | Assessment design team | | Accessibility assessment | Identify candidates who may need accommodations (extra time, alternative formats, language support) | Accessibility team | | Anxiety reduction resources | Information about the exam format, sample questions, and tips for managing exam anxiety (Fenton, 2025: “as much student-facing support as you can to reduce anxiety”) | Student wellbeing |

| Deliverable | Description | Owner | |---|---| | Pilot course selection | Select 2–3 courses for initial deployment; prefer low-stakes or formative assessments first | Assessment design team | | Feedback loops | Establish candidate and examiner feedback mechanisms (surveys, focus groups) | Quality assurance | | Baseline metrics | Collect pre-migration data on student satisfaction, assessment outcomes, and marking consistency for comparison | Analytics team |

Theoretical grounding: Bayley et al. (2024) implemented ConVOEs for 600+ students. Their key lesson: concurrent administration requires standardised question formats, parallel grading, and technology failure handling. The migration plan must account for these at the institutional level.

| Deliverable | Description | Owner | |---|---| | Capacity planning | Estimate concurrent session capacity; load test infrastructure | Platform team | | Question pool design | For large cohorts, design question pools with equivalent difficulty variants (Bayley et al., 2024) | Assessment design team | | Moderation workflow | Design human review process for AI-generated evidence signals | Assessment design + QA | | Technology failure protocol | Establish SLA for bot crashes, STT failures, and disconnections during concurrent sessions | Platform + support teams |


Phase 1 — Event Protocol & Transcript Closure

Section titled “Phase 1 — Event Protocol & Transcript Closure”

Goal: Establish a reliable event stream and ensure exam transcripts are complete, structured, and persisted.

AreaBeforeAfter
Event emissionAd-hoc; some events emitted via data channel, some lostEvery key lifecycle moment emits a typed event to the event store
bot_readyNot consistently emittedMUST emit when the bot session is established and ready
node_enteredImplicit in flowJson node switchExplicit event with nodeId, nodeType, timestamp
node_exitedNot emittedMUST emit before next node_entered
transcript_deltaPartial STT results sent to UIFormalised event with nodeId, speaker, isFinal flag
transcript_finalEmitted but not reliably persistedMUST be persisted server-side with nodeId, speaker, spanId
exam_completedInconsistent; some sessions end without itMUST emit exactly once, guaranteed by runtime controller
UI event consumptionUI parses raw data channel messagesUI consumes typed events from a standardised event contract
Server transcript persistenceTranscripts scattered across logsUnified transcript store, queryable by examId + sessionId
  • Without reliable events, observability is impossible — debugging failed exams requires guessing.
  • Without transcript closure, the marking pipeline receives incomplete or unstructured data, producing unreliable marks.
  • exam_completed consistency is prerequisite for triggering post-exam workflows (marking, analytics, candidate notification).
  1. Define event schema (EventProtocol types in §05-event-protocol.md). Implement as a TypeScript interface + JSON Schema for validation.
  2. Instrument the Pipecat bot lifecycle:
    • Emit bot_ready at session start.
    • Emit node_entered / node_exited around each FlowManager node transition.
    • Emit transcript_delta / transcript_final from STT pipeline.
  3. Implement event store persistence layer:
    • New exam_events table (or equivalent) with examId, sessionId, eventType, payload, timestamp.
    • Index on (examId, sessionId, timestamp).
  4. Guarantee exam_completed emission:
    • Add a finally-style hook in the runtime controller that fires exam_completed regardless of how the session ends (normal, timeout, error, disconnection).
  5. Update the frontend exam room to consume typed events instead of raw data channel messages. Create an event dispatcher that maps event types to UI update functions.
  6. Add transcript aggregation service:
    • Collects transcript_final events per session.
    • Produces an ordered, deduplicated transcript with span IDs.
    • Persists to exam_transcripts table.
RiskMitigation
STT produces duplicate or overlapping final transcriptsDeduplicate by span ID; use monotonic timestamps
exam_completed fires before all transcripts are persistedAdd a flush-and-wait step before emitting exam_completed
Increased event volume impacts bot latencyEvent emission MUST be async (fire-and-forget to a queue); MUST NOT block the dialogue loop
Existing bots emit different event formatsAdapter normalises legacy events during migration window
  • Unit tests: Schema validation for every event type. Malformed payloads MUST be rejected with clear error messages.
  • Integration tests: Spin up a test bot session, verify every expected event is emitted in order and persisted.
  • Contract tests: Frontend event consumer tests — verify UI renders correctly for each event type.
  • Regression: Run existing exam flows; verify no change in candidate-facing behaviour.
  • Chaos test: Kill the bot mid-session; verify exam_completed still fires (from the guaranteed hook) and transcript is recoverable.

None. This phase adds events and persistence. It does not change the flowJson format or the candidate-facing experience. Existing published packages continue to work — they simply don’t emit the new events until re-published.


Phase 2 — Node State, Progress & Candidate Commands

Section titled “Phase 2 — Node State, Progress & Candidate Commands”

Goal: Introduce runtime-managed node state, question progress tracking, and candidate command consumption.

AreaBeforeAfter
Runtime node stateBot tracks which FlowManager node is active; no richer stateRuntime controller maintains per-node state: followUpCount, timeElapsed, evidenceCovered
Question progressNot tracked; LLM decides when to move onRuntime emits node_progress events; UI can show “Question 1 of 2 — Follow-up 1/2”
Candidate commandsrepeat, clarification, raise_hand are UI-only or handled ad-hoc by LLM promptRuntime intercepts candidate commands, applies policy, emits candidate_command events
Data channel command protocolNo standardised candidate→bot command channelFormalised command protocol: candidate sends typed command via data channel, runtime validates and routes
node_progress eventDoesn’t existNew event emitted on every state change within a node
  • Progress visibility: Candidates and proctors need to see where they are in the exam. Without runtime state, the UI is blind.
  • Command determinism: Candidate commands currently rely on the LLM correctly interpreting intent from natural language. This is fragile. Runtime interception provides deterministic, auditable command handling.
  • Follow-up counting: Without runtime tracking, the LLM can exceed the author’s intended follow-up limit. This is a fairness issue.
  1. Implement runtime node state store:
    • In-memory state object per active session.
    • Schema: { nodeId, followUpCount, maxFollowUps, timeBudgetSeconds, timeElapsed, evidenceCovered: string[], candidateCommandsUsed: [...] }.
    • Emit node_progress on every state mutation.
  2. Implement candidate command classifier:
    • Receives STT output for candidate utterances.
    • Classifies intent: repeat, clarification, raise_hand, or answer.
    • Uses a lightweight classifier (rule-based or small model) — NOT the main LLM, to avoid latency and cost.
  3. Implement data channel command protocol:
    • Candidate UI sends: { type: "candidate_command", command: "repeat" }.
    • Runtime validates: Is this command allowed at this point? Is it within maxPerNode? Does it cost a follow-up?
    • Routes to appropriate handler (re-prompt, clarification, pause timer).
  4. Wire candidate commands to runtime state:
    • repeat → re-emit the current prompt, do NOT increment followUpCount.
    • clarification → allow LLM to clarify within guardrails, do NOT increment followUpCount.
    • raise_hand → pause timeBudgetSeconds countdown for configured duration.
  5. Update the frontend to display node_progress data and send candidate commands via the data channel protocol.
RiskMitigation
Command classifier misclassifies an answer as a command (or vice versa)Confidence threshold; fallback to treating ambiguous utterances as answers. Log misclassifications for retraining.
Candidate uses command strategically to waste time (e.g., repeated raise_hand)maxPerNode limits enforced by runtime. Exceeded commands logged and ignored.
Runtime state diverges from FlowManager stateAdd reconciliation check: runtime state and FlowManager node MUST agree. Emit state_mismatch alert if they diverge.
Latency increase from command classificationCommand classifier MUST complete in <200ms. Use a fast local model or rule-based system, not the main LLM.
  • Unit tests: Command classifier accuracy — test with a corpus of 500+ candidate utterances across all command types and edge cases.
  • State machine tests: Verify followUpCount increments correctly, pauses work, maxPerNode is enforced.
  • Integration tests: Full session with candidate commands — verify events are emitted, state is updated, and LLM responds correctly.
  • Adversarial tests: Candidate sends 10 repeat commands in a row. Verify: first 3 work (within maxPerNode), rest are rejected with a polite message. Verify followUpCount never increments for repeat.
  • Regression: Existing exams without commands continue to work normally.

Minimal. Published packages that don’t declare candidateCommands continue to work unchanged. Packages that want to support commands need to be re-published with the new candidateCommands section in the specification. This is opt-in.


Phase 3 — Evidence Target & Evidence Ledger

Section titled “Phase 3 — Evidence Target & Evidence Ledger”

Goal: Attach structured evidence targets to questions, emit evidence_signal events during the exam, and produce a complete evidence ledger for the marking pipeline.

AreaBeforeAfter
Evidence targets in specificationNot present in flowJsonevidenceTargets array on each question node with id, description, rubric, level
Evidence detectionLLM judges evidence ad-hoc in its context; no structured outputLLM emits structured evidence_signal events during the exam; runtime validates and persists
Transcript span mappingNo link between evidence and transcriptevidence_signal includes transcriptSpanId linking to the exact transcript excerpt
Evidence ledgerDoesn’t exist; marking uses raw transcriptStructured ledger: per-evidence-target, with signal status, confidence, rationale, transcript excerpts
markingRuntime inputRaw transcript onlyStructured input: evidence ledger + transcript + runtime audit
  • Marking quality: Without structured evidence, the marking pipeline must re-analyse the entire transcript. This is expensive, slow, and inconsistent.
  • Auditability: Evidence signals with transcript links enable human markers to verify the AI’s assessment quickly.
  • Rubric alignment: Evidence targets in the specification create an explicit contract between the author’s intent and the runtime’s execution.
  1. Extend the specification schema with evidenceTargets on question nodes (see §02-schema.md).
  2. Implement evidence detection in the LLM pipeline:
    • After each candidate answer, the LLM evaluates which evidence targets have been addressed.
    • Outputs a structured evidence_signal (not free text).
    • This can be a separate LLM call (judge model) or a structured output from the main dialogue LLM.
  3. Implement transcript span mapping:
    • When evidence_signal is emitted, link it to the most recent transcript_final span(s) that contain the relevant content.
    • Store transcriptSpanIds in the signal payload.
  4. Implement evidence ledger persistence:
    • New exam_evidence table keyed by (examId, sessionId, evidenceTargetId).
    • Upsert on each evidence_signal — later signals can override earlier ones if confidence increases.
  5. Build the markRuntime input assembly:
    • New service that, on exam_completed, assembles the full marking input: evidence ledger + transcript + runtime audit + specification snapshot.
    • Persist and make available to the marking pipeline.
RiskMitigation
Evidence detection LLM hallucinates — marks evidence as “covered” when it isn’tUse confidence threshold (e.g., 0.7); below threshold, mark as “uncertain” for human review. Always include rationale.
Evidence detection adds latency to each answer turnRun evidence detection asynchronously after transcript_final; do not block the dialogue loop. Evidence signal may arrive seconds after the transcript.
Transcript span mapping is wrong — links evidence to the wrong excerptUse the most recent candidate transcript_final before the signal. Validate that the span text actually contains content relevant to the evidence target.
Marking pipeline doesn’t use the evidence ledgerPhase 3 delivers the data; marking pipeline integration is a separate workstream. Ensure the marking team is aligned on consuming the new input format.
  • Unit tests: Evidence signal schema validation. Transcript span linking correctness.
  • Integration tests: Full session with evidence detection — verify ledger is complete and accurate.
  • Accuracy tests: Run 50+ recorded exams through the evidence detector; compare against human-annotated ground truth. Target: >85% agreement.
  • Edge case tests: Candidate gives a one-word answer. Candidate gives a rambling answer that partially covers multiple evidence targets. Candidate contradicts themselves.
  • Regression: Existing exams without evidence targets continue to work. The evidence ledger is simply empty for those exams.

Opt-in additive. Existing published packages don’t have evidenceTargets and continue to work. Packages re-published with evidenceTargets get structured evidence collection. The marking pipeline MUST handle both: exams with evidence ledgers (new) and exams with raw transcripts only (legacy).


Phase 4 — Hard Follow-Up & Transition Policy

Section titled “Phase 4 — Hard Follow-Up & Transition Policy”

Goal: Enforce follow-up limits and transition policies at the runtime level, preventing the LLM from exceeding author-defined constraints.

AreaBeforeAfter
Follow-up enforcementFollow-up limit is a prompt instruction; LLM may exceed itRuntime tracks followUpCount; blocks LLM from generating follow-up when limit reached
Transition authorityLLM decides when to move to the next nodeRuntime approves all transitions; LLM proposes, runtime decides
Transition blockingNo mechanism to prevent LLM from jumping nodesRuntime blocks unauthorised transitions; emits guardrail_violation
Transition decision logNo record of why transitions happenedtransition_decision event with decision, reason, targetNodeId
Time budget enforcementTime budget is a prompt hintRuntime enforces: warns at 80%, hard-moves at 100%
  • Fairness: If the LLM can exceed follow-up limits, some candidates get more chances than others. This is a serious assessment integrity issue.
  • Structural integrity: The author designed a specific flow. The LLM should not be able to deviate from it. Runtime enforcement guarantees this.
  • Auditability: Transition decisions are now logged with reasons. This is essential for appeals, quality assurance, and exam reviews.
  1. Implement follow-up counter in runtime controller:
    • Increment on each LLM-generated follow-up question.
    • Decrement policy: NEVER (follow-ups are permanent).
    • When followUpCount >= maxFollowUps, inject a “move to next question” instruction into the LLM context instead of allowing another follow-up.
  2. Implement transition approval gate:
    • LLM signals intent to transition (via structured output or a special token).
    • Runtime checks: Is the target in allowedTargets? Is the transition condition satisfied?
    • If approved: emit transition_decision with decision: "move_to_next_node".
    • If blocked: emit transition_decision with decision: "blocked" and re-inject the current node’s prompt.
  3. Implement time budget enforcement:
    • Runtime tracks elapsed time per node.
    • At 80% of budget: emit time_budget_warning event.
    • At 100%: emit time_budget_exceeded and force transition (per overrunPolicy).
  4. Implement guardrail violation handling:
    • When the LLM generates text that violates a forbidden rule, block it.
    • Emit guardrail_violation event.
    • Regenerate the response without the violation.
  5. Add transition_decision event to the event protocol.
RiskMitigation
Runtime blocks a transition that the LLM correctly identified as appropriateTransition conditions are authored; if they’re too strict, the author should adjust. Log blocked transitions for review.
Forced transition due to time budget feels abrupt to the candidateThe LLM is instructed to provide a graceful bridge: “We’re running short on time, so let’s move to the next question.”
LLM ignores the “move to next question” instruction after follow-up limitIf the LLM generates another follow-up despite the instruction, the runtime MUST block it and inject the next node’s stem directly.
Transition approval adds latencyTransition checks are pure in-memory logic — MUST complete in <10ms. No network calls.
  • Unit tests: Follow-up counter: increments, caps at max, never decrements. Transition approval: allowed and blocked cases.
  • State machine tests: Full state machine simulation — verify all paths through the node graph with various follow-up counts and time budgets.
  • Adversarial tests: LLM prompt injection attempts — try to get the LLM to skip a question, reveal the rubric, or exceed follow-up limits. Verify runtime blocks all of these.
  • Integration tests: Full session with hard transitions — verify the candidate experience is smooth even when transitions are forced.
  • Regression: Existing exams continue to work. The new enforcement is additive — it only activates for IRs that declare transitionPolicy and maxFollowUps.

Backward-compatible. Existing published packages that don’t declare transitionPolicy or maxFollowUps continue to use the current LLM-decided transitions. Packages that declare these get runtime enforcement. This is opt-in until Phase 5 makes it mandatory.


Phase 5 — Promote flowJson to Formal IOA-ORM

Section titled “Phase 5 — Promote flowJson to Formal IOA-ORM”

Goal: Make the IOA-ORM the single source of truth for exam runtime configuration. flowJson becomes a legacy compatibility layer.

AreaBeforeAfter
Source of truthflowJson is compiled and passed to Pipecat directlyIOA-ORM is the source of truth; Pipecat config is an adapter output
Compilation pipelineAssessmentPackage → flowJson → PipecatAssessmentPackage → IOA-ORM → (adapter) → Pipecat config
VersioningNo formal versioning on flowJsonirVersion field in the specification; semantic versioning; backward compatibility rules
Backward compatibilityN/APublished packages with old flowJson are auto-migrated to IOA-ORM v1.0.0 on first use
Schema validationAd-hocJSON Schema for ExamRuntimeIR; CI validation; runtime validation on load
Pipecat adapterflowJson IS the Pipecat configSeparate adapter module that compiles specification → Pipecat config; isolates Pipecat-specific concerns
DocumentationScattered across code commentsFormal specification (this document suite); API docs; migration guides
  • Single source of truth: Eliminates the drift between what the author intended and what the runtime executes.
  • Portability: If Pipecat is replaced or supplemented, only the adapter changes. The specification and the runtime controller are unaffected.
  • Ecosystem: Other tools (analytics, reporting, quality assurance) can consume the specification directly, without understanding Pipecat internals.
  • Governance: Versioned specification enables controlled evolution, deprecation policies, and migration tooling.
  1. Finalise the specification schema (all fields from Phases 1–4, plus metadata, versioning, and any remaining gaps).
  2. Implement the Pipecat adapter module:
    • Input: ExamRuntimeIR.
    • Output: Pipecat FlowManager config.
    • The adapter MUST NOT add domain logic — it is a pure translation layer.
  3. Implement IR versioning and migration:
    • irVersion follows semver.
    • Migration tool converts old flowJson → IOA-ORM v1.0.0.
    • Breaking changes increment major version; migration tool provided.
  4. Implement schema validation:
    • JSON Schema published and versioned.
    • CI: validate specification on build.
    • Runtime: validate specification on load; reject invalid specifications with clear errors.
  5. Auto-migrate existing published packages:
    • On first access, detect old flowJson format.
    • Convert to IOA-ORM v1.0.0 and persist.
    • Original flowJson preserved for rollback.
  6. Update the Assessment Studio to compile to IOA-ORM instead of flowJson.
  7. Deprecate flowJson:
    • Add deprecation warnings to flowJson code paths.
    • Set a sunset date (e.g., 6 months after Phase 5 ships).
    • After sunset, flowJson code paths are removed.
RiskMitigation
Auto-migration introduces bugs in existing examsRun migration in dry-run mode first; compare generated specification against expected output. Validate 100% of existing published packages before enabling auto-migration.
Breaking change in specification forces all packages to be re-publishedSemver policy: major version bump for breaking changes. Migration tool provided. Old major versions supported for at least 2 major versions.
Pipecat adapter introduces bugsAdapter is a pure function — easily testable. Comprehensive test suite mapping specification → expected Pipecat config for every node type.
Team resists the migration because flowJson “works fine”Phase 5 is the culmination — by this point, the team has already seen the value of events, evidence, and runtime control in Phases 1–4. Phase 5 just formalises it.
External integrations depend on flowJson formatProvide a compatibility shim that produces flowJson from specification. Deprecate the shim on the same timeline.
  • Migration tests: Run migration on every existing published package. Verify: specification is valid, Pipecat adapter output matches original flowJson (where applicable), no candidate-facing changes.
  • Schema validation tests: Valid IRs pass; invalid IRs are rejected with specific error messages.
  • Adapter tests: For every node type, verify adapter output matches expected Pipecat config.
  • End-to-end tests: Full exam session using IOA-ORM as source of truth. Verify: events, evidence, commands, transitions, marking input — all correct.
  • Performance tests: Specification compilation + adapter MUST complete in <500ms for a typical exam.
  • Regression: All existing tests continue to pass.

This is the migration phase. All existing published packages are affected: they are auto-migrated from flowJson to IOA-ORM v1.0.0. After migration, they continue to work exactly as before — but now through the specification pipeline.

Post-migration:

  • New packages MUST be published as IOA-ORM.
  • Old packages continue to work via auto-migration.
  • flowJson is deprecated with a sunset date.

PhaseDuration EstimateKey DeliverableBreaking?
0 — Institutional Readiness2–3 weeks (parallel)Examiner training, student prep, pilot planningNo
1 — Event Protocol & Transcript Closure3–4 weeksReliable event stream + persisted transcriptsNo
2 — Node State & Candidate Commands3–4 weeksRuntime state + candidate command handlingNo (opt-in)
3 — Evidence Target & Ledger4–5 weeksStructured evidence for markingNo (opt-in)
4 — Hard Follow-Up & Transition Policy3–4 weeksRuntime-enforced constraintsNo (opt-in)
5 — Promote to IOA-ORM4–6 weeksIOA-ORM as source of truth + migrationAuto-migration for all packages

Total estimated duration: 19–25 weeks (5–6 months), assuming one team + parallel institutional readiness workstream.

Phases 1–4 can partially overlap — they build on each other but each delivers independent value. Phase 0 runs in parallel with Phases 1–3. Phase 5 depends on all previous phases being stable.


Phase 0 ──────────────────────────────▶ (runs in parallel with Phases 1–3)

Phase 1 ──▶ Phase 2 ──▶ Phase 4
   │                      ▲
   └──────▶ Phase 3 ─────┘

                          └──▶ Phase 5
  • Phase 0 runs in parallel with Phases 1–3. It must complete before Phase 5 (which involves all published packages) but does not block engineering phases.
  • Phase 2 depends on Phase 1 (events are needed for state tracking).
  • Phase 3 depends on Phase 1 (transcript spans are needed for evidence linking).
  • Phase 4 depends on Phases 2 and 3 (follow-up counting needs node state; transition conditions may reference evidence coverage).
  • Phase 5 depends on all previous phases being stable and shipped.
  • Phase 5 also depends on Phase 0: examiner training and student preparation must be complete before mass migration.
VersionDateChanges
v0.2.02026-06-30Updated migration plan for IOA-ORM naming. Adjusted phase dependencies.
v0.1.02026-05-06Initial release.