Skip to content

Open Questions

Draft · v0.2.0 · 2026-06-30

This chapter collects unresolved questions that require product, engineering, or stakeholder input before they can be incorporated into the specification. Each question includes context, trade-offs, and a recommended default position.

These questions are intentionally left open — they represent design decisions that should not be made unilaterally by the spec author.


Q1: Evidence Sufficiency — LLM Judgment vs Deterministic Policy + LLM Rationale

Section titled “Q1: Evidence Sufficiency — LLM Judgment vs Deterministic Policy + LLM Rationale”

Context: When the runtime evaluates whether sufficient evidence has been collected for a question, there are two approaches:

  • (A) LLM judgment: The LLM itself decides “I have enough evidence to move on.” This is flexible but non-deterministic — the same exam could produce different transitions for different candidates.
  • (B) Deterministic policy + LLM rationale: The runtime evaluates a boolean expression over evidence coverage (e.g., evidence_covered(['ev-q1-required']) AND count(covered) >= 2). The LLM provides the evidence signals, but the transition decision is deterministic.

Trade-offs:

LLM JudgmentDeterministic Policy
FlexibilityHigh — handles nuanceLow — may miss edge cases
FairnessLow — two candidates may get different treatmentHigh — same rules for everyone
AuditabilityLow — “the LLM decided”High — “policy X was applied”
ComplexityLow — just prompt engineeringHigher — policy authoring + expression evaluation
Failure modeLLM may never transition (or transition too early)Policy may be too strict (blocks valid transitions) or too loose

Default position: (B) Deterministic policy + LLM rationale. Fairness and auditability are non-negotiable for assessment. The LLM’s role is to provide high-quality evidence signals; the runtime’s role is to enforce transition policy. However, the deterministic policy SHOULD be configurable per question, and a “policy: none” mode MAY be offered for low-stakes practice exams.

Open question: Should the deterministic policy be authorable in the Assessment Studio UI, or only as a code-level expression? The UI approach is more accessible to lecturers but limits expressiveness. The code approach is more powerful but requires engineering support.


Context: When a candidate asks for clarification (e.g., “What do you mean by starvation?”), the LLM must explain without revealing the answer or rubric. But the line is blurry:

  • “Starvation means a process waits indefinitely” — safe clarification.
  • “Starvation is when short jobs keep arriving and long jobs never get scheduled, which is a key problem with SJF” — this is partly the answer.
  • “You’re on the right track — starvation is what I’m looking for here” — this reveals the rubric.

Open questions:

  1. Should clarification responses be pre-authored per question (deterministic) or generated by the LLM at runtime (flexible)?
  2. Should there be a hard list of “safe clarification templates” for common terms, with the LLM only falling back to generation for unanticipated questions?
  3. How do we handle a candidate who repeatedly asks for clarification as a strategy to extract hints? Should the runtime intervene after N clarifications?

Default position: Clarification is LLM-generated with guardrails. A bank of pre-authored safe clarifications is optional and may improve consistency. The maxPerNode limit on clarification commands (currently 2) provides a basic safeguard against strategic abuse.

Needs input from: Assessment design team, academic integrity team.


Context: Some oral assessments may want adaptive flows — e.g., if a candidate struggles with Q1, Q2 could be simplified; or if they excel, Q2 could be harder. This is “branching.”

The specification currently supports a linear graph (opening → q1 → q2 → closing). Branching would require:

  • Conditional transitions based on runtime state (e.g., evidence coverage, follow-up count, time used).
  • Multiple candidate nodes for the same logical question position.
  • Authoring complexity: the lecturer must design multiple paths.

Trade-offs:

No BranchingBranching
Authoring simplicityHighLow
FairnessHigh — all candidates get same questionsLower — different candidates face different difficulty
Assessment validityEasier to validateHarder to validate (are branches equivalent?)
Adaptive assessmentNot possiblePossible

Default position: Defer to v2. Branching is valuable but significantly increases authoring complexity and validation burden. The current specification should be designed to not preclude branching (e.g., allowedTargets is already an array, not a singleton), but the Assessment Studio and runtime should not implement it in v1.

Needs input from: Assessment design team, psychometricians.


Context: The examinerPersona field in the specification controls the LLM’s tone and style. The current schema is:

{
  "tone": "supportive_encouraging",
  "style": "asks_for_clarification_when_vague"
}

Open questions:

  1. How many persona dimensions should be configurable? Tone, style, formality, verbosity, cultural sensitivity, language complexity?
  2. Should personas be reusable presets (e.g., “supportive_undergrad”, “formal_postgrad”) or fully custom?
  3. Should the persona affect evidence detection (e.g., a supportive persona might be more generous in interpreting ambiguous answers)?
  4. Should the persona be allowed to vary during the exam (e.g., more formal during questions, warmer during closing)?

Default position: Keep it simple in v1 — tone (enum) and style (string). Presets can be added later. The persona MUST NOT affect evidence detection — that’s a separate, deterministic concern.

Needs input from: UX research, linguistics/pedagogy team.


Context: The specification specifies timeBudgetSeconds per question node. The runtime enforces this with warnings and hard stops.

Open questions:

  1. Should the time budget be a hard limit (candidate is cut off) or a soft limit (warning only, no forced transition)?
  2. Should the candidate see a visible timer?
  3. Should the time budget account for candidate commands (repeat, clarification, raise_hand)? Currently, raise_hand pauses the timer, but repeat and clarification do not — should they?
  4. Should the total exam time budget be enforced independently of per-node budgets? (e.g., “10 minutes total, however distributed”)
  5. Should the time budget be adjustable by a proctor during the exam?

Default position:

  • Hard limit with overrunPolicy (configurable).
  • Candidate MAY see a timer — this is a UI decision, not a specification decision.
  • repeat and clarification SHOULD count toward the time budget (they use real time). raise_hand pauses because it’s explicitly a “I need a moment” signal.
  • Total time budget SHOULD be enforced as a fallback (if all per-node budgets are exhausted but total isn’t, the exam can continue; if total is exhausted, the exam MUST end).
  • Proctor adjustment is out of scope for v1.

Needs input from: Assessment design team, accessibility team.


Context: The specification has a metadata.language field, but the current design assumes a single language per exam.

Open questions:

  1. Should the specification support multilingual exams (e.g., candidate answers in Mandarin, examiner speaks English)?
  2. Should the specification support code-switching (candidate mixes languages)?
  3. How does STT handle multilingual input? Is this a runtime concern or a specification concern?
  4. Should evidence detection be language-aware? (A Chinese answer and an English answer about the same concept should be evaluated equivalently.)

Default position: The specification’s language field sets the primary language. Multilingual support is a runtime/STT concern, not a specification concern. The specification should not encode language-specific rules — evidence targets are language-agnostic (they describe concepts, not words). Multilingual exams are out of scope for v1 but the specification should not preclude them.

Needs input from: Internationalisation team, STT vendor.


Q7: EvidenceSignal — Auto-Generated, Human-Reviewed, or Marking Runtime Confirmed?

Section titled “Q7: EvidenceSignal — Auto-Generated, Human-Reviewed, or Marking Runtime Confirmed?”

Context: The evidence_signal event is the bridge between the exam runtime and the marking pipeline. Who produces the final signal?

Options:

  • (A) Auto-generated only: The exam runtime’s LLM generates evidence signals during the exam. These go directly to marking. Fast, cheap, but potentially inaccurate.
  • (B) Auto-generated + human-reviewed: The LLM generates signals, but a human marker reviews and confirms/overrides them before final marking. Slow, expensive, but accurate.
  • (C) Auto-generated + marking runtime confirmed: The LLM generates signals, and the marking runtime (which may itself be an AI) independently evaluates the transcript and confirms/overrides. Middle ground.

Trade-offs:

Auto-onlyHuman-reviewedMarking runtime confirmed
SpeedInstantDaysMinutes
CostLowHighMedium
AccuracyMediumHighMedium-High
ScalabilityHighLowHigh

Default position: (C) as default, with (B) as an option for high-stakes exams. The exam runtime generates signals with confidence scores. The marking runtime independently evaluates the transcript and can override signals where confidence is low or where it disagrees. For high-stakes exams (e.g., professional certification), human review is mandatory.

Needs input from: Assessment design team, academic integrity team, marking pipeline team.


Context: In some scenarios, the runtime might want to insert a node that wasn’t in the original specification:

  • A “repair” node when the candidate is confused and the LLM needs to re-explain the question from scratch.
  • A “buffer” node when transitioning between topics (e.g., “Great, now let’s move on to a different topic”).
  • An “anxiety intervention” node when the candidate shows signs of distress.

Open questions:

  1. Should the runtime be allowed to insert nodes not in the specification.
  2. If yes, what constraints apply? (e.g., transient nodes cannot have evidence targets; they cannot extend the exam duration; they must be logged.)
  3. Should transient nodes be visible in the evidence ledger and transcript?
  4. Should the candidate know they’re in a transient node?

Default position: Transient nodes are allowed but heavily constrained.

  • They MUST NOT have evidence targets (they don’t assess anything).
  • They MUST NOT extend the total exam duration (they borrow time from the current node’s budget).
  • They MUST be logged as transient_node_entered / transient_node_exited events.
  • They MUST appear in the transcript with a transient: true flag.
  • The candidate does not need to know — from their perspective, the examiner is just being helpful.

Needs input from: Assessment design team, UX team.


Q9: Candidate Anxiety / Accessibility — Policy or Prompt?

Section titled “Q9: Candidate Anxiety / Accessibility — Policy or Prompt?”

Context: The runtime should handle candidates who show signs of anxiety (e.g., long silence, repeated “I don’t know”, stuttering, asking to stop).

Options:

  • (A) Prompt-only: The LLM’s system prompt includes instructions to be supportive and handle anxiety gracefully. No runtime enforcement.
  • (B) Policy + prompt: The runtime detects anxiety signals (e.g., silence threshold, negative sentiment) and triggers specific actions (e.g., pause timer, offer to skip, emit anxiety_detected event). The LLM is also prompted to be supportive.

Default position: (B) for accessibility and duty of care.

  • The runtime SHOULD detect basic anxiety signals (extended silence, repeated “I don’t know”, explicit requests to stop).
  • On detection: pause timer, emit anxiety_detected event, prompt LLM to offer support (“Would you like to take a moment? We can also move on to the next question if you’d prefer.”).
  • If the candidate explicitly requests to stop, the exam MUST end gracefully (exam_completed with reason candidate_requested_stop).
  • The LLM prompt is a secondary layer — it handles the nuance, but the runtime handles the structural response.

Needs input from: Accessibility team, student wellbeing team, legal team.


Q10: Transcript Redaction / Privacy Policy in the Specification

Section titled “Q10: Transcript Redaction / Privacy Policy in the Specification”

Context: Transcripts may contain sensitive information — candidate names, student IDs, or personal details inadvertently spoken. Some jurisdictions require data minimisation.

Open questions:

  1. Should the specification specify a redaction policy (e.g., “redact candidate name from transcript before storage”)?
  2. Should redaction happen at the STT layer, the event store layer, or the marking pipeline layer?
  3. Should the candidate have the right to request transcript deletion after the exam?
  4. Should the specification encode data retention periods (e.g., “delete transcript after 90 days”)?

Default position: Privacy is a platform-level concern, not a specification concern. The specification should not encode redaction or retention policies — these belong in the platform’s data governance framework. However, the specification SHOULD provide hooks:

  • The event store SHOULD support redaction as a post-processing step.
  • The marking input SHOULD include a redacted: true flag if redaction was applied.
  • Retention policies are enforced at the storage layer, not the specification layer.

Needs input from: Legal team, data protection officer, platform team.


Q11: Post-Exam Marking Referencing Runtime Transition Rationale

Section titled “Q11: Post-Exam Marking Referencing Runtime Transition Rationale”

Context: The transition_decision event records why each transition happened (e.g., “all evidence covered,” “time budget exceeded”). Should the marking pipeline use this information?

Open questions:

  1. Should a “time budget exceeded” transition affect the mark? (e.g., penalise for running out of time, or just note it as context?)
  2. Should a “blocked unauthorised transition” event (guardrail violation) affect the mark?
  3. Should the marking pipeline be aware of how many follow-ups were used? (e.g., “candidate needed 2 follow-ups to cover evidence” vs “candidate covered evidence on the stem question”)
  4. Should the marking pipeline see the raw transition_decision events, or a summary?

Default position: The marking pipeline SHOULD receive the full runtime audit, including transition decisions. How it uses this information is a marking policy decision:

  • Time budget exceeded: MAY affect marks (e.g., partial credit for not-covered evidence). SHOULD be flagged for human marker review.
  • Guardrail violations: SHOULD be flagged for academic integrity review. SHOULD NOT automatically reduce marks (the violation is the LLM’s, not the candidate’s).
  • Follow-up count: MAY inform marking (e.g., “required prompting” is weaker evidence than “volunteered”). SHOULD be available as metadata.

Needs input from: Marking pipeline team, academic integrity team, assessment design team.


Q12: Published Package Allowing Patch Event Protocol Adapter

Section titled “Q12: Published Package Allowing Patch Event Protocol Adapter”

Context: When a published package is loaded by a newer runtime version, the runtime may emit new event types that the package wasn’t designed for. Should the runtime “downgrade” its event output to match the package’s expected event protocol version?

Options:

  • (A) Always emit latest events: The runtime emits all events it supports. Older packages ignore unknown events. The UI and marking pipeline handle missing events gracefully.
  • (B) Emit package-compatible events: The runtime checks the package’s specification version and only emits events that version supports. Newer events are suppressed.
  • (C) Emit both: The runtime emits latest events AND a compatibility layer that maps them to the older event format.

Default position: (A) with graceful degradation.

  • The runtime ALWAYS emits the latest event protocol.
  • Consumers (UI, marking pipeline, event store) MUST handle unknown event types by ignoring them gracefully.
  • The specification’s irVersion tells the runtime which features the package expects, but it does NOT limit which events the runtime emits.
  • A compatibility adapter (C) MAY be provided as an optional utility for consumers that cannot be updated, but it is not part of the core runtime.

This is the simplest approach and avoids the combinatorial explosion of maintaining multiple event protocol versions.

Needs input from: Platform team, frontend team, marking pipeline team.


Context: Fenton (2025) recommends “a training or shadowing program with experienced instructors leading novices” and “all examiners should receive training in oral assessment procedures, including the need to focus only on professional attributes, language issues for candidates taking oral assessments in a second language, and the nature and source of bias.” Akimov & Malin (2020) found that “none of the 30 examiners surveyed had any training on how to conduct oral examinations” and that “a large number of examiners acknowledged the presence of various biases.”

The specification’s LLM-as-examiner model inherits this concern — the LLM needs calibration just as human examiners do.

Open questions:

  1. Should the specification include a calibration mode where the LLM assesses pre-annotated sample exams and its performance is measured against ground truth?
  2. Should calibration results be stored as metadata on the published package?
  3. How should the system handle drift — when the LLM’s evidence detection accuracy degrades over time due to model updates?
  4. Should the specification support multiple LLM assessors (analogous to Joughin’s panel of examiners) with inter-rater reliability tracking?
  5. How should the system handle examiner persona consistency across candidates? (Joughin, 1998: “reliability is threatened when examiners are poorly prepared.”)

Default position: Calibration is a runtime/platform concern, but the specification SHOULD provide hooks:

  • A calibrationProfile field in metadata referencing calibration exam IDs and accuracy metrics.
  • A multiAssessor option that runs evidence detection through multiple LLM instances and reports agreement.
  • A driftDetection mechanism that monitors evidence signal accuracy over time.

Needs input from: Assessment design team, platform team, psychometricians.


Context: Akimov & Malin (2020) evaluate their oral exam against a formal validity matrix (face validity, content validity, construct validity, concurrent validity, plus four reliability measures). The specification has no equivalent framework for declaring or measuring validity.

Joughin (1998) identifies validity as a fundamental concern: the six dimensions of oral assessment are not just descriptive — they are design parameters that affect whether the assessment measures what it claims to measure.

Open questions:

  1. Should the specification include a validity metadata section where authors declare their validity evidence?
  2. Should the runtime automatically compute reliability metrics (e.g., inter-rater agreement across sessions using the same package)?
  3. Should the specification support validity versioning — tracking how validity evidence accumulates over multiple exam administrations?
  4. Should the specification mandate a content coverage map (learning outcome → evidence target → node) for summative exams?

Default position: The specification SHOULD include optional validity metadata. The runtime SHOULD compute and report reliability metrics where data permits. This is a v1 feature, not a deferral. The assessmentProfile field (see §09) provides a natural location for validity declarations.

Needs input from: Assessment design team, psychometricians, quality assurance.


Context: Akimov & Malin (2020) describe identity verification as standard practice: “at the start of the examination each student had to show the examiner a current student ID card or a government-issued document.” Fenton (2025) similarly recommends “ensure the student presents their identification card at the start of the assessment and ensure it matches with the student who is present.” The specification has no identity verification mechanism.

Open questions:

  1. Should the specification support an identity verification node (e.g., type: "identity_check") before the assessment begins?
  2. Should identity verification be a platform concern or a specification concern?
  3. How should the system handle identity verification in remote/online contexts where visual ID check may not be feasible?
  4. Should identity verification be mandatory for all exams or only for high-stakes summative assessments?

Default position: Identity verification SHOULD be a standard node type in the specification. The platform MAY implement it via photo ID check, biometric verification, or proctor confirmation. This is a v1 feature for any high-stakes assessment.

Needs input from: Academic integrity team, platform team, legal team.


Context: Akimov & Malin (2020) argue that “oral assessments should be applied intelligently to courses and more holistically in degree programmes.” Their implementation used oral exam as 40% of the course grade, combined with an applied project (50%) and a quiz (10%). The specification treats each exam in isolation.

Open questions:

  1. Should the specification support a weighting field indicating the exam’s contribution to the overall course grade?
  2. Should the specification support a portfolio concept linking multiple assessments (oral + written + project)?
  3. Should the marking pipeline be aware of the exam’s weight when determining grade boundaries?
  4. How should the specification handle prerequisite assessments (e.g., written submission must be completed before viva voce)?

Default position: Weighting is a course-level concern, not a specification concern. But the specification SHOULD include a courseAssessmentContext field where authors can declare how the oral exam fits within the broader assessment strategy. This helps markers contextualize their decisions.

Needs input from: Assessment design team, curriculum committee.


Context: The LLM that generates evidence signals may have systematic biases. Research on LLM assessment shows biases related to verbosity, cultural communication styles, and language proficiency. Joughin (1998) warns that “the social interaction entailed in oral assessment may distort communication and affect both a candidate’s performance and how that performance is perceived by the examiner.” Fenton (2025) notes “there may be a potential for bias around gender, ethnicity, language skills, speed of answering, and subjective grading.”

Open questions:

  1. Should the specification include bias auditing requirements (e.g., “run bias analysis on evidence signals across demographic groups before deployment”)?
  2. Should the runtime track and report evidence signal distributions across candidate demographics?
  3. Should the specification support “bias-aware” evidence detection that adjusts for known LLM biases?
  4. How should the system handle candidates who answer in a non-native language? Should evidence targets be language-agnostic?
  5. Should the specification mandate the communicationStyleIsLearningOutcome flag to prevent penalising accent or fluency when communication is not a learning outcome?

Default position: Bias auditing SHOULD be a mandatory pre-deployment step for any high-stakes assessment. The runtime SHOULD support demographic-stratified reporting. The specification SHOULD NOT embed bias corrections in the schema (this is a runtime concern) but SHOULD include a biasAudit metadata field where authors declare audit results. The communicationStyleIsLearningOutcome flag SHOULD be mandatory in all packages.

Needs input from: DEI team, psychometricians, assessment design team, accessibility team.


Context: Fenton (2025) notes that “the oral assessment also allows teaching staff to uncover gaps in student knowledge or common misconceptions and allow for the adjustment and improvement of unit materials.” Iannone & Simpson (2012, cited in Akimov & Malin, 2020) found that “students valued the extensive feedback from assessors in the oral form of assessment as compared to just receiving a grade allocation on a written exam.”

Open questions:

  1. Should the specification specify what feedback candidates receive after the exam (e.g., transcript access, evidence summary, learning recommendations)?
  2. Should the feedback be automated (generated from the evidence ledger) or human-authored?
  3. Should the specification support a feedback delay policy (e.g., “feedback released after all candidates in the cohort have completed”)?
  4. Should candidates have access to their full transcript, or only a summary?
  5. Should the specification support a feedback rubric that maps evidence signals to formative comments?

Default position: The specification SHOULD include a candidateFeedback policy specifying what information is released, when, and in what format. This is a v1 feature — candidate feedback is a core pedagogical benefit of oral assessment that the specification should not leave to chance.

Needs input from: Assessment design team, student wellbeing team, academic integrity team.


Context: Bayley et al. (2024) describe Concurrent Video-Based Oral Exams (ConVOEs) where 600+ students simultaneously record video responses via an LMS. This is a fundamentally different interaction model from the specification.s real-time dialogue paradigm. Questions are independent (no follow-ups), responses are time-limited recordings, and grading happens after submission.

Open questions:

  1. Should the specification support a “recorded response” mode where the candidate records answers without real-time dialogue?
  2. If yes, how do follow-ups work in a recorded format? (Bayley et al. note this as a limitation: “each question posed is independent from one another which does not permit instructors to delve into a student’s answer.”)
  3. Should the specification support hybrid modes (e.g., recorded initial response + live follow-up)?
  4. How should the specification handle question pools and randomisation for large-cohort concurrent administration?
  5. Should the specification support the parallel evaluation grading strategy (grade all candidates on Q1 before Q2)?

Default position: Recorded response mode is a legitimate assessment format that the specification SHOULD support. The key difference is that the interactionMode (Joughin, 1998) shifts from dialogue toward presentation. The specification should add an interactionMode field: "real_time_dialogue" | "recorded_response" | "hybrid". This is a v2 feature but the specification should not preclude it in v1.

Needs input from: Assessment design team, platform team, LMS integration team.


Context: Joughin (1998) references group oral exams (Mandeville & Menchaca, 1994; Dressel, 1991) where multiple candidates are assessed simultaneously. The The specification currently assumes single-candidate sessions exclusively. Group oral exams are common in professional education (e.g., business case discussions, clinical team assessments) and assess interpersonal competence in ways that individual exams cannot.

Open questions:

  1. Should the specification support multi-candidate sessions where the examiner interacts with 2–4 candidates simultaneously?
  2. How should evidence signals be attributed to individual candidates in a group setting?
  3. Should the specification support peer assessment within group oral exams (Joughin’s “peer assessment” examiner dimension)?
  4. How should turn-taking be managed when multiple candidates compete for speaking time?
  5. Should the specification support panel assessment (multiple human examiners + AI examiner) as described by Joughin?

Default position: Group oral exams require a multi-candidate session model that the current specification does not support. This is a v2 feature. The specification SHOULD be designed to not preclude group sessions (e.g., the session model should allow for multiple candidate IDs). Panel assessment (multiple examiners) is a separate concern that the ExaminerConfiguration construct (see §01 domain model review proposals) could support.

Needs input from: Assessment design team, professional accreditation bodies.


Q21: Communication Skills as First-Class Evidence

Section titled “Q21: Communication Skills as First-Class Evidence”

Context: Joughin (1998) identifies “interpersonal competence” as a distinct content category: “communication or interview skills exhibited in relation to a clinical situation or problem solving exercise” (p. 370). Both Akimov & Malin (2020) and Fenton (2025) emphasize communication skills as a key benefit of oral assessment. Fenton notes that oral assessments develop “professional identity, communication skills, and employability.”

The current evidence model treats communication skills identically to content knowledge — same signalKind taxonomy, same confidence model, same gap detection. But communication competence is not binary (present/absent); it is multidimensional and contextual.

Open questions:

  1. Should the specification support a separate evidence category for communication skills (fluency, clarity, professionalism, engagement)?
  2. Should communication skills be assessed per-node (like content evidence) or as session-wide transversal signals?
  3. Should the specification support paralinguistic metrics (speaking rate, hesitation count, filler words) derived from STT data?
  4. Should communication skills evidence use a different confidence model than content evidence (e.g., trajectory-based rather than per-signal)?
  5. How should the specification handle the communicationStyleIsLearningOutcome flag — should it be mandatory?

Default position: Communication skills SHOULD be assessable as transversal evidence targets (session-wide, not per-node). The specification SHOULD support a communicationSkills section on evidence targets with skill-specific rubrics (fluency, clarity, professionalism). Paralinguistic metrics are a v2 feature. The communicationStyleIsLearningOutcome flag SHOULD be mandatory in all packages.

Needs input from: Assessment design team, linguistics/pedagogy team, professional accreditation bodies.


#QuestionDefault
Q1Evidence sufficiencyDeterministic policy + LLM rationale
Q2Clarification boundaryLLM-generated with guardrails; maxPerNode limit
Q3BranchingDefer to v2; design specification to not preclude
Q4Examiner personaSimple tone + style in v1
Q5Per-node time budgetHard limit with configurable overrunPolicy
Q6MultilingualOut of scope for v1; specification language-agnostic
Q7EvidenceSignal authorshipAuto + marking runtime confirmed; human review for high-stakes
Q8Transient nodesAllowed with constraints
Q9Anxiety handlingPolicy + prompt; runtime detects, LLM responds
Q10Privacy/redactionPlatform concern; specification provides hooks
Q11Transition rationale in markingFull audit available; usage is marking policy
Q12Event protocol versioningAlways emit latest; consumers degrade gracefully
Q13Examiner calibrationCalibration mode + multi-assessor support
Q14Assessment validity frameworkOptional validity metadata; v1 feature
Q15Candidate identity verificationStandard node type; platform implements
Q16Assessment portfolio/weightingCourse-level concern; specification provides courseAssessmentContext
Q17LLM bias in evidence detectionMandatory pre-deployment bias audit; runtime supports stratified reporting
Q18Post-exam candidate feedbackSpecification specifies feedback policy; v1 feature
Q19ConVOE / recorded responseRecorded response mode; defer full support to v2 but do not preclude
Q20Group oral examsDefer to v2; requires multi-candidate session model
Q21Communication skills evidenceSeparate signal channel for communication quality; v1 partial support
VersionDateChanges
v0.2.02026-06-30Resolved several questions from v0.1.0. Updated remaining questions for IOA-ORM context.
v0.1.02026-05-06Initial release.