Open Questions
Status
Section titled “Status”Draft · v0.2.0 · 2026-06-30
This chapter collects unresolved questions that require product, engineering, or stakeholder input before they can be incorporated into the specification. Each question includes context, trade-offs, and a recommended default position.
These questions are intentionally left open — they represent design decisions that should not be made unilaterally by the spec author.
Q1: Evidence Sufficiency — LLM Judgment vs Deterministic Policy + LLM Rationale
Section titled “Q1: Evidence Sufficiency — LLM Judgment vs Deterministic Policy + LLM Rationale”Context: When the runtime evaluates whether sufficient evidence has been collected for a question, there are two approaches:
- (A) LLM judgment: The LLM itself decides “I have enough evidence to move on.” This is flexible but non-deterministic — the same exam could produce different transitions for different candidates.
- (B) Deterministic policy + LLM rationale: The runtime evaluates a
boolean expression over evidence coverage (e.g.,
evidence_covered(['ev-q1-required']) AND count(covered) >= 2). The LLM provides the evidence signals, but the transition decision is deterministic.
Trade-offs:
| LLM Judgment | Deterministic Policy | |
|---|---|---|
| Flexibility | High — handles nuance | Low — may miss edge cases |
| Fairness | Low — two candidates may get different treatment | High — same rules for everyone |
| Auditability | Low — “the LLM decided” | High — “policy X was applied” |
| Complexity | Low — just prompt engineering | Higher — policy authoring + expression evaluation |
| Failure mode | LLM may never transition (or transition too early) | Policy may be too strict (blocks valid transitions) or too loose |
Default position: (B) Deterministic policy + LLM rationale. Fairness and auditability are non-negotiable for assessment. The LLM’s role is to provide high-quality evidence signals; the runtime’s role is to enforce transition policy. However, the deterministic policy SHOULD be configurable per question, and a “policy: none” mode MAY be offered for low-stakes practice exams.
Open question: Should the deterministic policy be authorable in the Assessment Studio UI, or only as a code-level expression? The UI approach is more accessible to lecturers but limits expressiveness. The code approach is more powerful but requires engineering support.
Q2: Candidate Clarification Boundary
Section titled “Q2: Candidate Clarification Boundary”Context: When a candidate asks for clarification (e.g., “What do you mean by starvation?”), the LLM must explain without revealing the answer or rubric. But the line is blurry:
- “Starvation means a process waits indefinitely” — safe clarification.
- “Starvation is when short jobs keep arriving and long jobs never get scheduled, which is a key problem with SJF” — this is partly the answer.
- “You’re on the right track — starvation is what I’m looking for here” — this reveals the rubric.
Open questions:
- Should clarification responses be pre-authored per question (deterministic) or generated by the LLM at runtime (flexible)?
- Should there be a hard list of “safe clarification templates” for common terms, with the LLM only falling back to generation for unanticipated questions?
- How do we handle a candidate who repeatedly asks for clarification as a strategy to extract hints? Should the runtime intervene after N clarifications?
Default position: Clarification is LLM-generated with guardrails. A bank
of pre-authored safe clarifications is optional and may improve consistency.
The maxPerNode limit on clarification commands (currently 2) provides a
basic safeguard against strategic abuse.
Needs input from: Assessment design team, academic integrity team.
Q3: Branching Exam Support
Section titled “Q3: Branching Exam Support”Context: Some oral assessments may want adaptive flows — e.g., if a candidate struggles with Q1, Q2 could be simplified; or if they excel, Q2 could be harder. This is “branching.”
The specification currently supports a linear graph (opening → q1 → q2 → closing). Branching would require:
- Conditional transitions based on runtime state (e.g., evidence coverage, follow-up count, time used).
- Multiple candidate nodes for the same logical question position.
- Authoring complexity: the lecturer must design multiple paths.
Trade-offs:
| No Branching | Branching | |
|---|---|---|
| Authoring simplicity | High | Low |
| Fairness | High — all candidates get same questions | Lower — different candidates face different difficulty |
| Assessment validity | Easier to validate | Harder to validate (are branches equivalent?) |
| Adaptive assessment | Not possible | Possible |
Default position: Defer to v2. Branching is valuable but significantly
increases authoring complexity and validation burden. The current specification should be
designed to not preclude branching (e.g., allowedTargets is already an
array, not a singleton), but the Assessment Studio and runtime should not
implement it in v1.
Needs input from: Assessment design team, psychometricians.
Q4: Examiner Persona
Section titled “Q4: Examiner Persona”Context: The examinerPersona field in the specification controls the LLM’s tone and
style. The current schema is:
{
"tone": "supportive_encouraging",
"style": "asks_for_clarification_when_vague"
}
Open questions:
- How many persona dimensions should be configurable? Tone, style, formality, verbosity, cultural sensitivity, language complexity?
- Should personas be reusable presets (e.g., “supportive_undergrad”, “formal_postgrad”) or fully custom?
- Should the persona affect evidence detection (e.g., a supportive persona might be more generous in interpreting ambiguous answers)?
- Should the persona be allowed to vary during the exam (e.g., more formal during questions, warmer during closing)?
Default position: Keep it simple in v1 — tone (enum) and style
(string). Presets can be added later. The persona MUST NOT affect evidence
detection — that’s a separate, deterministic concern.
Needs input from: UX research, linguistics/pedagogy team.
Q5: Per-Node Time Budget
Section titled “Q5: Per-Node Time Budget”Context: The specification specifies timeBudgetSeconds per question node. The runtime
enforces this with warnings and hard stops.
Open questions:
- Should the time budget be a hard limit (candidate is cut off) or a soft limit (warning only, no forced transition)?
- Should the candidate see a visible timer?
- Should the time budget account for candidate commands (repeat,
clarification, raise_hand)? Currently,
raise_handpauses the timer, butrepeatandclarificationdo not — should they? - Should the total exam time budget be enforced independently of per-node budgets? (e.g., “10 minutes total, however distributed”)
- Should the time budget be adjustable by a proctor during the exam?
Default position:
- Hard limit with
overrunPolicy(configurable). - Candidate MAY see a timer — this is a UI decision, not a specification decision.
repeatandclarificationSHOULD count toward the time budget (they use real time).raise_handpauses because it’s explicitly a “I need a moment” signal.- Total time budget SHOULD be enforced as a fallback (if all per-node budgets are exhausted but total isn’t, the exam can continue; if total is exhausted, the exam MUST end).
- Proctor adjustment is out of scope for v1.
Needs input from: Assessment design team, accessibility team.
Q6: Multilingual Oral Exam
Section titled “Q6: Multilingual Oral Exam”Context: The specification has a metadata.language field, but the current design
assumes a single language per exam.
Open questions:
- Should the specification support multilingual exams (e.g., candidate answers in Mandarin, examiner speaks English)?
- Should the specification support code-switching (candidate mixes languages)?
- How does STT handle multilingual input? Is this a runtime concern or a specification concern?
- Should evidence detection be language-aware? (A Chinese answer and an English answer about the same concept should be evaluated equivalently.)
Default position: The specification’s language field sets the primary language.
Multilingual support is a runtime/STT concern, not a specification concern. The specification
should not encode language-specific rules — evidence targets are
language-agnostic (they describe concepts, not words). Multilingual exams
are out of scope for v1 but the specification should not preclude them.
Needs input from: Internationalisation team, STT vendor.
Q7: EvidenceSignal — Auto-Generated, Human-Reviewed, or Marking Runtime Confirmed?
Section titled “Q7: EvidenceSignal — Auto-Generated, Human-Reviewed, or Marking Runtime Confirmed?”Context: The evidence_signal event is the bridge between the exam runtime
and the marking pipeline. Who produces the final signal?
Options:
- (A) Auto-generated only: The exam runtime’s LLM generates evidence signals during the exam. These go directly to marking. Fast, cheap, but potentially inaccurate.
- (B) Auto-generated + human-reviewed: The LLM generates signals, but a human marker reviews and confirms/overrides them before final marking. Slow, expensive, but accurate.
- (C) Auto-generated + marking runtime confirmed: The LLM generates signals, and the marking runtime (which may itself be an AI) independently evaluates the transcript and confirms/overrides. Middle ground.
Trade-offs:
| Auto-only | Human-reviewed | Marking runtime confirmed | |
|---|---|---|---|
| Speed | Instant | Days | Minutes |
| Cost | Low | High | Medium |
| Accuracy | Medium | High | Medium-High |
| Scalability | High | Low | High |
Default position: (C) as default, with (B) as an option for high-stakes exams. The exam runtime generates signals with confidence scores. The marking runtime independently evaluates the transcript and can override signals where confidence is low or where it disagrees. For high-stakes exams (e.g., professional certification), human review is mandatory.
Needs input from: Assessment design team, academic integrity team, marking pipeline team.
Q8: Runtime-Generated Transient Nodes
Section titled “Q8: Runtime-Generated Transient Nodes”Context: In some scenarios, the runtime might want to insert a node that wasn’t in the original specification:
- A “repair” node when the candidate is confused and the LLM needs to re-explain the question from scratch.
- A “buffer” node when transitioning between topics (e.g., “Great, now let’s move on to a different topic”).
- An “anxiety intervention” node when the candidate shows signs of distress.
Open questions:
- Should the runtime be allowed to insert nodes not in the specification.
- If yes, what constraints apply? (e.g., transient nodes cannot have evidence targets; they cannot extend the exam duration; they must be logged.)
- Should transient nodes be visible in the evidence ledger and transcript?
- Should the candidate know they’re in a transient node?
Default position: Transient nodes are allowed but heavily constrained.
- They MUST NOT have evidence targets (they don’t assess anything).
- They MUST NOT extend the total exam duration (they borrow time from the current node’s budget).
- They MUST be logged as
transient_node_entered/transient_node_exitedevents. - They MUST appear in the transcript with a
transient: trueflag. - The candidate does not need to know — from their perspective, the examiner is just being helpful.
Needs input from: Assessment design team, UX team.
Q9: Candidate Anxiety / Accessibility — Policy or Prompt?
Section titled “Q9: Candidate Anxiety / Accessibility — Policy or Prompt?”Context: The runtime should handle candidates who show signs of anxiety (e.g., long silence, repeated “I don’t know”, stuttering, asking to stop).
Options:
- (A) Prompt-only: The LLM’s system prompt includes instructions to be supportive and handle anxiety gracefully. No runtime enforcement.
- (B) Policy + prompt: The runtime detects anxiety signals (e.g., silence
threshold, negative sentiment) and triggers specific actions (e.g., pause
timer, offer to skip, emit
anxiety_detectedevent). The LLM is also prompted to be supportive.
Default position: (B) for accessibility and duty of care.
- The runtime SHOULD detect basic anxiety signals (extended silence, repeated “I don’t know”, explicit requests to stop).
- On detection: pause timer, emit
anxiety_detectedevent, prompt LLM to offer support (“Would you like to take a moment? We can also move on to the next question if you’d prefer.”). - If the candidate explicitly requests to stop, the exam MUST end gracefully
(
exam_completedwith reasoncandidate_requested_stop). - The LLM prompt is a secondary layer — it handles the nuance, but the runtime handles the structural response.
Needs input from: Accessibility team, student wellbeing team, legal team.
Q10: Transcript Redaction / Privacy Policy in the Specification
Section titled “Q10: Transcript Redaction / Privacy Policy in the Specification”Context: Transcripts may contain sensitive information — candidate names, student IDs, or personal details inadvertently spoken. Some jurisdictions require data minimisation.
Open questions:
- Should the specification specify a redaction policy (e.g., “redact candidate name from transcript before storage”)?
- Should redaction happen at the STT layer, the event store layer, or the marking pipeline layer?
- Should the candidate have the right to request transcript deletion after the exam?
- Should the specification encode data retention periods (e.g., “delete transcript after 90 days”)?
Default position: Privacy is a platform-level concern, not a specification concern. The specification should not encode redaction or retention policies — these belong in the platform’s data governance framework. However, the specification SHOULD provide hooks:
- The event store SHOULD support redaction as a post-processing step.
- The marking input SHOULD include a
redacted: trueflag if redaction was applied. - Retention policies are enforced at the storage layer, not the specification layer.
Needs input from: Legal team, data protection officer, platform team.
Q11: Post-Exam Marking Referencing Runtime Transition Rationale
Section titled “Q11: Post-Exam Marking Referencing Runtime Transition Rationale”Context: The transition_decision event records why each transition
happened (e.g., “all evidence covered,” “time budget exceeded”). Should the
marking pipeline use this information?
Open questions:
- Should a “time budget exceeded” transition affect the mark? (e.g., penalise for running out of time, or just note it as context?)
- Should a “blocked unauthorised transition” event (guardrail violation) affect the mark?
- Should the marking pipeline be aware of how many follow-ups were used? (e.g., “candidate needed 2 follow-ups to cover evidence” vs “candidate covered evidence on the stem question”)
- Should the marking pipeline see the raw
transition_decisionevents, or a summary?
Default position: The marking pipeline SHOULD receive the full runtime audit, including transition decisions. How it uses this information is a marking policy decision:
- Time budget exceeded: MAY affect marks (e.g., partial credit for not-covered evidence). SHOULD be flagged for human marker review.
- Guardrail violations: SHOULD be flagged for academic integrity review. SHOULD NOT automatically reduce marks (the violation is the LLM’s, not the candidate’s).
- Follow-up count: MAY inform marking (e.g., “required prompting” is weaker evidence than “volunteered”). SHOULD be available as metadata.
Needs input from: Marking pipeline team, academic integrity team, assessment design team.
Q12: Published Package Allowing Patch Event Protocol Adapter
Section titled “Q12: Published Package Allowing Patch Event Protocol Adapter”Context: When a published package is loaded by a newer runtime version, the runtime may emit new event types that the package wasn’t designed for. Should the runtime “downgrade” its event output to match the package’s expected event protocol version?
Options:
- (A) Always emit latest events: The runtime emits all events it supports. Older packages ignore unknown events. The UI and marking pipeline handle missing events gracefully.
- (B) Emit package-compatible events: The runtime checks the package’s specification version and only emits events that version supports. Newer events are suppressed.
- (C) Emit both: The runtime emits latest events AND a compatibility layer that maps them to the older event format.
Default position: (A) with graceful degradation.
- The runtime ALWAYS emits the latest event protocol.
- Consumers (UI, marking pipeline, event store) MUST handle unknown event types by ignoring them gracefully.
- The specification’s
irVersiontells the runtime which features the package expects, but it does NOT limit which events the runtime emits. - A compatibility adapter (C) MAY be provided as an optional utility for consumers that cannot be updated, but it is not part of the core runtime.
This is the simplest approach and avoids the combinatorial explosion of maintaining multiple event protocol versions.
Needs input from: Platform team, frontend team, marking pipeline team.
Q13: Examiner Calibration and Training
Section titled “Q13: Examiner Calibration and Training”Context: Fenton (2025) recommends “a training or shadowing program with experienced instructors leading novices” and “all examiners should receive training in oral assessment procedures, including the need to focus only on professional attributes, language issues for candidates taking oral assessments in a second language, and the nature and source of bias.” Akimov & Malin (2020) found that “none of the 30 examiners surveyed had any training on how to conduct oral examinations” and that “a large number of examiners acknowledged the presence of various biases.”
The specification’s LLM-as-examiner model inherits this concern — the LLM needs calibration just as human examiners do.
Open questions:
- Should the specification include a calibration mode where the LLM assesses pre-annotated sample exams and its performance is measured against ground truth?
- Should calibration results be stored as metadata on the published package?
- How should the system handle drift — when the LLM’s evidence detection accuracy degrades over time due to model updates?
- Should the specification support multiple LLM assessors (analogous to Joughin’s panel of examiners) with inter-rater reliability tracking?
- How should the system handle examiner persona consistency across candidates? (Joughin, 1998: “reliability is threatened when examiners are poorly prepared.”)
Default position: Calibration is a runtime/platform concern, but the specification SHOULD provide hooks:
- A
calibrationProfilefield in metadata referencing calibration exam IDs and accuracy metrics. - A
multiAssessoroption that runs evidence detection through multiple LLM instances and reports agreement. - A
driftDetectionmechanism that monitors evidence signal accuracy over time.
Needs input from: Assessment design team, platform team, psychometricians.
Q14: Assessment Validity Framework
Section titled “Q14: Assessment Validity Framework”Context: Akimov & Malin (2020) evaluate their oral exam against a formal validity matrix (face validity, content validity, construct validity, concurrent validity, plus four reliability measures). The specification has no equivalent framework for declaring or measuring validity.
Joughin (1998) identifies validity as a fundamental concern: the six dimensions of oral assessment are not just descriptive — they are design parameters that affect whether the assessment measures what it claims to measure.
Open questions:
- Should the specification include a
validitymetadata section where authors declare their validity evidence? - Should the runtime automatically compute reliability metrics (e.g., inter-rater agreement across sessions using the same package)?
- Should the specification support validity versioning — tracking how validity evidence accumulates over multiple exam administrations?
- Should the specification mandate a content coverage map (learning outcome → evidence target → node) for summative exams?
Default position: The specification SHOULD include optional validity metadata. The
runtime SHOULD compute and report reliability metrics where data permits. This
is a v1 feature, not a deferral. The assessmentProfile field (see §09)
provides a natural location for validity declarations.
Needs input from: Assessment design team, psychometricians, quality assurance.
Q15: Candidate Identity Verification
Section titled “Q15: Candidate Identity Verification”Context: Akimov & Malin (2020) describe identity verification as standard practice: “at the start of the examination each student had to show the examiner a current student ID card or a government-issued document.” Fenton (2025) similarly recommends “ensure the student presents their identification card at the start of the assessment and ensure it matches with the student who is present.” The specification has no identity verification mechanism.
Open questions:
- Should the specification support an identity verification node (e.g.,
type: "identity_check") before the assessment begins? - Should identity verification be a platform concern or a specification concern?
- How should the system handle identity verification in remote/online contexts where visual ID check may not be feasible?
- Should identity verification be mandatory for all exams or only for high-stakes summative assessments?
Default position: Identity verification SHOULD be a standard node type in the specification. The platform MAY implement it via photo ID check, biometric verification, or proctor confirmation. This is a v1 feature for any high-stakes assessment.
Needs input from: Academic integrity team, platform team, legal team.
Q16: Assessment Portfolio and Weighting
Section titled “Q16: Assessment Portfolio and Weighting”Context: Akimov & Malin (2020) argue that “oral assessments should be applied intelligently to courses and more holistically in degree programmes.” Their implementation used oral exam as 40% of the course grade, combined with an applied project (50%) and a quiz (10%). The specification treats each exam in isolation.
Open questions:
- Should the specification support a
weightingfield indicating the exam’s contribution to the overall course grade? - Should the specification support a
portfolioconcept linking multiple assessments (oral + written + project)? - Should the marking pipeline be aware of the exam’s weight when determining grade boundaries?
- How should the specification handle prerequisite assessments (e.g., written submission must be completed before viva voce)?
Default position: Weighting is a course-level concern, not a specification concern.
But the specification SHOULD include a courseAssessmentContext field where authors can
declare how the oral exam fits within the broader assessment strategy. This
helps markers contextualize their decisions.
Needs input from: Assessment design team, curriculum committee.
Q17: LLM Bias in Evidence Detection
Section titled “Q17: LLM Bias in Evidence Detection”Context: The LLM that generates evidence signals may have systematic biases. Research on LLM assessment shows biases related to verbosity, cultural communication styles, and language proficiency. Joughin (1998) warns that “the social interaction entailed in oral assessment may distort communication and affect both a candidate’s performance and how that performance is perceived by the examiner.” Fenton (2025) notes “there may be a potential for bias around gender, ethnicity, language skills, speed of answering, and subjective grading.”
Open questions:
- Should the specification include bias auditing requirements (e.g., “run bias analysis on evidence signals across demographic groups before deployment”)?
- Should the runtime track and report evidence signal distributions across candidate demographics?
- Should the specification support “bias-aware” evidence detection that adjusts for known LLM biases?
- How should the system handle candidates who answer in a non-native language? Should evidence targets be language-agnostic?
- Should the specification mandate the
communicationStyleIsLearningOutcomeflag to prevent penalising accent or fluency when communication is not a learning outcome?
Default position: Bias auditing SHOULD be a mandatory pre-deployment step
for any high-stakes assessment. The runtime SHOULD support demographic-stratified
reporting. The specification SHOULD NOT embed bias corrections in the schema (this is a
runtime concern) but SHOULD include a biasAudit metadata field where authors
declare audit results. The communicationStyleIsLearningOutcome flag SHOULD be
mandatory in all packages.
Needs input from: DEI team, psychometricians, assessment design team, accessibility team.
Q18: Post-Exam Candidate Feedback
Section titled “Q18: Post-Exam Candidate Feedback”Context: Fenton (2025) notes that “the oral assessment also allows teaching staff to uncover gaps in student knowledge or common misconceptions and allow for the adjustment and improvement of unit materials.” Iannone & Simpson (2012, cited in Akimov & Malin, 2020) found that “students valued the extensive feedback from assessors in the oral form of assessment as compared to just receiving a grade allocation on a written exam.”
Open questions:
- Should the specification specify what feedback candidates receive after the exam (e.g., transcript access, evidence summary, learning recommendations)?
- Should the feedback be automated (generated from the evidence ledger) or human-authored?
- Should the specification support a feedback delay policy (e.g., “feedback released after all candidates in the cohort have completed”)?
- Should candidates have access to their full transcript, or only a summary?
- Should the specification support a feedback rubric that maps evidence signals to formative comments?
Default position: The specification SHOULD include a candidateFeedback policy
specifying what information is released, when, and in what format. This is a
v1 feature — candidate feedback is a core pedagogical benefit of oral
assessment that the specification should not leave to chance.
Needs input from: Assessment design team, student wellbeing team, academic integrity team.
Q19: ConVOE / Recorded Response Format
Section titled “Q19: ConVOE / Recorded Response Format”Context: Bayley et al. (2024) describe Concurrent Video-Based Oral Exams (ConVOEs) where 600+ students simultaneously record video responses via an LMS. This is a fundamentally different interaction model from the specification.s real-time dialogue paradigm. Questions are independent (no follow-ups), responses are time-limited recordings, and grading happens after submission.
Open questions:
- Should the specification support a “recorded response” mode where the candidate records answers without real-time dialogue?
- If yes, how do follow-ups work in a recorded format? (Bayley et al. note this as a limitation: “each question posed is independent from one another which does not permit instructors to delve into a student’s answer.”)
- Should the specification support hybrid modes (e.g., recorded initial response + live follow-up)?
- How should the specification handle question pools and randomisation for large-cohort concurrent administration?
- Should the specification support the parallel evaluation grading strategy (grade all candidates on Q1 before Q2)?
Default position: Recorded response mode is a legitimate assessment format
that the specification SHOULD support. The key difference is that the interactionMode
(Joughin, 1998) shifts from dialogue toward presentation. The specification should add an
interactionMode field: "real_time_dialogue" | "recorded_response" |
"hybrid". This is a v2 feature but the specification should not preclude it in v1.
Needs input from: Assessment design team, platform team, LMS integration team.
Q20: Group Oral Exams
Section titled “Q20: Group Oral Exams”Context: Joughin (1998) references group oral exams (Mandeville & Menchaca, 1994; Dressel, 1991) where multiple candidates are assessed simultaneously. The The specification currently assumes single-candidate sessions exclusively. Group oral exams are common in professional education (e.g., business case discussions, clinical team assessments) and assess interpersonal competence in ways that individual exams cannot.
Open questions:
- Should the specification support multi-candidate sessions where the examiner interacts with 2–4 candidates simultaneously?
- How should evidence signals be attributed to individual candidates in a group setting?
- Should the specification support peer assessment within group oral exams (Joughin’s “peer assessment” examiner dimension)?
- How should turn-taking be managed when multiple candidates compete for speaking time?
- Should the specification support panel assessment (multiple human examiners + AI examiner) as described by Joughin?
Default position: Group oral exams require a multi-candidate session model
that the current specification does not support. This is a v2 feature. The specification SHOULD be
designed to not preclude group sessions (e.g., the session model should allow
for multiple candidate IDs). Panel assessment (multiple examiners) is a separate
concern that the ExaminerConfiguration construct (see §01 domain model
review proposals) could support.
Needs input from: Assessment design team, professional accreditation bodies.
Q21: Communication Skills as First-Class Evidence
Section titled “Q21: Communication Skills as First-Class Evidence”Context: Joughin (1998) identifies “interpersonal competence” as a distinct content category: “communication or interview skills exhibited in relation to a clinical situation or problem solving exercise” (p. 370). Both Akimov & Malin (2020) and Fenton (2025) emphasize communication skills as a key benefit of oral assessment. Fenton notes that oral assessments develop “professional identity, communication skills, and employability.”
The current evidence model treats communication skills identically to content
knowledge — same signalKind taxonomy, same confidence model, same gap
detection. But communication competence is not binary (present/absent); it is
multidimensional and contextual.
Open questions:
- Should the specification support a separate evidence category for communication skills (fluency, clarity, professionalism, engagement)?
- Should communication skills be assessed per-node (like content evidence) or as session-wide transversal signals?
- Should the specification support paralinguistic metrics (speaking rate, hesitation count, filler words) derived from STT data?
- Should communication skills evidence use a different confidence model than content evidence (e.g., trajectory-based rather than per-signal)?
- How should the specification handle the
communicationStyleIsLearningOutcomeflag — should it be mandatory?
Default position: Communication skills SHOULD be assessable as transversal
evidence targets (session-wide, not per-node). The specification SHOULD support a
communicationSkills section on evidence targets with skill-specific rubrics
(fluency, clarity, professionalism). Paralinguistic metrics are a v2 feature.
The communicationStyleIsLearningOutcome flag SHOULD be mandatory in all
packages.
Needs input from: Assessment design team, linguistics/pedagogy team, professional accreditation bodies.
Summary of Default Positions
Section titled “Summary of Default Positions”| # | Question | Default |
|---|---|---|
| Q1 | Evidence sufficiency | Deterministic policy + LLM rationale |
| Q2 | Clarification boundary | LLM-generated with guardrails; maxPerNode limit |
| Q3 | Branching | Defer to v2; design specification to not preclude |
| Q4 | Examiner persona | Simple tone + style in v1 |
| Q5 | Per-node time budget | Hard limit with configurable overrunPolicy |
| Q6 | Multilingual | Out of scope for v1; specification language-agnostic |
| Q7 | EvidenceSignal authorship | Auto + marking runtime confirmed; human review for high-stakes |
| Q8 | Transient nodes | Allowed with constraints |
| Q9 | Anxiety handling | Policy + prompt; runtime detects, LLM responds |
| Q10 | Privacy/redaction | Platform concern; specification provides hooks |
| Q11 | Transition rationale in marking | Full audit available; usage is marking policy |
| Q12 | Event protocol versioning | Always emit latest; consumers degrade gracefully |
| Q13 | Examiner calibration | Calibration mode + multi-assessor support |
| Q14 | Assessment validity framework | Optional validity metadata; v1 feature |
| Q15 | Candidate identity verification | Standard node type; platform implements |
| Q16 | Assessment portfolio/weighting | Course-level concern; specification provides courseAssessmentContext |
| Q17 | LLM bias in evidence detection | Mandatory pre-deployment bias audit; runtime supports stratified reporting |
| Q18 | Post-exam candidate feedback | Specification specifies feedback policy; v1 feature |
| Q19 | ConVOE / recorded response | Recorded response mode; defer full support to v2 but do not preclude |
| Q20 | Group oral exams | Defer to v2; requires multi-candidate session model |
| Q21 | Communication skills evidence | Separate signal channel for communication quality; v1 partial support |
Revision History
Section titled “Revision History”| Version | Date | Changes |
|---|---|---|
| v0.2.0 | 2026-06-30 | Resolved several questions from v0.1.0. Updated remaining questions for IOA-ORM context. |
| v0.1.0 | 2026-05-06 | Initial release. |