Evidence Ledger
Status
Section titled “Status”Draft · v0.2.0 · 2026-06-30
Canonical type reference: The authoritative TypeScript type definitions for
EvidenceSignal,EvidenceTarget,EvidenceLedger,TranscriptTurn, andEvidenceGapare in 02-schema.md §6–§8, §15. This document provides design rationale, lifecycle documentation, and implementation guidance for the evidence model. Where this document’s inline type definitions differ from02-schema.md, the schema takes precedence.
1. Why Transcript Alone Is Not Enough
Section titled “1. Why Transcript Alone Is Not Enough”A raw transcript of an oral exam captures what was said but not what it means for assessment. Consider:
| What transcript gives you | What marking actually needs |
|---|---|
| ”The candidate said the algorithm is O(V²)." | "The candidate demonstrated understanding of time complexity for this specific algorithm — rubric item RC-3.1.” |
| A sequence of turns with timestamps. | Which turns contributed evidence for which rubric dimensions. |
| Examiner asked a follow-up question. | The follow-up was triggered because evidence for RC-3.2 was absent, not because the LLM felt chatty. |
| Silence for 12 seconds. | Whether the silence triggered a recovery, whether the candidate eventually answered, and whether that answer provided evidence. |
The transcript is necessary but insufficient. Marking requires:
- Structured mapping from utterances to rubric items (learning outcomes, competency dimensions).
- Signal classification — was the evidence positive, partial, absent, or a misconception?
- Confidence and provenance — who proposed the signal (LLM vs. runtime), and how confident is the classification?
- Gap tracking — which rubric items were never addressed, so markers know where to probe in moderation.
- Contextual metadata — how many follow-ups were needed, whether guardrails fired, what recovery path was taken.
The Evidence Ledger provides all of these on top of the transcript.
2. Core Concepts
Section titled “2. Core Concepts”2.1 EvidenceTarget
Section titled “2.1 EvidenceTarget”An EvidenceTarget is the thing a rubric item or learning outcome asks the candidate to demonstrate. It is the bridge between the assessment authoring model and the runtime evidence collection.
/**
* An evidence target derived from the exam's rubric.
* Created at compile time from the assessment authoring model.
* One rubric item may map to multiple EvidenceTargets across different nodes.
*/
interface EvidenceTarget {
/** Unique identifier, stable across exam versions. */
targetId: string;
/** The rubric item or learning outcome this target serves. */
rubricItemId: string;
/** Human-readable label for markers and auditors. */
label: string;
/**
* The dimension of oral assessment this target addresses.
* Joughin (1998) identifies four primary content types:
* (i) knowledge and understanding, (ii) applied problem solving,
* (iii) interpersonal competence, (iv) intrapersonal qualities.
* A fifth dimension, "metacognitive", captures self-correction and reasoning process
* (Fenton, 2025: students can "reflect on their choices and have the chance to self-correct").
*
* This field is REQUIRED because different dimensions have different assessment semantics:
* a "positive" signal for knowledge_understanding means something fundamentally different
* from a "positive" signal for interpersonal_competence.
*/
evidenceDimension:
| "knowledge_understanding"
| "applied_problem_solving"
| "interpersonal_competence"
| "intrapersonal_quality"
| "metacognitive";
/**
* Whether this target is transversal (session-wide) or scoped to specific nodes.
* Transversal targets (e.g., communication quality, critical thinking) are assessed
* across ALL nodes, not scoped to specific ones. Fenton (2025): oral assessments
* develop "professional identity, communication skills, and employability" — these
* are emergent properties of the entire interaction, not per-node rubric items.
*
* Joughin (1998): interpersonal competence is "not skills per se but rather skills
* exhibited in relation to a clinical situation or problem solving exercise."
*/
transversal: boolean;
/** The node(s) where this target is expected to be evidenced. Empty for transversal targets. */
expectedNodeIds: string[];
/**
* Aggregation method for transversal targets.
* - "holistic": marker judges overall quality from the full session
* - "best_of": highest signal quality across nodes
* - "trajectory": assess whether quality improved over the session
* Ignored for non-transversal targets.
*/
aggregationMethod?: "holistic" | "best_of" | "trajectory";
/** Minimum number of positive signals needed to consider this target "covered". */
minPositiveSignals: number;
/** Whether this target is mandatory (failure to cover = automatic gap). */
mandatory: boolean;
/** Weighting factor for this target in the overall rubric (0.0–1.0). */
weight: number;
}
Relationship to rubric / learning outcomes:
- The assessment authoring studio defines rubric items (e.g., “Explain algorithmic complexity”) and maps them to learning outcomes (e.g., “LO-3: Analyse algorithm efficiency”).
- At compile time, each rubric item becomes one or more
EvidenceTargetentries, scoped to specific nodes in the exam flow graph. - The runtime collects evidence against these targets. The marking runtime evaluates whether targets were met.
- This separation means: authoring defines what to assess, runtime collects evidence, marking judges it — no single layer has full authority.
2.2 TranscriptTurn
Section titled “2.2 TranscriptTurn”A TranscriptTurn is the persisted version of a transcript_final event, enriched with metadata for evidence linking.
/**
* A single speaker turn in the exam transcript.
* Persisted from transcript_final events with additional linkage metadata.
*/
interface TranscriptTurn {
/** Unique turn identifier (matches transcript_final.turnId). */
turnId: string;
/** The session this turn belongs to. */
sessionId: string;
/** Who spoke. */
speaker: "candidate" | "examiner";
/** The transcribed text. */
text: string;
/** Wall-clock start time (ms from session start). */
startTimeMs: number;
/** Wall-clock end time (ms from session start). */
endTimeMs: number;
/** The node active when this turn occurred. */
nodeId: string;
/** STT confidence score (0.0–1.0). */
sttConfidence: number;
/** Language detected. */
language: string;
/** IDs of EvidenceSignals derived from this turn. Populated after analysis. */
evidenceSignalIds: string[];
/** Whether this turn was part of a recovery sequence. */
recoveryContext?: string; // recoveryId if applicable
}
Relationship to EvidenceSignal:
- A single TranscriptTurn MAY produce zero, one, or many EvidenceSignals.
- Multiple TranscriptTurns MAY contribute to the same EvidenceSignal (e.g., a multi-sentence explanation that collectively demonstrates a rubric item).
- The
evidenceSignalIdsback-reference on TranscriptTurn enables bi-directional traversal: “show me the signals for this utterance” and “show me the utterances behind this signal.”
2.3 EvidenceSignal
Section titled “2.3 EvidenceSignal”An EvidenceSignal is a proposed or confirmed observation that a candidate’s utterance demonstrates (or fails to demonstrate) a rubric item. It is the atomic unit of evidence in the ledger.
Critical design constraint:
Interview runtime may generate evidence signals, but MUST NOT assign final marks.
EvidenceSignals are observations, not judgements. They say “the candidate said X, which appears to relate to rubric item Y” — they do NOT say “this merits 3 out of 5.” Scoring is the exclusive domain of the marking runtime.
/**
* A single piece of evidence linking transcript content to a rubric target.
*/
interface EvidenceSignal {
/** Unique signal identifier. */
signalId: string;
/** The session this signal belongs to. */
sessionId: string;
/** The node where this evidence was collected. */
nodeId: string;
/** Transcript turns that support this signal. */
turnIds: string[];
/** The evidence target(s) this signal addresses. */
targetIds: string[];
/**
* The dimension of oral assessment this signal addresses.
* Joughin (1998) identifies four primary content types. The "metacognitive"
* dimension captures self-correction and reasoning process quality
* (Fenton, 2025).
*/
evidenceDimension:
| "knowledge_understanding"
| "applied_problem_solving"
| "interpersonal_competence"
| "intrapersonal_quality"
| "metacognitive";
/**
* Classification of the evidence.
*
* The taxonomy extends beyond knowledge-correctness to capture process quality.
* Fenton (2025): oral assessments reveal "the process of learning rather than
* the output" and allow students to "reflect on their choices and have the
* chance to self-correct."
*
* - positive: Correct and complete evidence
* - partial: Partially correct or incomplete
* - absent: No evidence for this target
* - misconception: Demonstrates a misunderstanding
* - flawed_reasoning: Right answer with incorrect justification
* - process_positive: Good reasoning process, regardless of final answer
* - process_negative: Poor reasoning process
* - self_correction: Candidate identified and corrected their own error
*/
signalKind:
| "positive"
| "partial"
| "absent"
| "misconception"
| "flawed_reasoning"
| "process_positive"
| "process_negative"
| "self_correction";
/** Free-text description for human reviewers. */
description: string;
/** Confidence in this signal classification (0.0–1.0). */
confidence: number;
/**
* STT confidence summary for the underlying transcript turns.
* Signal confidence is epistemically dependent on transcript quality.
* A 0.85-confidence signal from 0.6-confidence transcripts is weaker
* than one from 0.95-confidence transcripts.
*/
sttConfidenceSummary: {
min: number;
max: number;
mean: number;
turnCount: number;
};
/** Who proposed this signal. */
proposedBy: "llm_analysis" | "runtime_heuristic" | "manual_marker";
/** Whether the runtime controller has validated this signal. */
approved: boolean;
/** ISO-8601 timestamp of signal creation. */
createdAt: string;
/** ISO-8601 timestamp of approval (null if not yet approved). */
approvedAt: string | null;
/** Schema version. */
schemaVersion: "1";
}
3. Confidence Expression
Section titled “3. Confidence Expression”Confidence is expressed as a floating-point number in the range [0.0, 1.0] and represents the system’s certainty that the signal classification is correct.
| Range | Interpretation | Typical source |
|---|---|---|
0.9–1.0 | Very high confidence. Clear, unambiguous evidence. | LLM high-confidence extraction from a fluent answer. |
0.7–0.89 | High confidence. Minor ambiguity. | LLM extraction with some hedging or partial coverage. |
0.5–0.69 | Moderate confidence. Ambiguous or indirect evidence. | Short answer, unclear phrasing, or indirect demonstration. |
0.3–0.49 | Low confidence. Weak or contradictory signals. | Conflicting evidence across turns, or very brief response. |
0.0–0.29 | Very low confidence. Likely noise or misclassification. | STT errors, off-topic speech, or near-silence. |
Rules:
- The LLM MUST report its raw confidence. The runtime controller MUST NOT artificially inflate or deflate confidence.
- Signals with confidence below
0.3SHOULD be flagged for manual review rather than auto-approved. - The marking runtime MUST treat confidence as a weighting factor, not a hard threshold. A
0.4confidence positive signal is weaker evidence than a0.9one, but it is still evidence. - Confidence is per-signal, not per-rubric-item. A rubric item backed by three
0.6signals has more aggregate evidence than one backed by a single0.9signal.
3.1 STT Confidence Provenance
Section titled “3.1 STT Confidence Provenance”Signal confidence is epistemically dependent on transcript quality. The sttConfidenceSummary field on EvidenceSignal provides the provenance chain from STT → transcript → signal. The marking runtime SHOULD apply appropriate skepticism when signal confidence and STT confidence diverge:
- A signal with
confidence: 0.85derived from transcripts withsttConfidenceSummary.mean: 0.6is epistemically weaker than one frommean: 0.95transcripts. - When
sttConfidenceSummary.mean < 0.5, the signal SHOULD be flagged for manual review regardless of signal confidence. - The marking runtime MAY apply a provenance discount:
effectiveConfidence = signalConfidence × sttConfidenceSummary.mean.
3.2 Confidence Calibration Protocol
Section titled “3.2 Confidence Calibration Protocol”The confidence model requires empirical validation to be meaningful. Without calibration, the [0.0, 1.0] range is an arbitrary number, not a probability. Akimov & Malin (2020) identify intra-rater reliability as a concern — “the examiner’s judgement is the same at various points in time during the assessment process” — and this applies equally to LLM confidence.
Calibration requirements:
-
Baseline calibration. Before deployment, the LLM signal proposer MUST be evaluated against a corpus of pre-scored oral exam transcripts. Calibration is measured by: are 0.8-confidence signals correct approximately 80% of the time? The calibration curve (predicted confidence vs. observed accuracy) MUST be documented.
-
Session-level drift detection. The runtime SHOULD monitor whether the LLM’s average confidence systematically increases or decreases over a session. Systematic drift (analogous to examiner fatigue effects) indicates calibration degradation. If the mean confidence of the last quartile of signals deviates from the first quartile by more than 0.15, the session SHOULD be flagged for review.
-
Inter-rater reliability tracking. When manual markers override LLM proposals during moderation, the disagreement rate MUST be tracked. Persistent disagreement (Cohen’s κ < 0.7) between LLM and human markers indicates the confidence model needs recalibration.
-
Cross-session consistency. For exams using the same specification package, the distribution of signal confidence SHOULD be compared across sessions. Significant variance (standard deviation > 0.15) across sessions for equivalent targets indicates inconsistent assessment.
4. Evidence Gaps
Section titled “4. Evidence Gaps”An evidence gap exists when a mandatory EvidenceTarget has fewer positive signals than its minPositiveSignals threshold at the time the node is exited.
/**
* Records an evidence gap: a mandatory target that was not sufficiently evidenced.
*/
interface EvidenceGap {
/** The target that was underserved. */
targetId: string;
/** The node where evidence was expected. */
nodeId: string;
/** Number of positive signals collected. */
positiveSignalsCollected: number;
/** Minimum required. */
minPositiveSignalsRequired: number;
/** How the gap was detected. */
detectedBy: "runtime_check" | "marking_pipeline" | "manual_review";
/** Whether the gap was addressed via follow-up. */
addressedByFollowUp: boolean;
/** Whether a recovery was attempted. */
addressedByRecovery: boolean;
}
Rules:
- The runtime controller MUST check for evidence gaps when a node exits (see
node_exitedevent). - If a gap exists and follow-ups remain, the runtime SHOULD trigger a follow-up targeting the missing evidence.
- If a gap exists and follow-ups are exhausted, the gap is recorded and passed to marking. The marking runtime decides whether to penalise, flag for moderation, or accept partial evidence.
- Gaps MUST be persisted as part of the EvidenceLedger. They are first-class data, not just absence of signals.
5. EvidenceLedger
Section titled “5. EvidenceLedger”The EvidenceLedger is the complete, persisted record of evidence collection for a single exam session. It is the primary input to the marking runtime.
/**
* The complete evidence ledger for an exam session.
* This is the canonical input to the marking runtime.
*/
interface EvidenceLedger {
/** The session this ledger belongs to. */
sessionId: string;
/** The exam ID. */
examId: string;
/** All evidence targets defined for this exam. */
targets: EvidenceTarget[];
/** All transcript turns, in chronological order. */
turns: TranscriptTurn[];
/** All evidence signals (approved and pending). */
signals: EvidenceSignal[];
/** All detected evidence gaps. */
gaps: EvidenceGap[];
/** Summary statistics. */
summary: {
totalTurns: number;
totalSignals: number;
signalsByKind: {
positive: number;
partial: number;
absent: number;
misconception: number;
flawed_reasoning: number;
process_positive: number;
process_negative: number;
self_correction: number;
};
signalsByDimension: {
knowledge_understanding: number;
applied_problem_solving: number;
interpersonal_competence: number;
intrapersonal_quality: number;
metacognitive: number;
};
targetsFullyCovered: number;
targetsPartiallyCovered: number;
targetsWithGaps: number;
mandatoryGaps: number;
averageConfidence: number;
averageSttConfidence: number;
};
/**
* Optional reference to the session recording.
* Akimov & Malin (2020): "to reduce the potential problem of intra-rater
* reliability, all online oral examinations were recorded and moderated."
* The recording enables post-hoc human review and moderation.
*/
recordingRef?: {
audioUrl?: string;
videoUrl?: string;
availableForModeration: boolean;
candidateConsented: boolean;
retentionPolicy: {
retainUntil: string; // ISO-8601
deleteAfterReview: boolean;
};
};
/**
* Optional moderation record.
* Akimov & Malin (2020): all oral exams were "moderated by another
* finance academic" for intra-rater reliability. This field supports
* the human-in-the-loop workflow.
*/
moderationRecord?: {
moderatorId: string;
reviewedAt: string; // ISO-8601
/** Whether the moderator agreed with the LLM's evidence signals. */
agreementRate: number; // 0.0–1.0
/** Signals the moderator overrode. */
overriddenSignalIds: string[];
/** Signals the moderator added. */
addedSignals: EvidenceSignal[];
/** Moderator notes. */
notes?: string;
};
/** ISO-8601 timestamp of ledger finalisation. */
finalisedAt: string;
/** Schema version. */
schemaVersion: "1";
}
JSON Example: EvidenceLedger Snapshot
Section titled “JSON Example: EvidenceLedger Snapshot”{
"sessionId": "sess-2026-05-06-001",
"examId": "exam-midterm-orals-cs201",
"targets": [
{
"targetId": "tgt-algo-explain",
"rubricItemId": "rubric-algo-explain",
"label": "Explain the core mechanism of Dijkstra's algorithm",
"evidenceDimension": "knowledge_understanding",
"transversal": false,
"expectedNodeIds": ["q-explain-dijkstra"],
"minPositiveSignals": 1,
"mandatory": true,
"weight": 0.3
},
{
"targetId": "tgt-complexity-analysis",
"rubricItemId": "rubric-complexity-analysis",
"label": "Analyse time and space complexity of Dijkstra's algorithm",
"evidenceDimension": "knowledge_understanding",
"transversal": false,
"expectedNodeIds": ["q-explain-dijkstra"],
"minPositiveSignals": 1,
"mandatory": true,
"weight": 0.2
},
{
"targetId": "tgt-graph-apply",
"rubricItemId": "rubric-graph-apply",
"label": "Apply graph algorithms to a real-world scenario",
"evidenceDimension": "applied_problem_solving",
"transversal": false,
"expectedNodeIds": ["q-graph-scenario"],
"minPositiveSignals": 2,
"mandatory": true,
"weight": 0.3
},
{
"targetId": "tgt-communication",
"rubricItemId": "rubric-communication",
"label": "Communicate technical concepts clearly throughout the session",
"evidenceDimension": "interpersonal_competence",
"transversal": true,
"expectedNodeIds": [],
"aggregationMethod": "holistic",
"minPositiveSignals": 2,
"mandatory": false,
"weight": 0.2
}
],
"turns": [
{
"turnId": "turn-001",
"sessionId": "sess-2026-05-06-001",
"speaker": "candidate",
"text": "Dijkstra's algorithm works by greedily selecting the unvisited node with the smallest known distance, then relaxing all its outgoing edges.",
"startTimeMs": 18200,
"endTimeMs": 24500,
"nodeId": "q-explain-dijkstra",
"sttConfidence": 0.91,
"language": "en",
"evidenceSignalIds": ["sig-001", "sig-003"]
},
{
"turnId": "turn-002",
"sessionId": "sess-2026-05-06-001",
"speaker": "examiner",
"text": "Can you explain what happens when there are negative edge weights?",
"startTimeMs": 25000,
"endTimeMs": 31500,
"nodeId": "q-explain-dijkstra",
"sttConfidence": 0.95,
"language": "en",
"evidenceSignalIds": []
},
{
"turnId": "turn-003",
"sessionId": "sess-2026-05-06-001",
"speaker": "candidate",
"text": "Dijkstra's doesn't handle negative weights. You'd need Bellman-Ford for that because the greedy assumption breaks down.",
"startTimeMs": 33000,
"endTimeMs": 39200,
"nodeId": "q-explain-dijkstra",
"sttConfidence": 0.88,
"language": "en",
"evidenceSignalIds": ["sig-002", "sig-004", "sig-005"]
}
],
"signals": [
{
"signalId": "sig-001",
"sessionId": "sess-2026-05-06-001",
"nodeId": "q-explain-dijkstra",
"turnIds": ["turn-001"],
"targetIds": ["tgt-algo-explain"],
"evidenceDimension": "knowledge_understanding",
"signalKind": "positive",
"description": "Candidate correctly described the greedy selection strategy and edge relaxation process.",
"confidence": 0.88,
"sttConfidenceSummary": { "min": 0.91, "max": 0.91, "mean": 0.91, "turnCount": 1 },
"proposedBy": "llm_analysis",
"approved": true,
"createdAt": "2026-05-06T02:00:50.000Z",
"approvedAt": "2026-05-06T02:00:50.500Z",
"schemaVersion": "1"
},
{
"signalId": "sig-002",
"sessionId": "sess-2026-05-06-001",
"nodeId": "q-explain-dijkstra",
"turnIds": ["turn-003"],
"targetIds": ["tgt-algo-explain"],
"evidenceDimension": "knowledge_understanding",
"signalKind": "positive",
"description": "Candidate identified the limitation with negative weights and named an alternative algorithm.",
"confidence": 0.85,
"sttConfidenceSummary": { "min": 0.88, "max": 0.88, "mean": 0.88, "turnCount": 1 },
"proposedBy": "llm_analysis",
"approved": true,
"createdAt": "2026-05-06T02:00:52.000Z",
"approvedAt": "2026-05-06T02:00:52.300Z",
"schemaVersion": "1"
},
{
"signalId": "sig-003",
"sessionId": "sess-2026-05-06-001",
"nodeId": "q-explain-dijkstra",
"turnIds": ["turn-001"],
"targetIds": ["tgt-complexity-analysis"],
"evidenceDimension": "knowledge_understanding",
"signalKind": "partial",
"description": "Candidate described the algorithm mechanism but did not explicitly state time complexity.",
"confidence": 0.72,
"sttConfidenceSummary": { "min": 0.91, "max": 0.91, "mean": 0.91, "turnCount": 1 },
"proposedBy": "llm_analysis",
"approved": true,
"createdAt": "2026-05-06T02:00:50.000Z",
"approvedAt": "2026-05-06T02:00:50.800Z",
"schemaVersion": "1"
},
{
"signalId": "sig-004",
"sessionId": "sess-2026-05-06-001",
"nodeId": "q-explain-dijkstra",
"turnIds": ["turn-003"],
"targetIds": ["tgt-communication"],
"evidenceDimension": "interpersonal_competence",
"signalKind": "positive",
"description": "Candidate communicated the concept concisely using correct terminology.",
"confidence": 0.80,
"sttConfidenceSummary": { "min": 0.88, "max": 0.88, "mean": 0.88, "turnCount": 1 },
"proposedBy": "llm_analysis",
"approved": true,
"createdAt": "2026-05-06T02:00:52.000Z",
"approvedAt": "2026-05-06T02:00:52.500Z",
"schemaVersion": "1"
},
{
"signalId": "sig-005",
"sessionId": "sess-2026-05-06-001",
"nodeId": "q-explain-dijkstra",
"turnIds": ["turn-003"],
"targetIds": ["tgt-algo-explain"],
"evidenceDimension": "metacognitive",
"signalKind": "self_correction",
"description": "Candidate corrected their earlier implicit assumption about negative weights by explicitly naming the limitation.",
"confidence": 0.82,
"sttConfidenceSummary": { "min": 0.88, "max": 0.88, "mean": 0.88, "turnCount": 1 },
"proposedBy": "llm_analysis",
"approved": true,
"createdAt": "2026-05-06T02:00:52.000Z",
"approvedAt": "2026-05-06T02:00:52.600Z",
"schemaVersion": "1"
}
],
"gaps": [
{
"targetId": "tgt-complexity-analysis",
"nodeId": "q-explain-dijkstra",
"positiveSignalsCollected": 0,
"minPositiveSignalsRequired": 1,
"detectedBy": "runtime_check",
"addressedByFollowUp": true,
"addressedByRecovery": false
}
],
"summary": {
"totalTurns": 3,
"totalSignals": 5,
"signalsByKind": {
"positive": 3,
"partial": 1,
"absent": 0,
"misconception": 0,
"flawed_reasoning": 0,
"process_positive": 0,
"process_negative": 0,
"self_correction": 1
},
"signalsByDimension": {
"knowledge_understanding": 3,
"applied_problem_solving": 0,
"interpersonal_competence": 1,
"intrapersonal_quality": 0,
"metacognitive": 1
},
"targetsFullyCovered": 2,
"targetsPartiallyCovered": 1,
"targetsWithGaps": 1,
"mandatoryGaps": 1,
"averageConfidence": 0.81,
"averageSttConfidence": 0.89
},
"recordingRef": {
"audioUrl": "https://recordings.example.com/sess-2026-05-06-001.opus",
"availableForModeration": true,
"candidateConsented": true,
"retentionPolicy": {
"retainUntil": "2026-12-06T00:00:00Z",
"deleteAfterReview": true
}
},
"finalisedAt": "2026-05-06T02:15:01.000Z",
"schemaVersion": "1"
}
6. EvidenceSignal: LLM Proposal vs. Runtime-Approved Record
Section titled “6. EvidenceSignal: LLM Proposal vs. Runtime-Approved Record”The interview bot’s LLM analysis layer proposes EvidenceSignals in real time as the conversation unfolds. However, these proposals are NOT automatically canonical.
Lifecycle
Section titled “Lifecycle”LLM proposes signal
│
▼
┌──────────────┐ ┌────────────────────┐
│ approved: │ │ approved: │
│ false │────▶│ true │
│ (pending) │ │ (canonical) │
└──────────────┘ └────────────────────┘
│ │
│ rejected │ feeds into
▼ ▼
(discarded or EvidenceLedger
flagged for → markingRuntime
manual review)
-
LLM proposals are advisory. The LLM emits signals with
approved: false. The runtime controller validates them against structural constraints before promoting toapproved: true. -
Validation checks before approval:
- The referenced
nodeIdMUST be the currently active node. - The referenced
turnIdsMUST exist in the persisted transcript. - The referenced
targetIdsMUST be valid for the current node. - The signal MUST NOT duplicate an existing approved signal for the same
(targetId, turnIds)combination. - Confidence MUST be in
[0.0, 1.0].
- The referenced
-
Only approved signals enter the canonical EvidenceLedger. Pending proposals are kept in a staging area for audit but are not visible to marking.
-
Manual override: A human marker MAY approve, reject, or amend signals during moderation. Manual signals use
proposedBy: "manual_marker"and are alwaysapproved: trueupon creation. -
The LLM MUST NOT self-approve. Even if the LLM is highly confident, the runtime controller is the sole approval authority. This prevents the LLM from fabricating a confident-sounding but incorrect signal that bypasses review.
7. EvidenceLedger as markingRuntime Input
Section titled “7. EvidenceLedger as markingRuntime Input”The EvidenceLedger is the canonical, structured input to the marking runtime. It replaces raw transcript processing.
markingRuntime Input Excerpt
Section titled “markingRuntime Input Excerpt”{
"sessionId": "sess-2026-05-06-001",
"examId": "exam-midterm-orals-cs201",
"candidateId": "student-2024-0456",
"submittedAt": "2026-05-06T02:15:01.000Z",
"targets": [
{
"targetId": "tgt-algo-explain",
"rubricItemId": "rubric-algo-explain",
"label": "Explain the core mechanism of Dijkstra's algorithm",
"evidenceDimension": "knowledge_understanding",
"transversal": false,
"mandatory": true,
"weight": 0.3,
"signals": [
{
"signalId": "sig-001",
"signalKind": "positive",
"evidenceDimension": "knowledge_understanding",
"confidence": 0.88,
"sttConfidenceSummary": { "min": 0.91, "max": 0.91, "mean": 0.91, "turnCount": 1 },
"description": "Candidate correctly described the greedy selection strategy and edge relaxation process.",
"turnText": "Dijkstra's algorithm works by greedily selecting the unvisited node with the smallest known distance, then relaxing all its outgoing edges."
},
{
"signalId": "sig-002",
"signalKind": "positive",
"evidenceDimension": "knowledge_understanding",
"confidence": 0.85,
"sttConfidenceSummary": { "min": 0.88, "max": 0.88, "mean": 0.88, "turnCount": 1 },
"description": "Candidate identified the limitation with negative weights and named an alternative algorithm.",
"turnText": "Dijkstra's doesn't handle negative weights. You'd need Bellman-Ford for that because the greedy assumption breaks down."
}
],
"gap": null
},
{
"targetId": "tgt-complexity-analysis",
"rubricItemId": "rubric-complexity-analysis",
"label": "Analyse time and space complexity of Dijkstra's algorithm",
"evidenceDimension": "knowledge_understanding",
"transversal": false,
"mandatory": true,
"weight": 0.2,
"signals": [
{
"signalId": "sig-003",
"signalKind": "partial",
"evidenceDimension": "knowledge_understanding",
"confidence": 0.72,
"sttConfidenceSummary": { "min": 0.91, "max": 0.91, "mean": 0.91, "turnCount": 1 },
"description": "Candidate described the algorithm mechanism but did not explicitly state time complexity.",
"turnText": "Dijkstra's algorithm works by greedily selecting the unvisited node with the smallest known distance, then relaxing all its outgoing edges."
}
],
"gap": {
"positiveSignalsCollected": 0,
"minPositiveSignalsRequired": 1,
"addressedByFollowUp": true
}
},
{
"targetId": "tgt-communication",
"rubricItemId": "rubric-communication",
"label": "Communicate technical concepts clearly throughout the session",
"evidenceDimension": "interpersonal_competence",
"transversal": true,
"aggregationMethod": "holistic",
"mandatory": false,
"weight": 0.2,
"signals": [
{
"signalId": "sig-004",
"signalKind": "positive",
"evidenceDimension": "interpersonal_competence",
"confidence": 0.80,
"sttConfidenceSummary": { "min": 0.88, "max": 0.91, "mean": 0.90, "turnCount": 2 },
"description": "Candidate communicated concepts concisely using correct terminology across multiple nodes.",
"turnText": "[aggregated across turns]"
}
],
"gap": null
}
],
"examinerTurns": [
{
"turnId": "turn-002",
"text": "Can you explain what happens when there are negative edge weights?",
"nodeId": "q-explain-dijkstra",
"purpose": "follow_up"
}
],
"guardrailEvents": [],
"metadata": {
"totalDurationSec": 900,
"followUpsUsed": 1,
"recoveryCount": 0,
"guardrailTriggerCount": 0
},
"recordingRef": {
"audioUrl": "https://recordings.example.com/sess-2026-05-06-001.opus",
"availableForModeration": true
}
}
Key design decisions for marking input:
- Signals are grouped by target, not by turn. This matches how markers think: “did the candidate demonstrate X?”
- Each signal includes the original
turnTextso markers can verify context without querying the transcript. - Gaps are surfaced explicitly. A
gapfield on a target tells the marker: “the runtime noticed this wasn’t covered.” - Examiner turns are included so markers can assess question quality and follow-up appropriateness.
- Metadata provides session-level context (duration, follow-ups, guardrails) that may inform marking decisions.
8. Preventing Interview Runtime from Direct Scoring
Section titled “8. Preventing Interview Runtime from Direct Scoring”The architecture enforces a strict separation of concerns:
| Concern | Owner | Allowed | Forbidden |
|---|---|---|---|
| Evidence collection | Interview runtime (bot + runtime controller) | Propose signals, track gaps, record transcript | Assign scores, compute grades, determine pass/fail |
| Evidence validation | Runtime controller | Approve/reject LLM proposals, enforce structural constraints | Modify signal content or confidence |
| Scoring | Marking runtime | Consume approved signals, apply rubric weights, compute scores | Access raw LLM proposals or pending signals |
| Moderation | Human markers | Override signals, add manual evidence, adjust scores | Bypass the evidence ledger |
Enforcement Mechanisms
Section titled “Enforcement Mechanisms”-
API boundary: The marking runtime consumes the EvidenceLedger via a read-only interface. It has no write access to the interview runtime’s state.
-
No score fields in interview events: No event or signal type contains a
score,grade,mark,points, orpass/failfield. These concepts do not exist in the interview layer. -
LLM prompt constraints: The bot’s system prompt MUST NOT include rubric scoring criteria, grade boundaries, or point allocations. The LLM may be instructed to identify evidence but explicitly forbidden from scoring it.
-
Runtime controller audit: The runtime controller logs all LLM proposals and their approval/rejection status. Any signal that bypasses the approval pipeline is a protocol violation and MUST trigger a
guardrail_triggeredevent. -
Temporal separation: The EvidenceLedger is finalised (
finalisedAttimestamp) only after the exam session ends. The marking runtime is invoked only after finalisation. There is no real-time scoring during the interview.
Interview runtime may generate evidence signals, but MUST NOT assign final marks.
9. Human-in-the-Loop: Moderation and Review
Section titled “9. Human-in-the-Loop: Moderation and Review”Akimov & Malin (2020) describe their moderation process: “to reduce the potential problem of intra-rater reliability, all online oral examinations were recorded and moderated by another finance academic.” The Evidence Ledger supports this workflow through first-class moderation fields.
9.1 Moderation Workflow
Section titled “9.1 Moderation Workflow”Exam session ends
│
▼
EvidenceLedger finalised
│
▼
┌──────────────────────┐
│ Marking runtime │
│ produces initial │
│ assessment │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Moderation policy │
│ selects sessions │
│ for human review │
│ (random/stratified/ │
│ all/flagged) │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Human moderator │
│ reviews signals, │
│ may override/add │
│ records in │
│ moderationRecord │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Final marks │
│ computed with │
│ moderated signals │
└──────────────────────┘
9.2 Moderation Rules
Section titled “9.2 Moderation Rules”-
Recording is a prerequisite. If
recordingRefis absent, the session MUST NOT bypass moderation — all sessions without recordings require mandatory human review. -
Moderation sampling. The exam package SHOULD define a moderation sampling strategy:
"all": every session reviewed (required for high-stakes exams)"random": random sample (default rate: 10%)"stratified": stratified by confidence quartile, with low-confidence sessions over-sampled"flagged": only sessions with mandatory gaps, guardrail triggers, or low average confidence
-
Moderator authority. The moderator MAY:
- Override any
EvidenceSignal(changesignalKind,confidence, ordescription) - Add new
EvidenceSignalentries (withproposedBy: "manual_marker") - Remove erroneous signals
- Add notes explaining their decisions
- Override any
-
Audit trail. All moderation actions are recorded in the
moderationRecordfield. The original LLM-proposed signals are preserved for inter-rater reliability analysis. -
Calibration feedback. Moderation disagreement rates MUST be tracked and fed back into the confidence calibration protocol (§3.2). Persistent disagreement triggers recalibration.
9.3 Session Recording
Section titled “9.3 Session Recording”The recordingRef field on EvidenceLedger references the audio/video recording of the session. This is essential for:
- Moderation review: Moderators can re-listen to specific moments flagged by low-confidence signals.
- Appeal resolution: Candidates who contest their assessment can have their recording reviewed by an independent panel.
- Calibration: Recordings provide ground truth for evaluating LLM signal accuracy.
- Fairness auditing: Recordings enable demographic-stratified analysis of assessment quality.
Retention policies MUST comply with institutional data governance requirements. The retentionPolicy field specifies when recordings will be deleted.
Revision History
Section titled “Revision History”| Version | Date | Changes |
|---|---|---|
| v0.2.0 | 2026-06-30 | Added integrated_practice evidence signal. Updated canonical type references to align with schema v0.2.0 additions. |
| v0.1.0 | 2026-05-06 | Initial release. |