HomeFeaturesPerformance ManagementAI Performance Reviews

AI Performance Reviews

Performance reviews drafted by AI, edited by humans, calibrated by statistics, fact-checked against signals.

Most "AI review" tools generate generic prose from a feedback form. Bynarize generates a structured narrative grounded in 12 months of signals (Goals, Feedback, Moments, Snapshots, Behavioural Indicators), runs it through guardrails (PII / Toxicity / Bias / Schema), fact-checks every claim against SignalAggregate, scores hallucination risk, calibrates confidence — and only THEN puts it in front of a manager. Section-by-section regenerate. Mandatory human edit. Full per-call trace. Manager edits feed back into the learning loop.

Real-world differentiator

Why most AI review tools fail — and how Bynarize fixes it.

Most "AI for performance reviews" wraps a single LLM call around a feedback form. Bynarize wraps every AI call in a governed orchestrator with guardrails, RAG, fact-check, hallucination detection, calibrated confidence and a learning loop — so the output is fast, fair, accurate, calibrated, defensible AND the manager is always in control.

What everyone else does

Single LLM call → "draft my review"

Why it actually hurts you

No grounding, no fact-check, no calibration, no audit — the manager either trusts blind or rewrites everything

How Bynarize solves it

Drafts grounded in 12 months of signals + Goals + Feedback + Moments, fact-checked vs SignalAggregate, hallucination-detected, confidence-calibrated, fully traced

What everyone else does

"AI wrote nice things" — about a project the employee never touched

Why it actually hurts you

Reviews lose all credibility the first time a manager spots a hallucinated achievement

How Bynarize solves it

Hallucination Detector with severity-driven gating; fact-checker validates every claim vs source-of-truth before the manager sees it

What everyone else does

Confidence shown as a flat "70%" with no calibration

Why it actually hurts you

Managers either over-trust or ignore the score — AI loses the room either way

How Bynarize solves it

AIConfidence aggregates model self-report + Platt/Isotonic calibration + coverage + recency + agreement; thresholds are tenant-configurable

What everyone else does

Manager edits vanish into the void

Why it actually hurts you

Same prompt mistakes repeat next cycle; no learning happens

How Bynarize solves it

Manager edits feed AILearningEvent + AIFeedbackSignal; A/B prompt experiments with hourly lift/p-value pick the winners automatically

What everyone else does

Generic feedback analysis: "this comment seems negative"

Why it actually hurts you

Themes, biases and recognition patterns stay invisible — managers still operate on gut feel

How Bynarize solves it

Per-feedback sentiment + theme + toxicity + bias-language analysis, employee theme aggregates, RecognitionGapAlert by severity with recommended action

What everyone else does

No audit trail when an AI-written line decides a promotion

Why it actually hurts you

Legal, DEI and the employee all have valid grounds to challenge — and the company has no defence

How Bynarize solves it

Per-call AIExecutionRun + steps + RAG retrieval log + guardrail decisions + fact-check + confidence + cost — every AI decision defensible end-to-end

Eight things AI reviews quietly break — fixed

Generative AI without governance is a lawsuit waiting to happen. Ours isn't.

Managers stare at a blank review form three weekends in a row.

AI Review Narrative drafts strengths, improvements, behavioural insights, coaching and promotion language in seconds — manager edits with track-changes, original AI version always preserved.

"AI wrote nice things — but did the employee actually do any of them?"

Fact-checker extracts every claim and verifies it against SignalAggregate (PRs merged, tickets closed, kudos received, courses completed). Unverifiable claims are flagged BEFORE the manager sees them.

AI hallucinates a project the employee never worked on.

Hallucination Detector runs on every output with severity-driven Block / RequiresHumanReview / Allow. High-severity hallucinations never reach the manager — they go to a review queue with the original prompt and context.

"70% confidence" means nothing without calibration.

AIConfidence aggregates model self-report + Platt/Isotonic calibration + coverage + recency + agreement — so when we say 70% it really means 70%. Auto-publish vs human-review thresholds are configurable per task.

Feedback piles up but nobody knows what it actually says.

Per-feedback AI analysis on every entry — sentiment, toxicity, bias, themes — surfaced as chips on the feedback card and rolled up into employee theme aggregates.

Recognition goes to the loud people, never to the quiet ones.

EmployeeRecognitionAnalytics rolls 30/90/YTD recognition stats; RecognitionGapAlert flags systematically under-recognised employees by severity with a recommended next action.

Goals are created from gut feel — and three months later nobody can defend them.

AI Goal Recommender scans the employee's 12-month signal history, role skill requirements and team OKRs to pre-suggest goals before the form is opened. One click to accept; dismiss with reason feeds the learning loop.

"How did the AI come to that conclusion?" — silence.

Every AI output ships with a per-step trace: prompt version (signed hash), model + region, inputs, RAG documents retrieved, guardrail verdicts, fact-check results, confidence breakdown, cost and latency. Defensible to legal, DEI, audit and the employee.

Inside this capability

Six layers — every one shipping on the same governed AI platform.

AI Review Narrative — drafted, fact-checked, calibrated

  • 1:1 with PerformanceReview — original AI draft preserved forever
  • Sections: Strengths / Improvements / Behavioural / Leadership / Coaching / Promotion language
  • Section-by-section regenerate (no need to redo the whole thing)
  • Track-changes manager edits with IsManagerEdited flag
  • Fact-checker validates every claim vs SignalAggregate
  • Hallucination detector with severity-driven gating
  • Calibrated confidence per section — auto-publish thresholds configurable per tenant

Per-Feedback AI Analysis — sentiment, themes, bias, toxicity

  • 1:1 with every Feedback entry — produced asynchronously, never blocks the giver
  • Sentiment (positive / neutral / negative + intensity)
  • Theme extraction with embeddings (clusters across cycles + employees)
  • Toxicity + professionalism + bias-language flagging
  • EmployeeThemes aggregate — managers see what their team actually talks about
  • Anonymous feedback stays anonymous to the receiver — analysis still computed and audited

AI Goal Recommender — suggestions before the form is opened

  • Scans 12-month signal history + role skill requirements + team OKRs + cycle theme
  • Suggests 3–7 candidate goals per employee with rationale + suggested KRs
  • Accept = creates a Goal in one click with the suggestion as parent evidence
  • Dismiss with reason → feeds learning loop (AILearningEvent)
  • Refreshed weekly + on cycle launch + on demand
  • Confidence scored — low-confidence suggestions hidden by default per tenant policy

EmployeeRecognitionAnalytics & Recognition Gap Alert

  • Rolling 30 / 90 / YTD recognition stats per employee — given AND received
  • Per-pillar breakdown — recognition for Goal delivery vs Collaboration vs Innovation etc.
  • Manager team-recognition heatmap — see who gets noticed and who doesn't
  • RecognitionGapAlert: severity-scored alerts when an employee is systematically ignored
  • Each alert ships with recommended action (1:1 prompt, kudos suggestion, manager nudge)
  • Resolution captured — feeds ManagerQualityScore and bias detection

Governed AI Platform — orchestrator, guardrails, RAG, quality loop

  • Single AI Orchestrator: budget pre-check → routing → guardrails → model call → post-guardrails → trace
  • Pre + post-call guardrails (PII / Toxicity / Bias / Topic / Schema) with Allow / Warn / Block / Redact
  • RAG grounding via pgvector embeddings (Employee / Role / Goal / Feedback / Document)
  • Fact-check every claim vs SignalAggregate; hallucination detector with severity gating
  • A/B prompt experiments + AILearningEvent harvester continuously improve quality
  • AICostLedger enforces hourly / daily / monthly budgets per tenant — no surprise bills
  • Per-call AIExecutionRun + AIExecutionStep trace — defensible to auditors and the employee

Universal Feedback Widget & Learning Loop

  • Thumbs up/down + edit-tracking on every AI-generated card across PMS
  • Feedback signals roll into AIFeedbackSignal — used by experiment evaluator
  • Manager edits feed AILearningEvent (auto-promoted high-quality runs become curated training pairs)
  • A/B prompt experiments: 50/50 split per task, lift + p-value computed hourly
  • Confidence calibrator (Platt / Isotonic) refits daily per Model × Task
  • Calibration drift surfaced on the AI observability dashboard
Why this is enterprise-defensible

AI you can put in front of a board, an auditor and the employee — same week.

1
AI drafts. Humans decide. Both are tracked.

Original AI version preserved forever. Manager edits captured with IsManagerEdited. Section-by-section regenerate. Acknowledgement workflow on the employee side.

2
Every claim fact-checked vs source-of-truth signals

Claims like "shipped 3 releases" are verified against SignalAggregate before the manager sees them. Unverifiable claims are flagged with a "needs review" chip.

3
Confidence that actually means something

AIConfidence aggregates model self-report + Platt/Isotonic calibration + coverage + recency + agreement. Auto-publish vs human-review thresholds are configurable per task.

4
Hallucinations caught before they reach a career

Hallucination Detector runs on every output. High-severity hallucinations are blocked or routed to human review with full context — they never silently make it into a review.

5
A/B prompt experiments — quality is measured, not assumed

50/50 traffic splits per task with hourly lift + p-value evaluation. Winning prompts auto-promoted; losers archived with full result history.

6
Defensible to legal, DEI and the employee

Per-call trace: prompt hash, model, region, inputs, RAG docs, guardrail verdicts, fact-check results, confidence breakdown, cost, latency — every AI decision is auditable end-to-end.

Frequently asked

AI Performance Reviews — questions buyers actually ask.

When the manager submits the manager review, an Azure Function (Fn_AI_GenerateReviewNarrative) is triggered. It calls the orchestrator with the ReviewNarrative prompt, which runs guardrails, retrieves relevant context (Goals, Moments, Feedback themes, Snapshots, Behavioural Indicators) via RAG, calls the routed model, runs post-guardrails, fact-checks each claim against SignalAggregate, computes calibrated confidence and writes the AIReviewNarrative row 1:1 with the PerformanceReview. The manager reviews, edits with track-changes (IsManagerEdited captured), and finalises. The original AI version is preserved forever.

Three layers. (1) RAG — the AI is grounded in approved documents and structured PMS data via pgvector embeddings. (2) Fact-checker — extracts every claim ("shipped 3 releases this quarter") and verifies against SignalAggregate; unverifiable claims are flagged. (3) Hallucination Detector — runs on every output with severity-driven gating; high-severity hallucinations are blocked or routed to human review queue with full context. Together they make hallucination both rare and visible.

AIConfidence aggregates five inputs: model self-report, Platt/Isotonic calibration fitted nightly on labelled outcomes, coverage (how much relevant evidence the AI saw), recency (how fresh the evidence is) and agreement (cross-model or cross-prompt). The result drives auto-publish vs human-review per task. Calibration drift over time is shown on the AI observability dashboard.

Yes — POST /api/pms/reviews/{id}/ai-narrative/regenerate-section?section=Strengths|Improvements|Behavioural|Leadership|Coaching|Promotion. Only that section is recomputed and the rest of the manager's edits are preserved.

When IsAnonymous = true on the Feedback row, the receiver UI hides author identity. The analysis (sentiment, themes, bias) is still computed and stored, with full audit on the backend for moderation, but is never surfaced to the receiver in a way that re-identifies the giver. First-time givers go through a soft moderation queue.

A scheduled function (Fn_AI_GoalRecommender) runs weekly per employee and on cycle launch. It scans the employee's 12-month signal history, role skill requirements, team OKRs and the active cycle theme to produce 3–7 candidate goals with rationale and suggested KRs. Accept creates a Goal in one click; dismiss with reason writes an AILearningEvent — so the recommender keeps getting smarter at your tenant.

Two lines of defence. (1) Pre-call: bias guardrail screens prompts. (2) Post-call: bias guardrail screens outputs; the calibration session for the cycle runs KL-divergence + Chi-square distribution analysis and the bias-detection model scans for patterns by manager × cohort. All findings land in BiasDetectionResult with severity and statistical evidence — HR acknowledges or dismisses with mandatory note.

AICostLedger enforces hourly / daily / monthly budgets per tenant per bucket. Budget pre-check runs BEFORE every external model call; Fn_AI_BudgetGuard throttles when caps are reached. AIExecutionRun records actual cost per call; Fn_AI_CostRollupHourly aggregates into the ledger; the AI Cost dashboard shows current spend vs cap with alert thresholds.

The manager is always the decision-maker. AI drafts; the manager edits with track-changes; the original AI version is preserved; the manager finalises; the employee acknowledges (or disputes — disputes route to HR). At no point is a rating published to an employee without a human signing off — the AI accelerates the work, it does not replace the judgement.

AI you can defend. Decisions humans still own.

Bynarize AI Reviews accelerate the work without taking the judgement away — drafted, fact-checked, calibrated, audited and always edited by a human before it touches a career.