The silent watcher: grading interviews without interrupting the candidate

Most AI interviewing tools put one model in the room: a chatbot that asks questions and scores the answers. That design has a problem baked in. The thing talking to the candidate is also the thing judging them — so the score gets contaminated by the conversation, and the candidate is performing for their evaluator the entire time.

Probe runs two AIs, and they never talk to each other in front of you.

Two roles, deliberately split

The assistant sits with the candidate. It's a real coding assistant — they can ask it anything, hand it work, argue with it. Its only job is to be the tool the candidate would actually use on the job.

The watcher sits behind one-way glass. It sees everything in the session: the diff, the full prompt history, every test run, every focus change. It never speaks to the candidate. It never nudges, hints, or reacts. Its only job is to build an evidence-cited score against the rubric you defined.

That separation is the whole point. The candidate works naturally because nothing they say to the assistant is "the answer being graded." And the score stays clean because the grader was never part of the performance.

What the watcher actually looks for

It's not counting lines or matching against a reference solution. It grades like a senior engineer reviewing a session, against your dimensions. For example:

Decomposition — did they break the problem down sensibly, or thrash?
Verification — did they test the thing they claimed to fix, or trust a green check?
Judgment with the assistant — did they catch the assistant's mistakes, or inherit them?
Code quality — does the result hold up, or did they paper over a tradeoff?

And critically, it flags the moments — "wrote a test that passes whether or not the bug is fixed (11:480)" — not just a number. A score you can't trace is a score you can't defend.

Why "evidence-cited" matters more than the number

A hiring recommendation you can't explain is a liability — to the candidate you reject, to the teammate who asks "why," and increasingly to your legal team. Every dimension in a Probe scorecard points back to a verbatim moment in the transcript. "No hire" stops being a vibe and becomes a sentence you can stand behind: here is what they did, here is where, here is why it matters.

And because it's evidence and not authority, you stay in charge. Disagree with the watcher? The transcript is right there. Override the recommendation with one click. The AI's job is to surface the evidence fast and consistently, not to make the call for you.

The result

You get the throughput of an async, automated interview with the rigor of a senior engineer's review — minus the senior engineer's afternoon, and minus the bias that creeps in when the grader is also the host.

One AI helps the candidate do their best work. The other tells you, with receipts, whether it was good enough.

See it in context: how to interview engineers who use AI · the AI-native interview overview.