For engineering hiring teams

Engineers work with AI now. Your interview should too.

Stop pretending candidates aren't using AI. Probe gives them realistic engineering tasks with the assistant turned on, while a second AI watches quietly in the background. It scores the work against your rubric and tells you who can actually think.

Get started
Live, not pre-recorded
No installs for candidates
interview · 32:14 · live
watcher observing
You → Assistant12:04
The retry decorator is timing out under load. Before I touch it, what's the safest way to add a per-call timeout without breaking the existing tests?
Assistant12:05
Wrap the inner call in asyncio.wait_for with a bounded budget, then surface the timeout as a typed exception so the retry loop can decide whether to back off.
You → Assistant12:31
Sketch the smallest test that would catch unbounded concurrency. I want to see it fail before I change anything.

AI broke the technical interview. Banning it doesn't fix it.

Every candidate has Claude open in another tab. Whiteboarding tests recall. Take-homes test patience. Neither tells you whether someone can decompose a real problem, prompt well, or notice when their assistant is wrong.

Probe assumes the AI is there. The interview becomes a test of judgment, which is what you wanted to measure in the first place.

38%
Use AI anyway
of technical candidates triggered AI-cheating flags in 19,368 interviews studied by Fabric (Jul 2025 – Jan 2026).
Minutes
To scorecard
Automated grading kicks off the moment a candidate submits. No review queue, no waiting on a reviewer's calendar.
0
Whiteboards
No leetcode, no tricks, no pre-recorded video reviews.
110m
Per session · you decide
Behavioral round
Coding rounds
2
Debrief / round

Live total from the same engine that schedules the real interview.

A session you shape. Then a scorecard you can defend.

You write the rubric. We run the interview. Your candidates work in a real editor with a real AI assistant they can ask anything, while a second AI quietly watches and grades against the dimensions you care about.

01
Pick a task
Choose from our library of production-realistic tasks in Python, Java, or C++: productionize, build-a-feature, refactor, review-an-AI-PR, and an open-ended build. Custom tasks from your own repo are on the roadmap.
configure · 5 min
02
Candidate runs the session
They get a link. No install. They can ask the AI assistant anything. A second AI silently watches every edit, prompt, and test run. That's the evidence the scorecard will cite later.
live · 30–60 min
03
The AI debriefs the candidate
Switch it on per round: after the candidate submits, a voice AI asks them to walk through what they built and why. It confirms they understand the code they shipped, not just what the assistant generated, and flags any gaps for the scorecard.
optional · ~10 min per round
04
You read the scorecard
Recommendation, dimension scores, and a complete timeline of every prompt, edit, and test run. Every score is backed by a transcript citation. Hire, pass, or escalate to a human round, with the evidence to back the call.
delivered automatically · no grading queue

Five coding tasks plus a behavioral round, each measuring a different kind of judgment.

You pick the format that matches what you need to see. The five coding types run in Python, Java, and C++, and each one is built to surface a specific signal rather than a generic "code quality" score. An optional behavioral round runs first.

Balanced screen
Productionize this
Working-but-flawed code to make production-ready. The issues are layered, including a subtle one the assistant will happily walk past. Exercises every dimension, so it's the safe default screen.
Comprehension
Build a feature
Drop into an unfamiliar codebase and add a specific feature. Do they navigate the code with their assistant, or dump the whole repo into the prompt and pray?
Verification
Refactor, preserve behavior
Turn messy working code into something cleaner that behaves identically. Candidates who don't test first get burned by the assistant's "almost equivalent" rewrites.
Judgment
Review an AI PR
No code written. They review a diff, leave comments, and decide whether to approve or request changes. Legit code is mixed with plausible-but-wrong code, plus a bug that only context catches. Probably our most revealing format.
Decomposition
Build (open-ended)
A broad, open-ended brief, a blank multi-file workspace, 45 minutes, and no tests to chase. The candidate sets the scope and their own bar — the clearest place to spot the mega-prompt that returns 400 lines nobody can defend.
Communication
Behavioral round
A 30-minute, résumé-led voice conversation that runs before the technical rounds. It probes ownership and depth on past work. Answers can be spoken or typed, so a microphone is never required.

Watch a full session, from first prompt to scorecard.

It's the same interview your candidates run: a live coding session with an AI assistant, a voice debrief, and the evidence-cited scorecard that lands minutes after they submit. Plays automatically, or tap a scene to jump.

retry.py · interview · live
watcher observing
retry.py
1import asyncio
2
3async def fetch_with_retry(url, *, retries=3):
4 for attempt in range(retries):
5 try:
6 return await asyncio.wait_for(
7 _get(url), timeout=2.0
8 )
9 except asyncio.TimeoutError:
10 if attempt == retries - 1:
11 raise
12 await backoff(attempt)
You → Assistant12:04
The retry decorator times out under load. Safest way to add a per-call timeout without breaking the existing tests?
Assistant12:05
Wrap the inner call in asyncio.wait_for with a bounded budget, then surface the timeout as a typed exception so the retry loop can decide whether to back off.
Watcher observing
3 signals
Verification08:42
Accepted the assistant's regex and shipped without an edge-case test. Flagged for the scorecard.
Decomposition17:15
test_concurrent passes, but the mock returns instantly, so it isn't a real concurrency signal. Citation captured.
Prompt quality31:02
Asked the assistant to sketch the failing test first, before touching the code. Strong signal.

The candidate works in a real editor with an AI assistant, while a silent watcher captures evidence against your rubric.

A silent watcher that grades like a senior engineer.

The watcher agent sees everything: the diff, the prompt history, the test runs. It never speaks to the candidate. Instead it builds an evidence-cited score against your rubric and flags the moments that actually matter, like shortcuts that hide tradeoffs, or tests that don't really exercise the thing they claim to.

  • Tuned by senior engineers. Every task is graded against a rubric written and validated by the engineers who would normally run the interview, not a generic "code quality" score.
  • Doesn't interrupt thinking. No probes, no pop-ups, no chat. The candidate works the way they actually work, with their assistant, start to finish.
  • Every score is auditable. You see the verbatim transcript moment behind each dimension score, and can override the recommendation with a single click.

What hiring teams ask before getting started.

Honest answers, including what we don't claim and what we're still building.

01Validity & trustThe biggest reason teams hesitate. Worth answering first.
How do you prevent candidates from cheating?
There's almost nothing to cheat on. The AI is already built into the session. We watch how candidates use it, not whether. We log when the candidate's tab loses focus, but the real defense is the transcript itself: candidates who outsource thinking leave a trail of vague prompts, unverified pastes, and code they can't justify.
How accurate is the AI's scoring?
Every score is backed by a transcript citation, so you can audit the exact moment that produced the rating. We don't claim to replace your senior engineers' judgment. We just surface the right ten minutes for them to review, rather than asking them to sit through the whole thing.
Will engineers on my team trust the recommendation?
They will if they can see the work. Every scorecard links to the full session: every prompt, every edit, every test run, every flagged moment. We've found engineers go from skeptical to convinced after reviewing two or three real transcripts. We recommend running Probe in shadow mode alongside your existing loop for the first month.
Won't candidates just have the AI do everything for them?
That's the point, and the candidates who try it fail. The watcher tracks vague prompts, unverified pastes, and decisions the candidate can't justify. "Use the AI" and "have the AI do it" look very different in the transcript.
How do you verify it's actually the candidate doing the interview?
We don't do ID checks or webcam proctoring. That's a deliberate choice, not a gap. Each invite link is tied to a single candidate record, and the real defense is the same one we use against AI over-reliance: the transcript. Someone who hands the whole session to a friend leaves the same trail as someone who hands it to the assistant: prompts and code they can't account for, which the optional implementation debrief is designed to expose. If hard identity assurance is a requirement for your process, run Probe as a screen ahead of an onsite where you confirm the person.
02Candidate experienceWhat it feels like on the other side of the link.
How do candidates feel about being interviewed by an AI?
Better than they expect. The session is realistic work, the timer is fair, and unlike take-homes, there's no "we'll get back to you in three weeks." Most candidates finish the interview and immediately see their own transcript. We're still pre-launch, so we'll have real survey data later this year.
What if a candidate has never used AI tools before?
The welcome page walks them through the assistant (chat, file editing, and test runs) before the timer starts. AI usage isn't required, just permitted. Strong candidates who prefer not to use AI still pass; the rubric rewards good judgment, and that includes knowing when to skip the assistant.
Is it accessible for candidates with disabilities?
We use Monaco, the same editor as VS Code, which supports keyboard navigation and screen readers. The behavioral round accepts either typed or spoken answers, so a microphone is never required.
How long does the interview take, and how many rounds are there?
You decide. Each technical round is a single sitting of 30 to 60 minutes, with 45 the default, and you can run anywhere from one round up to five. An optional behavioral round runs first: a 30-minute, résumé-led conversation. Most teams start with a single technical round to replace a phone screen, not a marathon loop. Turning on the per-round implementation debrief adds roughly ten minutes to each round it runs.
Which AI assistant do candidates use, and can they bring their own?
The session has a built-in assistant powered by Anthropic's Claude, available in the editor the whole time. Candidates can chat with it, and it can read files, make edits, and run tests. There's nothing to install or sign into, and candidates don't bring their own. Nothing stops someone from opening another AI in a separate tab, but that's exactly why we score the work in the transcript and log when the tab loses focus rather than trying to lock the browser down.
What happens if a candidate's connection drops mid-session?
Their work isn't lost. Every edit, prompt, and test run is written to the session log on our servers as it happens, so nothing important lives only in the browser. A brief network blip reconnects on its own, showing "Reconnecting…" instead of a blank screen. If a candidate hits something that genuinely interrupts the session, they can email us and we'll sort out a re-invite.
03Integration & controlHow Probe fits into how you already hire.
Can we use our own coding tasks?
Not yet. The task catalog is currently authored and calibrated entirely by Probe. Each task is written and tested by senior engineers, with hidden test suites and a graded rubric, before it ships. Custom tasks from your codebase are on the roadmap. In the meantime, you can configure which preset task type runs in each round.
Does it integrate with Greenhouse, Ashby, or Lever?
Not at launch. We're starting with the simplest possible workflow: send a link, get a scorecard. We'll add ATS connectors based on what design partners actually use most. If you want a specific integration, tell us and we'll prioritize.
Can we override the AI's recommendation?
Always. The scorecard is a recommendation, not a decision. Every dimension shows the transcript moment behind it, so your team can disagree and re-decide on the evidence rather than the summary. An in-product override-and-log workflow is on the roadmap; today the decision lives in your ATS or notes, not in Probe.
What programming languages do you support?
Probe's task catalog runs in Python, Java, and C++ today, and you pick the language that fits the role. JavaScript/TypeScript, Go, Rust, and front-end framework work aren't supported yet; they're on the roadmap, along with custom tasks authored from your own codebase, which will widen the set further. If your stack isn't covered, tell us. That's how we decide what to add next.
What roles and seniority levels is Probe built for?
Probe fits software-engineering roles where the work is building, debugging, refactoring, and reviewing real code. Because the task catalog is Python, Java, and C++, it's strongest for backend and systems roles right now. Broader front-end and language coverage is on the roadmap. On seniority it spans new-grad through staff: the rubric rewards judgment, decomposition, and catching the assistant when it's wrong, so the same task naturally separates a junior from a senior in the transcript.
Is the interview live and proctored, or asynchronous?
Asynchronous. You send a link and the candidate runs the session whenever they like: no scheduling, no install, and no one watching live on a call. "Live" on our site means the candidate writes and runs real code in real time, not a recorded screen-share you review later. It doesn't mean a person is observing. The only observer is the watcher AI, and it never interrupts.
04Legal & dataCompliance and what happens to candidate information.
Is this EEOC-compliant? Can the AI introduce bias?
We don't claim to be unbiased; no AI system is. We do claim to be auditable: every score has a transcript citation and every rubric weight is visible, so your team can review, override, and reweight on the cases that matter. If your jurisdiction requires NYC Local Law 144-style audits, talk to us before signing. We're still pre-launch, and we want compliance scope nailed down before we onboard.
What happens to candidate data?
Stored on Google Cloud (Firestore, US-based) and encrypted at rest by default. Candidate and session data are not used to train any AI models. The hiring team can delete a candidate's record at any time from the dashboard, and candidates can email us to request deletion. We're working on formal DPA and GDPR/CCPA processes ahead of paid customers. Ask us where we are before signing if your compliance team needs specifics. See our privacy page (linked in the footer) for the full inventory of what we collect.

Run your first AI-native interview this week.

Set up in under five minutes.

Get started

Questions, or want to talk through your hiring loop?

Send us a note and we'll get back to you within one business day.