Company

AI sourcing is broken by design. Here's how we're fixing it.

A response to Paul Karrmann's article on why AI sourcing keeps failing, and how Jack & Jill is fixing the three structural problems he identified.

Matthew WilsonCo-founder & CEO, Jack & Jill22 May 20268 min read

A response to Paul Karrmann's article on why AI sourcing keeps failing.

You may have read Paul Karrmann's interesting article on why AI sourcing is structurally broken. It's been doing the rounds for good reason: he claims that most AI sourcing tools fail as a result of stale data, thinly disguised keyword search, and AI reasoning over unbounded sets. We agree with his diagnosis.

But we disagree that nobody's fixing all three. Here's what we're building and why it works.

The contention at the heart of our business

The single most important choice we made was a company-level one. We're building a two-sided marketplace where candidates actively sign up.

Paul is describing one-sided sourcing tools. A recruiter pays, the tool scrapes LinkedIn, and candidates are objects in a database. They are unaware, uninvolved, and unreachable except by cold outreach. The entire industry is built on this paradigm, so the entire industry has the same data freshness problem. More importantly, both candidates and recruiters alike are having a hard time.

We made a different bet. We built Jack: an AI career agent that helps candidates navigate the fraught process of finding a new job. He delivers real value every week:

Job matching across 15 million roles, searched daily
Mock interviews: HM screens, behavioural, PM strategy, consulting PEIs
Career coaching: planning moves, bouncing back from rejections
Salary benchmarking and negotiation prep

Jack is completely free for candidates. The result is that people actually talk to him regularly, and tell their friends. We've grown to ~200,000 candidates and counting, with strong growth driven by word of mouth and organic referrals.

Problem 1: "The data is stale before you query it"

Things look very different with a two-sided marketplace. Here's our freshness model, in brief:

At signup: LinkedIn via OAuth (not scraping), CV, phone. Enriched immediately from verified sources.
Ongoing: Most importantly, Jack maintains each candidate's structured profile from every conversation: chat, voice, WhatsApp, email. If the candidate's preferences change, we're the first to know.
- This inverts the typical model. Candidates don't have to update their profiles across a dozen platforms: Jack checks in with them directly, and does it for them.
Change detection: When a candidate changes jobs, Jack notices and updates their profile in context.
Candidate-controlled: They set their own visibility (open / selective / hidden). Both sides consent before any introduction happens.

The result is dramatically higher response rates, for two reasons:

We're showing candidates roles we already know they're interested in, because they told us what they want.
We meet candidates where they are. They share their email and WhatsApp so they can be alerted to roles they're suited for, as soon as they hit the market.

Paul's phone-book analogy is good: most tools have a phone book where half the numbers changed last week. We have a contact list where people pick up when you call, because they gave you their number and told you when to ring.

Problem 2: "AI matching is mostly fancier keyword search"

At no point do we do any keyword search (or anything like it). Jill constructs a bespoke search pipeline for every role: a sequence of gates, each one a genuine reasoning task with a structured rubric.

Every gate has four components that Jill defines: which candidate context fields to inject, a prompt with explicit criteria for each evaluation tier, a structured output space, and a composite filter condition combining multiple dimensions of judgment. A gate assessing design craft doesn't look for the word "Figma", it reads the candidate's actual history and makes a judgment call against criteria like "led design at a celebrated consumer product" versus "primarily B2B/enterprise with some consumer work." Early gates make focused cuts on narrow signals (location, visa, function); later gates assess subtler dimensions like consumer product pedigree, founding mentality, or go-to-market archetype. The sophistication of the reasoning increases as the funnel narrows.

Before any gate runs, all candidate context is stripped of protected characteristics, removing demographic signals that shouldn't influence hiring decisions. (This is something we care deeply about getting right. We run regular audits on our debiasing and publish our methodology publicly: jackandjill.ai/jill/bias.) Each gate also only receives the fields relevant to its specific decision, which keeps inference fast and targeted.

An example search might look something like this:

Jill's candidate evaluation funnel: candidate context is debiased and minimised, then passed through five reasoning gates (location & eligibility, function fit, design craft, founding mentality, and go-to-market archetype fit) before a shortlist is handed off to the hiring manager, with regular fairness audits running alongside.

Why buzzwords don't matter

Since every gate reasons against evidence rather than vocabulary, surface signals don't help candidates game the pipeline. A candidate who lists "founding mentality" without the history to support it gets assessed on what they actually built, where they worked, and what they owned. And because our candidates described their experience to Jack in conversation, their profiles include real context a CV would never capture.

Problem 3: "AI can't reason over millions of records at once"

Our pipeline doesn't demand this. Each gate runs one LLM call per candidate. 1,000 candidates entering a gate means 1,000 parallel LLM calls, each one reasoning independently. This is what makes per-candidate depth viable at funnel scale: parallelism handles the throughput, while the funnel's earlier gates have already done the work of cutting the population before more nuanced reasoning kicks in.

Iteration and observability

Since each gate is independent, Jill can test any gate in isolation, running it on a sample population and inspecting exactly who passed and who failed. We call this the dye test, where we track known candidates through the pipeline, check whether each gate correctly evaluated them, and iterate until the pipeline is right. Every prompt is readable directly from the brief page UI. It's debuggable in a way a monolithic ranking model never is.

What this means in practice

A hiring manager briefs Jill. Within minutes:

She's researched the company, team, and role
She's built a structured evaluation rubric with the hiring manager
She's tested it against reference candidates
She's run the full search pipeline
She's presenting a shortlist of deeply evaluated candidates with per-dimension reasoning

The candidates she shows aren't people who happened to have the right keywords four years ago. They're people who are actively looking, have explicitly relevant experience (validated by multi-step AI evaluation), and have told their agent they're open to exactly this kind of role.

If you're hiring and want to see what this looks like for your role, talk to Jill.

A final word

For most roles, our network is where our value really lies. But for highly specialist or niche positions (where any sourcing tool would struggle with liquidity) we've extended the same search methodology to public candidate profiles via data partners. You get the same structured, rubric-grounded evaluation; just applied to a broader pool. And it's completely free.

We're Jack & Jill. Jill is your AI recruiting agent. Jack is the candidate's AI career agent. Between them, they make introductions that actually work.

Ready to land your next role?