May 27, 202621 min readhow to evaluate a B2B outbound agency

How to Evaluate a B2B Outbound Agency Before You Sign (2026 Founder's Checklist)

B2B outbound agencies fail B2B SaaS founders the same way every time. Here are the 8 questions, 6 red flags, and the ownership test that separates a real partner from another template farm.

Samuel Roa

Founder, TrueAdvertize

If your B2B sales motion is active but the pipeline is flat, you may already have had the same polished agency conversation more than once. An agency partner gets warm-introed in by someone in your network. Their first call sounds smart. Their second call sounds tailored. By the third call you're being asked to sign a multi-month retainer, with the promise that pipeline will start moving by week 4.

You've been burned before, so you ask the questions you remember being burned by last time. They have good answers. You sign.

Ninety days later, the sequences look like sequences. The pipeline is up by a few meetings. The reply rate is below 2%. Your team can't tell you what's working without the agency present. The agency wants to extend.

The recurring failure is signing before the scope, ownership, evidence, and handover terms are clear. And the reason it keeps happening is that the questions you remember to ask are the ones the agency has rehearsed answers to. The questions you need to ask are different.

I run TrueAdvertize. We build owned revenue engines for B2B companies on a fixed-scope Build with a defined day-90 handover. I sit on the other side of these evaluations almost every week. Some founders sign with us. Some sign with someone else. Some decide to build in-house. The framework below works regardless of who you pick. If we lose a deal because a founder used it on us and decided we weren't the fit, that's a result we can live with. The point is that you stop signing with template farms.

This is what I would want any B2B buyer to read before evaluating TrueAdvertize, an agency, a consultancy, or an in-house build.

Why most B2B outbound agencies fail (three failure modes)

Before the questions, use three delivery-model patterns to inspect the proposal: reusable execution, indefinite dependency, and strategy without implementation. Treat them as diagnostic patterns, not universal labels. Knowing which pattern you're looking at on the first call is more useful than any specific question.

The template farm. Sends similar sequences across every client. The "ICP research" is a 30-minute call where they ask you who your customer is, then they write copy that sounds like every other agency's copy because it is every other agency's copy with names swapped. The giveaway: their pitch deck has 12 logos of companies in different industries, and the case studies all show the same three metrics. You're paying for distribution of generic copy at scale, and the reply rate stays low because the constraint is the list, not the volume.

The retainer-dependency model. Charges a monthly retainer with no compounding asset on the client side. The agency owns the Clay workflows. The agency owns the sequences inside their Smartlead or Instantly account. The agency owns the data enrichment subscriptions. The day you cancel, you have nothing. The "system" lived in their tooling under their logins, and they're not handing you the keys because the keys are the only thing keeping you on the retainer. This is the model the industry runs on because it's the most profitable, not because it's the most honest.

The strategy deck. Charges a large upfront fee for a 40-page Notion doc that describes the GTM motion you should build. Then they walk. The deck is rigorous, the slides are pretty, and at no point did anyone build anything. You now have an excellent description of the system that you still have to build yourself, with a third of your runway gone.

Each of these models has its place somewhere in the market. The right model depends on the job: ongoing execution, strategy, a system Build, or an internal-team intervention. Spot the pattern on call one, and you've already filtered out most of the noise.

The eight questions to ask before signing

These are the questions that separate the agency that's actually building a system from the three failure modes above. Each one has a clean answer if you're talking to the right firm. If the answer is hedged, vague, or "let me get back to you on that," treat it as a red flag.

1. "Who owns the Clay workflows, sequences, and data on day 90?"

The single highest-leverage question on this list. If the answer is anything other than "you do, 100%, in your own accounts, with documented SOPs and admin access," you're signing for a dependency, not a system. Ask to see the standard handoff document. Real partners have one. Template farms don't.

2. "What's your reply rate floor, and what list do you measure it against?"

Any agency can quote a reply rate because the number is decoupled from list quality. The honest version of this answer specifies two things: the reply rate floor they design toward (an 8% engineered target on a tight, signal-based cold list is reasonable; a number that only holds on a tiny test list is not), and the list construction (Apollo scrape with no filters versus a hand-curated TAM with multi-provider enrichment). If they can't articulate list discipline, the reply rate number is theater.

3. "What does week 1 of the build actually look like?"

You want to hear specifics. "Day 1 we kick off and do ICP discovery. Day 2 we map your current funnel and identify gaps. Day 3 we draft the system architecture in a shared doc. Day 5 you get the blueprint." If the answer is "we'll send you a project plan after kickoff," they don't have a repeatable process, which means they're inventing it on your engagement and they'll miss timelines.

4. "Can I review a recent, permissioned reference or handover example?"

Ask for evidence recent enough to reflect the current offer: a permissioned reference, signed case study, redacted handover, or dated system artifact. Confirm what the client owned after the engagement. Two flags on this one. First, agencies that only show you 18-month-old case studies are showing you the only one that worked. Second, the question filters for whether the agency lets clients leave with a working system in the first place. Retainer-dependency shops can't produce this reference because their clients either churned (bad) or are still on retainer (which means they don't really own anything).

5. "What does the contract say happens if the Build misses an agreed obligation?"

Ask for the exact eligibility rule, refund or credit amount, trigger, exclusions, deadline, and artifact treatment in writing. For TrueAdvertize, the Build includes a 30-day money-back guarantee; the signed engagement letter controls the details. If the answer is "we don't offer refunds because every engagement is different," the firm doesn't believe in its own process enough to put real money on it. The risk reversal isn't about the money. It's about whether the firm has skin in the game.

6. "What tools are you using on this engagement, and whose account do they live in?"

If the agency runs the campaign through their Clay seat, their Instantly campaign, their Inboxkit setup, you're renting infrastructure. The handoff at engagement end requires migrating all of that into your accounts. The honest agency builds inside your tools from day 1, with their team as collaborators on your seats, so the migration on day 90 is a transfer of admin rights, not a rebuild.

7. "Who will I actually be working with day to day?"

The pitch usually involves a senior partner. The work usually gets done by an account manager and an offshore junior. The disconnect is where most engagements quietly die. Ask for the name and LinkedIn of the person who will run your weekly call in week 3. If they can't tell you, the org chart isn't built around accountability.

8. "What happens at the end of the engagement? What's the post-handoff structure?"

The retainer-dependency agency will pitch you on "continued optimization" at month 4. The build agency will hand you a 30-page SOP library, recorded training videos, a documented playbook, and offer optional monthly check-ins that are scoped and paid hourly, not on retainer. The post-handoff structure is where the agency's actual model reveals itself.

Six red flags that should kill the deal on the spot

Some answers are not just hedged. They're disqualifying. If you hear any of these on the first or second call, end the evaluation.

Red flag 1: They can't define your ICP back to you by the second call. If they're still asking who your customer is after a 90-minute discovery call, they're going to spend your first three weeks doing what should have been done before they pitched you. The agencies worth hiring have either done in-niche work before or built their process around fast, rigorous ICP definition. Either way, by call two, they should be reflecting your ICP back to you in language sharper than the way you usually describe it.

Red flag 2: The case studies are all logos with no numbers. A logo wall says "this company once paid us." A real case study says "this founder at this stage ran this play for this period with this reply rate against this list size." If the deck is logos, ask for the case studies in narrative form. If they can't produce them, the engagement didn't go well enough to write down.

Red flag 3: The pricing model is a fixed monthly retainer with no scope cap. "We charge $8K per month for outbound services" means the scope is whatever they choose to ship each month, and your incentive to keep paying is decoupled from outcomes. Pricing tied to the build (one number for the build phase, one number for the optimize phase, then optional check-ins) creates the right alignment.

Red flag 4: First-send speed is promised before dependencies are known. Ask what must be true before launch: list definition, data quality, deliverability, messaging approval, CRM tracking, and ownership. TrueAdvertize defines first-send criteria after the Blueprint. Speed-to-send is not a feature.

Red flag 5: They can't tell you which CRM, sending tool, or enrichment provider they prefer and why. "We work with whatever you have" sounds flexible but usually means they don't have opinions because they haven't built enough engagements to develop them. The agencies that work have a default stack they recommend (Clay, Instantly or Smartlead, Apollo, HubSpot or Salesforce) and a defensible reason for each tool's role.

Red flag 6: Your team's required time is not in the scope. Ask for the expected client time, decision owners, approval deadlines, training plan, and dependencies in writing. TrueAdvertize defines those commitments in the Build scope produced after the Blueprint.

The ownership question (and why it's the only one that matters in 12 months)

If I had to compress this entire framework into one question, it would be this:

"On day 91, what do I own that I didn't own on day 0?"

The answer separates every real partner from every retainer. Here's what good looks like in concrete artifacts:

The full GTM Blueprint document, with ICP definition, vertical splits, and rebuild plan
Working Clay tables, enrichment waterfall, and scoring logic in your Clay seat
Sequences live in your Instantly or Smartlead account, under your admin
A 30+ page SOP library (Notion, Confluence, whatever your team uses)
Recorded training videos walking your team through how to run the system
Documented attribution: how the CRM tracks pipeline back to source
Direct admin access to every credential the system uses

Notice what's not on this list: a deliverable that lives inside the agency's tooling, requires their account to run, or comes with an "ongoing support contract" without which it stops working. The ownership test is binary. Either you can run the system without the agency on day 91, or you can't. There's no middle path. Most retainer-dependency engagements fail this test, which is why so many clients churn at the 12-month mark feeling like they paid $96K for nothing.

The pricing models you'll encounter (and what each one signals)

Five pricing models dominate the B2B outbound agency market in 2026. Each one tells you something about how the firm runs.

Monthly retainer. Ongoing execution for a recurring fee. Confirm scope, term, staffing, cancellation, tool costs, and what transfers at exit.

Scoped monthly retainer. Ongoing execution with named deliverables. Confirm change-control rules, performance definitions, and ownership.

Performance-linked pricing. Some fee depends on a defined event. Confirm the event definition, attribution, exclusions, dispute process, and data access.

Fixed-scope Build. A defined implementation with a handover. TrueAdvertize uses this structure for Revenue Engine Builds: custom scope after a $5,000 Blueprint; most scoped Builds fall between $15,000 and $80,000.

Revenue-share model. Compensation depends on attributed pipeline or revenue. Confirm control of the funnel, attribution, term, data access, and post-contract rights.

There is no universally correct model. Choose the one that matches the job, internal owner, desired handover, and evidence.

How to evaluate case studies (the questions behind the numbers)

Every agency shows you the same case study format: company X, problem Y, our team did Z, results were W. The format hides almost everything that matters. Here are the questions that turn a case study into a real signal.

What was the list size and how was it built? A 12% reply rate on a 400-lead hand-curated list of in-ICP accounts is a real result. A 12% reply rate on a 40-lead pilot list from the founder's personal network is a vanity number. The list construction is most of the story.

How long did the engagement run before the reported metric? Reply rates in week 2 are not reply rates in week 12. Pipeline volume in month 3 is not pipeline volume in month 9. Ask for the time-series, not the headline.

What was the client's baseline before the engagement? A company that came in at 0.8% reply rate and ended at 9% is a real story. A company that came in at 6% and ended at 9% is a smaller intervention. Without baseline, the lift is unreadable.

What did the client retain after the engagement ended? If the answer is "we still run their outbound for them," the case study is a retention story, not a build story. The interesting version is: client X took over the system on day 91, has run it independently for the last 14 months, and is now at Y reply rate without us in the room.

Can I talk to the named contact at company X? If yes, schedule it. If no, ask why. The agencies worth hiring have clients who will take a 20-minute reference call. The agencies that aren't worth hiring have clients who won't.

The reply rate question (and how to spot the lies)

Reply rate is the most reported and most misreported metric in B2B outbound. A few quick filters.

A reply-rate claim needs its unit. Ask for the list, unique-contact denominator, total-versus-positive-reply definition, cold-versus-warm status, volume, timeframe, and source.

For planning, use 8% total replies as an engineered target on a tight, signal-based cold list, not a promised result. Track positive replies and meetings booked per 1,000 sends beside it.

Numbers above 20% reply rate should make you skeptical. Either it's a tiny test list (20% of 30 is 6 replies), or the reply definition is loose (counting auto-replies, out-of-office, unsubscribe-with-message as a reply), or the engagement was a single hand-crafted A/B test, not a sustained motion. Ask for the cohort definition.

The most useful question: "what's your reply-to-meeting conversion rate?" Reply rates can be juiced. Meeting conversion is harder to fake. Ask for the reply-to-meeting conversion definition and underlying counts. Do not accept a percentage without the numerator and denominator.

What money-back guarantees actually mean

Most agencies don't offer them. The ones that do, the guarantee usually has more fine print than a credit card agreement. A few questions cut through this.

What triggers the refund? "Build doesn't ship on time" is a clean trigger. "We didn't hit the agreed outcome" is more subjective and usually litigated. The trigger should be measurable from the outside.

How much is refundable? "100% of the build fee" is meaningful. "Up to 50% pro-rated against unbilled work" is mostly a marketing line.

What do you keep if you trigger it? "Keep all artifacts shipped to that point" is the answer that matters. If the answer is "you keep nothing," the guarantee is theater because you'll have to start the rebuild from scratch anyway.

When does the window close? A 30-day window from kickoff is honest. A 7-day window is performative. A window that runs until first send is built to expire before anything can go wrong.

The reason this matters isn't that you're planning to invoke the refund. It's that an agency willing to put real money behind a clean trigger has confidence in its own delivery, which is a structural signal about whether the firm will treat your engagement seriously.

When NOT to hire a B2B outbound agency

This article is from someone who runs one. So the failure mode of this whole frame is that the answer is always "hire a good agency." It isn't.

There are at least three situations where hiring an agency is the wrong move, and you should know which one you're in before you start evaluating.

You do not yet know which market or problem you are validating. A system Build is premature when the core buyer, problem, and message are still changing weekly. Keep learning directly before engineering scale.

Your in-house team already owns the system and needs a focused intervention. If the data, workflows, and operators exist, the right scope may be a Recalibration or a targeted enablement project rather than a full Build.

Your problem is closing, not generating. If qualified pipeline is healthy but deals stall, fix discovery, qualification, deal process, or enablement before adding more top-of-funnel.

A Revenue Engine Build fits when the market is defined, the sales motion is research-heavy, the pipeline is flat or founder-led, observable buyer signals exist, and an in-house operator can inherit the system.

The reference call playbook

When you get a reference call with a former client, use the limited time to extract the real story. Many buyers waste it on small talk and surface questions. Three categories of question that actually surface signal:

Operational. "Walk me through what your team was doing in week 4." "Who on their team was on your weekly calls, and were they the same people from week 1 to week 12?" "What's something you had to push back on the agency about, and how did they respond?"

Outcome. "What was your reply rate at week 4 vs. week 12?" "How many meetings were on the calendar at handoff?" "What's the reply rate now, six months after the engagement ended?"

Structural. "On day 91, what did you own that you didn't own on day 0?" "Has the system kept running without them in the room?" "If you had to do it again, would you sign with them?"

The last question is the one that breaks through. If the answer is yes, you've got real social proof. If the answer is "I think so, but..." you've got a signal worth probing. If the answer is "honestly, no," you just saved yourself a year.

FAQ

How long should the evaluation process take?

Three to four weeks is reasonable. Call one is intro and qualification. Call two is process and team. Call three (after they've sent a draft scope of work) is structure, pricing, and reference calls. Anyone pushing you to sign in under two weeks is selling, not evaluating.

How many agencies should I evaluate in parallel?

Three is the sweet spot. One forces you to compare against your own assumptions. Two creates a coin flip. Three lets you triangulate. More than four diffuses the energy across too many discovery calls and you end up with surface knowledge of seven firms instead of deep knowledge of three.

What does TrueAdvertize cost?

The Revenue Engine Blueprint is $5,000 flat and creditable toward a Build within 30 days. Revenue Engine Builds are custom-scoped after the Blueprint; most scoped Builds fall between $15,000 and $80,000, depending on data, targeting, messaging, CRM, measurement, training, and handover scope. Enterprise engagements are scoped custom above the band and uncapped. Revenue Engine Recalibration starts at $5,000/quarter. Price alone does not establish quality; compare the written scope, evidence, ownership, and terms.

The Blueprint is the paid diagnosis and architecture phase. You keep it whether or not you continue to a Build.

Should I ask for a paid pilot before the full engagement?

Yes, if the provider offers one. A paid discovery and architecture phase lets you see how they work without committing the full build budget. For TrueAdvertize, that phase is the $5,000 Revenue Engine Blueprint, creditable toward a Build within 30 days. Providers that refuse any paid pilot usually do so because their process front-loads sales and back-loads delivery, which is the inverse of what you want.

What's the single biggest mistake buyers make when hiring outbound agencies?

Optimizing for speed-to-pipeline instead of ownership-at-handoff. The agencies that promise pipeline by week 4 are usually running generic templates, and the reply rate stays low because the list, not the calendar, is the constraint. The agencies that build the system before sending are building something you'll still own at month 12.

How do I know if an agency has actually run their own GTM before, or if they're just selling consulting?

Ask them what their own outbound stack looks like, how they source their own leads, and what their inbound-to-outbound mix is. Agencies that have actually run a GTM motion can answer with specifics in 30 seconds. Agencies that have only consulted on GTM hedge or pivot.

Key takeaways

If you skim everything above and only remember six things, remember these:

Inspect the delivery model, not the label. Ask what is custom, what ships, who owns it, how evidence is defined, and what remains after the contract ends.
The ownership question is the single highest-leverage filter. On day 91, do you own the system or are you still renting it? Binary test, no middle path.
Fixed-fee builds align incentives better than retainers when the goal is standing up a system you own. Retainers make sense for ongoing management once a system exists. Builds make sense for getting to that system in the first place.
Case studies without baselines, list sizes, and time series are vanity metrics. Push for the questions behind the numbers.
Real money-back guarantees signal real confidence in delivery. The version that means something defines its trigger, refund, and artifact treatment in the signed engagement letter, and the provider shows you that language up front.
You don't always need an agency. Pre-PMF, in-house with playbooks, and broken-close-rate scenarios all require different interventions. Make sure you're in the right bucket before you start evaluating firms.

The framework above doesn't tell you which agency to hire. It tells you how to filter the ones that aren't worth hiring at all. The agencies that survive this framework are the ones whose model is built around your outcomes, not theirs. That's a small fraction of the market. Once you have your three finalists, the rest is reference calls and gut.

Use this framework on TrueAdvertize. Send the questions in advance. We will answer them in writing and identify the current evidence we can share. Apply the same process to every provider you are considering. Whichever firm passes the framework most cleanly is the one you should sign with. We're happy to compete on that basis.

If you want to map the system your company would need, book a Revenue Engine Diagnostic. Thirty minutes, founder-led, no pitch. We'll show you the framework above applied to your stage, and if we're not the right fit, we'll tell you who is.