B2B Services · Lead Operations
Two afternoons of cleanup, collapsed into an overnight chain.
A B2B services company was scraping prospect lists every week — and paying for it with two afternoons of manual cleanup before anything was usable in HubSpot. We rebuilt the pipeline around a single decision: one normalized source table between the scrapers and the CRM. Five services now hydrate, validate, and dedupe records autonomously. Owners wake up Monday to clean contacts.
In one sentence
A source-table-driven enrichment pipeline — Apify scrapers feed a normalized contact table, Apollo and Seamless.AI hydrate, ZeroBounce validates, HubSpot upserts on canonical email — so adding a scraper or swapping a service costs hours, not a pipeline rewrite.
The Problem
Scraper output isn't lead data. It's lead-shaped strings.
The team ran a handful of Apify scrapers against industry directories every week. The output was technically lead data — names, companies, sometimes a job title — but it wasn't clean, validated, or usable. Email columns were a mix of guessed addresses, generic info@ aliases, and outright typos. Phone fields were empty more often than not. Two prospects from two different scrapers would land as two different contacts because their company names disagreed by a comma.
So every Tuesday scrape kicked off a familiar ritual: two team members would spend Wednesday and Thursday afternoons running rows through Apollo, copying numbers from LinkedIn, validating emails by hand, normalizing company names. By Friday a CSV was ready to upload into HubSpot — at which point roughly 30% of the records duplicated against existing contacts, and the owners would spend the following week chasing ghost leads that turned out to be bad numbers.
The cleanup tax was so high that scrapes ran less often than they should have. The team wanted weekly. They were running fortnightly, sometimes monthly. The pipeline went hungry — not because there weren't leads in the directories, but because no one had two afternoons free.
I want to add a scraper without rewriting the pipeline. And I want to swap Apollo for the next-shiny-thing without rewriting the pipeline.
— The ops lead, during scoping
Two previous attempts to automate it had ended the same way: a tightly-coupled Zap stitched a scraper directly into HubSpot. Change one input — a different directory, a different field, a different enrichment provider — and the whole pipeline broke. Nobody wanted to touch it. Nobody could swap anything out without rewriting the entire chain.
When we asked about validation strategy in the kickoff, the team's first answer was “we just clean it manually.” That afternoon-of-cleanup is the cost we were there to eliminate. The brief came back with three non-negotiables:
- Scrapers must not talk to HubSpot directly — clean separation between layers
- Every enrichment service must be swappable without rewriting the pipeline
- Every record must carry an audit trail per record: which service ran, when, and what it returned
The Approach
Most pipelines couple scrapers directly to the CRM. That's the trap.
The instinct — and the reason every prior attempt failed — is to route the scraper straight into HubSpot, with a few enrichment steps in between. It looks economical. It removes a layer. It feels like the right answer because there's less to maintain.
It's the trap. The moment scraper output is the same shape as a HubSpot contact, every change is a pipeline rewrite. New scraper? Rewrite. New enrichment service? Rewrite. HubSpot adds a custom property? Rewrite. The architecture has no neutral ground — the scraper and the CRM are speaking each other's dialect, and any third party that wants in has to translate both ways.
We argued for one normalized source table between them. A source-table-driven shape — neutral to any scraper, neutral to any CRM — where every record lands first and every enrichment happens against the same row. That one decision is what made every other piece swappable. The ops lead said it best: add a scraper without rewriting the pipeline; swap Apollo for the next-shiny-thing without rewriting the pipeline.
From that one decision, the rest of the design fell out naturally.
Key insight
The source table isn't overhead. It's the architectural commitment that lets you add a scraper without touching the CRM, swap an enrichment service without rewriting the pipeline, and survive every change you don't see coming.
One normalized source table between layers
Apify scrapers write into a single normalized contact table — neutral to scraper origin and neutral to CRM destination. No scraper talks to HubSpot directly. Adding a scraper is a one-table change. Migrating off HubSpot one day would be a one-table change too.
Each enrichment service is one swappable step
Apollo for phone + email hydration. Seamless.AI as a cross-check. ZeroBounce for deliverability validation. Each runs as a discrete Zapier step against the source-table row. If Apollo deprecates an endpoint tomorrow, we replace one step. The chain keeps running.
Validation gates before HubSpot writes
Records below a confidence threshold — no valid email, no phone, ambiguous company match — are flagged in the source table and held back. HubSpot only ever sees records that cleared validation. The CRM is the operational copy. The source table is the truth.
Deduplication on canonical email
Email becomes the canonical dedup key after ZeroBounce confirms deliverability. HubSpot upserts on canonical email, not on the scraper-provided string. Two scrapers surfacing the same prospect with different company-name spellings collapse into one record, not two.
Audit trail per record per service
Every enrichment call writes its timestamp, result, and confidence back into the source-table row. If a HubSpot contact looks wrong, the source-table row shows exactly which service touched which field and when. No more guessing whether Apollo or Seamless gave us that phone number.
Implementation
How the pieces fit.
The build looks deceptively simple from the outside — a few Zaps, a table, three APIs. The decisions that made it work were where each service drew its line.
Apify — normalized at the Actor level
One or more Actors run against the target directories on a schedule. The output schema is normalized inside each Actor — not downstream — so a new scraper added next quarter writes the same shape into the source table. The cost of normalization is paid once, at the edge, where the data is freshest. We resisted the urge to do it in Zapier; that's how the last pipeline got brittle.
The source table — Airtable as the spine
One row per prospect, one column per provenance field, one timestamp per enrichment service. The source table is the audit trail per record, the dedup arbiter, and the staging area in one place. Nothing flows from a scraper to HubSpot without passing through it. Nothing moves between enrichment services without writing its result back. The source table is what the team looks at when something disagrees with reality.
Zapier — two Zaps, not one
Splitting the orchestration into two Zaps was a deliberate call. Zap 1 watches for new source-table rows and runs the enrichment chain: Apollo → Seamless.AI → ZeroBounce, with each step writing back to its own column. Zap 2 watches for rows whose validation flag turned green and upserts them into HubSpot. The split means a scraper never talks to HubSpot directly, a failure in enrichment never partially writes to the CRM, and re-running enrichment doesn't re-trigger CRM writes. Two zaps, two responsibilities, no cross-contamination.
HubSpot — custom properties carry provenance
Each enriched field in HubSpot has a sibling property recording which service contributed it and when. HubSpot upserts via dedup on canonical email, not on the scraper-provided string. Owner assignment and lifecycle stage are set by the enrichment outcome — confidence band, geography, role match — not by which scraper found the prospect. Sales sees clean records. Marketing sees the same clean records. The result: zero-touch enrichment for the team. Nobody has to know there's a source table behind it. Which is exactly the point.
Results
From two afternoons to overnight.
Before the rebuild, a Tuesday scrape locked up two team members for Wednesday and Thursday afternoons, surfaced in HubSpot on Friday with roughly a 30% duplicate rate, and cost the owners the following week chasing contacts that turned out to be bad. Net usable yield per scrape: low, and the team had stopped trusting it.
After the rebuild, the scrape runs on schedule, the enrichment chain runs autonomously overnight, validation flags catch the records that aren't safe to upload, and HubSpot sees only contacts that cleared every gate. Owners walk into Monday with clean, deduplicated records and confidence in the email addresses. The two afternoons disappeared. The duplicate rate collapsed. Scrapes now run on the cadence the team always wanted.
Manual review batches eliminated
Two afternoons per scrape, gone. Enrichment runs autonomously between the scrape and the CRM write.
Duplicate rate dropped to near-zero
Canonical-email dedup collapses cross-scraper overlap into single records. The 30% duplicate problem stopped existing.
Five services, one source table, zero rewrites
Add a scraper, swap an enrichment provider, change a HubSpot property — the pipeline absorbs the change without a rewrite.
Audit trail per record
Every HubSpot contact has a source-table row showing the exact enrichment chain. No more guessing where a field came from.
Scrape cadence finally matches intent
The team can run weekly because nothing has to be cleaned by hand. Pipeline went from hungry to fed.
The Takeaway
If you take one thing from this: the source table isn't overhead. It's the only thing that lets your pipeline survive the next change you don't see coming — a new scraper, a deprecated enrichment service, a CRM migration, a compliance rule that demands provenance. Couple scrapers directly to HubSpot and you rebuild every time. Put one normalized table between them and you stop rebuilding.
Engagement artifact
The methodology, shared on request.
Source-table schema + Zapier blueprint, on request
The implementation specifics — table schema, Zap step-by-step, HubSpot property setup — are confidential to the engagement. The methodology is not. We share a redacted blueprint with teams running scrapers into a CRM who want to understand the source-table separation before they commit to it.
Takeaway
The source table isn't overhead. It's the only thing that lets your pipeline survive the next change you don't see coming — a new scraper, a deprecated enrichment service, a CRM migration.
Running scrapers into a CRM with afternoons of cleanup in between?
Let's design the source-table separation before the cleanup eats your week.
30 minutes. Share your scraper output, your CRM shape, your validation rules. We'll tell you whether one normalized source table can collapse two afternoons of cleanup into a Zapier chain that runs overnight.
Book a 30-min call