Taming AI Rate Limits at RefundIQ

A fintech SaaS platform processing 80,000+ emails per user was hitting silent API failures, unpredictable sync times of 4+ hours, and inflated AI costs. This is how the entire Gemini API usage layer was redesigned — from an unbounded parallel fire-and-forget model to a controlled, resilient, cost-optimized pipeline.
NestJS
Gemini 2.5 Flash
Vertex AI
BullMQ / Redis
Node.js
TypeScript
MongoDB

10-20 mins

Sync Time After
↓ 95% reduction (was 4.5 hrs)

-75%

AI Token Cost Reduction
via prompt caching + unified calls

Zero

Silent AI Features Post-Fix
real retry with exponential backoff

THE PROBLEM

Three Interconnected Failure Modes

The platform's email sync pipeline used Gemini to classify and extract data from every email. At scale, this architecture broke down across three interconnected failure modes.

01

Unbounded parallel AI bursts → silent data loss

100 emails were fired at Gemini simultaneously via Promise.all() — producing bursts of 200 concurrent AI requests. When Google's rate limiter returned 429 errors, the handler silently returned undefined. Emails were marked as processed with empty data. Users lost refunds that should have been found — with no alerting, no retries, no visibility.

02

Two API calls per email — double token spend

Classification and parsing were separate Gemini calls per email. Each call also re-sent a 3,000+ token static system prompt every time, despite Gemini's implicit prompt caching being available. Token spend was roughly 3× what it needed to be. At 80k emails per user, this added up fast.

03

No durable queue — no global rate limit control

There was no Redis, no BullMQ, no global concurrency cap. Each user's sync ran independently, with no awareness of other concurrent syncs. Multiple users syncing simultaneously could collectively breach Gmail and Gemini org-level rate limits with no circuit breaker, no retry coordination, and no observability into queue depth.

ARCHITECTURE: BEFORE

Unbounded, Brittle, Silent

Every email entered the pipeline and immediately triggered two sequential Gemini calls — no queuing, no rate control, no error handling beyond a silent null return.

THE SOLUTION

Three-Phase Optimization

Every email enteredFixes were grouped into three phases ordered by dependency and immediate impact. Phase 1 alone dropped sync time from 4.5 hours to under 60 minutes. the pipeline and immediately triggered two sequential Gemini calls — no queuing, no rate control, no error handling beyond a silent null return.

Phase

Changes

Target

Phase 1
Quick wins · ~1 day
Fix phantom Gemini model IDs. Add real 429 exponential backoff with jitter. Remove hard-coded 1s pause. Introduce bounded concurrency limiter (15–30 concurrent calls by billing tier). Replace .distinct() dedupe with duplicate-safe bulk insert.
4.5 hrs → 45–60 min
Phase 2
Structural · 3–5 days
Merge classify + parse into a single Gemini call with typed JSON schema output. Enable implicit prompt caching on static 3,000+ token system instructions. Switch Gmail to batch endpoint (100 sub-requests per HTTPS round trip).
45 min → 20–30 min
Phase 3
Infrastructure · 1–2 wks
BullMQ + Redis for global concurrency cap across all concurrent users. Parallel date sharding — 4–6 workers per user on non-overlapping date slices. Org-level rate limit enforcement prevents aggregate breaches.
20–30 min → 10–20 min

ARCHITECTURE: AFTER

Controlled, Observable, Resilient

All AI work is routed through a Redis-backed BullMQ queue. A global concurrency cap ensures the org never exceeds Gemini's RPM limit. Classify and parse are a single call with a cached system prompt prefix. Results stream progressively to the dashboard as emails are found.

KEY ENGINEERING DECISIONS

The Design Choices That Mattered

Four targeted decisions drove the bulk of the performance and cost improvements.

Unified prompt design

A single Gemini call now returns both the classification label and all extracted fields as a typed JSON schema response. This eliminated a full round-trip per email and reduced token volume by ~50% on the AI layer alone, while also halving latency per email.

Implicit prompt caching

Static system instructions — 3,000+ tokens of classification rules and extraction examples — are now sent once and cached by Gemini. Subsequent calls reference the cached prefix at a fraction of normal input token cost, delivering ~60–75% savings on billed input tokens per sync.

Bounded concurrency limiter

A configurable semaphore replaced raw Promise.all(). Max concurrent AI calls are set just below the org's published RPM ceiling (15 on tier 1, 30 on tier 2). Exponential backoff with random jitter on retries prevents thundering-herd re-storms after a 429 batch event.

Manage your team with reports

The 2-year inbox window is split into 4–6 non-overlapping date slices, each processed by a separate worker. Deduplication is enforced by a unique {messageId, userId} database index — not application code. Gmail per-user quota is respected by design at every concurrency level.

THE OUTCOME

A Brittle Pipeline Transformed

Cumulative optimization across all three phases produced a reliable, cost-efficient, observable system that scales safely with user growth — without ever compromising the progressive refund-discovery experience that users depend on.

-

95

%

reduction in sync wall-clock time

-

75

%

lower Gemini token spend per sync

Zero

silent refund detection failures

Your next product runs on AI. Let's build it.

Tell us what you're building and we'll show you how AI can make it faster, smarter, and built to last.
Let's Talk