What does webhook retry logic actually cover?

It covers provider redelivery when your endpoint does not acknowledge an event successfully. It does not guarantee that your downstream business processing completed correctly.

Why can working webhook code still fail in production?

Production adds duplicate delivery, timeouts, network jitter, slower commits, and partial failures between receipt and side effects. If the dedupe boundary is weak or the endpoint does too much synchronous work, retries can create duplicate outcomes and manual cleanup.

What should happen inside the webhook endpoint?

The endpoint should verify authenticity and basic schema, persist the receipt, and return a quick acknowledgment. Fulfillment, retries, and reconciliation should move to workers behind that durable receipt boundary.

How is a webhook event ID different from an idempotency key?

The event ID helps detect transport-level replays of the same delivery. The idempotency key protects the business side effect so the same action is not applied twice even if events arrive out of order or are retried.

How should internal worker retries differ from provider redelivery?

Provider retries are about delivering the event to your edge. Worker retries are your internal recovery mechanism after durable receipt, and they should use failure classes, backoff, jitter, and a dead-letter path for non-retriable cases.

Implementing Webhook Retry Logic for Payment Notifications

Q: What must the missed-event runbook confirm before closing an incident?

It should confirm that delivery health is restored, the backlog or gap has been backfilled, and idempotency controls prevented duplicate application during catch-up. That proves system state, not just endpoint availability, was restored.

Why webhook retries fail in production even when the code looks fine#

Webhook retry failures in production usually come from unclear acknowledgment boundaries and weak replay handling, not from a syntax bug in the handler. The same event can arrive again, arrive late, or arrive while downstream work is still in progress. If processing is not replay-safe, you get duplicate side effects and manual cleanup.

Teams usually learn this after launch because the test path looks clean. The signature check passes, the payload parses, the logic runs, and the response returns. Production adds provider retries, network jitter, slower commits, redirects, and partial failures between receipt and side effect. The real issue is usually contract design: what counts as received, what can be replayed safely, and how you prove what happened.

Why "working code" still breaks#

At-least-once delivery means duplicates are normal. If your handler performs business side effects before a dedupe boundary, a valid retry can apply those effects again.

Signal	Meaning
Response over 10 seconds	One cited webhook example treats it as failed delivery
Endpoint work past 5 seconds	Engineers often report trouble once work stretches past this point
HTTP `300-399`	Points to a redirect from your server
HTTP `400-499`	The request reached your server but was not processed successfully
HTTP `500-599`	The request reached your server but was not processed successfully

Another common failure mode is doing too much synchronous work in the webhook endpoint. As endpoint time increases, timeout risk rises. One cited webhook example treats responses over 10 seconds as failed delivery, and engineers often report trouble once endpoint work stretches past 5 seconds. The provider can mark delivery as failed and retry while downstream systems may already have processed part of the request.

Operator signals matter here. In Stripe, start with the endpoint's Failed events, inspect individual Webhook attempts, and check the HTTP status code and response details. A 300-399 response points to a redirect from your server. A 400-499 or 500-599 response means the request reached your server but was not processed successfully.

What keeps failures recoverable#

Recoverability depends on three things working together: a provider-aware contract, replay-safe architecture, and operator-grade diagnostics. Keep the webhook endpoint thin. Verify authenticity, persist receipt, and return quickly. Run business processing behind that boundary, where your own retries, exponential backoff with jitter, and dead letter queue can handle unresolved failures.

Use one decision rule everywhere: if receipt cannot be persisted safely, do not acknowledge success. If receipt is persisted, acknowledge and recover internally if later processing fails. That separates delivery reliability from downstream recovery and makes replay predictable instead of guesswork.

Store enough evidence to debug incidents quickly: raw event payload, provider event identifier, response outcome, and processing state. That record lets you tell the difference between an event that was never received, one that was received multiple times, and one that was processed multiple times.

What this article is and is not about#

This article covers payment notification delivery and processing reliability: webhook endpoint behavior, retries, duplicates, timeout handling, replay safety, and recovery workflows. It does not cover card decline recovery or dunning strategy.

From here, the path is straightforward: verify the provider delivery contract, define the webhook boundary, design idempotency, route failures intentionally, and verify the flow before go-live. If you want a deeper dive, read How to Handle Failed Payments Across Multiple Payment Methods and Regions.

Webhook retry logic in payments and what it does not cover#

In payments, webhook retry logic means the provider redelivers the same event when your endpoint does not acknowledge it successfully. That is about delivery reliability, not completion of your downstream business work.

Separate provider redelivery from your own retry path#

Treat these as separate responsibilities. The provider manages delivery attempts to your endpoint. Your system manages retries for business steps after receipt is stored durably.

This boundary matters in production. If your endpoint spends more than a few seconds on business processing, timeout-driven redelivery becomes more likely. Then you can end up with partial internal work and another delivery attempt at the same time. Before you return success, make sure you have durably stored the raw payload, webhook event ID, and received timestamp.

What duplicate protection actually depends on#

Retries, replays, and out-of-order delivery are normal, so duplicate protection needs explicit checkpoints. Use the webhook event ID to detect transport-level replays, and use an idempotency key to prevent duplicate business side effects.

Keep a processed-webhook log keyed by provider identifiers, and enforce database uniqueness constraints so duplicate inserts are rejected if a replay slips through. If an event ID is new but the business idempotency key was already applied, acknowledge receipt, record the replay, and skip reapplying the change.

What a success response should mean#

An HTTP 2xx response should mean "event received and recoverable from here," not "all downstream processing is complete." If your provider accepts or expects a JSON acknowledgment, treat it as receipt confirmation only.

Automatic retries help with transient delivery failures. They do not fix persistent signature, payload validation, or authentication errors. For card decline recovery and sequencing, see Smart Dunning Strategies: How to Sequence Retry Logic for Maximum Recovery.

Compare provider delivery contracts before writing application code#

Build and approve a provider delivery contract matrix before you write handlers, and block launch if key fields are still unknown. Once you separate provider redelivery from your internal retry path, you need each provider's actual contract. Retry behavior is not uniform across services, and retries do not fix persistent payload or authentication errors.

Build the contract table first#

Treat the contract table as the first deliverable. Include the fields that affect incident response, recovery design, and alerting, even when the current answer is "not confirmed."

Provider	Ack format	Retry pattern	Disablement behavior	Replay support	Re-enable process	Known unknowns
PayMongo	Not confirmed in current docs reviewed	Not confirmed in current docs reviewed	Not confirmed in current docs reviewed	Not confirmed in current docs reviewed	Not confirmed in current docs reviewed	Confirm exact success-response expectations, retry stop conditions, whether endpoint disablement exists, whether missed events can be replayed, and whether re-enable is manual or automatic
Stripe	Not confirmed in current docs reviewed	Not confirmed in current docs reviewed	Not confirmed in current docs reviewed	Not confirmed in current docs reviewed	Not confirmed in current docs reviewed	Confirm response contract, retry window or stop conditions, resend/replay options, and operator steps for restoring delivery
GitHub reference example	Not confirmed in sources reviewed	Delivery is considered failed if response takes more than 10 seconds; broader retry behavior not confirmed here	Not confirmed in sources reviewed	Not confirmed in sources reviewed	Not confirmed in sources reviewed	Timeout thresholds are part of the delivery contract and should be documented per provider

The point is to expose gaps early. If disablement or replay behavior is unknown, your recovery design is still incomplete.

Use PayMongo as the incident-readiness baseline#

Use PayMongo as the forcing function for operational readiness, but keep claims conditional until they are verified in current docs and agreements. If your confirmed PayMongo contract allows delivery to stop and requires manual re-enable, that should shape on-call alerts, incident checklists, and post-restore backfill steps.

A practical rule is simple: if PayMongo can stop delivery without self-recovery, alert on that state directly, not only on 5xx trends. Document who can restore delivery, where to check status, and how you verify missed events.

Treat unknowns as launch blockers, not doc debt#

Keep a visible "known unknowns" column and resolve it before production. At minimum, confirm for each provider in scope:

What counts as acknowledgment, whether HTTP 2xx only, a body requirement, or structured JSON
What retry behavior is documented, and what is explicitly not guaranteed
Whether endpoints can be disabled or paused, and how that is detected and reversed
Whether missed deliveries can be replayed or resubmitted, or must be reconciled from authoritative provider objects

If any of those answers are missing for PayMongo or Stripe, the integration is not production-ready.

Capture verification evidence#

Once you resolve an unknown, keep the proof. Store the provider doc URL, check date, and exact excerpt or support response. During sandbox tests, keep delivery artifacts such as headers, payloads, timestamps, statuses, and error details so observed behavior can be compared with documented behavior.

For Gruv implementations, also confirm market-specific and program-specific terms in current provider docs and integration agreements, since behavior can vary by program. This pairs well with our guide on Xero Integration for Payout Platforms: How to Sync Contractor Payments with Your Accounting System.

Decide what happens in the webhook endpoint and what moves to workers#

Set a strict boundary: the webhook endpoint should acknowledge receipt quickly, and workers should handle fulfillment. In practice, the endpoint should validate authenticity and basic schema, persist the receipt, return a fast acknowledgment, and hand off processing asynchronously.

This separation keeps delivery acknowledgment distinct from business execution. The provider needs confirmation that you received the event. Your system owns retries, fulfillment, and reconciliation after that point. Mixing those concerns in one path is what turns timeouts into duplicate deliveries and harder incident recovery.

Keep the endpoint narrow#

A practical endpoint should do only this:

Task	Where it belongs
Verify the request is authentic and matches expected schema	Webhook endpoint
Persist an immutable receipt record before side effects	Webhook endpoint
Enqueue processing tied to the provider event identifier	Webhook endpoint
Return success quickly after receipt is safely stored	Webhook endpoint
`ledger journal` writes	Workers
`payout batch` changes	Workers
Notifications	Workers
Dependency-heavy lookups	Workers

When endpoint work gets slow, timeout risk rises, and retry behavior is provider-specific enough that you should not depend on it to clean up design mistakes.

Keep financial mutations out of the acknowledgment path#

Do not perform money-state or payout-state mutations before acknowledgment. If a timeout hits around partial processing, you can end up unsure what committed, whether a retry will come, and whether a replay will apply the same side effect again.

Workers do not remove the need for idempotency, but they make the boundary clearer: receipt first, fulfillment second. That gives operators a cleaner recovery path when duplicates or failures happen.

Make failure behavior explicit#

Make the branches explicit:

if receipt persistence fails, do not return success
if receipt is stored but downstream worker processing fails, keep provider acknowledgment and recover internally

Before any 2xx, confirm you have a receipt record keyed by provider event ID with timestamp, payload fingerprint, and the headers you need for verification. Treat provider redelivery as uncertain, not guaranteed. Design the endpoint as a reliable receipt boundary, then handle retries and reconciliation inside your system.

Design idempotency that survives retries and partial failures#

Design idempotency at the data boundary so retries and partial failures resolve as safe no-ops, not duplicate side effects. Use two dedupe controls for two different risks, and enforce both where writes happen.

Use two keys for two different failure modes#

A single webhook event ID is necessary, but not enough. Providers can retry an event even after your system processed it, and events can arrive out of order.

Use:

webhook event ID for transport dedupe, the same delivery of the same event
business idempotency key for operation dedupe, the same business effect across retries or paths

The event ID tells you whether this delivery is new. The business key tells you whether the protected effect already happened.

Put the stop at the persistence boundary#

Do duplicate protection before side effects, at the write boundary. Code-level checks alone can race under concurrency.

Apply uniqueness or conflict checks to the record that represents the protected action, including ledger journal posting and Payment Intent status transitions. If the key already exists, load the existing record, confirm it matches the intended action, and treat the replay as already applied instead of mutating again.

Track progress so retries can resume safely#

Idempotency also needs recovery checkpoints for partial failure. Persist processing state so a retry continues from the next unfinished step instead of restarting blindly.

A practical internal state machine is:

received
validated
applied
notified

These are internal control points, not provider requirements. The goal is durable progress markers so a later retry can skip completed financial mutations and run only unfinished work.

Use a clear decision rule for replays#

If the transport ID is new but the business idempotency key is already applied, acknowledge and log it as a replay. Do not reapply side effects. That rule turns duplicate deliveries into operational noise instead of financial incidents, even when retries arrive later or out of order.

Build the payment event pipeline in the right order#

Once idempotency is in place at the write boundary, sequencing becomes the next reliability decision. Separate event receipt from business effects so retries stay operational, not financial. A practical internal order is receive, verify, persist raw input, enqueue, process, then audit. Use that as your design rule, not as a provider standard.

Keep receipt separate from business effects#

Delivery and processing fail for different reasons, and providers can retry when acknowledgment is not confirmed even if your server received the event. If acknowledgment is tied to ledger updates, fulfillment, or notifications, transport issues can trigger duplicate side effects.

Keep the intake layer narrow: authenticity checks, request validation, schema checks, and durable storage. After that, workers can normalize into your canonical payment object and apply business logic asynchronously.

A useful checkpoint is simple: every accepted event should have an immutable raw record plus internal processing state, so recovery does not depend on guesswork from logs.

Decide compliance gating explicitly#

If your flow includes KYC, KYB, or AML checks, treat gate placement as an explicit architecture choice and document it. The provider acknowledgment decision and the compliance decision do different jobs, so avoid blending them by accident.

A common pattern is to acknowledge after trusted durable receipt, then run downstream checks that can move the payment or account into hold or review states when required. That keeps intake resilient when a dependency in the compliance path is slow or unavailable.

Make ownership and replay explicit#

In complex flows, missed events often become an ownership problem, not just a delivery problem. Before launch, define an ownership map per event type with:

source event type
owning service
allowed state or money mutations
downstream subscribers
replay entry point

Replay points should match your internal states so recovery is controlled:

Replay point	Recovery action
`received`	Safe to re-enqueue
`validated`	Safe to rebuild canonical mapping
`applied`	Skip financial writes, continue downstream
`notified`	Outbound communication already sent

This gives you a clean way to recover missed notifications without reapplying financial effects.

Set retry classes and backoff rules for internal workers#

Do not use one retry policy for every failure. Classify failures first: retry transient worker failures with exponential backoff plus jitter, and route non-retriable failures to a dead letter queue or equivalent review lane.

This keeps your internal recovery logic clear when provider redeliveries are also happening. If you mix those two retry clocks, duplicate inbound deliveries can look like internal progress, and stuck jobs can hide behind fresh traffic.

Failure class	Typical signal	Internal handling	Stop or escalate rule
Transient processing or dependency failure	timeout, temporary unavailability, rate limiting	Retry with exponential backoff and jitter	Escalate if the same error repeats with no state change for the same event
Duplicate, replay, or idempotency conflict	same event reappears, idempotency conflict	Re-check state before retrying; stop if already applied	Hard stop when the same `idempotency key` is already marked applied
Non-retriable input or state failure	invalid payload or invalid state for your processor	Remove from the normal retry lane and send to `dead letter queue` for review	Escalate with payload reference and failure reason

Make retry decisions visible on the job record, not only in logs. Per attempt, persist at least webhook event ID, idempotency key, failure class, retry count, next attempt time, and last processing state. That lets operators quickly tell whether the issue is provider delivery, worker processing, or a suppressed replay.

Keep provider redelivery and worker retries separate#

Provider retry policy is a different control surface from your internal worker retry policy. You may receive repeat deliveries at the edge while an accepted copy is already processing internally.

Track and alert on these windows separately: edge delivery behavior versus internal queue or job aging. If you only monitor end-to-end completion, one failure mode can hide the other.

Stop looping on the same identifiers#

Repeated failures on the same webhook event ID or idempotency key need a hard stop. If an event keeps failing without advancing state, stop auto-retrying at your class limit and escalate to incident or manual review.

Replay checks belong here too. A valid request can be resent multiple times, so validate signature timestamp metadata, for example X-Signature: t=1690000000,v1=..., and reject stale signed requests outside your allowed window. A five-minute cutoff is one example pattern, not a universal rule.

Throttle bursty event classes before they stampede your stores#

High-volume bursts need rate limiting and buffering so retries do not become a correlated storm. Cap concurrency by event class and let queues absorb spikes to reduce overload, duplicate pressure, lock contention, and avoidable dead letter queue floods.

Need the full breakdown? Read Accounts Payable Aging Report for Platforms: How to Track Overdue Contractor Payments.

Prepare for missed events and disabled webhook states#

Plan for missed events, not just late retries. Stripe's webhook guidance separates undelivered events from irrecoverable webhook events, so your runbook should have two branches: normal catch-up and explicit gap closure when normal delivery cannot recover the state.

Adyen frames the risk clearly as well: webhooks keep your system synchronized, and poor endpoint handling can leave you with missed events and stale internal state. Treat a disabled or unhealthy endpoint as an incident, not a background warning. You need a way to detect the gap, identify the missing range, and reconcile from authoritative provider objects when replay is unavailable.

Before you close the incident, verify three things: delivery is healthy again, the backlog or gap has been backfilled, and your idempotency controls prevented duplicate application during catch-up. That is the difference between restoring the endpoint and restoring system state.