
Webhook retry failures in production usually come from unclear acknowledgment boundaries and weak replay handling, not from a syntax bug in the handler. The same event can arrive again, arrive late, or arrive while downstream work is still in progress. If processing is not replay-safe, you get duplicate side effects and manual cleanup.
Teams usually learn this after launch because the test path looks clean. The signature check passes, the payload parses, the logic runs, and the response returns. Production adds provider retries, network jitter, slower commits, redirects, and partial failures between receipt and side effect. The real issue is usually contract design: what counts as received, what can be replayed safely, and how you prove what happened.
At-least-once delivery means duplicates are normal. If your handler performs business side effects before a dedupe boundary, a valid retry can apply those effects again.
| Signal | Meaning |
|---|---|
| Response over 10 seconds | One cited webhook example treats it as failed delivery |
| Endpoint work past 5 seconds | Engineers often report trouble once work stretches past this point |
HTTP 300-399 | Points to a redirect from your server |
HTTP 400-499 | The request reached your server but was not processed successfully |
HTTP 500-599 | The request reached your server but was not processed successfully |
Another common failure mode is doing too much synchronous work in the webhook endpoint. As endpoint time increases, timeout risk rises. One cited webhook example treats responses over 10 seconds as failed delivery, and engineers often report trouble once endpoint work stretches past 5 seconds. The provider can mark delivery as failed and retry while downstream systems may already have processed part of the request.
Operator signals matter here. In Stripe, start with the endpoint's Failed events, inspect individual Webhook attempts, and check the HTTP status code and response details. A 300-399 response points to a redirect from your server. A 400-499 or 500-599 response means the request reached your server but was not processed successfully.
Recoverability depends on three things working together: a provider-aware contract, replay-safe architecture, and operator-grade diagnostics. Keep the webhook endpoint thin. Verify authenticity, persist receipt, and return quickly. Run business processing behind that boundary, where your own retries, exponential backoff with jitter, and dead letter queue can handle unresolved failures.
Use one decision rule everywhere: if receipt cannot be persisted safely, do not acknowledge success. If receipt is persisted, acknowledge and recover internally if later processing fails. That separates delivery reliability from downstream recovery and makes replay predictable instead of guesswork.
Store enough evidence to debug incidents quickly: raw event payload, provider event identifier, response outcome, and processing state. That record lets you tell the difference between an event that was never received, one that was received multiple times, and one that was processed multiple times.
This article covers payment notification delivery and processing reliability: webhook endpoint behavior, retries, duplicates, timeout handling, replay safety, and recovery workflows. It does not cover card decline recovery or dunning strategy.
From here, the path is straightforward: verify the provider delivery contract, define the webhook boundary, design idempotency, route failures intentionally, and verify the flow before go-live. If you want a deeper dive, read How to Handle Failed Payments Across Multiple Payment Methods and Regions.
In payments, webhook retry logic means the provider redelivers the same event when your endpoint does not acknowledge it successfully. That is about delivery reliability, not completion of your downstream business work.
Treat these as separate responsibilities. The provider manages delivery attempts to your endpoint. Your system manages retries for business steps after receipt is stored durably.
This boundary matters in production. If your endpoint spends more than a few seconds on business processing, timeout-driven redelivery becomes more likely. Then you can end up with partial internal work and another delivery attempt at the same time. Before you return success, make sure you have durably stored the raw payload, webhook event ID, and received timestamp.
Retries, replays, and out-of-order delivery are normal, so duplicate protection needs explicit checkpoints. Use the webhook event ID to detect transport-level replays, and use an idempotency key to prevent duplicate business side effects.
Keep a processed-webhook log keyed by provider identifiers, and enforce database uniqueness constraints so duplicate inserts are rejected if a replay slips through. If an event ID is new but the business idempotency key was already applied, acknowledge receipt, record the replay, and skip reapplying the change.
An HTTP 2xx response should mean "event received and recoverable from here," not "all downstream processing is complete." If your provider accepts or expects a JSON acknowledgment, treat it as receipt confirmation only.
Automatic retries help with transient delivery failures. They do not fix persistent signature, payload validation, or authentication errors. For card decline recovery and sequencing, see Smart Dunning Strategies: How to Sequence Retry Logic for Maximum Recovery.
Build and approve a provider delivery contract matrix before you write handlers, and block launch if key fields are still unknown. Once you separate provider redelivery from your internal retry path, you need each provider's actual contract. Retry behavior is not uniform across services, and retries do not fix persistent payload or authentication errors.
Treat the contract table as the first deliverable. Include the fields that affect incident response, recovery design, and alerting, even when the current answer is "not confirmed."
| Provider | Ack format | Retry pattern | Disablement behavior | Replay support | Re-enable process | Known unknowns |
|---|---|---|---|---|---|---|
| PayMongo | Not confirmed in current docs reviewed | Not confirmed in current docs reviewed | Not confirmed in current docs reviewed | Not confirmed in current docs reviewed | Not confirmed in current docs reviewed | Confirm exact success-response expectations, retry stop conditions, whether endpoint disablement exists, whether missed events can be replayed, and whether re-enable is manual or automatic |
| Stripe | Not confirmed in current docs reviewed | Not confirmed in current docs reviewed | Not confirmed in current docs reviewed | Not confirmed in current docs reviewed | Not confirmed in current docs reviewed | Confirm response contract, retry window or stop conditions, resend/replay options, and operator steps for restoring delivery |
| GitHub reference example | Not confirmed in sources reviewed | Delivery is considered failed if response takes more than 10 seconds; broader retry behavior not confirmed here | Not confirmed in sources reviewed | Not confirmed in sources reviewed | Not confirmed in sources reviewed | Timeout thresholds are part of the delivery contract and should be documented per provider |
The point is to expose gaps early. If disablement or replay behavior is unknown, your recovery design is still incomplete.
Use PayMongo as the forcing function for operational readiness, but keep claims conditional until they are verified in current docs and agreements. If your confirmed PayMongo contract allows delivery to stop and requires manual re-enable, that should shape on-call alerts, incident checklists, and post-restore backfill steps.
A practical rule is simple: if PayMongo can stop delivery without self-recovery, alert on that state directly, not only on 5xx trends. Document who can restore delivery, where to check status, and how you verify missed events.
Keep a visible "known unknowns" column and resolve it before production. At minimum, confirm for each provider in scope:
2xx only, a body requirement, or structured JSONIf any of those answers are missing for PayMongo or Stripe, the integration is not production-ready.
Once you resolve an unknown, keep the proof. Store the provider doc URL, check date, and exact excerpt or support response. During sandbox tests, keep delivery artifacts such as headers, payloads, timestamps, statuses, and error details so observed behavior can be compared with documented behavior.
For Gruv implementations, also confirm market-specific and program-specific terms in current provider docs and integration agreements, since behavior can vary by program. This pairs well with our guide on Xero Integration for Payout Platforms: How to Sync Contractor Payments with Your Accounting System.
Set a strict boundary: the webhook endpoint should acknowledge receipt quickly, and workers should handle fulfillment. In practice, the endpoint should validate authenticity and basic schema, persist the receipt, return a fast acknowledgment, and hand off processing asynchronously.
This separation keeps delivery acknowledgment distinct from business execution. The provider needs confirmation that you received the event. Your system owns retries, fulfillment, and reconciliation after that point. Mixing those concerns in one path is what turns timeouts into duplicate deliveries and harder incident recovery.
A practical endpoint should do only this:
| Task | Where it belongs |
|---|---|
| Verify the request is authentic and matches expected schema | Webhook endpoint |
| Persist an immutable receipt record before side effects | Webhook endpoint |
| Enqueue processing tied to the provider event identifier | Webhook endpoint |
| Return success quickly after receipt is safely stored | Webhook endpoint |
ledger journal writes | Workers |
payout batch changes | Workers |
| Notifications | Workers |
| Dependency-heavy lookups | Workers |
When endpoint work gets slow, timeout risk rises, and retry behavior is provider-specific enough that you should not depend on it to clean up design mistakes.
Do not perform money-state or payout-state mutations before acknowledgment. If a timeout hits around partial processing, you can end up unsure what committed, whether a retry will come, and whether a replay will apply the same side effect again.
Workers do not remove the need for idempotency, but they make the boundary clearer: receipt first, fulfillment second. That gives operators a cleaner recovery path when duplicates or failures happen.
Make the branches explicit:
Before any 2xx, confirm you have a receipt record keyed by provider event ID with timestamp, payload fingerprint, and the headers you need for verification. Treat provider redelivery as uncertain, not guaranteed. Design the endpoint as a reliable receipt boundary, then handle retries and reconciliation inside your system.
Design idempotency at the data boundary so retries and partial failures resolve as safe no-ops, not duplicate side effects. Use two dedupe controls for two different risks, and enforce both where writes happen.
A single webhook event ID is necessary, but not enough. Providers can retry an event even after your system processed it, and events can arrive out of order.
Use:
webhook event ID for transport dedupe, the same delivery of the same eventidempotency key for operation dedupe, the same business effect across retries or pathsThe event ID tells you whether this delivery is new. The business key tells you whether the protected effect already happened.
Do duplicate protection before side effects, at the write boundary. Code-level checks alone can race under concurrency.
Apply uniqueness or conflict checks to the record that represents the protected action, including ledger journal posting and Payment Intent status transitions. If the key already exists, load the existing record, confirm it matches the intended action, and treat the replay as already applied instead of mutating again.
Idempotency also needs recovery checkpoints for partial failure. Persist processing state so a retry continues from the next unfinished step instead of restarting blindly.
A practical internal state machine is:
receivedvalidatedappliednotifiedThese are internal control points, not provider requirements. The goal is durable progress markers so a later retry can skip completed financial mutations and run only unfinished work.
If the transport ID is new but the business idempotency key is already applied, acknowledge and log it as a replay. Do not reapply side effects. That rule turns duplicate deliveries into operational noise instead of financial incidents, even when retries arrive later or out of order.
Once idempotency is in place at the write boundary, sequencing becomes the next reliability decision. Separate event receipt from business effects so retries stay operational, not financial. A practical internal order is receive, verify, persist raw input, enqueue, process, then audit. Use that as your design rule, not as a provider standard.
Delivery and processing fail for different reasons, and providers can retry when acknowledgment is not confirmed even if your server received the event. If acknowledgment is tied to ledger updates, fulfillment, or notifications, transport issues can trigger duplicate side effects.
Keep the intake layer narrow: authenticity checks, request validation, schema checks, and durable storage. After that, workers can normalize into your canonical payment object and apply business logic asynchronously.
A useful checkpoint is simple: every accepted event should have an immutable raw record plus internal processing state, so recovery does not depend on guesswork from logs.
If your flow includes KYC, KYB, or AML checks, treat gate placement as an explicit architecture choice and document it. The provider acknowledgment decision and the compliance decision do different jobs, so avoid blending them by accident.
A common pattern is to acknowledge after trusted durable receipt, then run downstream checks that can move the payment or account into hold or review states when required. That keeps intake resilient when a dependency in the compliance path is slow or unavailable.
In complex flows, missed events often become an ownership problem, not just a delivery problem. Before launch, define an ownership map per event type with:
Replay points should match your internal states so recovery is controlled:
| Replay point | Recovery action |
|---|---|
received | Safe to re-enqueue |
validated | Safe to rebuild canonical mapping |
applied | Skip financial writes, continue downstream |
notified | Outbound communication already sent |
This gives you a clean way to recover missed notifications without reapplying financial effects.
Do not use one retry policy for every failure. Classify failures first: retry transient worker failures with exponential backoff plus jitter, and route non-retriable failures to a dead letter queue or equivalent review lane.
This keeps your internal recovery logic clear when provider redeliveries are also happening. If you mix those two retry clocks, duplicate inbound deliveries can look like internal progress, and stuck jobs can hide behind fresh traffic.
| Failure class | Typical signal | Internal handling | Stop or escalate rule |
|---|---|---|---|
| Transient processing or dependency failure | timeout, temporary unavailability, rate limiting | Retry with exponential backoff and jitter | Escalate if the same error repeats with no state change for the same event |
| Duplicate, replay, or idempotency conflict | same event reappears, idempotency conflict | Re-check state before retrying; stop if already applied | Hard stop when the same idempotency key is already marked applied |
| Non-retriable input or state failure | invalid payload or invalid state for your processor | Remove from the normal retry lane and send to dead letter queue for review | Escalate with payload reference and failure reason |
Make retry decisions visible on the job record, not only in logs. Per attempt, persist at least webhook event ID, idempotency key, failure class, retry count, next attempt time, and last processing state. That lets operators quickly tell whether the issue is provider delivery, worker processing, or a suppressed replay.
Provider retry policy is a different control surface from your internal worker retry policy. You may receive repeat deliveries at the edge while an accepted copy is already processing internally.
Track and alert on these windows separately: edge delivery behavior versus internal queue or job aging. If you only monitor end-to-end completion, one failure mode can hide the other.
Repeated failures on the same webhook event ID or idempotency key need a hard stop. If an event keeps failing without advancing state, stop auto-retrying at your class limit and escalate to incident or manual review.
Replay checks belong here too. A valid request can be resent multiple times, so validate signature timestamp metadata, for example X-Signature: t=1690000000,v1=..., and reject stale signed requests outside your allowed window. A five-minute cutoff is one example pattern, not a universal rule.
High-volume bursts need rate limiting and buffering so retries do not become a correlated storm. Cap concurrency by event class and let queues absorb spikes to reduce overload, duplicate pressure, lock contention, and avoidable dead letter queue floods.
Need the full breakdown? Read Accounts Payable Aging Report for Platforms: How to Track Overdue Contractor Payments.
Plan for missed events, not just late retries. Stripe's webhook guidance separates undelivered events from irrecoverable webhook events, so your runbook should have two branches: normal catch-up and explicit gap closure when normal delivery cannot recover the state.
Adyen frames the risk clearly as well: webhooks keep your system synchronized, and poor endpoint handling can leave you with missed events and stale internal state. Treat a disabled or unhealthy endpoint as an incident, not a background warning. You need a way to detect the gap, identify the missing range, and reconcile from authoritative provider objects when replay is unavailable.
Before you close the incident, verify three things: delivery is healthy again, the backlog or gap has been backfilled, and your idempotency controls prevented duplicate application during catch-up. That is the difference between restoring the endpoint and restoring system state.
It covers provider redelivery when your endpoint does not acknowledge an event successfully. It does not guarantee that your downstream business processing completed correctly.
Production adds duplicate delivery, timeouts, network jitter, slower commits, and partial failures between receipt and side effects. If the dedupe boundary is weak or the endpoint does too much synchronous work, retries can create duplicate outcomes and manual cleanup.
The endpoint should verify authenticity and basic schema, persist the receipt, and return a quick acknowledgment. Fulfillment, retries, and reconciliation should move to workers behind that durable receipt boundary.
The event ID helps detect transport-level replays of the same delivery. The idempotency key protects the business side effect so the same action is not applied twice even if events arrive out of order or are retried.
Provider retries are about delivering the event to your edge. Worker retries are your internal recovery mechanism after durable receipt, and they should use failure classes, backoff, jitter, and a dead-letter path for non-retriable cases.
It should confirm that delivery health is restored, the backlog or gap has been backfilled, and idempotency controls prevented duplicate application during catch-up. That proves system state, not just endpoint availability, was restored.
A former product manager at a major fintech company, Samuel has deep expertise in the global payments landscape. He analyzes financial tools and strategies to help freelancers maximize their earnings and minimize fees.
Includes 4 external sources outside the trusted-domain allowlist.
Educational content only. Not legal, tax, or financial advice.

Move fast, but do not produce records on instinct. If you need to **respond to a subpoena for business records**, your immediate job is to control deadlines, preserve records, and make any later production defensible.

The real problem is a two-system conflict. U.S. tax treatment can punish the wrong fund choice, while local product-access constraints can block the funds you want to buy in the first place. For **us expat ucits etfs**, the practical question is not "Which product is best?" It is "What can I access, report, and keep doing every year without guessing?" Use this four-part filter before any trade:

Stop collecting more PDFs. The lower-risk move is to lock your route, keep one control sheet, validate each evidence lane in order, and finish with a strict consistency check. If you cannot explain your file on one page, the pack is still too loose.