IT infrastructure for high-traffic retail events: lessons from Black Friday

Why retail infrastructure fails under load

E-commerce platforms are typically built and tested under conditions that look nothing like a peak event. A site that handles 500 concurrent users without breaking a sweat can fall over completely when 8,000 arrive at once. The failure modes are predictable – and they're almost always the result of something that was known but not addressed before the event.

The most common causes: the database wasn't designed to handle concurrent writes at volume, auto-scaling was configured but never tested so the rules fired too slowly, a third-party script timed out and blocked page rendering for every user, or checkout broke because the payment integration wasn't tested under realistic load. None of these are inevitable. They're the result of not doing the preparation work.

The commercial cost of a peak-day outage isn't just the revenue lost in the window the site is down. It's the abandoned carts that never recover, the paid media spend that drove traffic to a broken experience, and the reputational damage that takes months to repair.

Load testing: what it is and how to do it properly

Load testing means simulating realistic concurrent user traffic against your platform before a peak event, to identify where it breaks and at what volume.

Tools commonly used for this include k6, Apache JMeter, Gatling and Locust. Each has different strengths – k6 is particularly well suited to developer-led testing with scriptable scenarios, while JMeter has broad support and a large ecosystem. The choice of tool matters less than using one at all.

The critical detail that most teams get wrong: test the whole transaction flow, not just the homepage. Simulating 10,000 concurrent users hitting your homepage tells you almost nothing useful. What matters is whether your platform holds up when those users are browsing product pages, adding items to cart, reaching checkout and completing payment – all at the same time. That's the flow that generates revenue, and it's the flow that exposes the real bottlenecks.

Set a target based on realistic projections. If your best Black Friday drove 3,000 concurrent sessions and you're expecting growth this year, test to 5,000 or 6,000. Find the point at which things degrade, then fix it and test again. The goal is to discover the ceiling before your customers do.

Scaling for peak traffic

Cloud-native platforms – AWS, GCP, Azure – support auto-scaling: the ability to spin up additional compute capacity automatically in response to demand. It's a powerful capability, but it requires configuration and testing to work reliably. Auto-scaling rules that fire after a 90-second delay are not useful when your traffic spike arrives in under a minute.

The configuration of these rules – what triggers them, how quickly they respond, how many instances they provision – needs to be tested before the event, not during it. Run a load test that simulates the expected traffic ramp and confirm that your scaling behaviour matches expectations. If it doesn't, adjust and retest.

For merchants on managed platforms, this is largely handled for you. Shopify Plus and BigCommerce absorb traffic spikes at the platform level – you're sharing infrastructure with thousands of other merchants and the platform is sized accordingly. Custom platforms on VPS or dedicated servers don't have this luxury. If your infrastructure is fixed-capacity, you need to provision for your peak, not your average.

CDN and edge caching strategy

A CDN (Content Delivery Network) sits between your users and your origin server, caching copies of your content at edge locations distributed geographically. When a user requests a page or asset, the CDN serves it from the nearest edge location rather than your origin server, reducing latency and, critically, reducing the load on your infrastructure.

Providers like Cloudflare, Fastly and AWS CloudFront are widely used for this. The immediate win is static assets – images, CSS, JavaScript – which should always be served from the CDN rather than your origin. But the bigger gain for peak events comes from caching page HTML at the edge for pages that don't require personalisation: category pages, product pages, blog content.

The principle is simple: cache everything cacheable. Every request that the CDN can serve without touching your origin is a request that doesn't add to database load or application server load during your peak window. Even short cache TTLs – 60 seconds on a product page – can dramatically reduce origin hits when you're taking thousands of requests per minute.

Be intentional about what you exclude from caching: basket pages, account pages and checkout must pass through to your origin to maintain session state. Everything else is a candidate for edge caching.

Database performance under load

The database is where most e-commerce outages under load actually originate. Application servers can be scaled horizontally relatively easily. Databases are harder.

The common failure patterns: too many concurrent connections exhausting the connection pool, read queries running against the primary database when they should use read replicas, missing indexes on queries that run fast at low volume and catastrophically slow at high volume, and N+1 query problems that don't show up in routine testing but compound badly under concurrent load.

Database performance testing should happen alongside application load testing, not as an afterthought. Watch query execution times, connection pool utilisation and CPU load on the database server during your load tests. If the database CPU is hitting 80% under simulated peak load, you have a problem to solve before the real event.

Read replicas are particularly valuable for read-heavy e-commerce workloads. Directing product page and search queries to replicas keeps the primary database free for writes – the transactions, stock updates and order records that actually matter.

Payment gateway resilience

Stripe, Adyen and Braintree all run robust infrastructure. Payment gateway outages are rare. Integration errors are not.

The failure modes in payment processing under load are almost always in the integration layer: timeout handling that isn't configured correctly, webhook endpoints that can't keep pace with the volume of events, retry logic that isn't implemented or implemented incorrectly, or error states that aren't caught and surfaced to the user in a useful way.

Test the checkout flow specifically under load – not just the happy path, but the failure cases. What happens when a payment request times out? Does the user see a clear error, or does the page hang? Is the order created before the payment is confirmed, leaving you with fulfillment problems? Does your retry logic create duplicate charges?

These edge cases are manageable, but only if you've found them in testing rather than in production during your peak hour.

Monitoring and incident response during peak events

Real-time visibility during a peak event isn't optional. You need dashboards that show response times, error rates and server resource usage – and you need people watching them.

Tools like Datadog, New Relic and Grafana all provide this capability. The specific tooling matters less than having it configured, tested and understood before the event. A monitoring dashboard that nobody knows how to read is not useful at 9pm on Black Friday.

Prepare a runbook in advance. A runbook is a documented set of responses to specific failure conditions: what does the on-call engineer do if database CPU exceeds 90%? What's the response if the checkout error rate climbs above 1%? What's the escalation path if the primary database becomes unavailable?

These questions have answers. Write them down before the event so that the person on call doesn't have to improvise under pressure.

Also consider your third-party dependencies during peak windows. Marketing pixels, live chat widgets, product recommendation engines and review platforms each add external requests to every page load. Any of them can become a single point of failure if their infrastructure has a problem. Consider disabling non-essential third-party scripts on peak days – a slow chat widget that degrades page performance is costing you more than the widget is worth during your most valuable trading hours.

Route B helps e-commerce businesses prepare their infrastructure for peak events. Get in touch to discuss your platform.

Get in Touch

The post-peak review

What happens after a peak event is as important as the preparation before it. Whether the event went smoothly or not, there's information to capture.

Pull the actual traffic data and compare it against your load test scenarios. Did you hit the volumes you planned for? Where did your infrastructure show signs of strain even if it didn't break? What were the response time curves as load increased? Which components had headroom to spare and which were close to their limits?

If there were incidents – errors, slowdowns, anything that degraded the user experience – document them properly. What was the root cause? When was it first visible in your monitoring? How long did it take to identify and resolve? What would have caught it earlier?

The output of this review feeds directly into preparation for the next peak event. Infrastructure decisions made in response to real load data are far more valuable than decisions made against assumptions. Each peak event should leave your platform better prepared for the next one.