When OpenAI goes down at 3am, your users find out from Twitter

It is 3:14 AM in San Francisco. A weight push at OpenAI has gone sideways, completions are returning a stream of stop tokens, and the /v1/chat/completions endpoint is still cheerfully responding with HTTP 200. Your monitoring sees green. Your customers are about to see something very different.

By the time you wake up, a thread on Hacker News titled something like "Is ChatGPT broken for anyone else?" is on the front page. Your support inbox has been filling for hours, and the messages all start the same way: "I just tried to use [your product] and it's giving me nonsense — are you down?"

You are not down. The model is.

This is the gap that traditional monitoring leaves wide open, and it is the gap that an AI status page is built to close. We want to walk through what the cost of a provider degradation actually looks like, why a generic uptime tool will not help you, and what the math looks like when you give your users a public surface to check before they file a ticket.

Why a traditional status page misses an AI outage

Status pages were designed for an era when "is the service working?" had a binary answer. A server returned 200 or it didn't. A database accepted connections or it didn't. That world made sense for fifteen years.

LLM-powered apps broke the binary. Now your stack can be operational by every classical metric — endpoints responding, latency normal, error rates flat — while the actual user experience is silently broken. The provider quietly swapped the model weights behind a stable name. A safety filter started rejecting prompts it accepted yesterday. The 99th percentile latency on a long-context request slid from 8 seconds to 45 seconds. None of this shows up on a Statuspage-style green pill.

We compared this in more detail on our Apselog vs Statuspage page, but the short version is: a generic uptime tool monitors the wrong layer. It tells your customers your API is healthy at the exact moment your product is failing them.

What an OpenAI outage costs you, hour by hour

Pull up the public OpenAI status history and you can count multiple high-profile incidents in the last two years — the November 2023 outage that took ChatGPT and the API down for the better part of a workday is the one most people remember, but there have been quieter ones too: latency spikes that stretched into multi-hour windows, model-update regressions where the API stayed nominally healthy while output quality shifted under everyone's feet.

Let's walk through what an incident costs the average AI app from the moment a provider starts misbehaving:

Hour 0: The provider's edge starts returning bad outputs. Your endpoint is healthy. Your error rate is fine. Nothing fires.

Hour 0 to 1: A handful of paying users see the issue. They don't know it's upstream. They assume it's you. The most engaged 5% of them email support. The other 95% silently churn or rage-tweet.

Hour 1 to 2: Support tickets cross some internal threshold. Someone on your team starts investigating. You check your dashboards. Everything looks fine. You start questioning your own deploy from yesterday. You roll something back. The problem persists. You waste 40 minutes.

Hour 2 to 4: Someone finally checks the provider's status page. Yes, there's an incident. Now you know. But your users still don't. You write a tweet. You update a Notion doc nobody reads. Your support team copy-pastes "we're investigating an upstream issue with our model provider" 120 times.

Hour 4 to incident resolved: Tickets keep coming because there is no canonical place for users to check. Each ticket takes 4–8 minutes of staff time to respond to. If you have 300 affected users and a 20% complaint rate, that is 60 tickets at roughly 6 minutes each — six hours of support labor for a problem you did not cause and cannot fix.

Day +1: Some percentage of affected users have already silently churned. You will see it in next month's retention numbers and you will not be able to attribute it to that one Tuesday morning.

The real cost is not the incident itself. It is the asymmetry: your users know something is broken, and they have nowhere to go to confirm it except your inbox.

What a real AI status page changes

The whole logic of a public status page is to flip the support load math. Instead of every confused user sending you a ticket, they hit a URL — status.yourapp.com — and get an honest answer in three seconds. If the answer is "OpenAI has been degraded since 03:42 UTC, this is upstream of us, here is the link to OpenAI's status page" — they bookmark, they close the tab, they wait. They do not file the ticket.

The teams we have talked to consistently say the same thing: a well-positioned status page meaningfully cuts incident-related support load. The exact reduction depends on how visible the page is from inside your app and how quickly you publish updates. But the direction is never in doubt — every ticket a user resolves by reading a status page is a ticket your team does not have to answer.

The reason an AI status page is different from a generic one comes down to what it actually monitors. Apselog probes the ten LLM providers most indie AI apps depend on — OpenAI, Anthropic, Google Gemini, Mistral, Groq, xAI, Replicate, Fireworks, Cohere, and Together AI — every two minutes. When a provider degrades, your status page reflects it within a few minutes. Your support team links to it in the auto-reply. Your users self-serve the answer. The math changes.

We are honest about the detection window: Apselog catches provider issues in 2–5 minutes, not 30 seconds. That is the price of running a probe-based architecture on a Vercel cron. Better Stack and the dedicated uptime players will detect a pure HTTP failure faster. The tradeoff is that we are looking at the right thing — the LLM provider's actual model surface — not just a TCP handshake.

The signals you actually need to surface

A status page that just shows "OpenAI: operational" is barely better than no status page. The signals that matter to an AI app are the ones that turn into user-visible failures, and there are four:

Provider uptime. The obvious one. Is the API responding, and is response time within the expected band. This is what every uptime tool does. We do it for the ten LLM providers out of the box.

Eval drift. The provider's API can be 100% available and the model can still be silently broken. A weights update, a safety filter change, a quantization swap — all of these can degrade quality without touching the uptime metric. Apselog catches this by replaying a small set of customer-defined golden prompts against the provider every night and scoring the results with an LLM-as-judge. When the score drops more than 5% versus the trailing seven-day baseline, an anomaly is created and the customer is alerted. This is the single most differentiated signal an AI status page can publish, and we wrote up the architecture in more detail on our Apselog vs Helicone comparison.

Token spend anomaly. A runaway agent, a prompt injection that triggers a retry loop, a customer accidentally setting max_tokens=10000 in production — these show up as 5–10× cost spikes before they show up as user complaints. Apselog ingests token usage events from the customer's runtime and runs hourly anomaly detection against the trailing 7-day baseline. A spike of 2.5× triggers a high-severity incident; 5× triggers critical.

Plain-English summaries. When something is wrong, the customer's users do not want a stack trace. They want a sentence. Apselog drafts a calm, factual incident summary using Claude Haiku via Vercel AI Gateway. The customer reviews and approves before it publishes — we never auto-post. Speed of publishing a status update is directly correlated with reduced support load, and removing the "what do I say?" friction matters more than people expect.

Why we are not building "AI observability"

There is a category of products called LLM observability — Helicone, Langfuse, Braintrust, LangSmith, Datadog LLM Observability. They are aimed at engineers. They show traces, spans, prompt diffs, latency histograms. They are excellent at what they do. We are not trying to compete with them.

The asymmetry that defines our wedge is that every one of those tools lives behind a login that your customers will never see. They give your engineering team more visibility — and they leave your end users in the dark.

Apselog flips the camera. We do not give your engineers a dashboard with more knobs. We give your customers a URL they can bookmark. We compared the buyer math in detail on our Apselog vs Datadog page — short version: Datadog LLM Observability starts around $160/month and is sold to enterprise SRE teams. Apselog Pro is $29/month and is sold to the indie founder who is shipping an AI feature with a team of three and lives in fear of a 3 AM OpenAI page.

This is not a knock on the observability tools. Most of our target customers should run one of them in parallel. But "show my engineers a flame graph" and "show my customers a green check" are different jobs, and assembling them yourself out of two tools is more work and more money than just picking a product that does the second job directly.

Why public-by-default matters more than the feature list

The most under-discussed lever in incident response is psychological. When something breaks and your users have a place to check, they trust you more even when the news is bad. Transparency is cheaper than tickets and it compounds across incidents.

This is why we put the status page in front of the dashboard, not behind it. The free tier of Apselog publishes a fully public page at apselog.com/status/<your-slug> with all ten providers monitored, 30 days of history, and an Apselog badge in the footer. The Pro tier removes the badge and lets you point a custom domain at it. Every incident, every drift event, every spend anomaly that the customer approves is published to that page within seconds of approval.

The pricing is on the pricing page — Free, Pro at $29/month, Team at $99/month. The free tier is intentionally generous because we want every AI-powered product to have an honest public status page. The bet is that some percentage of teams using the free tier will want a custom domain or multiple golden eval sets, and they will upgrade. The Apselog badge on every free page is the deal.

We are also serious about the privacy and compliance posture. The full sub-processor list, retention policy, and security questionnaire pre-answers live on the trust page. For B2B buyers who paste a security questionnaire into every vendor inbox the moment they hear "AI," this is the page that closes the deal.

What to do in the next ten minutes

If you ship an AI-powered product and you do not have a public status page that understands LLMs, the cost is not theoretical. The next provider degradation is coming. It might be next Tuesday. It might be tonight at 3 AM. You will not get to choose when.

The setup is short. Sign up, paste in your product name, generate a slug, optionally point a custom domain, and you have a public status page that monitors all ten major LLM providers every two minutes. If you want golden-set eval drift detection and token-spend anomaly alerts, that is on the Pro tier. If you run an agency and need multiple branded pages for clients, that is the Team tier.

Try Apselog free — no credit card, full provider monitoring, public page live in under five minutes. If you want to talk through whether the fit makes sense before signing up, email [email protected] and we will answer honestly. The worst case is that you set up a free status page and never need to send a single user to it. The best case is that the next time OpenAI has a bad morning, your users learn it from your domain, not from Twitter.