SSL Certificate Monitoring for DevOps Teams

3 AM PagerDuty: It's an Expired Certificate

The alert fires at 3:12 AM. Production is down. Users are seeing connection errors. Your team scrambles into the incident channel, checks the load balancer, checks the application -- everything looks fine. Logs show nothing unusual. Health checks pass on the instances themselves.

Then someone checks the certificate on the load balancer. Expired. Six hours ago. The auto-renewal cron job that was set up eighteen months ago stopped running after a server migration three months back. Nobody noticed because nobody was watching.

Forty-five minutes and one emergency certificate issuance later, production is back. The postmortem lands on the same conclusion every certificate incident reaches: "We need to monitor our certificates."

The DevOps Certificate Challenge

DevOps teams don't manage one certificate. They manage an ecosystem of them, spread across layers of infrastructure that all need to stay in sync.

Where certificates live in a typical stack:

Load balancers (ALB, NLB, HAProxy, Nginx)
CDN edge nodes (CloudFront, Fastly, Cloudflare)
API gateways (Kong, AWS API Gateway)
Kubernetes ingress controllers
Internal service-to-service mTLS
CI/CD webhook endpoints
Monitoring and observability tool endpoints
VPN and bastion host certificates

Each of these has its own renewal mechanism, its own timeline, and its own failure mode. Some are managed by cloud providers. Some are automated with Certbot or cert-manager. Some were manually installed by someone who's no longer on the team.

The certificate inventory problem

Most DevOps teams can't answer a simple question: "How many SSL certificates do we have, and when does each one expire?" If you can't answer that, you can't prevent expiry-related outages.

When Automation Isn't Enough

Let's Encrypt and ACME-based automation changed the game. Certificates that used to require manual renewal every year now renew themselves every 60-90 days. But automation creates its own kind of risk: the assumption that it's working.

Auto-renewal fails in predictable ways:

Server migrations break cron jobs and systemd timers. The Certbot timer that was running on the old server doesn't exist on the new one.
DNS changes invalidate validation. You move to a new DNS provider, and the API credentials for DNS-01 challenges no longer work.
Containerized environments lose state. The container rebuilds from a clean image and Certbot isn't installed.
Permission changes after security hardening prevent Certbot from writing to the certificate directory or reloading the web server.
Rate limits lock you out. A misconfigured renewal script hammers Let's Encrypt's API, and you hit the rate limit right when you need a renewal most.

The worst part is that these failures are silent. Certbot doesn't page you when it fails to renew. It logs an error to a file that nobody reads, and the certificate quietly marches toward expiry.

The Blast Radius of an Expired Production Certificate

When a certificate expires on a blog or a docs site, it's embarrassing. When a certificate expires on a production load balancer, the blast radius is enormous:

Every user connecting through that endpoint gets a certificate error
API consumers -- mobile apps, webhooks, third-party integrations -- all fail simultaneously
Health checks from upstream services start failing, potentially triggering cascading failures
HSTS-enabled domains become completely inaccessible -- no click-through option for users
Service mesh mTLS failures can take down internal service-to-service communication

A single expired certificate on the wrong endpoint can take down an entire platform. The mean time to detection (MTTD) is the critical variable -- and without monitoring, MTTD is "whenever a user complains."

Add a safety net to your certificate automation

Monitor the certificates your servers actually serve. Get alerts when auto-renewal silently fails.

Why Monitoring Is a Layer on Top of Automation

Automation handles the renewal. Monitoring verifies the result. These are complementary, not redundant.

Think of it like deployment pipelines. You automate deployments, but you still have health checks, smoke tests, and monitoring dashboards that verify the deployment succeeded. You don't skip Datadog because you trust your CI/CD pipeline. The same logic applies to certificates.

SSL Certificate Expiry checks the actual certificate being served by your endpoints -- the same certificate your users and API consumers see. It doesn't care whether that certificate was issued by Let's Encrypt, DigiCert, or your internal CA. It doesn't care whether renewal is automated or manual. It checks the live state and alerts you when expiry approaches.

External validation

Checks the certificate from outside your infrastructure, exactly like your users experience it. Catches issues that internal checks miss.

Full chain validation

Verifies the entire certificate chain, not just the leaf certificate. Catches intermediate certificate problems that cause hard-to-debug TLS errors.

Escalating alert cadence

Alerts at 30, 14, 7, 3, and 1 day before expiry. The first alert is informational. The last one is a fire alarm.

Co-recipient routing

Route alerts to the right people. Platform-wide certificates alert the on-call. Internal service certificates go to the owning team's renewal queue.

Bulk monitoring

Add your entire certificate inventory. No per-certificate configuration complexity.

How DevOps Teams Use SSL Certificate Expiry

Certificate Inventory

Start by building a complete picture. Add every externally-reachable endpoint: production domains, staging environments, API endpoints, CDN origins, webhook URLs. This becomes your certificate inventory -- a single place to see every certificate, its issuer, its expiry date, and its chain status.

Alert Routing by Blast Radius

Not every certificate is equally critical. Configure alerts based on impact:

Platform-critical certificates (main domain, API, load balancers): Route to the on-call rotation. These need immediate action.
Secondary services (docs, blog, staging environments): Route to the team channel or a renewal task queue. Important, but not a 3 AM page.
Internal services: Route to the owning team's backlog. Tracked, but handled during business hours.

Integration with Incident Workflow

When an alert fires at 30 days, it's a ticket. When it fires at 7 days, it's a priority ticket. When it fires at 3 days, it's an incident. Map the escalating alerts to your existing incident management process -- whether that's Jira, Linear, PagerDuty, or a Slack channel.

The Post-Incident Improvement

After a certificate incident, the first action item in the postmortem is always "add monitoring." With SSL Certificate Expiry, you can close that action item in 2 minutes. Add the domain, set up the alert routing, and move on to the structural fixes.

The Cost Argument

One certificate-related production outage costs your organization more than years of monitoring fees. Calculate it yourself:

Engineer time during the incident (multiple engineers, after-hours rates)
Revenue loss during downtime
Customer trust impact
Postmortem and remediation time
The opportunity cost of what those engineers would have been doing instead

$9/month for unlimited certificate monitoring is less than the cost of one on-call engineer's first 10 minutes during an incident. It's a rounding error in your infrastructure budget, and it prevents your team's worst on-call experience.

Free

Up to 3 items
Email alerts
Basic support

Pro

$9/month

Unlimited items
Email + Slack alerts
Priority support
API access

Get Started

Add your production endpoints

Start with the certificates that would cause the most damage if they expired: load balancers, API gateways, main application domains.

Add secondary and internal endpoints

Work outward from production: staging environments, documentation sites, webhook endpoints, CDN origins.

Configure alert routing

Set up co-recipients so alerts reach the right team. Platform-critical certs to on-call, secondary certs to the team channel.

Add to your infrastructure runbook

Include "add SSL monitoring" in your new service deployment checklist, right next to "configure health checks."

Part of Boring Tools--boring tools for boring jobs.

Never miss an SSL certificate expiry

Monitor your certificates and get alerts before they expire. Free for up to 3 certificates.