SSL Certificate Monitoring for DevOps Teams
Prevent certificate-related outages before they hit production. Monitoring for teams managing certificates across servers, load balancers, and cloud services.
3 AM PagerDuty: It's an Expired Certificate
The alert fires at 3:12 AM. Production is down. Users are seeing connection errors. Your team scrambles into the incident channel, checks the load balancer, checks the application -- everything looks fine. Logs show nothing unusual. Health checks pass on the instances themselves.
Then someone checks the certificate on the load balancer. Expired. Six hours ago. The auto-renewal cron job that was set up eighteen months ago stopped running after a server migration three months back. Nobody noticed because nobody was watching.
Forty-five minutes and one emergency certificate issuance later, production is back. The postmortem lands on the same conclusion every certificate incident reaches: "We need to monitor our certificates."
The DevOps Certificate Challenge
DevOps teams don't manage one certificate. They manage an ecosystem of them, spread across layers of infrastructure that all need to stay in sync.
Where certificates live in a typical stack:
- Load balancers (ALB, NLB, HAProxy, Nginx)
- CDN edge nodes (CloudFront, Fastly, Cloudflare)
- API gateways (Kong, AWS API Gateway)
- Kubernetes ingress controllers
- Internal service-to-service mTLS
- CI/CD webhook endpoints
- Monitoring and observability tool endpoints
- VPN and bastion host certificates
Each of these has its own renewal mechanism, its own timeline, and its own failure mode. Some are managed by cloud providers. Some are automated with Certbot or cert-manager. Some were manually installed by someone who's no longer on the team.
The certificate inventory problem
Most DevOps teams can't answer a simple question: "How many SSL certificates do we have, and when does each one expire?" If you can't answer that, you can't prevent expiry-related outages.
When Automation Isn't Enough
Let's Encrypt and ACME-based automation changed the game. Certificates that used to require manual renewal every year now renew themselves every 60-90 days. But automation creates its own kind of risk: the assumption that it's working.
Auto-renewal fails in predictable ways:
- Server migrations break cron jobs and systemd timers. The Certbot timer that was running on the old server doesn't exist on the new one.
- DNS changes invalidate validation. You move to a new DNS provider, and the API credentials for DNS-01 challenges no longer work.
- Containerized environments lose state. The container rebuilds from a clean image and Certbot isn't installed.
- Permission changes after security hardening prevent Certbot from writing to the certificate directory or reloading the web server.
- Rate limits lock you out. A misconfigured renewal script hammers Let's Encrypt's API, and you hit the rate limit right when you need a renewal most.
The worst part is that these failures are silent. Certbot doesn't page you when it fails to renew. It logs an error to a file that nobody reads, and the certificate quietly marches toward expiry.
The Blast Radius of an Expired Production Certificate
When a certificate expires on a blog or a docs site, it's embarrassing. When a certificate expires on a production load balancer, the blast radius is enormous:
- Every user connecting through that endpoint gets a certificate error
- API consumers -- mobile apps, webhooks, third-party integrations -- all fail simultaneously
- Health checks from upstream services start failing, potentially triggering cascading failures
- HSTS-enabled domains become completely inaccessible -- no click-through option for users
- Service mesh mTLS failures can take down internal service-to-service communication
A single expired certificate on the wrong endpoint can take down an entire platform. The mean time to detection (MTTD) is the critical variable -- and without monitoring, MTTD is "whenever a user complains."
Add a safety net to your certificate automation
Monitor the certificates your servers actually serve. Get alerts when auto-renewal silently fails.
Why Monitoring Is a Layer on Top of Automation
Automation handles the renewal. Monitoring verifies the result. These are complementary, not redundant.
Think of it like deployment pipelines. You automate deployments, but you still have health checks, smoke tests, and monitoring dashboards that verify the deployment succeeded. You don't skip Datadog because you trust your CI/CD pipeline. The same logic applies to certificates.
SSL Certificate Expiry checks the actual certificate being served by your endpoints -- the same certificate your users and API consumers see. It doesn't care whether that certificate was issued by Let's Encrypt, DigiCert, or your internal CA. It doesn't care whether renewal is automated or manual. It checks the live state and alerts you when expiry approaches.
External validation
Full chain validation
Escalating alert cadence
Co-recipient routing
Bulk monitoring
How DevOps Teams Use SSL Certificate Expiry
Certificate Inventory
Start by building a complete picture. Add every externally-reachable endpoint: production domains, staging environments, API endpoints, CDN origins, webhook URLs. This becomes your certificate inventory -- a single place to see every certificate, its issuer, its expiry date, and its chain status.
Alert Routing by Blast Radius
Not every certificate is equally critical. Configure alerts based on impact:
- Platform-critical certificates (main domain, API, load balancers): Route to the on-call rotation. These need immediate action.
- Secondary services (docs, blog, staging environments): Route to the team channel or a renewal task queue. Important, but not a 3 AM page.
- Internal services: Route to the owning team's backlog. Tracked, but handled during business hours.
Integration with Incident Workflow
When an alert fires at 30 days, it's a ticket. When it fires at 7 days, it's a priority ticket. When it fires at 3 days, it's an incident. Map the escalating alerts to your existing incident management process -- whether that's Jira, Linear, PagerDuty, or a Slack channel.
The Post-Incident Improvement
After a certificate incident, the first action item in the postmortem is always "add monitoring." With SSL Certificate Expiry, you can close that action item in 2 minutes. Add the domain, set up the alert routing, and move on to the structural fixes.
The Cost Argument
One certificate-related production outage costs your organization more than years of monitoring fees. Calculate it yourself:
- Engineer time during the incident (multiple engineers, after-hours rates)
- Revenue loss during downtime
- Customer trust impact
- Postmortem and remediation time
- The opportunity cost of what those engineers would have been doing instead
$9/month for unlimited certificate monitoring is less than the cost of one on-call engineer's first 10 minutes during an incident. It's a rounding error in your infrastructure budget, and it prevents your team's worst on-call experience.
Free
$0
- Up to 3 items
- Email alerts
- Basic support
Pro
$9/month
- Unlimited items
- Email + Slack alerts
- Priority support
- API access
Get Started
Add your production endpoints
Start with the certificates that would cause the most damage if they expired: load balancers, API gateways, main application domains.
Add secondary and internal endpoints
Work outward from production: staging environments, documentation sites, webhook endpoints, CDN origins.
Configure alert routing
Set up co-recipients so alerts reach the right team. Platform-critical certs to on-call, secondary certs to the team channel.
Add to your infrastructure runbook
Include "add SSL monitoring" in your new service deployment checklist, right next to "configure health checks."
Related Articles
Part of Boring Tools--boring tools for boring jobs.
Never miss an SSL certificate expiry
Monitor your certificates and get alerts before they expire. Free for up to 3 certificates.