How to Debug Production Issues Without Losing Your Mind
Production issues happen. When they do, panic is your worst enemy. Here's a practical guide to debugging production problems systematically, from logging strategies to incident response, so you can fix issues faster and keep your sanity intact.
Aymen Chikeb
12/15/2025
The Moment of Truth: When Production Breaks
It's 2 AM. Your phone buzzes. Slack is exploding. Users are reporting errors. The dashboard is red. Your heart rate spikes (And we don't forget sometimes you are the only person that can fix it).
Welcome to the production debugging.
We've all been there. That moment when something breaks in production and you need to figure out what, why, and how to fix it fast. The pressure is real: users are affected (many received mails), revenue might be impacted, and everyone is looking at you.
But here's the thing: panic is your worst enemy. The faster you panic, the slower you debug. The slower you debug, the longer users suffer.
And don't forget at this time you are humain and however is your problem or another person problem or external service problem be positive and don't be nervous because you don't have any benefit from that.
This isn't about never having issues (that's impossible). It's about having a systematic approach that helps you stay calm, think clearly, and solve problems faster.
(We all know this gif about fixing production bug)
Before the Fire: Building Your Debugging Arsenal
The best time to prepare for production issues is before they happen. You can't debug what you can't see. Sometimes ( because I work in a project that have only about 30 days logs and no monitoring that mean no alerting ah the real chaos but we always solve the problem by debugging the code and things of all situation and usecase that can be impacted by this pb) And when you work in this kind of projects you understand deeply why every project that have more than 1K customer need monitoring, alert and some DevOps tools...
What You Need in Place
1. Comprehensive Logging
- Structured logs (JSON format)
- Log levels (DEBUG, INFO, WARN, ERROR, FATAL)
- Request IDs for tracing (this in case you have microservices to trace the users logs in different components)
- Contextual information (user ID, session, request path)
2. Monitoring & Alerting
- Application metrics (response times, error rates, throughput)
- Infrastructure metrics (CPU, memory, disk, network)
- Business metrics (conversion rates, active users)
- Smart alerts (not just "something is wrong"—tell me what)
3. Distributed Tracing
- Trace IDs across services
- Service maps
- Latency breakdowns
4. Error Tracking
- Centralized error collection (Sentry, Rollbar, etc.)
- Stack traces with context
- Error grouping and trends
5. Runbooks & Documentation
- Common issues and solutions
- Service dependencies
- Rollback procedures
- On-call contacts
If you don't have these yet, start small. Add structured logging first. Add some health checks, subscribe to external services status (like aws status etc..) Then monitoring. Build your observability stack incrementally.
The Debugging Mindset: Stay Calm, Stay Systematic
When production breaks, your first instinct might be to:
- Start changing random things
- Blame the last deployment
- Panic and ask "WHAT DID WE BREAK?!"
Stop. Breathe. Think. And never lose your calm, because it’s what people will remember about this situation.
The Systematic Approach
1. Confirm the Issue
- Is it really broken? Or is it a false alarm?
- How many users are affected?
- What's the actual error message?
- Can you reproduce it?
2. Gather Information
- Check your monitoring dashboards
- Look at recent deployments
- Review error logs
- Check system metrics
3. Form a Hypothesis
- Based on the evidence, what do you think is wrong?
- What changed recently?
- What's the most likely cause?
4. Test Your Hypothesis
- Can you reproduce it in staging? (We do this a lot of time when we don't understand the cause), but in case of all the envs are aligned
- Does the hypothesis explain all the symptoms?
- What would fix it if your hypothesis is correct?
5. Fix and Verify
- Apply the fix
- Monitor to confirm it worked
- Document what happened
This sounds simple, but under pressure, it's easy to skip steps. Write it down. Create a checklist. Follow it every time.
Logging: Your Best Friend (When Done Right)
Good logs are like a time machine. They let you see exactly what happened, when it happened, and in what context.
What Makes Logs Good
Structured Logging
Instead of:
console.log("User logged in");
Do this:
logger.info({
event: "user_login",
userId: user.id,
timestamp: new Date().toISOString(),
requestId: req.id,
ip: req.ip,
userAgent: req.headers['user-agent']
});
Why? Structured logs are:
- Searchable (find all logins for user X)
- Filterable (show only errors from the last hour)
- Parseable (tools can extract metrics automatically)
Log Levels: Use Them Wisely
- DEBUG: Detailed information for diagnosing problems (only in development/staging)
- INFO: General informational messages (user actions, system events)
- WARN: Something unexpected happened, but the system is still working
- ERROR: An error occurred, but the system can continue
- FATAL: Critical error that might cause the system to crash
Pro tip: In production, set your log level to INFO or WARN. DEBUG logs are too noisy and expensive.
Request IDs: The Thread That Connects Everything
Every request should have a unique ID. Pass it through:
- HTTP headers
- Log messages
- Database queries
- External API calls
// At the start of a request
const requestId = crypto.randomUUID();
req.id = requestId;
// In your logs
logger.info({ requestId, message: "Processing payment" });
// In error logs
logger.error({ requestId, error: err.message, stack: err.stack });
Now you can trace a single request through your entire system. This is gold when debugging.
What to Log (And What Not to Log)
DO log:
- User actions (with user ID)
- System events (deployments, config changes)
- Errors (with full context)
- Performance metrics (slow queries, API calls)
- Business events (purchases, signups)
DON'T log:
- Passwords or secrets (obviously)
- Full credit card numbers
- Personal identifiable information (unless necessary and compliant)
- Too much data (logs are expensive)
Observability: Seeing What You Can't See
Logging tells you what happened. Observability tells you why it happened and how your system behaves.
The Three Pillars of Observability
1. Metrics
- Application metrics: Response times, error rates, request counts
- Infrastructure metrics: CPU, memory, disk I/O, network
- Business metrics: Revenue, conversions, active users
2. Logs
- Structured, searchable, contextual
- (We covered this above)
3. Traces
- Distributed tracing across services
- See the full path of a request
- Identify bottlenecks
Building Your Observability Stack
You don't need expensive tools to start. Here are some options:
Free/Open Source:
- Prometheus + Grafana: Metrics and dashboards
- Loki: Log aggregation
- Jaeger: Distributed tracing
- ELK Stack: Elasticsearch, Logstash, Kibana
Managed Services:
- Datadog: All-in-one (expensive but powerful)
- New Relic: APM and monitoring
- Sentry: Error tracking
- CloudWatch: AWS-native monitoring
Start simple: Add Prometheus metrics to your app. Export key metrics (request count, error rate, response time). Build one dashboard. Then expand.
Key Metrics to Monitor
Golden Signals (from Google's SRE book):
- Latency: How long requests take
- Traffic: How many requests you're handling
- Errors: Error rate
- Saturation: How "full" your system is (CPU, memory, etc.)
RED Method (for microservices):
- Rate: Requests per second
- Errors: Error rate
- Duration: Response time
Monitor these. Alert on them. They'll tell you when something is wrong.
The Incident Response Playbook
When production breaks, you need a plan. Not a plan you make up on the spot, a plan you've practiced.
Step 1: Acknowledge the Incident
- Create an incident channel (Slack, PagerDuty, etc.)
- Assign an incident commander
- Set severity level (P0 = critical, P1 = high, P2 = medium, P3 = low)
Step 2: Assess Impact
- How many users are affected?
- What functionality is broken?
- Is data at risk?
- What's the business impact?
Step 3: Communicate
- Update stakeholders (don't let them find out from Twitter)
- Set expectations ("We're investigating, update in 15 minutes")
- Be honest about what you know and don't know
Step 4: Debug Systematically
Follow your debugging process:
- Confirm the issue
- Gather information
- Form a hypothesis
- Test it
- Fix it
Document everything as you go. You'll need this for the post-mortem.
Step 5: Fix and Verify
- Apply the fix
- Monitor metrics to confirm
- Test in production (carefully)
- Verify the issue is resolved
Step 6: Post-Mortem
Always do a post-mortem. Even for small incidents. Ask:
- What happened?
- Why did it happen?
- What did we do well?
- What could we improve?
- What action items do we have?
Blame-free culture: Focus on the system, not the person. The goal is to learn and improve.
Common Production Issues and How to Debug Them
Issue: "The API is Slow"
Symptoms: High latency, timeouts, users complaining
Debug steps:
- Check response time metrics (p50, p95, p99)
- Look for slow database queries
- Check external API calls (are third-party services slow?)
- Review recent deployments (did something change?)
- Check resource utilization (CPU, memory, network)
Common causes:
- N+1 query problem
- Missing database indexes
- External API degradation
- Memory leaks
- Resource exhaustion
Issue: "Users Are Getting 500 Errors"
Symptoms: High error rate, error logs filling up
Debug steps:
- Check error logs (what's the actual error?)
- Look for patterns (same endpoint? same user? same time?)
- Check recent deployments
- Review error tracking (Sentry, etc.)
- Check dependencies (database, cache, external APIs)
Common causes:
- Unhandled exceptions
- Database connection issues
- Missing environment variables
- Dependency failures
- Memory issues (OOM kills)
Issue: "The Database is Slow"
Symptoms: Slow queries, connection pool exhaustion, timeouts
Debug steps:
- Check slow query log
- Look for missing indexes
- Check connection pool usage
- Review query patterns (N+1, full table scans)
- Check database metrics (CPU, I/O, connections)
Common causes:
- Missing indexes
- Inefficient queries
- Connection pool too small
- Database server resource issues
- Lock contention
Issue: "Memory Usage is Growing"
Symptoms: Memory increasing over time, OOM kills, slow performance
Debug steps:
- Check memory metrics over time
- Look for memory leaks in code
- Check for unclosed resources (connections, file handles)
- Review caching strategies (are caches growing unbounded?)
- Use memory profilers
Common causes:
- Memory leaks
- Unbounded caches
- Unclosed resources
- Large objects in memory
- Garbage collection issues
Issue: "Everything Was Working, Then It Broke"
Symptoms: Sudden failure after a period of stability
Debug steps:
- What changed? (deployment, config, traffic, data)
- Check deployment logs
- Review recent changes (git history)
- Check if it's time-based (cron job, scheduled task?)
- Look for external factors (third-party service, infrastructure)
Common causes:
- Recent deployment
- Configuration change
- Traffic spike
- Data corruption
- External dependency failure
Pro Tips from the Trenches
1. Keep a Debugging Journal
- Write down what you tried
- What worked, what didn't
- Patterns you notice
- You'll thank yourself later
2. Use Time-Travel Debugging
- If you have good logs, you can "replay" what happened
- Request IDs make this possible
- Distributed tracing shows the full picture
3. Reproduce in Staging
- If you can reproduce it, you can debug it safely
- Use production data (anonymized) if needed
- Test your fix before deploying
4. Know Your Rollback Procedure
- Practice it
- Document it
- Make it one command if possible
- Test it regularly
5. Monitor After Deployments
- Watch metrics for 30-60 minutes after deploy
- Set up deployment alerts
- Have someone on standby
6. Use Feature Flags
- Deploy code without enabling it
- Enable gradually
- Easy rollback (just flip the flag)
7. Log Everything (But Store Selectively)
- Log at DEBUG level in development
- Store only INFO+ in production
- But make sure you CAN log more if needed
8. Build Runbooks
- Document common issues
- Step-by-step solutions
- Update them as you learn
Remember: Production issues are inevitable. The goal isn't to never have them—it's to handle them better each time. Build your observability stack. Practice your debugging process. Learn from every incident.
And most importantly: stay calm. Panic makes everything worse. Take a breath, follow your process, and you'll figure it out.
Good luck out there. May your logs be structured, your metrics be clear, and your incidents be rare.


