Kubernate

Microservice # Debugging End To End

 Here is a full live example of debugging a microservices failure — exactly how it happens in real companies (Netflix, Uber, Amazon style).

This example will make the entire debugging process crystal clear. πŸ‘‡


🎯 Scenario: Order Placement Failure in a Microservices System

User tries to place an order:

/placeOrder

But the user gets:

500 Internal Server Error

We have these microservices:

  • API Gateway

  • Order Service

  • Payment Service

  • Inventory Service

  • Notification Service


πŸ”₯ Step 1: Check Logs in Centralized Logging (ELK)

We search for Correlation ID:

X-Correlation-ID: 8ab12c

We find the logs:

Order Service → Calling Payment Service...
Payment Service → Timeout when calling Bank API
Order Service → Payment Failed

πŸ” Insight

Payment Service timed out while calling external Bank API.

But we need more clarity → go to distributed tracing.


πŸ”₯ Step 2: Open Distributed Tracing (Jaeger / Zipkin)

Trace ID: 8ab12c

You see this:

Image

Image

Image

Image

Timeline diagram shows:

API Gateway → Order Service → Payment Service → Bank API (timeout)

πŸ‘‰ The slowest span is Payment → Bank API
πŸ‘‰ Bank API took 12 seconds
πŸ‘‰ But our timeout is 3 seconds

πŸ” Root Cause

Bank API is responding too slowly → causing Payment Service to fail → causing Order failure.


πŸ”₯ Step 3: Check Metrics (Prometheus + Grafana)

Graphs show:

  • Payment Service latency (p99) increased to 10 seconds

  • Error rate for Payment Service spiked to 65%

  • CPU & memory are normal (so not resource issue)

This confirms:

πŸ‘‰ The bottleneck is external Bank API, not our services.


πŸ”₯ Step 4: Check Kubernetes Pod Health

Run commands:

kubectl get pods -n prod
kubectl describe pod payment-service-xyz
kubectl logs payment-service-xyz

Logs show:

java.net.SocketTimeoutException: Read timed out
Retrying...
CircuitBreaker: OPEN

πŸ” Meaning

The circuit breaker opened, so Payment Service started failing immediately to avoid cascading failures.


πŸ”₯ Step 5: Reproduce the Issue Manually

We call Bank API directly:

curl https://bankapi.example.com/authorize?amount=500

Result:

Response took 11,230 ms

πŸ” Conclusion

Bank API is definitely slow.


πŸ”₯ Step 6: Check Recent Deployments / Changes

Use Git or CI/CD logs:

Payment Service v2.3 deployed 2 hours ago
Changes: Added encryption & validation → increased payload size by 40%

Possible reasons:

  • New payload size causes Bank API bottleneck

  • New encryption increased processing time


πŸ”₯ Step 7: Apply Temporary Fix

Quick mitigation:

  • Increase Payment Service → Bank API timeout from 3s to 8s

  • Increase retry delay

  • Reduce payload size (remove unwanted fields)

Apply config change:

timeout = 8000ms
retry = 1

Redeploy:

kubectl rollout restart deployment payment-service

πŸ”₯ Step 8: Observe the System After Fix

Metrics after fix:

  • Latency drops from 10s → 2s

  • Error rate drops from 65% → 1%

  • Circuit breaker closes

  • Order flow is successful again

User can place orders now!


🧠 End-to-End Debugging Summary (LIVE EXAMPLE)

Step What We Found Tool
1 Payment timeout in logs ELK
2 Bank API span slow Jaeger
3 Payment latency spike Grafana
4 Timeout exception in pods kubectl logs
5 Bank API slow manually curl
6 Change in deployment Git / CI-CD
7 Applied fix Config update
8 System stable Grafana

πŸ† Final Real-World Root Cause

Bank API became slow due to increased payload size introduced in the latest deployment.


🎁 Want an even better example?

I can give you:

✅ Debugging example for Kafka failure
✅ Debugging example for Database failure
✅ Debugging example for Service Mesh (Istio) routing failure
Just tell me which one you want!

No comments:

Post a Comment

Spring Boot - Bean LifeCycle

 Here is a clear, step-by-step lifecycle of a Spring Boot application , explained in a simple + interview-ready way. πŸ”„ Spring Boot Applica...

Kubernate