The Mushroom Cloud Effect or What Happens When Containers Fail?
Alois Mayr
Technology Lead Cloud & Containers
MesosCon Europe
August, 2016 @mayralois
about:me •
Helps bringing monitoring to modern technologies
•
Working with customers, partners and R&D
•
Technology Lead for Cloud & Containers
@mayralois
about:dynatrace •
APM market leader who helps companies in Digital transformation
•
Founded in Austria back in 2005
•
> 8000 customers across all industries
•
Seen many performance and stability problems and patterns out there
@mayralois
about:you •
Who of you run/manage containers in production?
•
Has running containers made your life easier?
•
Thanks!
@mayralois
…there’s been the mushroom cloud effect
oh yeah, everything screwed up
@mayralois Source: http://www.schoonoart.de/
The Mushroom Cloud Effect or What Happens When Containers Fail? @mayralois
TL;DR
@mayralois
About Cloud-Scale Systems
@mayralois
Important Aspects… •
Lots of (micro-)services
•
Lots of communication between services
•
Service dependencies
•
Versioning and API compatibilities
•
Zero downtime
@mayralois
Develop
Big monolithic application.
Small interconnected purpose-built services.
@mayralois
Pizza Box Teams
Small teams can deliver features into production
@mayralois
New Rules in the Game You build it, you run it. Werner Vogels, CTO Amazon
@mayralois
Ship
Deploy
Big Bang Releases of single special built applications.
Small continuous service delivery of standardized delivery @mayralois blocks.
Compute Hardwired datacenters.
Datacenter as an API.
Confidential, Dynatrace LLC
@mayralois
New platforms to help out in running apps •
Most often container-based
•
Clustered for scalability
•
Ephemeral containers
•
Resilient architecture
•
Cross AZ fail-overs
•
SDN for communication
@mayralois
Deployments are no Longer Static
7:00 a.m. Low load, service running with minimum redundancy
12:00 p.m. Scaled up service during peak load with failover of problematic node
7:00 p.m. Scaled back down to lower load, move to different geolocation
@mayralois
Anatomy of dynamic environments
@mayralois https://www.dynatrace.com/en/ruxit/
All About (Service) Dependencies
@mayralois
Failing containers… …may or may not have an (immediate) impact to service performance
@mayralois
Cascading Failures Lead to a Mushroom Cloud Effect
@mayralois
@mayralois
The Hungry Container Breakdown What was the problem? • • • •
Shared /logs partition on host No log rotation, no archiving for app logs No proper log management used for Docker environment Shared /logs partition ran out of space
@mayralois
The Hungry Container Breakdown How the problem has evolved over time? • • • • • •
Container health checks failed Orchestration killed container and rescheduled new one Still no free space on /logs Termination and rescheduling /var/lib/docker ran out of space Cluster nodes were no longer able to run any containers
@mayralois
The Hungry Container Breakdown How the problem affected services? • • •
Services at the top of the graph Increased failure rates Lots of depending Tomcat and DB services affected
@mayralois
@mayralois
The Hungry Container Breakdown How the problem could have been avoided? Log management tools for app logs --log-driver=none|syslog
Remove container / clean-up jobs /var/lib/docker deserves its own partition
@mayralois
The Hungry Container Breakdown
Buggy Containers May Kill Your Nodes
@mayralois
Try to Break Your Clusters Early (And be Prepared for Black Friday)