The Mushroom Cloud Effect or What Happens When Containers Fail?

The Mushroom Cloud Effect or What Happens When Containers Fail? Alois Mayr Technology Lead Cloud & Containers MesosCon Europe August, 2016 @mayra...
Author: Merryl Anthony
1 downloads 0 Views 7MB Size
The Mushroom Cloud Effect or What Happens When Containers Fail?

Alois Mayr

Technology Lead Cloud & Containers

MesosCon Europe

August, 2016 @mayralois

about:me •

Helps bringing monitoring to modern technologies



Working with customers, partners and R&D



Technology Lead for Cloud & Containers

@mayralois

about:dynatrace •

APM market leader who helps companies in Digital transformation



Founded in Austria back in 2005



> 8000 customers across all industries



Seen many performance and stability problems and patterns out there

@mayralois

about:you •

Who of you run/manage containers in production?



Has running containers made your life easier?



Thanks!

@mayralois

…there’s been the mushroom cloud effect

oh yeah, everything screwed up

@mayralois Source: http://www.schoonoart.de/

The Mushroom Cloud Effect or What Happens When Containers Fail? @mayralois

TL;DR

@mayralois

About Cloud-Scale Systems

@mayralois

Important Aspects… •

Lots of (micro-)services



Lots of communication between services



Service dependencies



Versioning and API compatibilities



Zero downtime

@mayralois

Develop

Big monolithic application.

Small interconnected purpose-built services.

@mayralois

Pizza Box Teams

Small teams can deliver features into production

@mayralois

New Rules in the Game You build it, you run it. Werner Vogels, CTO Amazon

@mayralois

Ship

Deploy

Big Bang Releases of single special built applications.

Small continuous service delivery of standardized delivery @mayralois blocks.

Compute Hardwired datacenters.

Datacenter as an API.

Confidential, Dynatrace LLC

@mayralois

New platforms to help out in running apps •

Most often container-based



Clustered for scalability



Ephemeral containers



Resilient architecture



Cross AZ fail-overs



SDN for communication

@mayralois

Deployments are no Longer Static

7:00 a.m. Low load, service running with minimum redundancy

12:00 p.m. Scaled up service during peak load with failover of problematic node

7:00 p.m. Scaled back down to lower load, move to different geolocation

@mayralois

Anatomy of dynamic environments

@mayralois https://www.dynatrace.com/en/ruxit/

All About (Service) Dependencies

@mayralois

Failing containers… …may or may not have an (immediate) impact to service performance

@mayralois

Cascading Failures Lead to a Mushroom Cloud Effect

@mayralois

@mayralois

The Hungry Container Breakdown What was the problem? • • • •

Shared /logs partition on host No log rotation, no archiving for app logs No proper log management used for Docker environment Shared /logs partition ran out of space

@mayralois

The Hungry Container Breakdown How the problem has evolved over time? • • • • • •

Container health checks failed Orchestration killed container and rescheduled new one Still no free space on /logs Termination and rescheduling /var/lib/docker ran out of space Cluster nodes were no longer able to run any containers

@mayralois

The Hungry Container Breakdown How the problem affected services? • • •

Services at the top of the graph Increased failure rates Lots of depending Tomcat and DB services affected

@mayralois

@mayralois

The Hungry Container Breakdown How the problem could have been avoided? Log management tools for app logs --log-driver=none|syslog

Remove container / clean-up jobs /var/lib/docker deserves its own partition

@mayralois

The Hungry Container Breakdown

Buggy Containers May Kill Your Nodes

@mayralois

Try to Break Your Clusters Early (And be Prepared for Black Friday)

@mayralois

Break Your Clusters Early Massive load testing!

Include everything

Services, Containers, Orchestration, EC2 instances

Survive three days of pain @mayralois

Testing everything

13.3k containers (+nodes) 3,451 services @mayralois

@mayralois

Automation Needed to Pinpoint the Root Cause of Cascading Failures!

@mayralois

Thank you! How do you know if a failing container impacts your apps?

See you tomorrow at Mesosphere‘s booth at 12:50pm?

@mayralois