Chaos is the New Normal

Tests are at same time a simple and a complex subject. Testing is something basic. Are you don't do testing how do you know it works? How do you know your software won't stop working after some refactorings?.

Unit Tests are the basic level of testing you could do. In the past unit test was normal. When we deliver software we need to deliver unit tests whiten so level of coverage. When we talk about coverage things can get tricky since you might end up coding tests for parts of your application where it is little or no value what so ever.

Coverage can be something questionable as you language might force you to do less or more tests. For instance is you have a strongly typed language where you enforce as much thing as you can via compiler you might need to do less testing since the compiler works in your favor. As you work with more dynamic and weak-typed languages you might need to do more tests since the compiler might catch fewer things.

As much the language ends up affecting how much tests you do I can say the same or even more about software architecture. How we run our software today affects a lot on how and what kind of strategies we should be considering to test out software.

From Monolith to Microservices

Today microservices are the standard de facto architecture style. Microservices bring effects into your software. There are lots of benefits when we do microservices like:

Independence: Different things can happen at same time.
Different teams can use different technologies and versions
Easier to scale
Easier to maintain and evolve since there are different and independent code bases
Isolation: Each microservice has they own:

Database
Operation System / VM
Configuration
Release Process

However, microservices are not a free lunch. There are several drawbacks or issues you need to address with microservices which were not present before, like:

How do we do joins? There is no central DB. The need for Streaming.
Infrastructure Complexity: Microservices require DevOps Engineering
The need for Observability: Health Checkers, Centralized Logs, and Distributed Tracking.

There is one big effect on microservice that is the fact that we moved from a CENTRALIZED solution to a DISTRIBUTED one. As we have distribution we will have way more FAILURE. Failure will happen and we need to design for failure. That's something it's very hard to do it later. As we need to design for failure, we also need to test for failure, right? Yes. We need to test our microservices in a different way. That's why we need to have chaos engineering.

Chaos: Different Strategy

When we are doing unit testing, integration testing, and contract testing we are testing our application(microservice) but we are not testing the infrastructure. Cloud-Native microservices implies that we run code on the cloud. How do we know that our infrastructure can survive failures? There are many kinds of failures for instance:

A machine can have no more CPU available
A machine can have no more MEMORY available
A machine can have no more DISK available
A machine can have no more NETWORK available
An Instance can be terminated any time
An AZ can go down
A Region can go down

Is out software ready to deal with all these kinds of failures? This failure eventually will happen and you might learn on the worst time and in the worst way possible. So it's better to test before it happens. This makes the chaos as the new normal. Is chaos is the new normal this could be the new Definition of Done? So now when we finish a story we can say we need to complement our testing strategy with chaos testing. If you are not doing this on the story level you should be doing at least at the delivery level. So you need to have some kind of Production Ready Checklist where applying chaos is one critical item.

Netflix's Simian Army(https://github.com/Netflix/SimianArmy) and Chaos Monkey(https://github.com/Netflix/chaosmonkey) which are chaos tools that can help you to simulate some of this scenarios. However, you just need very few things to create this calls. You can use most of Linux APIS and AWS APIs in order to generate this chaos.

Don't forget Network testing

Chaos testing it's not everything. You need to do more. When you have several microservices calling each other via a network. Several things can happen. For instance:

The network might completely fail
You may get 20% extra lag/latency
What happens is your call never return? The code will be hanging?
What happens is the return is completely mess up(corruption)
What IF the return is too big(10MB string for instance)?

Would your code be ready to deal with this scenarios? Well, there is the only way to know it. Doing some kind of network failure testing. Some of this network failures you can use tools like Toxy Proxy(https://github.com/h2non/toxy).

What about Databases?

Do you database infrastructure ready for failures? I'm really into open source. However, I know when I use NoSQL databases like Cassandra, Redis, ElasticSearch, for instance, I need to have a great automation in place, not only for deployments but for operations as well. If a Cassandra node die? Would the cluster recover? Who would spin out new instances? Do you have all under an ASG? Well, we can only know if we test it. The same kind of chaos testing we do for microservices we need to apply for all databases that are managed by us. When I say managed by us I mean any hosted services that is not cloud-vendor managed.

You might go even further and test the database itself. Do you know is your database deliver what the database promise to you? It's your database strong consistent? Are the DB losing data or not? Well, there is chaos testing for databases. Aphyr is doing a while with Jepsen(https://aphyr.com/tags/jepsen) and it's open source so is your DB is not there you can add it https://github.com/jepsen-io/jepsen.

Assertions on Chaos

Junit has this assertion class with several methods to do help you to check certain properties are correct as you expect. How we can do that with Chaos, Network and provisioning testing? For provisioning, you can use ServerSpec(http://serverspec.org/) and a check is your infrastructure is in place. For networking failure. As you use ToxyProxy you can code with JUnit or any other testing framework since you will do a remote call and you expect your code to survive. In order to survive its importance to code basic concepts like:

Circuit Breaker(provided by NetflixOss Hystrix https://github.com/Netflix/Hystrix)
Timeouts - Hystrix or any good request lib have it.
Retries - at least 3 times? Or use Retry Budget is latency sensible.
Fallback to other AZ and regions(Hystrix + Ribbon(https://github.com/Netflix/ribbon)
Error Observability: Are you fail - log it, send some metric somewhere.

For Simian Army cases, you actually should not care is a machine die or a microservice stop working. You need to worry is has you disrupted the service for the final users or is the failure is perceived by the final user. A great way to test this is using a stress test tool like Gatling(http://gatling.io/) because then you can simulate a number of users let's say 10k concurrent users per minute and then you can run the Simian army and see if a machine, az or region dies will after the users or not.

The software has changed a lot in the last 5 years. Software architecture changed, runtime changed. So your tests need to change as well otherwise you will not be ready for what you are building. Chaos testing requires discipline but keeping the user experience great and not increasing much latency pays off in the end of the day.

Cheers,

Diego Pacheco

Search This Blog

Diego Pacheco Tech blog

Chaos is the New Normal

Popular posts from this blog

Kafka Streams with Java 15

Rust and Java Interoperability

HMAC in Java