chaos engineering

Chaos Engineering: Not so Chaotic

by | 20.06.2021 | Engineering

It feels very complex when we talk a lot about cloud computing and developer operations. Furthermore, certain things look complicated, but they are not so if we easily understand those concepts. Today, we will discuss such a thing that sounds complex but is simple and is known as Chaos Engineering.

When cloud computing comes into our mind, we think about certain things like servers, data, zones, regions, etc. Still, these things also bring inevitable headaches like network outage, power failures and others. And, to cure those headaches or resisting network or power failures, chaos engineering comes to the rescue. So, if you are very new to chaos engineering, you have come to the right place and without further a due, let’s roll.

What is Chaos Engineering?

In short words, chaos engineering is the engineering process that determines whether a System can work in times of disturbances. Through Chaos Engineering, a discipline of experimenting goes on with a System to build its confidence while specific network or power failures occur.

With Chaos Engineering, we intentionally break a system to build its capabilities and see its working performance when any individual component fails. Also, specific stresses are applied to determine various potential outages, locate weakness and improve the System’s resiliency. Also, we can see traffic spikes, unpredictable situations and more through chaos engineering.

How Chaos Engineering Works?

Now that we have seen what chaos engineering is let us find out how does it work. Various processes are involved in chaos engineering, and we are going to have a look at it.

Defining Steady-State Hypothesis

The primary thing we have to do is determining an idea in which complications can occur. Then, we have to inject that failure into the System and wait for the outcome.

Simulating Real-World Events

Simulation of real-world events in simple terms means that we have to test the System using real-world scenarios and monitor how it can perform under specific stressful circumstances.

Confirming the Steady-State

This process is straightforward as we have to note down the changes that occurred through which we can get an insight into the System’s behaviour.

Collecting Metrics and Observing Dashboard

We have to collect the metrics (basically the System performance) by observing the dashboard in this process. The improved metrics will determine customer success and will help us to measure the failure against our hypothesis.

Making Changes and Fixing Issues

After running the experiment, we will have an idea of what needs to be changed, and we can improve the System. In addition, we can now identify what will lead to an outage and will break the System.

chaos engineering
Process of Chaos Engineering (Source: Medium)

Advantages of Chaos Engineering

Reducing Failures

Chaos Engineering helps reduce failures of a System through various experiments and tests if it can run during network outages or failures.

Improving Durability

As Chaos Engineering reduces failures, it improves the overall durability as it can tolerate unpredictable conditions.

Improving Service Availability

We can gain new insights about an application through chaos engineering, which helps improve the service of the application and makes it available during network chaos and leaves room for further future improvements.

Preventing Revenue Loss

Chaos Engineering helps prevent revenue loss as it identifies the causes of a network outage or failure in advance, which allows a business to save revenue.

Lower Maintenance Costs

Chaos Engineering helps lower maintenance costs because it helps in testing your System before running it in production, which can give a positive outcome during unpredictable conditions.

Disadvantages of Chaos Engineering

Takes extended amount of Time

One of the significant weakness of chaos engineering is that it has a long process, and that wastes a lot of time for a company because it eventually holds back the deploying of an application

Trial and Error Process

To gain a proper insight into an application through Chaos  Engineering, we have to do many experiments with fault injection, lengthening the process. Moreover, we have to do it again and again to check all the possibilities.

Antifragility

Not all applications can withstand the intentional breaking of their System, and as a result, we can’t apply Chaos Engineering in all applications.

Tools of Chaos Engineering

Now, we can see some of the tools with which we can use Chaos Engineering.

Chaos Monkey

Chaos Monkey is the original tool for implementing Chaos Engineering at Netflix in 2010. It is still a go-to chaos testing tool.

Gremlin

Gremlin is by far the most popular among chaos testing tools. The free version of it helps in simulating high CPU load and turning off machines.

Chaos Toolkit

Chaos Toolkit is an open-source initiative for testing chaos which makes it more accessible, and also, it has an Open API and a standard JSON format.

Pumba

Pumba is a chaos testing tool and also a network emulation for Docker.

Litmus

Litmus is a chaos engineering tool for stateful workloads on Kubernetes.

Learn more about Kubernetes here:

Value Proposition of Chaos Engineering

From all the above discussions and processes, we can now finally understand how chaos engineering works, and businesses should always think about implementing Chaos Engineering before deploying their applications.

As we can see, Chaos Engineering requires failure injection, so it will be preferable for an application with the proper requirements to withstand those experimentations. Big market value businesses build applications that require more machines and data, and those applications can get high facilitation from chaos testing. So, Chaos Engineering is highly preferable before deploying big applications, which will be disadvantaged during an outage or failure.

As for small and mid-cap businesses, they can test out Chaos Engineering through free tools as their application will be comparatively more minor and also, the budget can be an issue.

Final Thoughts

We now know the fundamental secrets of Chaos Engineering, and we have also discussed how we can use it and implement it. If you ask me, I will have a chaos test for my application through open-source because it helps in the long time process and also gives stability to my application during any outage. So, hopefully, we have learned a lot about Chaos Engineering and will apply it too.

Feeling exploratory? Feel free to check our other blogs:

Happy Learning!

CommunityNew

The DevOps Awareness Program

Subscribe to the newsletter

Join 100+ cloud native ethusiasts

#wearep3r

Join the community Slack

Discuss all things Kubernetes, DevOps and Cloud Native

Related articles6

How to clean up disk space occupied by Docker images?

How to clean up disk space occupied by Docker images?

Docker has revolutionised containers even if they weren't the first to walk the path of containerisation. The ease and agility docker provide makes it the preferred engine to explore for any beginner or enterprise looking towards containers. The one problem most of...

Parsing Packages with Porter

Parsing Packages with Porter

Porter works as a containerized tool that helps users to package the elements of any existing application or codebase along with client tools, configuration resources and deployment logic in a single bundle. This bundle can be further moved, exported, shared and distributed with just simple commands.

eBPF – The Next Frontier In Linux (Introduction)

eBPF – The Next Frontier In Linux (Introduction)

The three great giants of the operating system even today are well regarded as Linux, Windows and Mac OS. But when it comes to creating all purpose and open source applications, Linux still takes the reign as a crucial piece of a developer’s toolkit. However, you...

Falco: A Beginner’s Guide

Falco: A Beginner’s Guide

Falco shines through in resolving these issues by detecting and alerting any behaviour that makes Linux system calls. This system of alerting rules is made possible with the use of Sysdig’s filtering expressions to detect potentially suspicious activity. Users can also specify alerts for specific calls, arguments related to the calls and through the properties of the calling process.

Why DevOps Engineers Love Fluentd?

Why DevOps Engineers Love Fluentd?

Fluentd’s main operational forte lies in the exchange of communication and platforming for creating pipelines where log data can be easily transferred from log generators (such as a host or application) to their preferred destinations (data sinks such as Elasticsearch).

Operating On OpenTracing: A Beginner’s Guide

Operating On OpenTracing: A Beginner’s Guide

OpenTracing is a largely ignored variant of the more popular distributed tracing technique, commonly used in microservice architectures. Users may be familiar with the culture of using distributed tracing for profiling and monitoring applications. For the newcomers,...