Managing Chaos and Improving Reliability with Gremlin

Dana Kohut
theprimeview
Published in
6 min readMay 18, 2021

--

The Prime View had the privilege to sit with Matthew Fornaciari, CTO and Co-Founder of Gremlin Inc., a company that built a robust enterprise Chaos Engineering platform.

In the interview, we talked about common misconceptions about the Chaos Engineering concept, the way JP Morgan Chase benefited from leveraging Gremlin’s solutions, and the major update to Gremlin’s platform that is treating services as first class citizens.

Watch & Listen to the Full Interview

Matthew, how did your experience as an engineer at Amazon impact your decision to start a Chaos Engineering company?

Many tech experts work at Amazon and learn about how systems operate at scale, and then go out into the world and do exciting things with it. My co-founder, Kolton Andrus, and I learned many interesting things about systems and operating systems at scale. We got to see and dig into some interesting ways that things fail and know that engineers’ assumptions about the infrastructure and systems reliability don’t always hold true.

We decided to take that knowledge and build it into not only a tool, not only a product but also a company and a practice. The idea of Chaos Engineering is in the practice of going out and proactively checking out all of the failures before they bite you in production before they affect your customers. Because at the end of the day, the last place you want to find out about bugs and downtime is at the customer level.

Gremlin is pioneering the practice of Chaos Engineering. Most of our listeners are familiar with the term, but can you explain what it’s really all about from your perspective?

The practice of Chaos Engineering, it’s not new at all. It goes way back to the times people were in caves trying to figure out what works and what doesn’t. It’s all about trying to proactively assess things out. But in terms of computing, the concept goes back to Jesse Robbins, whose nickname was Master of Disaster. He was at Amazon long before me. Jesse created a GameDay project to increase Amazon’s website reliability by purposefully creating failures on a regular basis. His approaches were adopted by all large companies. And eventually, we realized that this was a practice that we want to democratize, to give to the masses.

Chaos Engineering is dedicated, specific testing, where you inject chaos, inject failures, but it’s not chaotic testing. It’s very scientific. You approach it with a hypothesis. You approach it with acceptance criteria, rejection criteria, you’re trying to validate or reject the assumptions you have about your system.

Gremlin offers a way to build out that practice and build out the muscle around Chaos Engineering in terms of a scientific method. Gremlin gives you a safe, secure, and straightforward way to run experiments in the real world, in production, staging, and development, wherever you want to validate your assumptions about your system.

How do you convince new clients who aren’t aware of Chaos Engineering concepts, that they actually need it?

We’re working to help customers understand that it’s not chaos for the sake of chaos. It’s chaos in the name of improving your reliability, your uptime, creating a better experience for your customers, and making sure that they’re using chaos as a tool.

Technology drives that paradigm shift from a reactive culture that we’ve had in terms of operations to the proactive — going out and finding things that can cause failure before they cause downtime before they lose your company’s money. Also, we’ve seen a general change in the public’s sentiment where downtime is not acceptable anymore. The expectation is that you are always up, always available, and there’s always a good customer experience. So, different things have come to prominence that has driven this move towards the proactive from the reactive approach to enterprise systems’ reliability.

Matt, I hear Gremlin is making a significant update to its platform. And now you’re going to treat services as a first class citizen. Tell me a bit about the decision. How did the rise of cloud and DevOps impact it?

We are making a big move, and the primary motivating factor to make services first-class citizens is to meet our customers where they are. Our customers think about their architecture about their systems in terms of services. They don’t necessarily think about the underlying infrastructure, how it’s running. Our customers can go straight to Gremlin, look at services A, B and C and run attacks directly without the additional cognitive load of having to understand whether this is running in Kubernetes, EC2 AWS, or GCP.

We wanted to make sure that we were replicating our customers’ mental model in Gremlin and give them a home base for each of these services from a reliability standpoint.

At Gremlin, we wanted to give customers the ability to tie in different mechanisms to see all of the attacks run on that particular service, regardless of where they’re hosted or how they’re operating.

So, it’s a kind of one-stop-shop for enterprise systems reliability?

The plan is to ensure that customers have a home base to come in and look at the run books with everything they need to make their services more reliable. And ultimately, as you increase the reliability of all those individual pieces, you increase the system’s reliability as a whole. At Gremlin, we’re well-positioned to help companies get up to speed on reliability, regardless of where they’re hosting what they’re doing. We are cloud-agnostic. We ultimately want to help you make their services more reliable.

I see Gremlin helped JP Morgan Chase to automate compliance. Could you please elaborate more on this case and the benefits delivered to JP Morgan Chase by Gremlin?

We’ve seen quite a bit of adoption from financial firms. Many banks are leaning into being more reliable; they’re much more amenable to withstand different failure modes, especially as we’ve seen this increasing trend towards online and cloud-based solutions.

With JP Morgan Chase, we’ve been building out a library of failure modes that they’re resilient to. And as their services move from on-prem to the cloud, they’re passing this bar; they’re being put through this test to make sure that they are resilient to X, Y, and Z before they’re able to run in production in the cloud. It’s been an extraordinary experience and a remarkable partnership to work with them.

What do you see in the future for Gremlin? Can you give us a sense of where you hope to take the company over the next 5 years?

We started off the conversation talking about Chaos Engineering but towards the end of the conversation, we talked a lot more about reliability. So, I think reliability is the long-term goal of Gremlin. Chaos Engineering is a fantastic tool, but it is only one tool in the reliability tool belt. We can integrate with different solutions that provide a much more robust reliability experience. Ultimately, the reason why the customers hire Gremlin is to help them become more reliable and there are many different vectors that we can go out and improve our product, our offering, and provide more guidance to our customers.

Three Quick Facts about Chaos Engineering

Chaos Engineering is not chaotic. It’s a scientific method to verify resilience hypotheses about software systems;

Chaos Engineering saves time. Without Chaos Engineering, you would spend more time troubleshooting and fixing production incidents;

Chaos Engineering saves costs. By applying Chaos Engineering tools, you can optimize resource allocation and test for non-essential infrastructure.

Matthew, thank you for the grand vision of transforming enterprise systems and infrastructure reliability. We do believe that you and the whole Gremlin team are poised to do big things!

Stay tuned for more great interviews coming your way!

Originally published at https://theprimeview.com.

--

--