Event Driven Operations - An Example in AWS

A lot has been written about Event Driven Architecture and it is hard to point to a single great resource. Therefore, to save space on my blog (and preventing it from becoming a book), I will just share the following resources for those who would like to familiarize themselves with the concept and theory:

As you will see, virtually all the resources focus on this pattern from an application perspective. RedHat and AWS hints toward an operational or support context, but overall I found very little solid references that focus on operational contexts for those of us who have to manage infrastructure and cloud infrastructure.

As we move toward Infrastructure as Code I think it is worthwhile for Support Engineers, DevOps and SRE engineers to also better understand the concept of event driven architecture and patterns, as we also are starting to produce more and more code (and logic). By the way, I specifically link to the Red Hat resource here, as I believe they have a really short but accurate description of the two dominant types of IaC at the moment: declarative and imperative.

What I want to explore in this post is an example of how anyone that has to look after the operations of a platform can leverage the event driven pattern in their operational support model. I will focus on AWS, as this is where my experience currently lies - but the principles can very well be applied in almost any environment - and more on that at the end of this post.

Example Scenario

For this specific blog post, I am not too concerned about IaC - I assume it is something that you already do - at least to some extend. If you are not yet on the IaC boat, I would suggest you learn about it as it will change your life! However, your environment, choice of tools etc. will largely dictate if you lean toward a declarative or an imperative IaC strategy. It could even be both! However, since I will be focusing on AWS and use SAM for the examples, you have to keep in mind that this approach is, in my opinion, a little bit of a mix between the two. AWS themselves will define CloudFormation as declarative while tools like SAM leans more toward the definition of imperative - even though the final output is CloudFormation templates.

So, without going to deep into these differences any further, I will look at AWS as an example and focus on how we can know that something we defined as IaC changes state - for whatever reason. The idea will be to send a properly formatted notification - standardized to our own specification, to something that can collate all this data. I am not going to propose a solution for how this is finally collected - there are many options and each organization have their own mix. However, the concept and examples up to that point should be fairly portable for AWS users.

In AWS we deploy virtual infrastructure as service resources. AWS has hundreds of services, like EC2 (virtual machines), ELB (Application and Network Load Balancers) etc. During a deployment, resources will be created. For example, in the EC2 service, resources like virtual machines, block storage devices, network interfaces etc. are created and all of these resources are required for the resource to work properly.

From a service perspective, each of the resources in the services can be in different states at different times and in general there are states for a resource being provisioned, then one for it being in a running and stable state and finally a state for when a resource is being de-provisioned (stopped, deleted or whatever other terminology is fitting).

In AWS, when resources changes state, many of such services that the resources belong to can emit an event. This event can be handled by another AWS Service known as the AWS EventBridge. You will see from the service description that the service is really pitched once again at application solutions. However, if you look at the conceptual diagram (copy below), you will clearly see AWS Services as an event source:

AWS EventBridge from DrawIO

The AWS EventBridge can get very involved, but from our perspective we are only interested when a resource in AWS changes state. We want to get informed about it and be able to react to it.

In my example, I will demonstrate how we can get the basic pieces put together to capture AWS service events, transform the even message to a standard form and finally deliver that standardized message in a way that can be consumed by virtually any other third party application.

At this point, it is also important to know which services and resources is capable of producing events. A list of supported services is available in the AWS documentation for EventBridge. As can be seen in the diagram above, AWS service events are directed to the default event bus.

In the example scenario we will look at the EC2 service and we see for this service, the following states will produce events for an EC2 instance:

Some of these states are transition states or temporary. In a transition state, the service may sometimes appear to be still operating normally. This depends a lot on the actual service and how it is configured. For example, an EC2 instance in the shutting-down state may still have a working SSH service and if you time it right, you may even still be able to log into the instance while in this transitioning state - however short lived your session may be!

Solution Design

Thinking about our design: goals, tools, patterns etc.

Coming up with an event driven solution can be hard. It takes some time to identify all the various components required and you also have to force yourself to think about potentially other scenarios in which each microservice could be used. The idea is to create services that can operate independently, each reacting to events.

I probably sneaked in the term microservice there, but at the end of each event, the consumer is typically a microservice. Each of these microservices has very specific tasks, and we will delve deeper into that later.

Another consideration is specifically around the environment the solution will be deployed in. In this case, the target is AWS and the example solution will therefore be specifically tailored for this environment.

Therefore, the AWS EventBridge, as already mentioned, is more or less our starting point. The Default Event Bus will receive events that we configure and then we have to do something with that event. The core AWS components at play will be:

A note about SNS Fan-Out: This is a very powerful feature and in our example, we will use it for a hypothetical scenario where the processed event notification is sent to two different SQS queues: one will be intended for a system based consumption like a monitoring and alerting system which requires the message in a computer readable format, like CSV, while the other queue is intended for message delivery to humans in a more readable format. In the latter example, we will transform the CSV data to a more human readable format.

Finally, for the service to be extended you will probably have to deal with a variety of source event message structures. Each piece of infrastructure have their own message body and statuses. It is therefore useful to consider how messages can be converted into a standard form in order for all downstream processing to deal with only one message format. This is of course again only an option and the example will illustrate a very basic concept of this, but taking an initial source message and condense it into a simple three column CSV message, where column one indicates the source Infrastructure class that generated the event (for example aws.ec2), column two holds the AWS ARN of the resource and finally column three holds a processed form of the state/event. In the example, the Lambda functions only process EC2 events and if you experiment with added more resource types to the event bridge, you will have to consider adding logic to handle those message formats properly as well.

Solution Design Diagram

In conclusion, a possible solution could take the following form - which is also the template solution used for our example:

example solution design

The diagram has a number of components, and we can now start to examine each of them. A typical flow, as should be obvious, is from top-to-bottom and left-to-right.

The Overall Process

The entire diagram represents an Infrastructure Event Notification Process. Once some event occurs in our infrastructure, a hypothetical monitoring system as well as humans will be interested in some or all of these events.

Initially, only EC2 events are configured and the various Lambda functions more-or-less expects only these types of events.

Certain components, like the EmailNotificationSender Lambda Function, is very generic and could potentially be re-used in many other processes. In terms of project organization, you might want to keep this specific microservice in a dedicated project where it can be maintained independently of other functions making up this Infrastructure Event Notification Process. This is why I like to start with diagrams like this: it allows us to identify these special scenarios and allow us to think what to do with them. In a simple example like I have here, and with no other processes, the specific service is kept in a single project.

Event Generation/Source

The AWS Infrastructure produces events as the infrastructure in question changes state. For example, an EC2 instance previously in a running state is terminated and is now in a terminated state.

In the example project, the EventBridge will be configured to handle only EC2 events for any EC2 state and these will be send to AwsServiceEvents SNS topic.

The First Layer: Process Initial Source Message and Produce a Standard Message

The AwsServiceEvents SNS topic will immediately forward all messages to the AwsServiceEvents SQS Queue. There is an AWS Lambda function called AwsServiceEventHandler which will consume SQS messages and attempt to produce a standardized message. The standardized message will be send onto the OrgResourceEvents SNS Topic.

I have chosen the specific name to make it more generic. From this point, we could handle messages from any other environment or cloud service. This design allows for a level of decoupling where we can now also re-use layer 2 and 3 by other non-AWS infrastructure resources that also produce state events that are of interest. For example, you can now consider channeling some of your on-premise infrastructure events (like the trusty old SNMP Traps) to AWS to all be handled now in a consistent way. This, by the way, can be a strategy forming part of a data center migration to the cloud exercise and allows you to implement an interim step to first consolidate the management of all infrastructure events across your entire estate.

The Second Layer: Consume and Process the Standard Message

In this layer we now have a standard message on the SNS topic and here is our first example of a fan-out strategy that we can employ with SNS. More specifically, we now want to split our message for consumption of two very different end-users: one is expected to be a "machine" and will process the standard message as-is which the second end-user is expected to be a human that needs a more human like message - perhaps with a simplified version of the state of a service.

The standard message (which is also the machine readable version) will be send to the OrgResourceEvents SQS Queue. Nothing is consuming this queue at the moment, so messages will just build up until the reach their expiry. This level of decoupling will also allow another team to independently work on the consuming part - and all they have to know is the message structure and the ARN of their queue. Therefore, our implementation also need to make the queue binding information available and this is typically achieved by CloudFOrmation Outputs.

There is also a second OrgResourceEventsMessaging SQS Queue which is consumed by the OrgResourceEventNotificationSender Lambda Function and who's purpose is to convert the standard message into a human readable version that can be send by downstream services. The resulting message will be published to the OrgResourceEventsNotification SNS Topic.

The Third Layer: Sending Human Readable Messages

The final layer could potentially be maintained by a completely independent team and also in a separate project, as mentioned before. In this example, the OrgResourceEventsNotification SNS Topic sends the message directly to a Lambda function that can send the e-mail message to SES. In a similar way you could add additional Lambda functions that can integrate with many other services, like Slack, MS Teams or Zoom (and many many others).

The decision to directly invoke an AWS Lambda function is driven be the fact that these message are not considered critical - if they are never send, it is not the end of the world. The enterprise monitoring solution, however, is the source of truth of the real time status of our Infrastructure and integration to such a system requires a more durable solution.

Other Considerations

What happens if the Standard Message needs to change?

The example is really simplified, but each of the topics and queues can be versioned in various ways. For this specific solution I would prefer completely separate topics and queues for each message version. For example, the OrgResourceEvents SNS Topic can have a second version later called OrgResourceEventsV2 SNS Topic. In turn the OrgResourceEventsV2 SNS Topic will now publish messages to the queues OrgResourceEventsV2 SQS Queue and OrgResourceEventsMessagingV2 SQS Queue and the teams implementing the Lambda function and/or other systems consuming these messages can over time integrate their solutions to consume from these queues.

This approach allows for decentralized and decoupled teams to work each at their own pace without any hard dependencies.

Also note that at this stage, the lambda function that starts to consume the OrgResourceEventsMessagingV2 SQS Queue can very well still produce version one messages for the OrgResourceEventsNotification SNS Topic - but now taking into account a new message structure from upstream events. In this scenario, the Lambda function sending e-mails is completely oblivious to the changes in the layers above it.

I need more data...

Sometimes the data contained within a message is very limited and you may want to enrich the data - especially when sending as an e-mail. For example, the message contains only the resource ARN and in this case an Instance ID. For a human that will read the message an instance ID may mean nothing. So perhaps in this case you may want to query the EC2 instance and add some information available from it's tags, like the instance name tag value (if set).

Talking about tags, tags can also be considered for other processing logic for example resources with a certain tag can be ignored.

Tagging can also be used in more advanced message routing options by tagging each resource with a department or business unit owner. Depending on the value of such a tag, different topics and queues can be targeted to meat different needs from these resource owners.

Whatever the case, the lambda function may have to be extended to query the specific resource to obtain the required data in order to add to the message. Keep in mind that such additional functionality may require you to consider the following:

Finally, more api calls will also incur more costs and however negligible this may be it all adds up in the end.

Implementation

AWS has a fairly large collection of example SAM templates in their serverless patterns repository. I used these to get started quickly with my own example repository for this exercise.

The specific examples I used included the following:

The deployment notes are in the example repository

There are some shortcomings of the example implementation and it is worthwhile discussing some of them briefly:

Testing

Post deployment, remember to set the sender and receiver e-mail addresses in the EmailNotificationSender Lambda Function Environment Variables. If the account still uses the SES sandbox, the e-mail address(es) needs to be verified.

Some example test scenarios:

Each time one of these events occur, you should receive an e-mail, similar to the one shown below:

email

Because there is no application consuming messages in the OrgResourceEvents SQS Queue, there should also be a message buildup, as can be seen below:

SQS queues

Further Important Points

Seeing the demonstration in action is great, but keep in mind that examples like these just scratch the surface and they are only there to describe a concept in a practical and simple way that anyone could test out and play with. There are a number of other points to consider when moving from this example to a full solution, including:

Wrapping up and conclusion

Event Driven Architecture can be really useful in IT operations where support engineers need to be made aware of certain events that occur in the environment. Cloud services like AWS provide all the tools required to use event based patterns to accomplish exactly this task. Other cloud services also have their own implementations and you can consult other sources, like this excellent blog post (from Steef-Jan Wiggers).

Having said that, I don't think it's a complete replacement for more traditional monitoring techniques that actively discover and probe infrastructure. I have delved into some of the reasons for this, and ultimately you will have to decide how you want to use the this information to best monitor your own environment. On the other side, I also briefly discussed how these event driven techniques can be great for automated security interventions and other scenarios where automation can really free up a lot of the support team's time.

Infrastructure is literally becoming just another object that can be manipulated in software. As infrastructure and support engineers deal with more cloud technologies and Infrastructure as Code, we also have to consider more patterns, designs and principles from the software development domain.

I hope you enjoyed this post as much as I had fun implementing the example and writing the post! I am always interested to hear your opinions/thoughts, so please feel free top engage with comments, pull requests or whatever other medium works for you.

Tags

infrastructure, iac, event driven, patterns, architecture, aws, azure, redhat, cloud, support, devops, sre