Monitoring Asynchronous Applications

Amy Boyle @amylouboyle

background from subtlepatterns.com

Async == Asynchronous

Monitoring Async Applications

Amy Boyle @amylouboyle

Roadmap

Software Engineer @

I am a software engineer at New Relic. New Relic provides multiple monitoring products, most notably APM.

Recently I was working on gathering Async performance metrics for Python. This turned out to be a non-trivial task, so I thought I'd write a talk to share with y'all some of the things I learned.

There is some confusion on the internet as to what the terms surrounding concurrency mean. So I'm going to define them for you, and provide an explanation to serve as a framework for understanding the rest of this talk.

Synchronous paradigm is the traditional way of doing things. Only executing on one task at a time, one after the other.

A concurrent program is one that is able to work on multiple tasks in the same period of time.

Parallel execution mean running different tasks in the same instant of time. For this we must have multiple CPU cores.

Asynchronous is interleaved task execution. We can alternately run chunks of different tasks.

A talk about async wouldn't be complete without an analogy

Let's say you have you have a litter of puppies. Each puppy needs to be fed and cuddled.

https://flic.kr/p/NCCT1 spilled food

Meli Lewis said I could use her photo

Synchronous

Explain diagram.

If you were to do this in a totally synchronous fashion, you would give a puppy its food, stand there and wait for her to finish, then clean up. Then you'd do that for the next puppy and the next etc. Then you would pick up and cuddle each puppy in turn as well.

Parallel

Parallel is if you had friends that could feed and cuddle the puppies, one each, all at the same time.

Asynchronous

Asynchronous fashion would allow you to give a puppy her food, and while she is eating you can move on to another puppy and give her her food. Since cuddling is an active task you can can't do anything else while you cuddle a puppy.

Human = CPU

Puppy = Task

Feeding = I/O call

Cuddling = CPU intensive code

Examples will be in Pseudocode Python

Mostly because Python is my favorite language, but also because pseudocode looks like Python. My hope is that even if you don't know python that you'll be able to follow the examples. I'll also be using the Tornado web framework because it has an elegant API that allows for concise examples, where we'll be able to stay at a high level.

I am taking an example-based approach.

If you are familiar with advanced python, you may notice that I have sacrificed best practices/safety for the sake of simplicity. Do not copy and use this code as is.

There are also some places where I'm going to gloss over the details. My goal here is to outline the fundamental concepts, not show python tricks.

Asynchronous code yields execution to other pieces of code

This is a coroutine

Asynchronous code yields execution to other pieces of code

Asynchronous code yields execution to other pieces of code

Asynchronous code yields execution to other pieces of code

class EventLoop():

    def add_callback(self, callback):
        self.queue.append(callback)

    def run(self):
        while(True):
            this_round = queue.copy()
            self.queue = []
            for callback in this_round():
                callback()

This is single threaded so we don't need locks

Asynchronous code yields execution to other pieces of code

What this code demonstrates is the ability to wait on multiple IO calls for the same task at a time.

For each handler we have 10 IO calls to make (in this case 10 http fetches), in the synchronous case we have to wait for each fetch to return before we get the next one, and this is a blocking wait, nothing else is able to run while the IO is taking place.

In the Async handler we can fire off each fetch, and then wait for them all to complete at the same time. While we are waiting we can also execute any other code that needs to run.

Winning!

This supports multiple long-lived connections to a user

Most common manifestation is event loops

Now that we know what an async app is, and why we should use it, I'm going to talk just a bit about monitoring it.

First of all, what do I mean when I say monitoring?

Collecting data on your app in production

Monitoring is Collecting and processing data about your application as it is running

Not Profiling

  • high overhead
  • critically, for async, doesn't give context

What data are we going to collect?

Some of most common types of data we would want to monitor are Timing data, throughput and error rate. Since it's not specific to async, we're not going to look at throughput or error rate for this talk.

Monitoring is a VERY large topic, not covering most of it here

I'm going to focus on what is specific to asynchronous apps

Your users should not be your monitoring system

Why should you have monitoring in the first place? I'm hoping if you're at this talk, I don't need to sell you on the importance of monitoring in general.

If your app is a source of income (or pride), and it's broken, you're losing money/sleep. So we'd prefer to find problems sooner, ideally before they become a real problem for our users.

By broken I mean not just total unavailable, or throwing errors.

Performance matters. Slow websites erode the sanity of your users.

Visualize your data in a way that is consumable

Tailing a log file is not monitoring

Visualize your data in a way that is consumable

Why is my website slow? hint: it's the database

Our example app

def cuddle(responses):
    # pretend to do processing
    time.sleep(0.01)

class ASyncRequestHandler(RequestHandler):

    async def get(self):
        future = feed_puppy()
        await future
        self.finish('Pup is full!\n')
        cuddle()

app = Application([('/', ASyncRequestHandler))]

if __name__ == '__main__':
    app.listen(8888)
    tornado.ioloop.IOLoop.current().start()

What to Measure

We may not always want to, or be able to measure all of these.

Response Time

Duration

CPU time

External Time

Aggregate and collect data in monitor service

I've created a simple plotting service which I can post data to, and it will use bokeh to chart the data.

This part is not async specific

I do get to use an async http client library to send data to my monitoring service, which keeps overhead low.

Percentiles are better than the mean

times.sort()
index95 = int(len(times)*0.95)
data_point = times[index95]

If your webservice has a mean latency of 100ms, your top 1% of requests may take 5 seconds. This is a bad user experience on it's own if that is a stand-alone service. However, with today's tend to move towards microservice architecture, if several such services are needed to render a page, the 99th percentile of one backend may become the median response of what the user experiences.

To the demo!

CPU intensive tasks are bad news for async architecture

Now that we're seen an example of what gathering this data looks like, I want to talk more generally about how to collect async data. How we can widely apply gathering async metrics from our app.

Strategies for a general solution

Of course, if for reduced effort, but also reduced insight to our app, we can monitor just response time using an outside service, that is watching request and responses.

Like I showed in the example, we can embed stopwatches into our code directly.

If we have a lot of async code, and this becomes tiresome, or we want do decouple our monitoring from our service, we can put the monitioring code in it's own modules or functions, and patch that into our app.

Challenge of a general solution:

Keeping track of callbacks

class ASyncRequestHandler2(RequestHandler):

    def get(self):
        feed_puppy2(callback=cuddle_pup)

The example I used before used Python coroutines, which allows us to yield execution in the middle of a function. This kept all our timer data neatly in one place.

This may not always be the case. In other languages, or using 3rd party libraries, and async function may take a callback that it will execute when it finishes.

class ASyncRequestHandler2(RequestHandler):

    def get(self):
        feed_puppy(callback=self.cuddle_pup_wrap)

    def cuddle_pup_wrap(*args, **kwargs):
        start = time.time()
        cuddle_pup()
        self.cuddle_time = time.time() - start

In order to include execution time data for this function, we can wrap in our own function that simply calls the original function surrounded by a stopwatch.

Multiple callbacks?

class ASyncRequestHandler2(RequestHandler):

    def get(self):
        self.cuddle_time = 0
        feed_puppy(callback=self.cuddle_pup_wrap)
        feed_puppy(callback=self.cuddle_pup_wrap)
        feed_puppy(callback=self.cuddle_pup_wrap)

    def cuddle_pup_wrap(*args, **kwargs):
        start = time.time()
        cuddle_pup()
        self.cuddle_time += time.time() - start

However, if we have several async calls, which one will be done first? How do we know when to stop collecting, and process our data? In this case our "general solution" becomes necessary to get any data at all

Keeping Track of the pieces

  1. Create an object to hold metrics
  2. Pass it around via wrapper code
  3. Have condition for when done

Critical Path

class ASyncRequestHandler2(RequestHandler):

    @tornado.web.asynchronous
    def get(self):
        self.things_done = 0
        feed_puppy(callback=self.thing_done)
        do_dishes(callback=self.thing_done)
        roomba_room(callback=self.thing_done)

    def thing_done(self):
        self.things_done += 1
        if self.things_done == 3:
            fight_crime()
            self.finish('All chores done')

looking back at this example where we had multiple external calls. We saw before how to keep track of the cpu time - which we know is important for throughput.

I'm going to depart from our puppy analogy a bit and introduce some other IO calls.

Thinking about this task by itself, what if one of the 3 chore we are doing in this task is a really slow eater? We want to know which particular call(s) is the limiting factor for us to finish. This is called the critical path.

Critical Path

class ASyncRequestHandler2(RequestHandler):

    @tornado.web.asynchronous
    def get(self):
        self.things_done = 0
        self.metrics = Metrics('io1','io2','io3')
        feed_puppy(callback=self.thing_done, 'io1')
        do_dishes(callback=self.thing_done, 'io2')
        roomba_room(callback=self.thing_done, 'io3')
        self.get_done = time.time()

    def thing_done(self, name):
        self.things_done += 1
        self.metrics[name] = time.time() - self.get_done
        if self.things_done == 3:
            fight_crime()
            self.finish('All chores done')
            self.metrics.process()

Use tools to help you

This is starting to get really complex. There are many open source and commercial tools out there to help you do this, or just do it for you.

You will probably not be trying to get all the data that I have outlined here. I have outline the building blocks of what kinds of metrics you want to look out for.

How To Monitor Async

Slides/Source on Github: github.com/boylea/monitoring_async_talk

@amylouboyle

SpaceForward
Right, Down, Page DownNext slide
Left, Up, Page UpPrevious slide
POpen presenter console
HToggle this help