Friday, February 8, 2013

Statistical Methods for Performance Engineers

A common use case for performance engineers is to determine the effect of a system change relative to an established baseline. If the effect is large and obvious then our work here is done. More often than not, however, effects are not obvious and we are left wondering whether to endorse the change or not. Statistical methods were, in essence, developed to enable us to make statements regarding the behavior of a system (population) given a small glimpse (sample) of how it behaved at one time or another.
Consider 1,000 response time samples collected over an extended period from our web application. This constitutes our performance baseline. We make changes to our application and conduct a quick test, collecting 100 response time samples. The histograms below show how the response times are distributed in each case. The new configuration has some outliers at the higher end of the scale but a narrower distribution around the peak than the baseline. To compare the distributions more equitably our strategy is to repeatedly re-sample the baseline data, calculating the mean of each sample. By the Central Limit Theorem (CLT), this so-called sampling distribution of the sample mean will approach a Gaussian distribution as more and more samples are taken from the baseline data.

As per the CLT, the resulting Gaussian will have the same mean as the original distribution (our baseline data) and variance equal to the original divided by the sample size. In this case our sample size is limited by the size of the test sample i.e. 100. The sequence of graphs below show how our Gaussian distribution forms as the number of bootstrap samples increases, converging on a mean ($\mu$) of 0.767 and standard deviation ($\sigma$) 0.0146.

Gaussian in hand, we can leverage the properties of this well-known distribution to make statements regarding our system. The curve can be used to determine the likelihood of our baseline system of generating samples that fall within a specific range. In this case the test mean is 0.793, higher than the baseline mean, so we want to determine the probability of baseline samples being greater than 0.793. In the figure below, this accounts for the shaded area under the right tail which can be calculated using Z-tables or widgets like this. In our case, this area amounts to 0.0367 (3.7%), which is interpreted as the probability that our baseline system would produce a sample like the test. This is very weak support for the notion that our changes have not regressed system performance.

Loosely speaking, we have conducted a one-sided hypothesis test and the principle applies equally for test samples lower than the baseline mean. In such a test we make the null-hypothesis that there has been no change to the system and then determine the strength of evidence that our test results would occur by chance alone i.e. no change to the system. If this support falls below a selected threshold (5% is common), we reject the null-hypothesis.

If one wanted to improve the accuracy of our estimates above, we would need to decrease the variance of our distribution of the sample mean. In the Bayesian sense, this variance represents the uncertainty in our estimate of the actual mean from the data we have available. For the most accurate estimate, we should maximize the sample size since in the sampling distribution of the sample mean, variance is proportional to the inverse of the sample size. Therefore, in order to improve the accuracy of our estimates, we need to collect more test data with the new configuration. Intuitively this should make sense.

Below we show how the variance of the sampling distribution of the sample mean decreases with sample size. These graphs illustrate location and shape only and probabilities (y-axis) should be ignored.

So there you have it: although we do not have a definitive answer regarding the effect of our changes to the system, we have made some steps towards quantifying our uncertainty. This might not seem like a lot to engineers accustomed to micrometers and 3-point precision but it's a fundamental step towards controlling a given process. Uncertainty is inevitable in any system we choose to measure and an important qualifier for any actionable metrics. Ask Nate Silver.

In this example we used web server response time as a metric but this analysis applies equally to any system metric. Data representing our two systems was collected by pointing curl at two different websites:

The python code for this example is available as an IPython notebook here

Friday, October 12, 2012

Where did my Tweet go ? Tracking retweets through the social graph.

D3 visualization of a tweet propagation for @OReillyMedia

Everybody tweets these days but not all tweets are created equal, and  neither are tweeters for that matter. One of the truest measures of a tweet and, by proxy, the tweeter, is how far into the social-graph the message propagates via retweeting. Retweeting is an explicit indication of engagement on the part of the retweeter and in most cases, an endorsement of the message and the original author. That all said, it is an interesting exercise to see where in the social-graph a particular tweet reached. The Twitter API makes it easy to tell how many times and by whom a message was retweeted but it takes a bit more legwork to determine the path taken to the recipients.

A simple method to follow the propagation of a tweet is to do a breadth-first traversal of followers links, starting at the message author, until all retweeters have been accounted for. Obviously there are some assumptions wrapped up in this methodology but for the most part evidence supports the results. The Python script below performs this walk through the social graph. For economy against the Twitter API, the script caches follower lists in a Redis server so that they may be re-used for subsequent runs. This scheme works best when examining tweets which are closely related and incorporate many of the same Twitter users.

For visualization purposes, the Python script outputs a JSON file for consumption by a D3 force-directed graph template. D3 expects nodes and links enumerated in separate lists, the link elements making reference to the node elements via node list indices. A sample graph is shown above, visualizing  the path of a tweet from @OReillyMedia. Twitter users are indicated by their avatars and a grey circle with radius proportional to the logarithm of the number of followers. The originator of the message is indicated with a red circle. The graph title gives the text of the tweet, the overall retweet count, and the number of users reached by the message (sum of everyone's followers).

While the ability to gather broad insight with this method is limited by Twitter API rate controls, it could be used to do a focused study on a specific Twitter user, looking for prominent social-graph pathways and individuals that warrant reciprocation. Failing that, the D3 transitions as the graph builds and stabilizes makes fascinating viewing.

Friday, September 28, 2012

Web Request Rate Analysis with R

During performance testing it's easy to find raw data relating to the response time of a web service or RPC. This information is typically recorded in log files like the server access log or in some form or another in the load-generating client logs. Response time statistics are an important operational parameter but need to be qualified with the request rate to give context to the metric. I use a simple R script generate request rate graphs from Apache JMeter output (.jtl files) but the concept can easily be extended to any format which includes timestamped entries, preferably labelled somehow to distinguish one service type from another.

The basis of the request rate calculation is calculating and inverting the inter-arrival time of requests. If the generation of requests is independent then the request generator follows a Poisson Process and inter-arrival times will follow an exponential distribution. Generally requests directed at a web site will not be entirely independent as a user will be guided by the site navigation constraints, but the approximation lends itself to modelling of the service using queueing theory. One of the simplest of such models is shown below, an M/M/1 queue, which models the queue length in a system having a single server, where arrivals are determined by a Poisson Process and job service times have an exponential distribution.

Moving on with less esoteric endevours, each request from the Jmeter client is recorded as a separate XML element an a jtl file. Request parameters are recorded as attributes on the element. The attributes we are interested in are ts (epoch timestamp) and lb (service label). The shell command below turns the jtl file into a format that R can read into a dataframe. This includes adding column headers to save us having to define these within the R script. These shell commands are the part you would change if importing data from another source.

Now we run the data file generated above through the R script below. The script begins by loading the data into a dataframe and sorting based on timestamp. We can't take for granted that log entries generated in a buffered multi-threaded environment will be chronologically interleaved. We then iterate over the dataframe, calculating the inter-arrival time between each request and its predecessor. This value is added as a new column v to the dataframe. We then invert and scale the inter-arrival times and plot over time to give us the overall request rate per minute. We then filter the data on service labels and perform the same exercise to get a service-specific request rate breakdown. We have used a locally weighted smoothing function, adjusting the span factor to suit our granularity requirements.

A sample graph is shown below. For this test run we used a rate-controlled request generator with a small ramp, accounting for the initial rise and bounded oscillation for each service. Below that is a density plot of the inter-arrival time for this sample data, showing that it does indeed look exponential, as postulated by the Poisson process model.

Friday, September 21, 2012

Apache Web Stats from Access Logs

Apache access logs represent a wealth of information regarding who/what is hitting your web site/service and how you are handling the traffic. There are many tools available to digest this information and generate useful insight, but just in case you have an itch to do it yourself, here's a bit of scripting to get you on your way. I used venerable old sed(1) to pull out the fields we are interested in and R to generate bar graphs, showing the breakdown of hits by source IP address and agent headers.

The first step is to parse the access logs to make it suitable for  import into R. The exact parsing pattern will depend on the Apache LogFormat definitions and which one is active for the access log. In this case we are assuming the combined format ...

apache2.conf: LogFormat "%h ....." combined

%...h:          Remote host
%...l:          Remote logname
%...u:          Remote user 
%...t:          Time, in common log format time format 
%...r:          First line of request
%...s:          Status.  
%...0:          Bytes sent
%...D:          Response time
%{VARNAME}e:    The contents of the environment variable VARNAME.
%{VARNAME}i:    The contents of VARNAME in request header. 
%{VARNAME}n:    The contents of note VARNAME from another module.

Using sed we extract the remote host address, request path, and first word of
User-Agent header. Output format is space-separated columns: ip path agent

sed -n 's/^\(\S*\) - - \[[^]]*\] "\(GET\|HEAD\) \(\/\S*\) [^"]*" [^"]*"-" "\(\S*\) .*$/\1 \3 \4/p' access.log > access.dat

Next step is to parse and graph in R using the script below. The script imports the normalized log records into a dataframe and then aggregates by ip and agent to generate bar graphs of the hits, broken down by source address and agent-user header repectively.  The aggregates are ordered by count and, for presentation purposes only, truncated after the top 40 classes. Because we do not take timestamps into account, multiple log files can be concatenated in any order and processed together. Below are sample output graphs. Note the prominent representation of Mozilla/5.0 user agent is somewhat misleading. For simplicity sake, the sed expression only extracts the first word of the user agent header which has the effect of grouping together Googlebot, Yandexbot, TweetedTimes, Firefox, and Safari, among others.

Friday, June 15, 2012

Bootstrapping an EC2 Hadoop Instance with a User-Data script

Amazon AWS is great for MapReduce jobs. You can take advantage of the AWS Elastic MapReduce service and load and transform data directly from/to S3 buckets, or you can roll your own and create a Hadoop image in EC2 and spin up instances as required.

 There are two ways to create a Hadoop image: you can customize a raw instance and once you have it working the way you would like, take a snapshot of the volume. Further instances can be created by cloning the snapshot. Obviously we need a way to differentiate instances into NameNodes, DataNodes, JobTrackers, and TaskTrackers. The easiest way to do this is probably to set up an initialization script in your image which reads instance user data/tags and customizes the configuration accordingly. User data/tags can be set when the instance is created, either through the AWS console or ec2-api command line.

Alternatively, a startup script can be supplied to a raw instance that will be executed when the instance initializes and configure the instance from scratch. User scripts can be injected via the user-data field in the AWS console or ec2-api command line. How the script is handled exactly depends on the operating system image selected. Official Ubuntu images from Canonical are configured to check the first two characters of the user-data script for #! in which case it executes the user data as a shell script. As in the case above, the initialization script can use user tags and other user data fields to customize the instance into the specific type of Hadoop node required.

In addition to the facilities described above, Canonical Ubuntu images also include a package called cloud-init. Cloud-init is the Ubuntu package that handles early initialization of a cloud instance. It is installed in the UEC Images and also in the official Ubuntu images available on EC2. Cloud-init operates on user data received during initialization and accepts this data in multiple formats: gzipped compressed content, cloud-config scripts, shell scripts, and MIME multi part archive, among others. Shell scripts function pretty much the same as described above while cloud-config scripts use a custom syntax specifically for configuring a Ubuntu instance via the built-in administrative interfaces (mostly APT package manager). Gzip content is just a way to compress the user-data payload and MIME multi part archive is a method to combine cloud-init scripts and shell scripts into a single payload.

The cloud-init script below performs an APT update and upgrade as well as installing specific packages required for the Hadoop install, including OpenJDK. The Oracle/Sun JDK is a bit harder to install since the related packages are no longer distributed by Canonical and setup requires configuring a thirdparty repository and pre-accepting the Oracle license, all tasks which can be performed using Cloud-init fucntions albeit with a bit more effort. The script also sets the server timezone and configures log sinks, very useful for debugging.

In addition to the cloud-init script, we created a shell script to download the Hadoop source, compile, install and configure. In this case we configured a single node as a pseudo cluster and have not used user data/tags to customize settings as described above. The script also downloads and installs the LZO compression modules for HDFS, imports SSH keys to allow communication between the JobTracker and TaskManagers, although not required in this single server deployment. Finally, the script disables IPv6, formats the HDFS filesystem and starts the Hadoop daemons.

The two scripts above are packaged as a MIME multi part message to be delivered to cloud-init via the EC2 user data payload. The cloud-init script write-mime-multipart is used to perform this function as shown in the simple Makefile below.

Tuesday, March 27, 2012

Data-Intensive Text Processing with MapReduce

This book is compact and intense but is an insightful and powerful demonstration as to how a problem may be decomposed to fit the MapReduce paradigm. Equally important, it describes the types of problem that are not suited to decomposition as MapReduce jobs. It covers in detail the use of MapReduce in text indexing, graph algorithms, and expectation maximization, but the techniques described could easily be applied to a wide range of applications. I was able to turn the pseudo code snippets, together with Hadoop: The Definitive Guide, into working examples in a relatively short time.

For me, this book filled in the blanks with respect to how to apply MapReduce to my own algorithms and data.

Thursday, March 22, 2012

Text Indexing with Aho-Corasick

The Aho-Corasick string matching algorithm is a kind of dictionary matching search algorithm. It was originally proposed as an alternative to indexing as a means of speeding up bibliographic search. That was back in 1975 before the World Wide Web and ensuing information explosion demanding indexing in some form or other to make real-time information retrieval practical. The Aho-Corasick algorithm, however, has some interesting properties which make it attractive for use as an indexing scanner.

The algorithm constructs a state machine from a collection of dictionary words. The state machine is in-effect, a reduced-grammar regular expression parser and can be used to scan text for the dictionary words in a single pass. The machine state transitions (edges) trigger on encountering a specific letter in the input stream. Machine states (nodes) can emit one or more dictionary words if the path leading to the state encodes all of the letters of the dictionary word in order. Failure edges transition from a state for which no outgoing edge matches the next next letter in the input stream, to a state from which it it still may be possible to match a dictionary word given the letters already encountered in the stream.

The time taken to construct the state machine is proportional to the sum of the lengths of all dictionary words. This cost however can be amortized over the life of the state machine and a single state machine can be used to parse multiple texts concurrently if the implementation uses independent iterators to track state transitions through the machine. The number of state transitions required for an Aho-Corasick state machine to scan a document is independent of the size of the dictionary. This means that Aho-Corasick method scales very well to large dictionaries, the limiting factor being the space required to hold the state machine in memory.

As a proof-of-concept, we implemented the Aho-Corasick algorithm in Java and ran some benchmark tests. For debugging puposes we implemented a method to dump a state machine to Graphviz DOT format. The visualization of a state machine constructed with dictionary [he, she, his, hers] is shown in Figure 1. The background image for this blog title is the visualization of a state machine constructed with a 100 word dictionary - not very practical to follow but makes an interesting graphic.

Figure 1: Aho-Corasick state machine for dictionary [he, she, his, hers]

Figure 2 shows how time taken to construct the state machine varies with the number of dictionary words. Only 3 data points were taken but the relationship is clearly linear.

Figure 2: Aho-Corasick state machine construction time

Figure 3 shows how the performance of the Aho-Corasick implementation varies as the size of the corpus increases. The relationship appears linear and, for the most part, insensitive to the dictionary size. Deviations are likely attributable to poor sampling and high variance between test runs.

Source code for this implementation is available here.