For several months starting in August, the use of the Maven Central Repository has seen a dramatic increase in traffic. This is a great indicator of the adoption of Maven and other tools that recognize the value of a shared binary repository. It also set in motion a series of upgrades and changes required to keep the system going strong for everyone. We're written about some of the steps we've taken in the last few months to preserve Maven Central as a public resource, and I wanted to provide more details for those of you who are interested in the numbers behind these changes.
The first change occurred in August, when we moved Central to a new 100 Mbps connection. This temporarily solved most of our availability problems, but the load continued to increase and Central started running out of httpd worker threads. For a while we played a cat and mouse game of bumping up the workers. Read on to find out how we ultimately solved this problem and made Central more stable and available for the world of developers it serves.
Moving to Nginx
As I previously reported, increasing the number of workers to the required level caused the CPU load on the machine to climb to well over 200 (on a 4 processor machine, this means it was 50x overloaded). On a tip, we investigated and migrated to Nginx, a remarkably fast and efficient http server optimized for static content. This immediately solved our load issue, dropping the load to an average of .05 instead of 200 during high traffic. You can't get much better than that.
The next issue we faced was that Nginx was so efficient at serving the content, that the connection was getting saturated, causing packet loss and connection timeouts. We were easily filling up our 100 Mbps pipe as traffic to Central continued to climb from August 2008 to November 2008. This next weekly graph hides the problem, it shows the "average" bandwidth used over an hour, but it conveys the trend.
Isolating the Hourly "Spikes"
If you zoom into this graph you'll see that 5-minute averages showed a saturated 100 Mbps pipe every hour on the hour. The problem needed an immediate solution, a saturated pipe for 10-15 minutes every hour translates to refused connections, slow transfers, and unhappy users.
These hourly spikes took a long time to isolate, and they were not confined to the regular work week. They continued around the clock - 7 days a week. On Thanksgiving (between Turkey and Pumpkin Pie), I really dug into the logs. The source was not coming from any single location that was easily identifiable, rather it seemed to be a distributed, coordinated "attack" on the system. We looked closer at the files being accessed and determined that the Nexus Index zip file was a lion's share of the traffic. Given that it is one of the larger files in the repo at 28mb, it wasn't really abnormal but was it the target of the hourly spikes? To find out, I moved the index.zip out of the way for 1 minute two separate times during the spike:
A Culprit Identified...
Moving the index.zip decreased the duration of the spike. Finally, we had a target to analyze. Looking closer now at this file, we uncovered many IPs downloading the zip multiple times a day... (one extreme case was download this file every 2 minutes!). All the suspicious requests were using a generic Jakarta user agent so we knew it wasn't Nexus. (Nexus defines a User-Agent, you should too.) We started looking into other tools that might use the index and found the culprit. One tool had a bug that caused it to ignore the properties file and the timestamp on the zip and download the full zip every time a request was made. Worse, the default configuration had every instance grabbing the file every 5 hours exactly on the hour even though the file only changes once a week. Account for clock drift and it perfectly explains the hourly spikes from distributed sources.
Now that we identified the source of the problem, we were able to block it from getting the index until the tool was fixed. Implementing the block had an immediate effect:
The traffic fell off to almost nothing, what you would expect on a holiday weekend. Notice that once an updated index was published, the traffic spike returned for a bit, but the overall ambient traffic was back to ~60-80 Mbps instead of the 98 Mbps that had been the norm over the preceeding months. This was the impetus for Tim's earlier blog entries on Efficient Downloads of the Nexus Index and Installing a Repository Manager.
60-80 Mbps is certainly better than intermittent spikes to 100 Mbps and people suffering from slow downloads, but we continued to crack at the problem to try to increase availability and reduce bandwidth even further. Tomorrow, I'll talk about how we have reduced the traffic on Central from 80 Mbps down to 14 Mbps. The solution might surprise you.