[aprssig] Slashdot After-action report...

Steve Dimse steve at dimse.com
Thu Aug 4 15:55:52 EDT 2005


What a rush...I always wanted to know how my server would hold up to  
a Slashdotting...the answer is better than I feared (and better than  
a lot of other sites do), but still worse than I'd hoped!

Once I knew I was on deck, I set up a grep on the access log, so I  
knew when it went live for Slashdot subscribers, who see it 15  
minutes sooner than the rest of the world. When it went live for  
everyone, the traffic instantly spiked, really amazing how many  
people follow Slashdot that closely, after the first hour the number  
of hits began slowly decreasing.

It turns out I'd prepped in exactly the wrong way, by increasing the  
number of allowed connections. I thought things were going well, it  
was taking a few seconds to load the pacsat page, usually most of the  
250 slots on the web server were full, but almost no errors were in  
the http log file, people were getting their pages and maps. Then I  
checked my weather page, and realized no data had been placed into  
the database for the 30 minutes the slashdotting had been going on. I  
realized the problem was too many reads swamping the writes, and kept  
lowering the max connections and restarting Apache.

The right answer was to limit the number of connections to 40, any  
more just flogged the database so badly that no data could be added  
to the database. Even at 40, I was occasionally getting a backup in  
the plot cgi when I would restart the server. I was typing in the  
command to disable plot.cgi when the server crashed. I'm still not  
sure why, the number of processes was down to 300 from a high over a  
thousand.

Luckily for me, the only tables corrupted in the crash were the 10  
day weather table, repairable in 2 minutes, and the position table,  
which is huge but not critical. The choice was to repair the position  
table (server offline for 6 hours) or empty it (2 seconds). The  
choice was easy, if painful. More than a few of will probably notice  
a lot fewer points in track.cgi, sorry. Had it been the weather  
table, I would not have been able to blow it away, and things would  
have been much worse.

Once I disabled plot cgi, things went very smoothly, with the server  
easily keeping up to the gradually diminishing load. I got over- 
confident though, and re-enabled plot.cgi around 2 AM. A couple hours  
later the server crashed again, but I wasn't watching it closely at  
the time. Again, I got lucky, this crash corrupted only the raw data  
file, so it got wiped clean. By morning the volume was down enough  
that I could put the world map back on the pages, and enable plot to  
all my regular users. Once my story dropped off the front page of  
Slashdot I was able to put back all the maps on the PCSat and ARISS  
pages, things are pretty much back to normal.

In terms of the numbers, there were about 40,000 extra visits (normal  
is about 25k/day) and over a million extra hits (normal is about 600k/ 
day) in the last 18 hours...

So, now I know how to handle this sort of thing in the future...not  
that I'll be submitting myself to Slashdot again any time soon!  
Thanks for putting up with the disruption, I didn't get a single  
complaint.

Steve




More information about the aprssig mailing list