[aprssig] Slashdot After-action report...
Steve Dimse
steve at dimse.com
Thu Aug 4 15:55:52 EDT 2005
What a rush...I always wanted to know how my server would hold up to
a Slashdotting...the answer is better than I feared (and better than
a lot of other sites do), but still worse than I'd hoped!
Once I knew I was on deck, I set up a grep on the access log, so I
knew when it went live for Slashdot subscribers, who see it 15
minutes sooner than the rest of the world. When it went live for
everyone, the traffic instantly spiked, really amazing how many
people follow Slashdot that closely, after the first hour the number
of hits began slowly decreasing.
It turns out I'd prepped in exactly the wrong way, by increasing the
number of allowed connections. I thought things were going well, it
was taking a few seconds to load the pacsat page, usually most of the
250 slots on the web server were full, but almost no errors were in
the http log file, people were getting their pages and maps. Then I
checked my weather page, and realized no data had been placed into
the database for the 30 minutes the slashdotting had been going on. I
realized the problem was too many reads swamping the writes, and kept
lowering the max connections and restarting Apache.
The right answer was to limit the number of connections to 40, any
more just flogged the database so badly that no data could be added
to the database. Even at 40, I was occasionally getting a backup in
the plot cgi when I would restart the server. I was typing in the
command to disable plot.cgi when the server crashed. I'm still not
sure why, the number of processes was down to 300 from a high over a
thousand.
Luckily for me, the only tables corrupted in the crash were the 10
day weather table, repairable in 2 minutes, and the position table,
which is huge but not critical. The choice was to repair the position
table (server offline for 6 hours) or empty it (2 seconds). The
choice was easy, if painful. More than a few of will probably notice
a lot fewer points in track.cgi, sorry. Had it been the weather
table, I would not have been able to blow it away, and things would
have been much worse.
Once I disabled plot cgi, things went very smoothly, with the server
easily keeping up to the gradually diminishing load. I got over-
confident though, and re-enabled plot.cgi around 2 AM. A couple hours
later the server crashed again, but I wasn't watching it closely at
the time. Again, I got lucky, this crash corrupted only the raw data
file, so it got wiped clean. By morning the volume was down enough
that I could put the world map back on the pages, and enable plot to
all my regular users. Once my story dropped off the front page of
Slashdot I was able to put back all the maps on the PCSat and ARISS
pages, things are pretty much back to normal.
In terms of the numbers, there were about 40,000 extra visits (normal
is about 25k/day) and over a million extra hits (normal is about 600k/
day) in the last 18 hours...
So, now I know how to handle this sort of thing in the future...not
that I'll be submitting myself to Slashdot again any time soon!
Thanks for putting up with the disruption, I didn't get a single
complaint.
Steve
More information about the aprssig
mailing list