When in doubt, always blame the application

When you have a misbehaving system and you are not sure what the problem is, always bet on a poorly written application.

Here’s a small example of how another poorly written web application caused system issues.

I was sitting at my office today I when I got this nagios alert for a host.

Date/Time: Tue Nov 6 19:15:11 EET 2012
Additional Info:
SWAP CRITICAL – 0% free (0 MB out of 509 MB)

Logging in actually showed all the swap’s been used and so was RAM, 0.95/1Gb. Lots of apache2 server instances were running. I did a netstat and I saw a lot of ESTABLISHED connections:

tcp6       0      0 2001:DB8:f00::1:35571 2001:DB8:bar::100:80 ESTABLISHED 9631/apache2    
tcp6       0      0 2001:DB8:f00::1:35777 2001:DB8:bar::100:80 ESTABLISHED 9656/apache2    
tcp6       0      0 2001:DB8:f00::1:36531 2001:DB8:bar::100:80 ESTABLISHED 11578/apache2   
tcp6       0      0 2001:DB8:f00::1:36481 2001:DB8:bar::100:80 ESTABLISHED 11158/apache2   
tcp6       0      0 2001:DB8:f00::1:36295 2001:DB8:bar::100:80 ESTABLISHED 11115/apache2   
tcp6       0      0 2001:DB8:f00::1:34831 2001:DB8:bar::100:80 ESTABLISHED 8312/apache2  

2001:DB8:f00::1 -> my server
2001:DB8:bar::100 -> dst server

As one can easily see my server is connecting to port 80 of dst, possibly asking for something over HTTP.

# netstat -antpW | grep 2001:DB8:bar::100 | wc -l
111

# dig -x 2001:DB8:bar::100 +short                                 
crl.randomcertauthority.com.

tailing the log files didn’t show anything weird happening. I run a tcpdump for that dst server but there wasn’t at that time any traffic going on.

So, I took a look at munin to see when this problem started developing.



As it’s obvious from the above graphs, the problem started around 14:00. So I took another look at the apache logs and I saw a bot crawling a specific url from my server. I visited that url on my server using curl and I saw traffic flowing through tcpdump going from my server to dst server. So visiting that URL was definitely causing problems. But why?

I restarted apache, swap and memory were released, all the stale ESTABLISHED connections went away and I saw hundreds of FIN/RST packets going back and forth at tcpdump.

I tried to open a few concurrent connections from my PC to my server’s url using curl. After a couple of tries netstat showed that I had managed to create stale ESTABLISHED connections towards dst server. It was an HTTP connection asking for a crl. So I was both able to reproduce the problem and I also knew the specific url of the dst server that caused the connection hanging issues.
Next thing I did was to try to open direct HTTP connections from my server to the dst url using curl. After a few concurrent connections I managed to make curl hang. So the problem was definitely not on my server, but at the dst server.

Since it was already quite late, my first (re)action was to install mod_evasive to try and minimize the problem so I could take a better look the next day.

# aptitude install libapache2-mod-evasive
# a2enmod mod_evasive
### edit /etc/apache2/sites-enabled/site-name and add the following
       <IfModule mod_evasive20.c>
            DOSHashTableSize    3097
            DOSPageCount        1   
            DOSSiteCount        50  
            DOSPageInterval     1   
            DOSSiteInterval     1   
            DOSBlockingPeriod   10  
            DOSEmailNotify myemail@mydomain.gr
        </IfModule>
# /etc/init.d/apache2 reload

I tried to curl my server’s URL from my PC and I got blocked after the second concurrent try. But after some repetitions I was still able to create one or two stale ESTABLISHED connections from my server to the dst server. Far fewer than before but the problem was still somewhat reproducible.

Then I decided to take a look at the site’s PHP code. Finding the culprit was quite easy, I just had to find the code segment where PHP requested the dst server’s url.
Here’s the code segment:

$ch = curl_init($this->crl_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
$crl_content = curl_exec($ch);

The developer had never thought that the remove server might keep the connection open for whatever reason (rate limiting anyone?)

Patching it was quite simple:

$ch = curl_init($this->crl_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,'60');
curl_setopt($ch, CURLOPT_TIMEOUT,'60');
$crl_content = curl_exec($ch);

After this everything worked fine again. Connections were getting ESTABLISHED but after 60 seconds they got torn down, automagically. No more stale ESTABLISHED connections. Hooray!

A letter to every developer:

Dear developer,

please test your code before shipping. Pretty please take corner cases into account. We know you’re competent enough, don’t be lazy.

Your kind sysadmin

Bind9 statistics-channels munin plugin

Bind9, since version 9.5, offers an experimental embedded web server which can provide statistics abound Bind through HTTP. Upon enabling, one can access this web server and get an XML response full with various statistics.

Enabling the feature is quite easy. One just needs to add some lines like the following inside Bind’s configuration file:

statistics-channels {
          inet 127.0.0.1 port 8053 allow {127.0.0.1;};
};

Restart Bind and try connecting to the statistics-channel. For example through curl:

$ curl http://127.0.0.1:8053
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/bind9.xsl"?>
<isc version="1.0">
  <bind>
    <statistics version="2.2">
      <views>
        <view>
          <name>_default</name>
          <zones>
            <zone>
              <name>0.in-addr.arpa/IN</name>
              <rdataclass>IN</rdataclass>
              <serial>1</serial>
            </zone>
            <zone>
              <name>10.in-addr.arpa/IN</name>
              <rdataclass>IN</rdataclass>
              <serial>1</serial>
            </zone>
...
...
...
...
         <TotalUse>810834110</TotalUse>
          <InUse>327646360</InUse>
          <BlockSize>740556800</BlockSize>
          <ContextSize>9954232</ContextSize>
          <Lost>0</Lost>
        </summary>
      </memory>
    </statistics>
  </bind>
</isc>

The output will be quite big, for example the current output from a busy recursive resolver is around 2.5Mb.

People usually use munin to monitor bind through 2 different ways. The first, usually applied to servers that are not very busy, for example servers with less than 100queries/sec, is to parse the output of the query log. Then a munin plugin that comes with the default munin installation and is called bind9 can parse the query log and create some graphs. The second way for a bit busier servers is to have query log disabled and monitor bind is through the output of the rndc stats command. A munin plugin that also comes with the default munin package and is called bind9_rndc can parse the created named.stats output file and create some nice graphs. To configure either of the two ways one can take a look at this informational wiki: http://wiki.debuntu.org/wiki/Munin/Bind9_Plugin

A third way is to have a new munin plugin get the XML output from the statistics-channel described above and create some even fancier and richer graphs. Luckily a guy called Andrew Duquette has created such a perl plugin for munin. You need to exercise your google skills to find it though, (that is if you don’t cheat searching with his name as a search term!) I took that plugin, made some changes to it and uploaded it to github. The plugin is called bind9_statchannel and you can find it in my github munin-plugins repo.

The changes I made were the following:

  • Use STACK instead of just AREA for some graphs
  • Add warning and critical levels for type ANY queries (to warn for DDoS through such queries)
  • Raise timeout to 300 seconds
  • Add sorting to stacked graphs so they are a bit more readable/comparable

The plugin creates the following type of graphs:

  • Cache DB RRsets for various views (or the _default if you don’t have any others)
  • Memory Context “In Use”
  • Memory Usage Summary
  • Queries In
  • Queries Out
  • Resolver Statistics for various views (or the _default if you don’t have any others)
  • Server Statistics
  • Socket I/O Statistics
  • Tasks

As you can see it can give many more details compared to the other bind9 munin plugins. It’s also the only bind9 munin plugin that can tell you how many incoming vs outgoing queries there are, as well as the only one that can tell you how many queries you have over IPv4 vs IPv6.

To use it on Debian you need to at least install the following two perl packages: libxml-simple-perl, liblwp-useragent-determined-perl

Here are some sample graphs from the bind9_statchannel plugin:

If you have any bug fixes, code changes, etc please don’t hesitate to fork and open a pull request on github.

Download: bind9_statchannel