{"id":1523,"date":"2012-11-06T22:36:53","date_gmt":"2012-11-06T20:36:53","guid":{"rendered":"http:\/\/www.void.gr\/kargig\/blog\/?p=1523"},"modified":"2012-11-06T22:40:43","modified_gmt":"2012-11-06T20:40:43","slug":"when-in-doubt-always-blame-the-application","status":"publish","type":"post","link":"https:\/\/www.void.gr\/kargig\/blog\/2012\/11\/06\/when-in-doubt-always-blame-the-application\/","title":{"rendered":"When in doubt, always blame the application"},"content":{"rendered":"<p>When you have a misbehaving system and you are not sure what the problem is, always bet on a poorly written application.<\/p>\n<p>Here&#8217;s a small example of how another poorly written web application caused system issues.<\/p>\n<p>I was sitting at my office today I when I got this nagios alert for a host.<\/p>\n<blockquote><p>Date\/Time: Tue Nov 6 19:15:11 EET 2012<br \/>\nAdditional Info:<br \/>\nSWAP CRITICAL &#8211; 0% free (0 MB out of 509 MB)<\/p><\/blockquote>\n<p>Logging in actually showed all the swap&#8217;s been used and so was RAM, 0.95\/1Gb. Lots of apache2 server instances were running. I did a netstat and I saw a lot of ESTABLISHED connections:<\/p>\n<p><pre><code2>tcp6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;0 2001:DB8:f00::1:35571 2001:DB8:bar::100:80 ESTABLISHED 9631\/apache2&nbsp;&nbsp;&nbsp;&nbsp;\ntcp6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;0 2001:DB8:f00::1:35777 2001:DB8:bar::100:80 ESTABLISHED 9656\/apache2&nbsp;&nbsp;&nbsp;&nbsp;\ntcp6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;0 2001:DB8:f00::1:36531 2001:DB8:bar::100:80 ESTABLISHED 11578\/apache2&nbsp;&nbsp; \ntcp6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;0 2001:DB8:f00::1:36481 2001:DB8:bar::100:80 ESTABLISHED 11158\/apache2&nbsp;&nbsp; \ntcp6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;0 2001:DB8:f00::1:36295 2001:DB8:bar::100:80 ESTABLISHED 11115\/apache2&nbsp;&nbsp; \ntcp6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;0 2001:DB8:f00::1:34831 2001:DB8:bar::100:80 ESTABLISHED 8312\/apache2&nbsp;&nbsp;<\/code2><\/pre>  <\/p>\n<p>2001:DB8:f00::1   -> my server<br \/>\n2001:DB8:bar::100 -> dst server<\/p>\n<p>As one can easily see my server is connecting to port 80 of dst, possibly asking for something over HTTP.<\/p>\n<p><pre><code2># netstat -antpW | grep 2001:DB8:bar::100 | wc -l\n111<\/code2><\/pre><\/p>\n<p><pre><code2># dig -x 2001:DB8:bar::100 +short&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \ncrl.randomcertauthority.com.<\/code2><\/pre><\/p>\n<p>tailing the log files didn&#8217;t show anything weird happening. I run a tcpdump for that dst server but there wasn&#8217;t at that time any traffic going on.<\/p>\n<p>So, I took a look at munin to see when this problem started developing.<\/p>\n<p><a href=\"https:\/\/www.void.gr\/kargig\/blog\/wp-content\/fw_conntrack-day.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.void.gr\/kargig\/blog\/wp-content\/fw_conntrack-day.png\" alt=\"\" title=\"fw_conntrack-day\" width=\"497\" height=\"352\" class=\"aligncenter size-full wp-image-1526\" srcset=\"https:\/\/www.void.gr\/kargig\/blog\/wp-content\/fw_conntrack-day.png 497w, https:\/\/www.void.gr\/kargig\/blog\/wp-content\/fw_conntrack-day-300x212.png 300w\" sizes=\"auto, (max-width: 497px) 100vw, 497px\" \/><\/a><br \/>\n<a href=\"https:\/\/www.void.gr\/kargig\/blog\/wp-content\/memory-day.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.void.gr\/kargig\/blog\/wp-content\/memory-day.png\" alt=\"\" title=\"memory-day\" width=\"497\" height=\"424\" class=\"aligncenter size-full wp-image-1527\" srcset=\"https:\/\/www.void.gr\/kargig\/blog\/wp-content\/memory-day.png 497w, https:\/\/www.void.gr\/kargig\/blog\/wp-content\/memory-day-300x255.png 300w\" sizes=\"auto, (max-width: 497px) 100vw, 497px\" \/><\/a><br \/>\n<a href=\"https:\/\/www.void.gr\/kargig\/blog\/wp-content\/fw_conntrack-day.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.void.gr\/kargig\/blog\/wp-content\/fw_conntrack-day.png\" alt=\"\" title=\"fw_conntrack-day\" width=\"497\" height=\"352\" class=\"aligncenter size-full wp-image-1526\" srcset=\"https:\/\/www.void.gr\/kargig\/blog\/wp-content\/fw_conntrack-day.png 497w, https:\/\/www.void.gr\/kargig\/blog\/wp-content\/fw_conntrack-day-300x212.png 300w\" sizes=\"auto, (max-width: 497px) 100vw, 497px\" \/><\/a><\/p>\n<p>As it&#8217;s obvious from the above graphs, the problem started around 14:00. So I took another look at the apache logs and I saw a bot crawling a specific url from my server. I visited that url on my server  using curl and I saw traffic flowing through tcpdump going from my server to dst server. So visiting that URL was definitely causing problems. But why?<\/p>\n<p>I restarted apache, swap and memory were released, all the stale ESTABLISHED connections went away and I saw hundreds of FIN\/RST packets going back and forth at tcpdump.<\/p>\n<p>I tried to open a few concurrent connections from my PC to my server&#8217;s url using curl. After a couple of tries netstat showed that I had managed to create stale ESTABLISHED connections towards dst server. It was an HTTP connection asking for a crl. So I was both able to reproduce the problem and I also knew the specific url of the dst server that caused the connection hanging issues.<br \/>\nNext thing I did was to try to open direct HTTP connections from my server to the dst url using curl. After a few concurrent connections I managed to make curl hang. So the problem was definitely not on my server, but at the dst server.<\/p>\n<p>Since it was already quite late, my first (re)action was to install mod_evasive to try and minimize the problem so I could take a better look the next day.<br \/>\n<pre><code2># aptitude install libapache2-mod-evasive\n# a2enmod mod_evasive\n### edit \/etc\/apache2\/sites-enabled\/site-name and add the following\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;IfModule mod_evasive20.c&gt;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;DOSHashTableSize&nbsp;&nbsp;&nbsp;&nbsp;3097\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;DOSPageCount&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1&nbsp;&nbsp; \n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;DOSSiteCount&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;50&nbsp;&nbsp;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;DOSPageInterval&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp; \n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;DOSSiteInterval&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp; \n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;DOSBlockingPeriod&nbsp;&nbsp; 10&nbsp;&nbsp;\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;DOSEmailNotify myemail@mydomain.gr\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;\/IfModule&gt;\n# \/etc\/init.d\/apache2 reload\n<\/code2><\/pre><\/p>\n<p>I tried to curl my server&#8217;s URL from my PC and I got blocked after the second concurrent try. But after some repetitions I was still able to create one or two stale ESTABLISHED connections from my server to the dst server. Far fewer than before but the problem was still somewhat reproducible.<\/p>\n<p>Then I decided to take a look at the site&#8217;s PHP code. Finding the culprit was quite easy, I just had to find the code segment where PHP requested the dst server&#8217;s url.<br \/>\nHere&#8217;s the code segment:<br \/>\n<pre><code2>$ch = curl_init($this-&gt;crl_url);\ncurl_setopt($ch, CURLOPT_RETURNTRANSFER,true);\n$crl_content = curl_exec($ch);\n<\/code2><\/pre><\/p>\n<p>The developer had never thought that the remove server might keep the connection open for whatever reason (rate limiting anyone?)<\/p>\n<p>Patching it was quite simple:<br \/>\n<pre><code2>$ch = curl_init($this-&gt;crl_url);\ncurl_setopt($ch, CURLOPT_RETURNTRANSFER,true);\ncurl_setopt($ch, CURLOPT_CONNECTTIMEOUT,&#039;60&#039;);\ncurl_setopt($ch, CURLOPT_TIMEOUT,&#039;60&#039;);\n$crl_content = curl_exec($ch);\n<\/code2><\/pre><\/p>\n<p>After this everything worked fine again. Connections were getting ESTABLISHED but after 60 seconds they got torn down, automagically. No more stale ESTABLISHED connections. Hooray!<\/p>\n<p>A letter to every developer:<\/p>\n<blockquote><p>Dear developer,<\/p>\n<p>please test your code before shipping. Pretty please take corner cases into account. We know you&#8217;re competent enough, don&#8217;t be lazy.<\/p>\n<p>Your kind sysadmin<\/p><\/blockquote>\n","protected":false},"excerpt":{"rendered":"<p>When you have a misbehaving system and you are not sure what the problem is, always bet on a poorly written application. Here&#8217;s a small example of how another poorly written web application caused system issues. I was sitting at my office today I when I got this nagios alert for a host. Date\/Time: Tue [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ep_exclude_from_search":false,"footnotes":""},"categories":[5,414,3,8],"tags":[182,37,225,33,509,508,595,507,490,505,277,506],"class_list":["post-1523","post","type-post","status-publish","format-standard","hentry","category-internet","category-ipv6-networking","category-linux","category-networking","tag-apache2","tag-bug","tag-curl","tag-debian","tag-http","tag-libcurl","tag-linux","tag-mod_evasive","tag-munin","tag-nagios3","tag-php","tag-tcpdump"],"aioseo_notices":[],"views":9360,"_links":{"self":[{"href":"https:\/\/www.void.gr\/kargig\/blog\/wp-json\/wp\/v2\/posts\/1523","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.void.gr\/kargig\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.void.gr\/kargig\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.void.gr\/kargig\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.void.gr\/kargig\/blog\/wp-json\/wp\/v2\/comments?post=1523"}],"version-history":[{"count":6,"href":"https:\/\/www.void.gr\/kargig\/blog\/wp-json\/wp\/v2\/posts\/1523\/revisions"}],"predecessor-version":[{"id":1532,"href":"https:\/\/www.void.gr\/kargig\/blog\/wp-json\/wp\/v2\/posts\/1523\/revisions\/1532"}],"wp:attachment":[{"href":"https:\/\/www.void.gr\/kargig\/blog\/wp-json\/wp\/v2\/media?parent=1523"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.void.gr\/kargig\/blog\/wp-json\/wp\/v2\/categories?post=1523"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.void.gr\/kargig\/blog\/wp-json\/wp\/v2\/tags?post=1523"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}