World city map of Tor nodes

Some months ago I started playing with the idea of creating a world map that would have every Tor node on it. Obviously I wan’t the first one…I soon discovered Moritz Bartl’s post on the same topic. Luckilly he had his code posted on Github so I could fork it and add features that I wanted. The original python script parsed the consensus and the misrodescriptors, put Tor nodes into some classes and created a KML file with some description on each node.

Some differences
I changed some parts of the python script to better suit my needs.
a. Create a separate kml files for each Tor node class.
b. Add new classes: Bad, Authority and Named.
c. Pay more attention on requesting every external URL over HTTPS.
d. Generate HTML code that displays those KMLs on a Google Maps overlay.
e. Add some small randomization to each nodes’s coordinates so that nodes in the same city don’t overlap.

You can find a complete changelog at kargig/tormap GitHub repo.

And here’s the outcome: World city map of Tor nodes at https://tormap.void.gr/
One of my main goals was to have selectable classes of nodes that will appear on the map.

To produce the map overlay, a cron script runs every hour, which is also the period it takes for Tor Authority nodes to produce a new consensus, and creates some static files which are then served by nginx.

I’m not a web developer/designer and I don’t really know any javascript. So please, feel free to fork my code and make it look better, run faster and add your own features. I’ll happily accept patches/pull requests!

Extras
On kargig/tormap repo you will also find a handy script, ‘runme.sh’, that downloads all necessary files that need to be parsed by the python script.

Missplaced nodes on the map
Well, blame MaxMind’s GeoIP City database for that. But I think it’s kinda funny to see Tor nodes in Siberia and in the middle of the sea though (look at the West coast of Africa), heh. For those wondering, these nodes are gathered there because their geoip Lat,Long is set to 0,0.
Really though, what’s “Ben’s Cat Shaque” diplayed there next to all those nodes in the west coast of Africa? Anyone has some clue ?

Conspriracy people
I’m sure that people who love conspriracy theories will start posting about those ‘Bad’ Tor nodes in Iran and Syria. Why do you think these are there ? What does it mean ? Let the flames begin!

Future TODO
a. OpenStreetMap
I have started working on an OpenStreetMap implementation of the above using OpenLayers. The biggest hurdle is that OSM does not provide a server that serves map tiles over HTTPS. Makes me wonder…is that actually so difficult ?
b. More stats
I would like to add small graphs on how the number of nodes in each class evolves.

Other Tor mapping efforts
https://b.kentbackman.com/2010/10/04/view-tor-exit-nodes-in-google-earth/
http://freehaven.net/~ioerror/maps/v3-tormap.html

Don’t forget, you can always help Tor by running a node/bridge or sending some money to Tor or EFF!

Scaling a small streaming system from 50 to 4000+ users

This post was originally written on 09/11/2012 but for various reasons it couldn’t be published earlier. It won’t be very technical, it’s mostly a behind the scenes view of how some of us at NOC GRNET tried to cope with an extreme spike of demands of a specific service we support.

Intro
At GRNET we provide a live streaming service to the Hellenic Parliament, who also connect to the Internet through our network. What we’re doing is that we’re serving the official TV channel of the Hellenic Parliament (the WebTV program actually). We have a machine that encodes (transcodes actually) video/audio and a separate streamer that serves the content to the clients via RTSP, RTMP and HTTP. While this service has been running for quite some time, and the streaming server has been used in various other occasions, it typically serves no more than 100 concurrent clients. With a stream at 640Kbps on average that’s abound 60Mbps of streaming traffic. The number of viewers usually goes up only when there’s something exciting going on. Our previous traffic spike was at 577Mbps on October 31st, when the Minister of Finance was presenting the budget for 2013 to the Parliament. What happened that day was that news reporters were on strike and it seems that one of the few news sources available at the time for people to use was the Parliament channel, either on the TV or through our stream. The viewership that day surpassed all previous expectations but the streaming system actually performed quite well considering we didn’t have any complaints.

Remember, remember the 7th of November
On the 7th of November almost everyone was on strike in Greece. There was a huge march arranged at 17:00 outside of the Greek Parliament to protest against the new harsh measures of Memorandum III that the parliament was to vote for at midnight.

We started seeing some traffic arriving, well leaving is more accurate, our streaming server at around 10:00 in the morning. Some news sites and blogs (tanea.gr, left.gr, protothema.gr) had already started posting links to our parliament’s streaming service for people to watch the discussion that was already taking place. At around 10:30 we were already serving more than 100Mbps while the traffic was steadily rising. A bit after 12:00 though things got really nasty. Websites like tovima.gr, zougla.gr and newsit.gr had posted our streaming links in their very first page. The discussions both inside the parliament and on the social networks had also started heating up. At 12:25 the traffic starting ramping up extremely fast!

12:25 160Mbps
12:32 367Mbps
12:38 597Mbps
12:42 756Mbps
12:49 779Mbps

That’s when traffic stopped growing any more. We actually saw some drop in the traffic and then up to 780Mbps and down again. It was obvious that we had reached some limit. It wasn’t very obvious though what the bottleneck was since the streaming system load was quite low, around 0.6-1.0 in a 4 vCPU VM and it also had a virtio-net/gigabit[1] network card. Theoretically it should be able to push at least 100-150 more Mbps. In order to cope with the increasing demand we changed the player web page in order to off-load some users to an experimental service that relies on IP multicast from the server and P2P between the clients for delivering streams. This bought us some time, about 2-3 hours. Also maybe because it was noon time, or due to more people using the experimental service or not being satisfied from the performance of the streaming video and disconnecting, traffic later on dropped “down” to an average of 650-700Mbits.

We started discussing short-term actions for improving the service so as to be able to cope with more demand. The streaming server was running on GRNET’s virtualization platform, which is based on ganeti, and the first thought we had was to try and add more network cards to the VM and somehow bond the cards together. Could we push more than 1Gbps from the same VM? The problem was that we couldn’t do any 802.3ad bonding since the “virtual switch” inside the virtualization platform did not support such features. One other solution would be to add another virtual network card and use Linux bonding mode 5 (balance-tlb) or 6 (balance-alb). After a bit of reading this mode was rejected as well. These two bonding modes expect from the ethernet card running on the machine to support ethtool to read their speed. Our VMs use virtio-net which doesn’t support this functionality for ethtool. We could switch to another type of network card, e1000 for example, but this could have an unmeasured/untested performance penalty which could actually negate the addition of a second network card.

Hundreds of thousands of Greeks were planning to join this huge protest starting at 17:00 and so were some of us working at GRNET, at least I was. We had to make a decision though, drop our plans and improve the streaming service to help more people, especially Greeks living abroad, to access the stream or just join others at Syntagma square? Our shift was ending soon anyway. We decided to stay at work and try and improve the service as far as we could.

The path we chose in order to “scale” was to create to a second instance of the streamer and try to “balance” client requests to both streamers using round robin DNS. That involved creating a checkpoint of the running VM at our storage system, copy the running VM’s image from that checkpoint as a new image and run a new VM with that image. So we would be cloning the current streaming service while the service was running and serving clients (take that Windows!).

So we provisioned a new Debian server through LDAP, added the copied image disks at the VM’s configuration, booted the VM, run puppet to change the configuration files according to the new hostname and IP, and we were ready to serve more clients. After some testing we created a RR DNS entry, “streamer-frontend.domain.gr” which pointed to both first-streamer.domain.gr and second-streamer.domain.gr and had a TTL of 60. Changing the landing page though to serve streamer-frontend.domain.gr as the streamer url wasn’t enough. The first problem was that there were news sites and blogs that had copied directly our first streamer’s URL instead of the live.grnet.gr/paliament/ landing page, which actually runs on a different system and is served by an Apache2 server (more on that later). Having already more than 1000 streams on the first streamer and then starting to do RR DNS on both servers meant that there would still be a large number of clients served by the first server and new people getting directed there would only make matters worse, while people directed to the second server would have a much better service experience. So what we actually did was point streamer-frontend.domain.gr not on both first-streamer.domain.gr and second-streamer.domain.gr but just second-streamer.domain.gr. We also changed the IP address of first-streamer A,AAAA records to second-streamer’s IP and flushed the first-streamer RR at our caching resolvers. Since many people use our caching resolvers, especially students with DSL lines, we were able to direct even more people towards the second-streamer.

Before adding second-streamer.domain.gr:

* first-streamer.domain.gr -> 1.2.3.4 TTL 86400
* live.grnet.gr/parliament/ pointing the streamer URL at first-streamer.domain.gr

After adding second-streamer.domain.gr:

* streamer-frontend.domain.gr -> 1.2.3.5 TTL 60
* first-streamer.domain.gr -> 1.2.3.5 TTL 60
* second-streamer.domain.gr -> 1.2.3.5 TTL 60
* live.grnet.gr/parliament/ pointing the streamer URL at streamer-frontend.domain.gr

That way people disconnecting from the stream and reconnecting some time later on were pointed to the new server, if they used our landing page. That meant better quality streaming for both “old” clients getting their stream from first-streamer, and for the new clients that were being served from second-streamer.

Within 30 minutes since booting up, the second streamer was serving more than 250Mbps and after another 30′ its traffic had climbed up to 450Mbps while first-streamer was steadily serving more than 550Mbits. That lead us to an all time record of 1.02Gbps at 17:45. At that point we were serving more than 2000 concurrent streams. A number we never expected for this humble streaming service.

While traffic was increasing through the addition of the second streamer, another problem came up. Some misconfigured clients and HTTP proxies were opening up connections at the Apache 2 serving the web page and never closed them down. We hit Apache’s internal ServerLimit variable of 256 around 17:00. We had to increase ServerLimit inside Apache’s config and restart it:

ServerLimit 500
<IfModule mpm_prefork_module>
    StartServers         10 
    MinSpareServers      15 
    MaxSpareServers      25 
    MaxClients          500
    MaxRequestsPerChild   0  
</IfModule>

After some minutes of happiness another “unexpected” issue came up. It started to rain in Athens, so people who were at the protest at Syntagma square would probably leave soon and go home. There would certainly be an increase in traffic. And what about midnight, which is when the MPs were supposed to vote on the new measures? People would be sitting at their computers, bashing the politicians on Twitter and Facebook while watching the Parliament’s stream. At the same time we started seeing a big increase in international traffic, since more Greeks living abroad were tuning in; this stream was probably the only way they could watch what was happening at the Parliament.

That only meant one thing. Hello third-streamer! 10′ later we had another streamer running. But we also wanted to make some changes to the first two VMs. We wanted to add more vCPUs, going from 4 to 8, add more RAM, going from 4096Mb to 6144Mb, and do some minor performance tuning on the software. That meant restarting the VMs in order to activate the additional resources… luckily the chairman of the Parliament announced a 10 minute recess some time around 19:30. That was our chance… we changed back first-streamer’s IP address, added it to streamer-frontend RR DNS while also adding third-streamer to it. Then we rebooted the first two VMs. After 30″ we had 3 streamers running and waiting for clients to join them. third-streamer instantly got more than 200Mbps at the very moment we rebooted the other 2 VMs.

After adding third-streamer.domain.gr:

* streamer-frontend.domain.gr -> {1.2.3.4 | 1.2.3.5 | 1.2.3.6} TTL 60
* first-streamer.domain.gr -> 1.2.3.4 TTL 60
* second-streamer.domain.gr -> 1.2.3.5 TTL 60
* third-streamer.domain.gr -> 1.2.3.6 TTL 60
* live.grnet.gr/parliament/ pointing the streamer URL at streamer-frontend.domain.gr

At 20:10, some of us decided to leave work after 11 hours, having 3 streamers running that had already reached 1.2Gbps of traffic. There were more than 2800 concurrent users at the time…

Our assumptions for midnight came true. Around that time, more than 4000 people had tuned in to the parliament’s stream and we were pushing about 1.66 Gbps. That was probably the all-time record for a single service in GRNET.

A big problem we didn’t solve that night
Unfortunately, each client gets a unique session id from the streamer, also when requesting an HTTP stream. These are not shared among the streamers, so if you send a subsequent HTTP request to a different streamer, who does not track that session id, it will not serve the request as expected but rather prompt the client to start a new session. Since some browsers did not cache the RRs of streamer-frontend long enough (remember that low TTL?) they would spread their requests to all the streamers. We knew this was happening but there was no way to solve this without extensive changes to the setup, which would have to be tested of course would mean significant downtime for the service. Therefore, depending on the browser people used, some got better or worse service than others. The good thing is that we know how to fix this in the future.

Some Stats & Graphs
more than 31000 unique IPs connecting to the streaming service
more than 4000 concurrent users at peak time
more than 4.7 TB of data streamed

first-streamer ethernet traffic:

second-streamer ethernet traffic:

third-streamer ethernet traffic:

Streamer client types

Streamer LAN traffic aggregates:

Create your own graphs here: http://mon.grnet.gr/rg/157184/details/

Epilogue
Yeah, there were hiccups in the stream for many many people, we acknowledge that. But we certainly did the best we could to keep the service running. Could we have done better ? Yes we could, but we would have to re-design the service and make some drastic changes that were never tested before. We didn’t want to risk making drastic changes that might not work at all while we could somewhat “scale” our service using a “working” solution. I wonder what will happen on Sunday when the Parliament votes for 2013 budget…Will we exceed our previous peak? We’ve already discussed alternatives to cope with the extra traffic and we’ll definitely be better prepared [2]!

I think that zmousm, alexandros and me did a fairly good job regarding the circumstances. We should buy each other a beer sometime..We’ll treat faidonl another one though for his helpful consulting 🙂

Since no external CDNs or other services provided by companies abroad were used, this might be the event with the biggest ever demand in network resources served from within Greece.

Do you like solving such problems ? Then there might be a lurking sysadmin inside you 😉

[1] After a talk with apoikos, he corrected me saying that virtio-net does not have a “gigabit” capacity. Virtio-net’s capacity is only limited by the node resources available, so it can theoretically perform better than gigabit. So our thoughts on using balance-alb/tlb to cope with the extra bandwidth were wrong. This clue points out that the cause of bottleneck on the streamer that couldn’t go over 800Mbps was the streamer software itself since both VM’s and hardware node’s system load were low.

[2] On 11/11/2012 the Greek Parliament voted for 2013 budget. What we did prior to that day to cope with the traffic was to setup a new version of the streamer software, that actually came out on 08/11/2012 just one day after the initial spike, that had the option to disable session IDs. Then it was quite straightforward to add some varnish caches in front of the streamers to serve the HTTP streams to the clients. Unfortunately client demand was somewhat lower, it only reached 0.92Gbps.

Review of the first Athens CryptoParty

On Sunday the 11th of November we finally had our first CryptoParty in Athens, Greece. We hosted it at the Athens Hackerspace.

Organizing
We organized our first CryptoParty in a very ad-hoc way. A pad was set up and advertised on Twitter/Facebook. Almost immediately people started writing their thoughts, views and interests there. We soon had a list of topics that people were interested in and another list of people willing to give presentations/workshops. Later on we set up a doodle so people would choose the most convenient dates for them. From the group of 50 people that originally expressed their interest to attend the CryptoParty, at least 20 voted on the doodle. That’s how the final date of November the 11th was chosen.

It was surprising/refreshing that even though everything was organized through an anonymously editable pad, nobody tried to vandalize it.

The actual event
Through the pad, we chose 3 topics for the first meeting. “Using SSL/TLS for your Internet communications”, an “introduction to Tor” and another “introduction to I2P”.
The time for the event was set for 12:00 in the morning, probably a very bad choice. The next one should definitely be later in the afternoon or even night. We learn by our mistakes though…People started showing up at around 11:30, but the event didn’t start until 12:30 when someone from hackerspace.gr gave a 5′ intro talk about what the hackerspace is to people who had never been there before. People kept coming even until 13:00 and the audience had grown to more than 30 people.
After the three workshops/presentations around 10-15 people stayed and we ordered pizza.

All in all I’d say it was fairly successful since more than 30 people came and actually did things to improve their security.

The presentations/workshops
Using SSL/TLS for your Internet communications” (in English) was my effort to show people how cleartext data travels through the Internet and how any intermediate “bad guy”/LEA can easily read or manipulate your data. People were instructed to install wireshark so they could actually see for themselves what the actual problem is. It was very “nice” to see their surprise upon watching cleartext packets flowing through their network cards. It was even nicer to see their surprise when I used tcpdump on hackerspace’s router to redirect traffic to wireshark running on a Debian laptop to display their data, without having “direct” access to their computer. Then people were introduced to the idea of Transport Layer Security (SSL/TLS), and how HTTPS protects their web data from prying eyes. After this tiny “privacy apocalypse” it was very easy to convince users to install HTTPS-Everywhere. And so they did. Afterwards they got instructions on how they should change SSL/TLS settings for their E-email and IM clients.
My original intention was to “scare” people a bit. It was funny to see their faces when they logged in to yahoo mail and they could see their emails cleartext on wireshark. People don’t understand how data travels through the Internet unless they experience it for themselves. I’m glad that people who had absolutely no idea about HTTPS are now using HTTPS-Everywhere to protect themselves. Hopefully they’ll show that to their friends as well.

Introduction to Tor” (in Greek) gave people an idea at what anonymity is, how it differs from security and how users should be combining both TLS and Tor usage for security and anonymity at the same time. A brief explanation of what hidden services are was given as well. Even though George asked people to download and install Tor Browser Bundle and use it, we’ll definitely need more “hands on” Tor workshops in the future. It will be interesting to convince more people to actually use it and why not, even set up their own hidden services.

Invisible Internet Project a.k.a. I2P” (in English) by @alafroiskiotos was probably the hardest of the three presentations to keep up for people that had no previous idea about anonymity networks. It’s unique architecture and some difficulties in it’s usage raised a lot of interesting questions by attendees.

Thoughts on future CryptoParties
After the end of the workshops/presentations we had a lengthy discussion with the attendees as to what they would like to see/experience in the future CryptoParties. Unfortunately people were not very vocal. Very few participated and openly expressed their thoughts/opinions. A great part of the discussion was spent trying to figure out whom should CryptoParty presentations/workshops target at, users? developers? geeks? It’s obviously very hard to target all groups of people at the same time.

So here are my thoughts on what future CryptoParties should be. CryptoParties should be about changing user habits, they should be closer to workshops than presentations. They should be focused mainly on users not developers nor computer science students. Just simple users. People don’t want theoretical talks about cryptography, they need advice they can use in their daily lives. It’s already very hard to talk about modern crypto to people who haven’t got a strong mathematical background, you have to oversimplify things. Oversimplifying things then makes geeks/nerds unhappy and still doesn’t “teach” people about proper crypto. Even a fairly “simple” HTTPS negotiation contains key crypto concepts that are very difficult for a “crypto-newbie” to grasp. So it’s a lose-lose situation.

We need to teach, or better convince, users on using good, secure, audited tools and not just tell them about technologies and concepts. We, weirdos, might like that, but most users don’t. People need our help to learn how to avoid “fancy” tools and false security prophets. We need to show them how security should be applied in a layered approach. Getting people to care about their own privacy is key to the success of CryptoParties in the way I see them. To achieve that, we, people that know a few things more than the average Joe, should all become volunteers to such efforts. We should be joining CryptoParties in order to help others and not in order to improve ourselves and our knowledge. (Actually when you study in order to make a good workshop/presentation you improve your own knowledge as well, but let’s leave that beside for now.) We can have our separate geeky/nerdy events to present fancy tech and cool crypto stuff, but let’s keep CryptoParties simple and practical. Oh and we’ll need to repeat things again and again and again. That’s the only way people might change their habits.

If you want to find out more about the next Athens CryptoParty keep an eye at Hackerspace’s events and the athens cryptoparty pad. Join us!

Good luck to all the CryptoParties worldwide!

When in doubt, always blame the application

When you have a misbehaving system and you are not sure what the problem is, always bet on a poorly written application.

Here’s a small example of how another poorly written web application caused system issues.

I was sitting at my office today I when I got this nagios alert for a host.

Date/Time: Tue Nov 6 19:15:11 EET 2012
Additional Info:
SWAP CRITICAL – 0% free (0 MB out of 509 MB)

Logging in actually showed all the swap’s been used and so was RAM, 0.95/1Gb. Lots of apache2 server instances were running. I did a netstat and I saw a lot of ESTABLISHED connections:

tcp6       0      0 2001:DB8:f00::1:35571 2001:DB8:bar::100:80 ESTABLISHED 9631/apache2    
tcp6       0      0 2001:DB8:f00::1:35777 2001:DB8:bar::100:80 ESTABLISHED 9656/apache2    
tcp6       0      0 2001:DB8:f00::1:36531 2001:DB8:bar::100:80 ESTABLISHED 11578/apache2   
tcp6       0      0 2001:DB8:f00::1:36481 2001:DB8:bar::100:80 ESTABLISHED 11158/apache2   
tcp6       0      0 2001:DB8:f00::1:36295 2001:DB8:bar::100:80 ESTABLISHED 11115/apache2   
tcp6       0      0 2001:DB8:f00::1:34831 2001:DB8:bar::100:80 ESTABLISHED 8312/apache2  

2001:DB8:f00::1 -> my server
2001:DB8:bar::100 -> dst server

As one can easily see my server is connecting to port 80 of dst, possibly asking for something over HTTP.

# netstat -antpW | grep 2001:DB8:bar::100 | wc -l
111

# dig -x 2001:DB8:bar::100 +short                                 
crl.randomcertauthority.com.

tailing the log files didn’t show anything weird happening. I run a tcpdump for that dst server but there wasn’t at that time any traffic going on.

So, I took a look at munin to see when this problem started developing.



As it’s obvious from the above graphs, the problem started around 14:00. So I took another look at the apache logs and I saw a bot crawling a specific url from my server. I visited that url on my server using curl and I saw traffic flowing through tcpdump going from my server to dst server. So visiting that URL was definitely causing problems. But why?

I restarted apache, swap and memory were released, all the stale ESTABLISHED connections went away and I saw hundreds of FIN/RST packets going back and forth at tcpdump.

I tried to open a few concurrent connections from my PC to my server’s url using curl. After a couple of tries netstat showed that I had managed to create stale ESTABLISHED connections towards dst server. It was an HTTP connection asking for a crl. So I was both able to reproduce the problem and I also knew the specific url of the dst server that caused the connection hanging issues.
Next thing I did was to try to open direct HTTP connections from my server to the dst url using curl. After a few concurrent connections I managed to make curl hang. So the problem was definitely not on my server, but at the dst server.

Since it was already quite late, my first (re)action was to install mod_evasive to try and minimize the problem so I could take a better look the next day.

# aptitude install libapache2-mod-evasive
# a2enmod mod_evasive
### edit /etc/apache2/sites-enabled/site-name and add the following
       <IfModule mod_evasive20.c>
            DOSHashTableSize    3097
            DOSPageCount        1   
            DOSSiteCount        50  
            DOSPageInterval     1   
            DOSSiteInterval     1   
            DOSBlockingPeriod   10  
            DOSEmailNotify myemail@mydomain.gr
        </IfModule>
# /etc/init.d/apache2 reload

I tried to curl my server’s URL from my PC and I got blocked after the second concurrent try. But after some repetitions I was still able to create one or two stale ESTABLISHED connections from my server to the dst server. Far fewer than before but the problem was still somewhat reproducible.

Then I decided to take a look at the site’s PHP code. Finding the culprit was quite easy, I just had to find the code segment where PHP requested the dst server’s url.
Here’s the code segment:

$ch = curl_init($this->crl_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
$crl_content = curl_exec($ch);

The developer had never thought that the remove server might keep the connection open for whatever reason (rate limiting anyone?)

Patching it was quite simple:

$ch = curl_init($this->crl_url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,'60');
curl_setopt($ch, CURLOPT_TIMEOUT,'60');
$crl_content = curl_exec($ch);

After this everything worked fine again. Connections were getting ESTABLISHED but after 60 seconds they got torn down, automagically. No more stale ESTABLISHED connections. Hooray!

A letter to every developer:

Dear developer,

please test your code before shipping. Pretty please take corner cases into account. We know you’re competent enough, don’t be lazy.

Your kind sysadmin