Hemiposterical: Facebook through a PRISM

There's a tremendous amount of hot air being talked about the alleged US Government access to personal data from the major internet companies via the supposed PRISM system. I'm not entirely sure who to believe, though I'm defaulting to "no-one"; there's no reason to trust any Government denials, nor any better reason to put faith in the technical understanding of journalists. So let's look at how PRISM might actually work from the limited point of view of snooping on Facebook.

The size of the problem

Facebook has somewhere around 1bn users, but they're not all active - indeed, they vary greatly in levels of activity. So let's say there are 250M distinct FB users per day, and they spend an average of 10 minutes per day on it with 2 activities (read a post or instant message, view a photo, update status, make or delete a friend) per minute. That's 5bn activities per day, or 60,000 per second, that you want to record. How do you find out what they are? Bear in mind that your key requirement is to be able to know who is talking to and associating with whom, and what they are saying.

Snooping at the ISP

The easiest place to start surveillance is at the user's domestic Internet Service Provider (ISP). This is where most USA-based people will connect to the Net. The user will have a public IP (internet address) which is the point where their traffic enters and exists the Net proper, and the ISP will - or should - normally know which of their users, physical locations and bank account ties to that IP. This knowledge will be looser for entities like public wi-fi networks, but they should still have physical location info e.g. the Starbucks on the corner of 5th and Maple.

Regular (HTTP) internet traffic consists of packets - consider them as postcards - with "from" and "to" Internet addresses, plus some text content. The packets are very small, so you have to be able to aggregate a lot of them in order to build up e.g. the entire contents of an email; however they have index numbers so you know what order they are supposed to be in. The "from" address will be the user's public IP, and for our purposes we know what "to" addresses belong to Facebook, so we can require the ISP to just capture those packets for our use. Assuming that we monitor all 250M people in this way, and that each "activity" is about 4KB in size (ignoring photos and voice chat) that's an average stream of 240MB/sec, nearly 2Gbits/sec that the Government has to collect from the various ISPs and process in real time. In practice you need to double that bandwidth because usage isn't flat throughout the day - there will be a definite diurnal cycle and you need to have capacity for the daily peak.

This is a substantial processing challenge but it's not impossible - the Government just has to write its own mini-Facebook back-end that records user activity, without the need to handle photos and videos, and allows them to associate Facebook IDs with real people IDs. Then they can run their own queries over that data store. They'll be writing nearly 20TB/day to that store, so they'll need quite a few hard drives (more when you consider redundancy) but hey, it's the government, they've got the spare $ somewhere.

Of course, someone has to actually pay for the hardware and bandwidth to filter, store and forward the traffic from the ISPs - more government cash - and someone's going to have to monitor and maintain it. This has to happen in nearly all USA ISPs, without any word of it getting out. I'm sure this is completely realistic.

Problem! We're primarily interested in "bloody foreigners", and not just the ones based in the USA. If two people outside the USA are communicating, even if it's via a USA-based Facebook data center, we won't even know it's happening. How can we improve this situation?

Snooping at Facebook's edge

Here we take advantage of the fact that even foreign users end up talking to a FB data center, and many of them are in the USA. (Presumably whatever we work out here could also be done by friendly governments like Eire or the UK for data centers abroad.) Instead of monitoring at the USA users' ingress points, you look at where they egress into Facebook's network. This gives you many fewer places to monitor, though obviously much more traffic per spot so you need fewer instances of hardware but at a much higher grade. You also have fewer places for news of the additional hardware installation and operation to leak from.

The IP packets still have source addresses so you know where they came into the Internet (more or less). You'll need additional collection of data from US ISPs tying IPs to locations and people, where feasible, and you won't have this quality of source information, but you can probably manage.

So far we've seen that just for Facebook you're looking at quite a substantial volume of traffic, and we've ignored all photos and videos, but you can probably infer quite a lot from this data and it doesn't seem to be an insurmountable volume. So far PRISM seems to be not technically infeasible. But there's a wrinkle...

Encryption

So far we've been blithely assuming that we can read the plain text of what the user is sending to and receiving from Facebook - the URLs, the posted text - without any problems. Indeed, HTTP - the system by which web browsers communicate with web browsers - makes it easy to read this information. An HTTP conversation happens in plain text and looks something like this:

From the browser: asking for the page "index.html" on host "www.example.com":
GET /index.html HTTP/1.1
Host: www.example.com
From the server:
HTTP/1.1 200 OK
Date: Mon, 30 Feb 2012 20:31:00 GMT
Server: Apache/2.3.4.5 (Unix) (Red-Hat/Linux)
Last-Modified: Sun, 29 Feb 2012 01:10:25 GMT
Etag: "2e70e-7d6-5f1c883b"
Content-Type: text/html; charset=UTF-8
Content-Length: 100
Connection: close

<html>
<head>
  <title>An Example Page</title>
</head>
<body>
  Hello World
</body>
</html>
The first block is the information about the server and what it's returning, the second block is the HTML page itself.

If you've got access to the stream of data between a user and a website, you can very easily work out what they're doing. You could even change the data, e.g. modifying every instance of the word "Guardian" to "Grauniad" in the stream back to the user, so that the user browsing the eponymous website gets very confused.

Luckily, some clever chaps were aware of this vulnerability of HTTP and came up with a modification: HTTP Secure (HTTPS). This is widely used, and is the default for new Facebook users. The difference it makes for our purposes is that all an external observer can see in plain text is a conversation between the browser and the Facebook server negotiating a "shared secret" - a string that both of them know but that no other observer can know. Once this is agreed, they encrypt the rest of their conversation using that shared secret. The observer can't see what URLs are being requested, or what data is returned. All they know is that IP 192.168.100.2 is communicating with Facebook, and that (judging by the encryption negotiation) they're using Internet Explorer 9. That's not a lot of use to an eavesdropper.

There are a number of approaches to compromising HTTPS sessions, but they're generally rather CPU intensive, target specific web applications, and are progressively being prevented by upgrades to the secure protocols. Here's a little light reading of some examples for the curious. Generally, the only approach that really scales is a man-in-the-middle attack. This is where an eavesdropper intercepts the user's packets to Facebook and pretends to be Facebook itself; in turn, the eavesdropper connects to Facebook pretending to be the user and relay's the user's requests and Facebook's responses.

The way that HTTPS/SSL defeats this is via Certificate Authorities, a small number of trusted firms across the world who provide the data that can verify that when you connect to a server believed to be from Facebook that the electronic signature you receive back from that server really does belong to Facebook. The ins and outs of how this works are complex, but the net effect is that it's really rather hard for even a Government to pretend to be Facebook, and requires a substantial compromise of either Facebook's secret SSL keys (so it can sign the connection just like Facebook does) or a certificate authority (so it can claim that its fake signature really is Facebook's). Even these approaches are not foolproof, and have to be cracked for each company and updated whenever each company changes its signature. Unfortunately this can be detected by browsers; for instance, modern browsers know what the real certificates should be for major websites and can warn you if someone is trying to impersonate Facebook even if a compromised certificate authority claims that they're kosher.

There's also the not insignificant issue that such an interception approach has to be at least as reliable as the servers the user connects to, and must not introduce any detectable latency into the connection despite having to relay all the traffic both ways and filter out the text it's interested in.

The killer, though is that you have to inspect all traffic to Facebook. Unlike plain text traffic, where you can easily see that packets pertain to photos or videos and ignore them, you can't tell this for HTTPS until you've intercepted the conversation and started to man-in-the-middle the connection. You've got to continue relaying the photos or video data, even though you're not interested in it, because if you drop the connection the browser will notice and so will the user. This massively magnifies the problem - you need as much processing capacity as Facebook itself has at its front ends.

Insider access

Google, Facebook et al have strenuously and specifically denied giving PRISM-like access to user data. Let's take them at their word. Assuming they're not co-operating, how would you get the access you'd like to user data without them knowing?

The most effective approach, as noted above, is to have an insider compromise their SSL secret keys. That lets you man-in-the-middle all HTTPS traffic. Unfortunately you have a very small set of insiders who have that access - and, by definition, those insiders will be as trustworthy and hard-to-compromise as possible.

The talk swanning around about "free access to data on Facebook's servers" is rubbish. There is no way any substantial routine access to user data is going unnoticed. Facebook will be monitoring read traffic, bandwidth usage, CPU and memory load for all its critical servers. If there's unexplained traffic in any volume, it's going to show up in dozens of monitoring consoles scattered all over the firm. So many people would have to be in on the snooping that word of it would inevitably leak.

Conclusion

It's just about feasible for a government to snoop on the plain-text non-photo non-video traffic for Facebook, and the best place to do it is probably where traffic exits the Internet going to Facebook's network. You're looking at a very serious amount of hardware to snoop and store the information, but it's tractable with the budget available from a major government. When it comes to routine snooping on encrypted (HTTPS) traffic though, forget it. It would require a major systematic compromise of closely-held secret keys, a very high performance software infrastructure operating at very high reliability, and - the killer - would have to be able to deal with as much traffic as the Facebook front ends themselves do. By extension, the same is true for Google, Yahoo, Microsoft etc. The Government is going to require inconveniently large amounts of hardware placed inconveniently close to the major Facebook, Google and Microsoft data centers.

I should add that the alleged $20M/year cost of PRISM would cover the capital costs of about 15,000 servers written off over 3 years (say, $4000 per server since you have to cover associated network, power and cooling infrastructure). That's really not a lot. If you have 5 TB of storage per server, that's 75,0000 TB over 3 years; the above requirements just for Facebook basics would be about 21,000 TB over that time, and you'd have to at least double that for redundancy. This doesn't even approach all the other personnel and software development costs.

Conclusion: the scope of PRISM has almost certainly been massively exaggerated. Journalists have been taken for a ride.

Hemiposterical

2013-06-08

Facebook through a PRISM