Authors: Bruce Schneier
Sometimes linking identities across data sets is easy; your cell phone is connected
to your name, and so is your credit card. Sometimes it’s harder; your e-mail address
might not be connected to your name, except for the times people refer to you by name
in e-mail. Companies like Initiate Systems sell software that correlates data across
multiple data sets; they sell to both governments and corporations. Companies are
also correlating your online behavior with your offline actions. Facebook, for example,
is partnering with the data brokers Acxiom and Epsilon to match your online profile
with in-store purchases.
Once you can correlate different data sets, there is a lot you can do with them. Imagine
building up a picture of someone’s health without ever looking at his patient records.
Credit card records and supermarket affinity cards reveal what food and alcohol he
buys, which restaurants he eats
at, whether he has a gym membership, and what nonprescription items he buys at a pharmacy.
His phone reveals how often he goes to that gym, and his activity tracker reveals
his activity level when he’s there. Data from websites reveal what medical terms he’s
searched on. This is how a company like ExactData can sell lists of people who date
online, people who gamble, and people who suffer from anxiety, incontinence, or erectile
dysfunction.
PIERCING OUR ANONYMITY
When a powerful organization is eavesdropping on significant portions of our electronic
infrastructure and can correlate the various surveillance streams, it can often identify
people who are trying to hide. Here are four stories to illustrate that.
1. Chinese military hackers who were implicated in a broad set of attacks against
the US government and corporations were identified because they accessed Facebook
from the same network infrastructure they used to carry out their attacks.
2. Hector Monsegur, one of the leaders of the LulzSec hacker movement under investigation
for breaking into numerous commercial networks, was identified and arrested in 2011
by the FBI. Although he usually practiced good computer security and used an anonymous
relay service to protect his identity, he slipped up once. An inadvertent disclosure
during a chat allowed an investigator to track down a video on YouTube of his car,
then to find his Facebook page.
3. Paula Broadwell, who had an affair with CIA director David Petraeus, similarly
took extensive precautions to hide her identity. She never logged in to her anonymous
e-mail service from her home network. Instead, she used hotel and other public networks
when she e-mailed him. The FBI correlated registration data from several different
hotels—and hers was the common name.
4. A member of the hacker group Anonymous called “w0rmer,” wanted for hacking US
law enforcement websites, used an anonymous Twitter account, but linked to a photo
of a woman’s breasts taken with an iPhone. The photo’s embedded GPS coordinates pointed
to a house in Australia.
Another website that referenced w0rmer also mentioned the name Higinio Ochoa. The
police got hold of Ochoa’s Facebook page, which included the information that he had
an Australian girlfriend. Photos of the girlfriend matched the original photo that
started all this, and police arrested w0rmer aka Ochoa.
Maintaining Internet anonymity against a ubiquitous surveillor is nearly impossible.
If you forget even once to enable your protections, or click on the wrong link, or
type the wrong thing, you’ve permanently attached your name to whatever anonymous
provider you’re using. The level of operational security required to maintain privacy
and anonymity in the face of a focused and determined investigation is beyond the
resources of even trained government agents. Even a team of highly trained Israeli
assassins was quickly identified in Dubai, based on surveillance camera footage around
the city.
The same is true for large sets of anonymous data. We might naïvely think that there
are so many of us that it’s easy to hide in the sea of data. Or that most of our data
is anonymous. That’s not true. Most techniques for anonymizing data don’t work, and
the data can be de-anonymized with surprisingly little information.
In 2006, AOL released three months of search data for 657,000 users: 20 million searches
in all. The idea was that it would be useful for researchers; to protect people’s
identity, they replaced names with numbers. So, for example, Bruce Schneier might
be 608429. They were surprised when researchers were able to attach names to numbers
by correlating different items in individuals’ search history.
In 2008, Netflix published 10 million movie rankings by 500,000 anonymized customers,
as part of a challenge for people to come up with better recommendation systems than
the one the company was using at that time. Researchers were able to de-anonymize
people by comparing rankings and time stamps with public rankings and time stamps
in the Internet Movie Database.
These might seem like special cases, but correlation opportunities pop up more frequently
than you might think. Someone with access to an anonymous data set of telephone records,
for example, might partially de-anonymize it by correlating it with a catalog merchant’s
telephone order
database. Or Amazon’s online book reviews could be the key to partially de-anonymizing
a database of credit card purchase details.
Using public anonymous data from the 1990 census, computer scientist Latanya Sweeney
found that 87% of the population in the United States, 216 million of 248 million
people, could likely be uniquely identified by their five-digit ZIP code combined
with their gender and date of birth. For about half, just a city, town, or municipality
name was sufficient. Other researchers reported similar results using 2000 census
data.
Google, with its database of users’ Internet searches, could de-anonymize a public
database of Internet purchases, or zero in on searches of medical terms to de-anonymize
a public health database. Merchants who maintain detailed customer and purchase information
could use their data to partially de-anonymize any large search engine’s search data.
A data broker holding databases of several companies might be able to de-anonymize
most of the records in those databases.
Researchers have been able to identify people from their anonymous DNA by comparing
the data with information from genealogy sites and other sources. Even something like
Alfred Kinsey’s sex research data from the 1930s and 1940s isn’t safe. Kinsey took
great pains to preserve the anonymity of his subjects, but in 2013, researcher Raquel
Hill was able to identify 97% of them.
It’s counterintuitive, but it takes less data to uniquely identify us than we think.
Even though we’re all pretty typical, we’re nonetheless distinctive. It turns out
that if you eliminate the top 100 movies everyone watches, our movie-watching habits
are all pretty individual. This is also true for our book-reading habits, our Internet-shopping
habits, our telephone habits, and our web-searching habits. We can be uniquely identified
by our relationships. It’s quite obvious that you can be uniquely identified by your
location data. With 24/7 location data from your cell phone, your name can be uncovered
without too much trouble. You don’t even need all that data; 95% of Americans can
be identified
by name
from just four time/date/location points.
The obvious countermeasures for this are, sadly, inadequate. Companies have anonymized
data sets by removing some of the data, changing the time stamps, or inserting deliberate
errors into the unique
ID numbers they replaced names with. It turns out, though, that these sorts of tweaks
only make de-anonymization slightly harder.
This is why regulation based on the concept of “personally identifying information”
doesn’t work. PII is usually defined as a name, unique account number, and so on,
and special rules apply to it. But PII is also about the amount of data; the more
information someone has about you, even anonymous information, the easier it is for
her to identify you.
For the most part, our protections are limited to the privacy policies of the companies
we use, not by any technology or mathematics. And being identified by a unique number
often doesn’t provide much protection. The data can still be collected and correlated
and used, and eventually we do something to attach our name to that “anonymous” data
record.
In the age of ubiquitous surveillance, where everyone collects data on us all the
time, anonymity is fragile. We either need to develop more robust techniques for preserving
anonymity, or give up on the idea entirely.
O
ne of the most surprising things about today’s cell phones is how many other things
they also do. People don’t wear watches, because their phones have a clock. People
don’t carry cameras, because they come standard in most smartphones.
That camera flash can also be used as a flashlight. One of the flashlight apps available
for Android phones is Brightest Flashlight Free, by a company called GoldenShores
Technologies, LLC. It works great and has a bunch of cool features. Reviewers recommended
it to kids going trick-or-treating. One feature that wasn’t mentioned by reviewers
is that the app collected location information from its users and allegedly sold it
to advertisers.
It’s actually more complicated than that. The company’s privacy policy, never mind
that no one read it, actively misled consumers. It said that the company would use
any information collected, but left out that the information would be sold to third
parties. And although users had to click “accept” on the license agreement they also
didn’t read, the app started collecting and sending location information even before
people clicked.
This surprised pretty much all of the app’s 50 million users when researchers discovered
it in 2012. The US Federal Trade Commission got involved, forcing the company to clean
up its deceptive practices
and delete the data it had collected. It didn’t fine the company, though, because
the app was free.
Imagine that the US government passed a law requiring all citizens to carry a tracking
device. Such a law would immediately be found unconstitutional. Yet we carry our cell
phones everywhere. If the local police department required us to notify it whenever
we made a new friend, the nation would rebel. Yet we notify Facebook. If the country’s
spies demanded copies of all our conversations and correspondence, people would refuse.
Yet we provide copies to our e-mail service providers, our cell phone companies, our
social networking platforms, and our Internet service providers.
The overwhelming bulk of surveillance is corporate, and it occurs because we ostensibly
agree to it. I don’t mean that we make an informed decision agreeing to it; instead,
we accept it either because we get value from the service or because we are offered
a package deal that includes surveillance and don’t have any real choice in the matter.
This is the bargain I talked about in the Introduction.
This chapter is primarily about Internet surveillance, but remember that everything
is—or soon will be—connected to the Internet. Internet surveillance is really shorthand
for surveillance in an Internet-connected world.
INTERNET SURVEILLANCE
The primary goal of all this corporate Internet surveillance is advertising. There’s
a little market research and customer service in there, but those activities are secondary
to the goal of more effectively selling you things.
Internet surveillance is traditionally based on something called a cookie. The name
sounds benign, but the technical description “persistent identifier” is far more accurate.
Cookies weren’t intended to be surveillance devices; rather, they were designed to
make surfing the web easier. Websites don’t inherently remember you from visit to
visit or even from click to click. Cookies provide the solution to this problem. Each
cookie contains a unique number that allows the site to identify you. So now when
you click around on an Internet merchant’s site, you keep telling it, “I’m customer
#608431.” This allows the site to find your account, keep your
shopping cart attached to you, remember you the next time you visit, and so on.
Companies quickly realized that they could set their own cookies on pages belonging
to other sites—with their permission and by paying for the privilege—and the third-party
cookie was born. Enterprises like DoubleClick (purchased by Google in 2007) started
tracking web users across many different sites. This is when ads started following
you around the web. Research a particular car or vacation destination or medical condition,
and for weeks you’ll see ads for that car or city or a related pharmaceutical on every
commercial Internet site you visit.
This has evolved into a shockingly extensive, robust, and profitable surveillance
architecture. You are being tracked pretty much everywhere you go on the Internet,
by many companies and data brokers: ten different companies on one site, a dozen on
another. Facebook tracks you on every site with a Facebook Like button (whether you’re
logged in to Facebook or not), and Google tracks you on every site that has a Google
Plus +1 button or that simply uses Google Analytics to monitor its own web traffic.
Most of the companies tracking you have names you’ve never heard of: Rubicon Project,
AdSonar, Quantcast, Pulse 260, Undertone, Traffic Marketplace. If you want to see
who’s tracking you, install one of the browser plugins that let you monitor cookies.
I guarantee you will be startled. One reporter discovered that 105 different companies
tracked his Internet use during one 36-hour period. In 2010, a seemingly innocuous
site like Dictionary.com installed over 200 tracking cookies on your browser when
you visited.
It’s no different on your smartphone. The apps there track you as well. They track
your location, and sometimes download your address book, calendar, bookmarks, and
search history. In 2013, the rapper Jay-Z and Samsung teamed up to offer people who
downloaded an app the ability to hear the new Jay-Z album before release. The app
required the ability to view all accounts on the phone, track the phone’s location,
and track who the user was talking to on the phone. And the Angry Birds game even
collects location data when you’re not playing.