Authors: Bruce Schneier
Right now, data mining is a hot technology, and there’s a lot of hype and opportunism
around it. It’s not yet entirely clear what kinds of research will be possible, or
what the true potential of the field is. But what is clear is that data-mining technology
is becoming increasingly powerful and is enabling observers to draw ever more startling
conclusions from big data sets.
SURVEILLING BACKWARDS IN TIME
One new thing you can do by applying data-mining technology to mass-surveillance data
is go backwards in time. Traditional surveillance can only learn about the present
and future: “Follow him and find out where he’s going next.” But if you have a database
of historical surveillance information on everyone, you can do something new: “Look
up that person’s location information, and find out where he’s been.” Or: “Listen
to his phone calls from last week.”
Some of this has always been possible. Historically, governments have collected all
sorts of data about the past. In the McCarthy era, for example, the government used
political party registrations, subscriptions to magazines, and testimonies from friends,
neighbors, family, and colleagues to gather data on people. The difference now is
that the capability is more like a Wayback Machine: the data is more complete and
far cheaper to get, and the technology has evolved to enable sophisticated historical
analysis.
For example, in recent years Credit Suisse, Standard Chartered Bank, and BNP
Paribas all admitted to violating laws prohibiting money transfer to sanctioned groups.
They deliberately altered transactions to evade algorithmic surveillance and detection
by “OFAC filters”—that’s the Office of Foreign Assets Control within the Department
of the Treasury. Untangling this sort of wrongdoing involved a massive historical
analysis of banking transactions and employee communications.
Similarly, someone could go through old data with new analytical tools. Think about
genetic data. There’s not yet a lot we can learn from someone’s genetic data, but
ten years from now—who knows? We saw something similar happen during the Tour de France
doping scandals; blood taken from riders years earlier was tested with new technologies,
and widespread doping was detected.
The NSA stores a lot of historical data, which I’ll talk about more in Chapter 5.
We know that in 2008 a database called XKEYSCORE routinely held voice and e-mail content
for just three days, but it held metadata for a month. One called MARINA holds a year’s
worth of people’s browsing history. Another NSA database, MYSTIC, was able to store
recordings of
all
the phone conversations for Bermuda. The NSA stores telephone metadata for five years.
These storage limits pertain to the raw trove of all data gathered. If an NSA analyst
touches something in the database, the agency saves it for much longer. If your data
is the result of a query into these databases, your data is saved indefinitely. If
you use encryption, your data is saved indefinitely. If you use certain keywords,
your data is saved indefinitely.
How long the NSA stores data is more a matter of storage capacity than a respect for
privacy. We know the NSA needed to increase its storage capacity to hold all the cell
phone location data it was collecting. As data storage gets cheaper, assume that more
of this data will be stored longer. This is the point of the NSA’s Utah Data Center.
The FBI stores our data, too. During the course of a legitimate investigation in 2013,
the FBI obtained a copy of all the data on a site called Freedom Hosting, including
stored e-mails. Almost all the data was unrelated to the investigation, but the FBI
kept a copy of the entire site and has been accessing it for unrelated investigations
ever since. The state of New York retains license plate scanning data for at least
five years and possibly indefinitely.
Any data—Facebook history, tweets, license plate scanner data—can basically be retained
forever, or until the company or government agency decides to delete it. In 2010,
different cell phone companies held text messages for durations ranging from 90 days
to 18 months. AT&T beat them all, hanging on to the data for seven years.
MAPPING RELATIONSHIPS
Mass-surveillance data permits mapping of interpersonal relationships. In 2013, when
we first learned that the NSA was collecting telephone calling metadata on every American,
there was much ado about so-called hop searches and what they mean. They’re a new
type of search, theoretically possible before computers but only really practical
in a world of mass surveillance. Imagine that the NSA is interested in Alice. It will
collect data on her, and then data on everyone she communicates with, and then data
on everyone they communicate with, and then data on everyone
they
communicate with. That’s three hops away from Alice, which is the maximum the NSA
worked with.
The intent of hop searches is to map relationships and find conspiracies. Making sense
of the data requires being able to cull out the overwhelming majority of innocent
people who are caught in this dragnet, and the phone numbers common to unrelated people:
voice mail services, pizza restaurants, taxi companies, and so on.
NSA documents note that the agency had 117,675 “active surveillance targets” on one
day in 2013. Even using conservative estimates of how many conversants each person
has and how much they overlap, the total number of people being surveilled by this
system easily exceeded 20 million. It’s the classic “six degrees of separation” problem;
most of us are only a few hops away from everyone else. In 2014, President Obama directed
the NSA to conduct two-hop analysis only on telephone metadata collected under one
particular program, but he didn’t place any restrictions on NSA hops for all the other
data it collects.
Metadata from various sources is great for mapping relationships. Most of us use the
Internet for social interaction, and our relationships show up in that. This is what
both the NSA and Facebook do, and it’s why the latter is so unnervingly accurate when
it suggests people you might
know whom you’re not already Facebook friends with. One of Facebook’s most successful
advertising programs involves showing ads not just to people who Like a particular
page or product, but to their friends and to friends of their friends.
FINDING US BY WHAT WE DO
Once you have collected data on everyone, you can search for individuals based on
their behavior. Maybe you want to find everyone who frequents a certain gay bar, or
reads about a particular topic, or has a particular political belief. Corporations
do this regularly, using masssurveillance data to find potential customers with particular
characteristics, or looking for people to hire by searching for people who have published
on a particular topic.
One can search for things other than names and other personal identifiers like identification
numbers, phone numbers, and so on. Google, for example, searches all of your Gmail
and uses keywords it finds to more intimately understand you, for advertising purposes.
The NSA does something similar: what it calls “about” searches. Basically, it searches
the contents of everyone’s communications for a particular name or word—or maybe a
phrase. So in addition to examining Alice’s data and the data of everyone within two
or three hops of her, it can search everyone else—the entire database of communications—for
mentions of her name. Or if it doesn’t know a name, but knows the name of a particular
location or project, or a code name that someone has used, it can search on that.
For example, the NSA targets people who search for information on popular Internet
privacy and anonymity tools.
We don’t know the details, but the NSA chains together hops based on any connection,
not just phone connections. This could include being in the same location as a target,
having the same calling pattern, and so on. These types of searches are made possible
by having access to everyone’s data.
You can use mass surveillance to find individuals. If you know that a particular person
was at a specific restaurant one evening, a train station three days later in the
afternoon, and a hydroelectric plant the next morning, you can query a database of
everyone’s cell phone locations, and anyone who fits those characteristics will pop
up.
You can also search for anomalous behavior. Here are four examples of how the NSA
uses cell phone data.
1. The NSA uses cell phone location information to track people whose movements intersect.
For example, assume that the NSA is interested in Alice. If Bob is at the same restaurant
as Alice one evening, and then at the same coffee shop as Alice a week later, and
at the same airport as Alice a month later, the system will flag Bob as a potential
associate of Alice’s, even if the two have never communicated electronically.
2. The NSA tracks the locations of phones that are carried around by US spies overseas.
Then it determines whether there are any other cell phones that follow the agents’
phones around. Basically, the NSA checks whether anyone is tailing those agents.
3. The NSA has a program where it trawls through cell phone metadata to spot phones
that are turned on, used for a while, and then turned off and never used again. And
it uses the phones’ usage patterns to chain them together. This technique is employed
to find “burner” phones used by people who wish to avoid detection.
4. The NSA collects data on people who turn their phones off, and for how long. It
then collects the locations of those people when they turned their phones off, and
looks for others nearby who also turned their phones off for a similar period of time.
In other words, it looks for secret meetings.
I’ve already discussed the government of Ukraine using cell phone location data to
find everybody who attended an antigovernment demonstration, and the Michigan police
using it to find everyone who was near a planned labor union protest site. The FBI
has used this data to find phones that were used by a particular target but not otherwise
associated with him.
Corporations do some of this as well. There’s a technique called geofencing that marketers
use to identify people who are near a particular business so as to deliver an ad to
them. A single geofencing company, Placecast, delivers location-based ads to ten million
phones in the US and UK for chains like Starbucks, Kmart, and Subway. Microsoft does
the same thing to people passing within ten miles of some of its stores; it
works with the company NinthDecimal. Sense Networks uses location data to create individual
profiles.
CORRELATING DIFFERENT DATA SETS
Vigilant Solutions is one of the companies that collect license plate data from cameras.
It has plans to augment this system with other algorithms for automobile identification,
systems of facial recognition, and information from other databases. The result would
be a much more powerful surveillance platform than a simple database of license plate
scans, no matter how extensive, could ever be.
News stories about mass surveillance are generally framed in terms of data collection,
but miss the story about data correlation: the linking of identities across different
data sets to draw inferences from the combined data. It’s not just that inexpensive
drones with powerful cameras will become increasingly common. It’s the drones plus
facial recognition software that allows the system to identify people automatically,
plus the large databases of tagged photos—from driver’s licenses, Facebook, newspapers,
high school yearbooks—that will provide reference images for that software. It’s also
the ability to correlate that identification with numerous other databases, and the
ability to store all that data indefinitely. Ubiquitous surveillance is the result
of multiple streams of mass surveillance tied together.
I have an Oyster card that I use to pay for public transport while in London. I’ve
taken pains to keep it cash-only and anonymous. Even so, if you were to correlate
the usage of that card with a list of people who visit London and the dates—whether
that list is provided by the airlines, credit card companies, cell phone companies,
or ISPs—I’ll bet that I’m the only person for whom those dates correlate perfectly.
So my “anonymous” movement through the London Underground becomes nothing of the sort.
Snowden disclosed an interesting research project from the CSEC—that’s the Communications
Security Establishment Canada, the country’s NSA equivalent—that demonstrates the
value of correlating different streams of surveillance information to find people
who are deliberately trying to evade detection.
A CSEC researcher, with the cool-sounding job title of “tradecraft developer,” started
with two weeks’ worth of
Internet identification data: basically, a list of user IDs that logged on to various
websites. He also had a database of geographic locations for different wireless networks’
IP addresses. By putting the two databases together, he could tie user IDs logging
in from different wireless networks to the physical location of those networks. The
idea was to use this data to find people. If you know the user ID of some surveillance
target, you can set an alarm when that target uses an airport or hotel wireless network
and learn when he is traveling. You can also identify a particular person who you
know visited a particular geographical area on a series of dates and times. For example,
assume you’re looking for someone who called you anonymously from three different
pay phones. You know the dates and times of the calls, and the locations of those
pay phones. If that person has a smartphone in his pocket that automatically logs
into wireless networks, then you can correlate that log-in database with dates and
times you’re interested in and the locations of those networks. The odds are that
there will only be one match.
Researchers at Carnegie Mellon University did something similar. They put a camera
in a public place, captured images of people walking past, identified them with facial
recognition software and Facebook’s public tagged photo database, and correlated the
names with other databases. The result was that they were able to display personal
information about a person in real time as he or she was walking by. This technology
could easily be available to anyone, using smartphone cameras or Google Glass.