Read What Stays in Vegas Online
Authors: Adam Tanner
The contest excited Arvind Narayanan, but for a different reason than other researchers. A University of Texas PhD student at the time, Narayanan did not aspire to win the million-dollar prize. Rather, he invented his own contest to reidentify some of the people whose names Netflix had removed when releasing the recommendations. He rushed to see his faculty advisor, Vitaly Shmatikov, a computer scientist with special interest in computer security and privacy.
Narayanan was convinced that Netflix was wrong in saying it could maintain the anonymity of customers in the released data. He felt confident that there had to be a way to identify some of them. If Narayanan and Shmatikov could succeed, they would demonstrate a major flaw in how companies approached protecting privacy as crowd-sourcing became increasingly popular.
Both foreign-born, the student and professor had grown up in very different privacy environments. Narayanan hailed from Chennai,
earlier known as Madras, a city of more than four million in southern India. As in other large Indian cities, surges of people crowded the streets and public transport, leaving little possibility for personal space. “I like to joke that it is not even feasible in India because if you insisted that everybody stay three feet apart from each other, you'd run out of space,” he says. “When an Indian person applies for a job they put their date of birth and a bunch of other personal details on their CV, which is very jarring in terms of the kind of contextual boundaries that we have here.”
Shmatikov, who came to the United States in 1992, grew up in Moscow during Soviet Communism's final years. The KGB and other state organs could monitor citizens of interest, but most Muscovites shuttled about the gray city anonymously, minding their own business among the masses. Muscovites could escape notice riding in the crowded Metro system or walking in Gorky Park. In some sense it was easier to be anonymous then because there was no massive data collection, and people weren't leaving digital traces all over the place.
“Of course, if they really wanted to track someone, they had no lack of manpower, they could always assign a man to follow you,” Shmatikov says. “But you could do it for one person, for ten people, for a hundred people, you cannot do it for ten million people. So in that sense the vast majority of the population could be as anonymous as they wanted to be. Now it is all very different because now there really is technical capability to track anything anyone is doing anywhere.”
At first glance, it might appear unlikely that two researchers could identify people who posted anonymous movie reviews on Netflix. Many people watch the same popular movies. Yet some people watch and review those popular movies in combination with obscure ones, creating distinct profiles. Another analogy: any two random humans share, on average, 99.9 percent of their DNA: all human variation (and identifiability) is attributable to the remaining 0.1 percent. Such combinations provide clues that can help unmask a person's identity, much as a contestant on the television game show
Wheel of Fortune
pieces together an entire message from partially revealed letters in a sentence. Some people in the Netflix prize dataset had watched and rated more
than a thousand movies. Some had even rated more than ten thousand of the seventeen thousand movies then in the collection.
Narayanan came to learn that some cinephiles watch multiple movies a day and freely share their opinions on different sites, including
imdb.com
, where people often give their names when reviewing movies. Over the first eight days, Narayanan and Shmatikov worked feverishly into the night. By piecing together the names from IMDB with the same sets of movies in the Netflix prize dataset, they identified two Netflix subscribers by name, showing that they could solve the puzzle. They felt no need to go further: they had shown they could reidentify the movie lovers. “We were confident because there were no other matches that were even close,” Narayanan said. “Out of all the other records in the 500,000 dataset there was one good match” for each reidentification. As a further check, they reidentified two colleagues who had shared their Netflix viewing data with them, so in those two cases they knew for sure that their method worked and that they had found the right people.
The findings illustrated the privacy dangers that massive amounts of personal data pose, even if stripped of names. Yet academics initially shunned their findings. Narayanan and Shmatikov offered a paper on their research to an academic conference and received a thumbs down. “It is well known that logs can leak lots of private data,” one reviewer said in a rejection note. “It's not clear whether there is much real novelty/research in this paper.”
23
A second conference also said no. Finally, a year later, the same conference that first rejected the paper accepted a revised version. This time, the study received wide public attentionâalthough that did not lead to riches. The million-dollar prize went to a team of data scientists who had come up with a 10.06 percent improvement on the Netflix movie recommendation systemâthree years after the company first revealed the subscriber recommendations.
24
The public attention to the privacy implications of the Netflix data eventually led to a class-action lawsuit. In that complaint, a lesbian said she did not want to reveal her sexual orientation or interest in gay-themed films. “On October 2, 2006, Netflix perpetrated the largest voluntary privacy breach to date, disclosing sensitive and personal
identifying consumer information,” the lawsuit said. “The information was not compromised by malicious intruders. Rather, it was given away to the world freely, and with fanfare.”
25
Netflix eventually settled the case out of court. In 2010 it canceled plans for a second contest. Today the company would rather forget the whole episode.
26
* * *
A few months before Netflix released its movie recommendation data in 2006, email and Internet pioneer AOL published the search histories of 650,000 users over three monthsâa total of twenty million searches. The company removed the IP addresses of the computer making the searches and instead assigned a unique ID number to each user so that researchers could follow the search patterns. Since users often look for information related to where they live and give clues about their identity over a period of time, two
New York Times
reporters succeeded in puzzling out the names of some of them.
27
After a public outcry, AOL fired the official who released the data; the company's chief technology officer resigned. AOL quickly tried to remove the data, yet the company suffered a damaging blow to its reputationâand a costly lawsuit. Only in 2013 did a federal judge approve the class-action settlement, which cost AOL up to $5 million, plus $930,000 to cover plaintiff's attorney fees. People whose search data were released received $50 to $100.
Even today, one can still download the dataset on the Internet, again showing that once released, information can never be put back into the bottle. “It was a big reminder of the beginnings of what people now refer to as big data,” AOL cofounder Steve Case reflected. “Data that was supposed to be helpful some people were able to use in a way that was not helpful. So it was a wake-up call to our business.
28
“These issues are not new issues. What is new is that far more people are online, they are online far more habitually, far more networked, far more places, so therefore there is more tracking of data and more ability to kind of analyze it in ways that can be helpful and also ways that may not be helpful.”
Unmasking sexual orientation or reidentifying people based on clues are obviously far from the business of running a casino, or luring guests into a department store, or any other business. The larger point is that many of the services we enjoy today in different areas of our lives collect data about us. Watching cable television, carrying a cell phone, using social media sites, or visiting a doctor all generate data that are shared widely, even if not with the person generating the data. Much of the information you generate is fairly innocuous. Your hobbies. Your favorite music. Your photos. Any one piece of data would not reveal very much. But continued advances in data mining have made small bits of personal data ever more revealing when combinedâand ever more valuable to companies.
Sometimes, these clues lead all the way to the naked truth.
Scanty Clues
A Yelp page reviewing Instant Checkmate, in a section called “about the business,” showed the image of a smiling woman.
1
It described her as Kristen B., manager of Instant Checkmate, followed by: “I'm Kristen, customer relations director at Instant Checkmate. When I am not responding to facebook [
sic
] messages, tweets, linked in requests and such, I can be found blogging on various sites. I love my job at Instant Checkmate and I am proud of help [
sic
] our customers!”
In Las Vegasâdatelined press releases and company blogs, the company seemed to leave the talking to Kristen Bright, described as a PR manager, public relations specialist, social media consultant, or spokeswoman.
2
Yet Kristen Bright did not respond to any attempts to contact her, either by phone or through email. Operators at the company's Las Vegas call center said they had never seen her. So I wondered: could a minuscule blurry photograph provide enough information to identify and locate the real woman behind the image?
I copied the photo and loaded it into Google's image search page. The results led to a photo of a woman on a boat with a bikini top stretched over significant cleavage. Someone had cropped the face from this image and put it on the Yelp page. Running a new search on the full photograph led to different pages with the same woman, photographed in a bikini or sexy underwear. On occasion she wore no top at all.
One view of the mystery woman I was trying to find. Source: Ann, surname withheld at her request
.
The homemade, snapshot quality of the photos and a winning smile suggested a certain wholesomeness, even when she posed partially naked. Some tame family photos showed her with a boy, perhaps her son. A few bloggers had created pages in her honor, and admirers wrote in to compliment her beauty and curves. Some wondered where one might be able to find more images. Some blog comments referred to unseen explicit videos. The hunt continued.
The initial Google searches offered several names for the woman, with at least two surnames. Those names helped find other saucy images but no contact details, suggesting she used a stage moniker. The other images did not mention the name Kristen Bright. Searching through the new racy images did lead to a 2010 blog post showing her with a man who described himself as her husband, Tom. He said they were both thirty-eight years old and heading to a Jamaican resort for their twentieth wedding anniversary. For all these clues, the woman's real name and contact details remained elusive.
Then one day I conducted a search through a background data broker site used by lawyers, insurance companies, law enforcement
agencies, and others. I found her stage surname embedded in a man's email address. That man, Tom, was then forty-one, about right for the husband if he had listed his true age in the Jamaica vacation posting a few years earlier. Tom had also filed a relatively recent
Chapter 13
joint bankruptcy petition with his wife, whose name was Ann. Among the debts listed on the court documents: $131 owed to Victoria's Secret. Might the lingerie chain be the source of some of the skimpy garb modeled in the online photos?
Those documents led to an address and phone number in California. Still, additional proof was needed before calling. After all, lots of couples named Tom and Ann live in the United States. A call to the wrong couple asking about naked photographs might provoke a justifiably angry response. A search of Ann's real surname linked her to a Los Angelesâarea high school where she had worked as a secretary. Deep within the school website lurked an old school newsletter with a photo of the support staff. Standing a bit shyly to the back of the group was a woman wearing a black vest over a white T-shirt. The face looked the same and the top-heavy body dimensions suggested she was indeed Ann.
I dialed a phone number I located for the couple and left a message, saying I was a fellow at Harvard University researching a book. An astonished Ann and Tom called back a few minutes later, wondering how they had been found. I told them about the Yelp listing for Instant Checkmate using her image. Had the website contacted her to gain permission to use her image as the face of the company? It had taken me a long time to find her. If she indeed worked for the data broker, the search would only have shown that she used a different name on the job. But she said not only did she
not
work for Instant Checkmate, she had never even heard of it or the name Kristen Bright. “Honestly, it's a little sickening,” she said. Then she joked: “Geez, if they would have asked, I could have sold a better photo!”
3