Authors: Tom Vanderbilt
The star system itself is filled with biases. People avoid the ends of scalesâ“
contraction bias,” it is called. So you get many more two- or four-star reviews than one or five. Another statistical hiccup, Amatriain said, is that “we know the rating scale is not linear; you don't have the same distance from a one star to a two, as from a two to a three.” That middle ground, the landscape of
meh
, gets pretty muddy in terms of what is watchable. Then there is “
integer bias,” or the idea that people seem predisposed to give whole-number ratings.
Assigning stars to a cultural product is itself a curiousâand long contentiousâenterprise. It seemed to kick off in books, actually, with Edward O'Brien's inaugural edited volume,
The Best Short Stories of 1915
. As he described in an introduction, the stories he selected “fell
naturally
into four groups” (my italics). These were denoted by asterisks, the more the better, all the way up to three (for stories that deserve “a position of some permanence in our literature”). With the vision of the disinterested critic, he declared, “I have permitted no personal preference or prejudice to influence my judgment consciously for or against a story” (later in the book we shall see just how difficult that is). O'Brien's star systemâand indeed the very act of choosing the “best” stories of the yearâitself came under some withering criticism.
*
1
Reviewing
The Best Short Stories of 1925
, a critic for
The New York Times
, chiding O'Brien's “dogmatic” valuation system, declared, “
A great many people will believe almost anything that any one tells them positively enough.” Star history gets a bit murky but seems to finally pop up in film in a review by Irene Thirer in the July 31, 1928, edition of the New York
Daily News
. She writes, “Judging movies via the star system, as we're going to do henceforth as a permanent thing”âimplying it was already under way. She then pans
Port of Missing Girls
with a single star.
*
2
People have been quibbling over stars ever since. One obvious problem is that because people's tastes are different, what one person thinks is a three-star movie may be for you a five-star flick. This is why Netflix distinguishes between the overall number of stars and the
metric “Our best guess for you.” This lays taste right out on the table: You liked this movie 0.7 more than others.
While we might take this to be some purer expression of “our” taste, one complication is that, as with all recommendation engines, that number is partially derived from what
other
people are doing. Another problem is that you may just rate differentlyâwith a high or low biasâregardless of what you actually thought about the movie. “Some people I know are very selective on giving high ratings,” says Amatriain. “So two or three stars for them is not necessarily a bad rating.”
This points to something interesting about Netflix and its ratings. Perhaps as a holdover from the days when we received our opinions largely from reviewers, who had their own rating systems, we might think of a star rating as a kind of stable measure of quality, or at least of one's taste. At both the individual and the aggregate level, however, Netflix stars are far from fixed. Rather, they are like free markets: prone to corrections, bubbles, hedges, inflation, and other forms of statistical “noise.”
In early 2004, to take one case, there was a “sudden rise in the average movie rating” on Netflix. Did Hollywood films suddenly get better? Actually, the recommendation system did. “
Users are increasingly rating movies that are more suitable for their own taste,” wrote Yehuda Koren, a researcher who participated in the Netflix Prize. In other words, the movies got better because they were chosen by more people who thought they were better. Depending on how you look at it, this could be thought of as a kind of selection biasâthe people who were likely to like a movie were rating it more favorablyâor as a kind of market equilibrium in taste: People were more accurately finding the movies (that is, the supply) they were more likely to like (that is, the demand).
Things are even messier at the individual level.
Ask someone to re-rate a movie he has already seen, and more likely than not he will rate it differently. Simply by altering a user's initial rating, experiments have shown, you can affect how that same person re-rates it later.
People seem to rate things differently when they rate a bunch of films en masse (training their algorithms) versus a single film. People rate television shows differently than films. “The average rating on a TV show tends to be much higher than on a movie,” Yellin said. Has television gotten better than film? “My intuition is that there's selection,” he said.
“Who's likely to rate
The Sopranos
? Not someone who watched five minutes and didn't like it because it wasn't really part of their life. It's the person who committed to it and spent a hundred hours of their life watching it.” On the other hand, “who will rate
Paul Blart: Mall Cop
? It might not be a very good movie, but it's ninety minutes long. Your bar or criteria might be different.”
Similarly, the same movie seen on streaming versus DVD might have different ratings. “Especially if a movie is much more visceral,” Yellin saidâlike a “very emotional” Spielberg title. “It's going to have an impact on you, but that impact might be ephemeral. So if you rated it right at the credits, you might give it a higher rating. A week later, it might not have that effect on you.” Watching a movie alone might yield a lower rating than watching a movie with enthusiastic friends.
And so on. “I was deep into the ratings game for years,” Yellin said gravely, sounding like a jaded gangster reflecting on his unsavory past on the streets. I sensed he was striving for some purity in those ratings, a Platonic ideal of what we like. “You question how much hair I have? I tore my hair out trying to understand these kinds of things.” Ratings, in the end, were not as potent a signal of what people would watch as one might think. Neither are things like gender and geography. “If you know nothing else, it will help a tiny bit,” Yellin said. “But if they watch five things on Netflix, we will know magnitudes more about them than age, gender, where they live.” You are what you watch.
All this talk of how ratings have been deemphasized does not mean that
recommendations
are any less important. They are indeed more central than ever to Netflix's algorithmic work, driving some 75 percent of all viewing.
Now, though, they are more
implicit
. Rather than tell you what you like, Netflix now in essence shows you what you like, in “personalized” rows whose architecture has essentially been created by your own behavior. “Everything is a recommendation,” as Amatriain liked to say of the new, “beyond the five stars” thinking. Even searching for thingsâa sign that “we are not able to show them what to watch”âfeeds into the recommendation engine. Knowing what you are looking for betrays what you might like. Doing
anything
on Netflix is itself a kind of meta-recommendation: The site, like much of the Internet, is
one big constant experiment in preferences, a series of “A/B tests” you probably participated in without being aware. Did moving the search box to the left or the right of the online shoe retailer lead you to buy more products? Did putting a row on your splash page titled “Foreign Dramas from the 1980s” get you to watch more foreign dramas from the 1980s?
The rows reflect a kind of middle ground between two extremes of signals that in and of themselves are not wholly useful: The first is your stated likes. These can lead into a kind of taste cul-de-sac, full of obscure, interesting films that you rarely get around to watching. “Overfitting” is the algorithmic word: The engine makes recommendations that are, in a sense,
too
perfectâand perfectly sterile.
The second is popularity. This is the antithesis of “personalization,” Amatriain told me; then again, if you are trying to optimize consumption, “a member is most likely to watch what most others are watching.” This can lead to the
Shawshank Redemption
Problem, or the rather superfluous recommendation of something the whole world has seen.
The Shawshank Redemption
is Netflix's highest-ever-rated film, a film so universally lauded on the site it has almost no predictive power beyond its own seemingly inherent likability. “People love that film all over the frickin' place,” Yellin marveled, shaking his head.
Perhaps as a concession to the inexorable noisiness of human taste, Netflix does not rely entirely on the behavior of users themselves to make recommendations. It also has a paid army of human “taggers” erecting a labyrinth of cinematic meta-data. Rather than trying to figure out what makes two people's taste similar, Netflix has found it is often easier to ascertain what makes two
films
similar. This can lead to curious discoveries. The presence of the director Pedro Almodóvar may forge a link between two films, no matter how different they may be, where nothing else would. But meta-data by themselves can mislead. Recommending
Dogville
âa film as polarizing as
Napoleon Dynamite
âto people who watched
The Hours
or
Moulin Rouge
, simply because Nicole Kidman was in both of them, could be disastrous.
But meta-data can also tease out things we might not have discovered ourselves. The often quirkily specific, human-generated genre rows remind us, as I have noted, of how categories can influence our preferences. We like things
as
something, even if, with a film like
The Big Lebowski
, it can take a while to figure out what “it” is. Netflix's
quirky genres try to shape meaning from what might otherwise seem capricious suggestions. “Recommendations can be too out-there,” Yellin said. “You're like, âWow, why would it say that just because I rated
Raise the Red Lantern
five stars that I'm going to really like this Japanese kids' movie?” Yellin pointed to his laptop. On his Netflix page was an array of recommendations:
Gomorrah, Valhalla Rising, Enter the Void
, and
Un Chien Andalou
. They were all contained in a genre dubbed Mind-Bending Foreign Dramas. “I got psyched looking at this,” he said, “but if you had shown it to me without any context, it might not be as compelling.” As the writer Alexis Madrigal described it, “It's not just that Netflix can show you things you might like, but that it can tell you
what
kinds of things those are.”
That these two things can influence each other is not only one of the curious forms of quantum entanglement found in the Big Data of recommendation systems but a fact of human taste.
My husband and I found this “off-the-beaten-path” place one night while driving on a dark desert highway. Our room was a bit dated (mirrors on the ceiling LOL!) but we were pleasantly surprised to find that we had been upgradedâour room even had champagne on ice! But the place has a serious noise problem: we were woken up in the middle of the night by voices coming from somewhere down the hallway. While I would agree with the previous reviewer that it is “Such a Lovely Place!” I have very mixed feelings. The worst thing, however, were the checkout policies, which I found to be completely unacceptable
.
You may recognize the above as my mashing up of two familiar narratives: the lyrics to the Eagles' “Hotel California” and a review on the travel Web site TripAdvisor.âcom. You know “Hotel California” because you have heard it to death on FM radio. And if you have spent any time on TripAdvisor.âcom, you will, after reading the twenty-eighth review of a hotel, have begun to absorb its gentle cadences: the casual, confessional tone; the banter with other reviewers; the personality that seems to come across at once as both the relatable everyman being wronged
and the aggrieved diva with a heightened sense of entitlement. Then there is the “but”âa hallmark of the “speech act” known as a complaint. As the linguist Harvey Sacks once noted, complaints tend to follow a standard pattern: “
a piece of praise plus âbut' plus something else.” The praise typically comes first, as if to say, “This is not me being unreasonable.”
Reading these sorts of reviews, I cannot help but wonder, where did people, previously, before the Internet and social media, channel this torrent of opinion? If the hotel shower's water pressure was not quite to one's liking, where was there, besides the captive audience at the front desk, to channel this disquietude? Then, as now, a person having a poor experience might simply have vowed never to visit the place again. He could have told friends and family about this experience, and this casual griping might have rippled out to a few people. But how could he warn that stranger, down the road, heading toward the proverbial Hotel California, that it might not be worth her money?
It may already seem difficult to remember, but in the days before the Internet, and then smartphones, to do something like eat at an unknown restaurant meant relying on a clutch of quick-and-dirty heuristics. The presence of a lot of truck drivers or cops at a lonely diner was a supposed claim to its quality (though it might simply have been the only option around). For “ethnic” food, there was the classic “We were the only non-[insert ethnicity] people in there.” Or one spent anxious minutes on the sidewalk, under the watchful gaze of the host, reading curling, yellowed reviews from local weeklies, wondering if the opinion of a critic who passed by one afternoon in 1987 still held.
We lived in an information-poor environment. To choose a hotel in an unfamiliar city, we might have paged through a guidebook. But what if that guidebook only covered a few hotels and was not recently updated? We might have relied simply on brands: I stayed at this hotel in Akron, so I will stay at the one in Davenport. But what if the Akron one was much better run?