Authors: Tom Vanderbilt
But for whatever reason it is done, how does one know a review is false? Consider these snippets of two reviews:
I have stayed at many hotels traveling for both business and pleasure and I can honestly stay that The James is tops. The service at the hotel is first class. The rooms are modern and very comfortable
.
My husband and I stayed at the James Chicago Hotel for our anniversary. This place is fantastic! We knew as soon as we arrived we made the right choice! The rooms are BEAUTIFUL and the staff very attentive and wonderful!!
As it turns out, the second of these reviews is fake.
A group of Cornell University researchers created a machine-learning system that can tell, with accuracy near 90 percent, whether a review is authentic or not. This is far better than trained humans typically achieve; among other problems, we tend to suffer from “truth bias”âa wish to assume people are not lying.
To create the algorithm, the Cornell team largely relied on decades of research into the way people talk when they are confabulating. In “invented accounts,” people tend to be less accurate with contextual details, because they were not actually there. Fake hotel reviews, they found, had less detailed information about things like room size and location. Prevaricating reviewers used more superlatives (
the best! The worst!
). Because lying takes more mental work, false reviews are usually shorter.
When people lie, they also seem to use more verbs than nouns, because it is easier to go on about things you
did
than to describe how things
were
. Liars also tend to use personal pronouns less than truth tellers do, presumably to put more “space” between themselves and the act of deception.
But doesn't the fake example above have plenty of personal pronouns? Indeed, the Cornell team found that people actually referred to themselves
more
in fake reviews, in hopes of making the review sound more credible.
Curiously, the researchers noted that people used personal pronouns less in fake negative than in fake positive reviews, as if the distancing were more important when the lie was meant to sound nasty. Lying in general is arguably easier online, absent the interpersonal and time pressures of trying to make up something on the spot in front of someone. How easy?
When I ran my imagined “Hotel California” review through Review Skeptic, a Web site created by a member of the Cornell team, it was declared “truthful.”
Fake reviews do exist and undoubtedly have economic consequences. But the enormous amount of attention they have received in the media, and all the energy dedicated to automatically sniffing out deceptive reviews, may leave one with the comfortable assumption that all the other reviews are, simply, “true.” While they may not be knowingly deceptive, there are any number of ways they are subject to distortion and biases, hidden or otherwise.
The first problem is that hardly anyone writes reviews.
At one online retailer, it was less than 5 percent of customersâhardly democratic.
And the first reviewers of a product are going to differ from people who chime in a year later; for one thing, there are existing reviews to influence the later ones. Merely buying something from a place may tilt you positive; people who rated but did not buy a book on Amazon, as Simester and Anderson discovered, were twice as likely to
not
like it. Finally, customers are often moved to write a review because of an inordinately positive or negative experience. So ratings tend to be “bimodal”ânot evenly distributed across a range of stars, but clustered at the top and the bottom. This is known as a “J-shaped distribution” or, more colorfully, the “
brag and moan phenomenon.”
The curve is J-shaped, not reverse-candy-cane-shaped, because of another phenomenon in online ratings: a “positivity bias.” On Goodreads.âcom, the average is 3.8 stars out of 5. On Yelp, one analysis found, the reviews suffer from an “
artificially high baseline.” The average of all reviews on TripAdvisor is 3.7 stars;
when a similar property is listed on Airbnb, it does even better, because owners can review
guests
.
Similarly, on eBay, hardly anyone leaves negative feedback, in part because, in a kind of variant of the famed “ultimatum game,” both buyer and seller can rate each other. Positivity bias was so rampant that in 2009 eBay overhauled its system. Now vendors, rather than needing to reach a minimum threshold of stars to ensure they were meeting the site's “
minimum service standard,” needed to have a certain number of
negative
reviews. They had to be bad to be good.
A few years ago, YouTube had a problem: Everyone was leaving five-star reviews. “
Seems like when it comes to ratings,” the site's blog noted, “it's pretty much all or nothing.” The ratings, the site's engineers reasoned, were primarily being used as a “seal of approval,” a basic “like,” not as some “editorial indicator” of overall quality (the next most popular rating was one star, for all the dislikers). Faced with this massively biased, nearly meaningless statistical regimen, they switched to a “thumbs up/thumbs down” rating regimen. Yet the binary system is hardly without flaws. The kitten video that has a mildly cute kittenâlet us be honest, a fairly low barâis endowed with the same sentiment as the world's cutest kitten video. But in the heuristic, lightning-fast world of the Internet, where information is cheap and the cost of switching virtually nil, people may not want an evaluation system that takes as much time as the consumption experience. And so all likes are alike.
And then there is the act of reviewing the reviewâor the reviewer.
The most helpful reviews actually make people more likely to buy something, particularly when it comes to “long tail” products. But these reviews suffer from their own kinds of curious dynamics. Early reviews get more helpfulness votes, and the more votes a review has, the more votes it tends to attract.
On Amazon, reviews that themselves were judged more “helpful” helped drive more salesâregardless of how many stars were given to the product.
What makes a review helpful? A team of Cornell University and Google researchers, looking at reviewing behavior on Amazon.âcom, found that a review's “helpfulness” rating falls as the review's star rating deviates from the average number of stars. Defining “helpfulness” is itself tricky: Did a review help someone make a purchase, or was it being rewarded for conforming with what others were saying? To explore this, they identified reviews in which text had been plagiarized, a “rampant” practice on Amazon, they note, in which the very same review is used for different products. They found, with these pairs, that when one review was closer to the stars of
all
reviews, it was deemed more helpful than the other. In other words, regardless of its actual content, a review was better when it was more like what other people had said.
Taste
is
social comparison. As Todd Yellin had said to me at Netflix, “How many times have you seen someone in an unfamiliar situationâlike âI'm at an opera and I've never been before'âthey'll look right, they'll look left, they'll look around. âIs this a good one?'â”
When the performance is over, whether a person joins in a standing ovation may have as much to do with what the surrounding crowd is doing as with whether he actually liked it. By contrast, when we cannot see what someone has chosen, as studies have shown, odds are we will choose differently.
Small wonder, then, that on social media, where the opinion of many others is ubiquitous and rather inescapable, we should find what Sinan Aral, a professor of management at MIT, has called “social influence bias.” Aral and his colleagues wanted to know if the widespread positivity bias in rating behavior was due to previous ratings. How much of that four-and-a-half-star restaurant rating is down to the restaurant itself, and how much to previous people voting it four and a half
stars? Does that first Instagram “like” lead to more likes than a picture with no likes?
So Aral and his colleagues devised a clever experiment, using a Digg-style “social news aggregation” site where users post articles, make comments on articles, and then “thumb up” or “thumb down” those comments. They divided some 100,000 comments into three groups. There was a “positive” group, in which comments had been artificially seeded with an “up” vote. Then there was a “negative” group, where comments were seeded “down.” A control group had no comments.
As with other sites, things kick off with an initial positivity bias. People were already 4.6 times more likely to vote up than down. When the first vote was artificially made “up,” however, it led to an even greater cascade of positivity. Not only was the next vote more likely to be positive, but the ones
after
that were too. When a first comment was negative, the next comment was more likely to also be negative. But eventually, those negatives would be “neutralized” by a counterforce of positive reviewers, like some cavalry riding in to the rescue.
What was happening? The researchers argued that up or down votes per se were not bringing out the people who generally like to vote up or down. It was that the presence of a rating on a comment encouraged more people to rateâand to rate even more positively than might be expected. Even people who were
negative
on control comments (the ones that had no ratings) tended to be more positive on comments seeded with a “down” vote. As Aral describes it, “
We tend to herd on positive opinions and remain skeptical of negative ones.”
The stakes are higher than just a few clicks on Digg temporarily lifting an article above the tide. An early positive review, authentic or not, can send a subtle ripple through all later reviews. Aral's study found that seeding a review positively boosted overall opinion scores by 25 percent, a result that persisted. Early positive reviews can create path dependence. Even if one went through and removed false reviews, the damage would have been done; those reviews might have influenced “authentic” reviews. “These ratings systems are ostensibly designed to give you unbiased aggregate opinion of the crowd,” Aral told me. But, as with that standing ovation, can we find our own opinion amid the roar of the crowd?
All this does not mean that ratings, having been pushed in a certain positive direction, always rise. In fact, on a site like Amazon, while
“sequential bias” patterns have been found, there is a general tendency for book ratings to grow more negative over time. “
The more ratings amassed on a product,” one study noted, “the lower the ratings will be.” What distinguishes Amazon from the one-click liking or disliking mechanisms seen in Aral's experiment is the higher cost of expression: You cannot just say how much you like or dislike something; you have to give some explanation as to
why
.
This seems to change behavior.
As the HP Labs researchers Fang Wu and Bernardo Huberman found in a study of Amazon reviewers, in contrast to the “herding and polarization” effects seen at the Digg-style sites, Amazon reviewers seem to react to previous “extreme” raters. Someone rating on the heels of a one-star review may feel compelled to “balance it out” with a three-star, when in reality he was thinking of leaving a two-star review. This reaction to extremes can lead to an overall “softening” of opinion over time.
One reason, they suspect, is an inherent desire to stand out from the crowd, to actually affect the result or inflate one's sense of self-worth. “What is the point of leaving another 5-star review,” Wu and Huberman ask, “when one hundred people have already done so?”
Rationally, there is none, just as, in the “voter's paradox,” there is little rational sense in voting in elections where one's individual vote will not affect the outcome (although, unlike with voting, there is evidence that recent reviews do affect sales). So the people who leave opinions, after time, tend to be those who disagree with previous opinions.
It is easy to imagine several stages in the evolution of a book's ratings life on Amazon. The earliest reviews tend to come from people who are most interested in the book (not to mention an author's friends and relatives, if not the author herself) and who are most likely to like it.
Taste is self-selection writ large. But once an author's fans and other motivated customers have weighed in, over time a book might attract a wider audience with “weaker preferences,” as the researchers David Godes and José Silva suggest. Whether they are more clear-eyed and objective critics, or they do not “get” a book the way early reviewers did, their opinion begins to diverge. With many books, a pronounced “undershooting dynamic” kicks in: a period in which reviews are even
lower
than the eventual lower average, as readers, perhaps swayed by the previous “positive review bias,” make essentially mistaken purchases. Then they weigh in, in what might be called the
“don't-believe-the-hype
effect.” So begins a feedback loop. “The more reviews there are,” Godes and Silva suggest, “the lower the quality of the information available, which leads to worse decisions which leads to lower ratings.” It is not uncommon to find late, fairly flummoxed one-star reviews of only a sentence or two: “I just didn't like it.”
As more reviews are posted, people spend less time talking about the contentsâbecause so many other people already haveâthan about the content of the other
reviews
.
When a review mentions a previous review, it is more likely to be negative.
Context takes over.
Which leads us back to Aral. How can you actually
tell
if a review has been influenced by a previous review, or whether it is simply homophily: correlated
group
preference, or, more simply, the idea that “birds of a feather flock together”? He gives an example from the sociologist Max Weber: If you see a group of people in a field open umbrellas as it begins to drizzle, you would not say they are influencing one another. They are merely people reacting to
endogenous
conditions, with a correlated group preference to not get wet. If a person opening an umbrella when there was
no
rain could get others to do so, that would be social influence.