Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers (4 page)

BOOK: Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers
3.51Mb size Format: txt, pdf, ePub
ads

We'll call this simple trick, of indexing metawords in the same way as normal words, the “metaword trick.” It might seem ridiculously simple, but this metaword trick plays a crucial role in allowing search engines to perform accurate searches and high-quality rankings. Let's look at a simple example of this. Suppose for a moment that a search engine supports a special type of query using the IN keyword, so that a query like boat IN TITLE returns hits only for pages that have the word “boat” in the title of the web page, and giraffe IN BODY would find pages whose body contains “giraffe.” Note that most real search engines do not provide IN queries in exactly this way, but some of them let you achieve the same effect by clicking on an “advanced search” option where you can specify that your query words must be in the title, or some other specific part of a document. We are pretending that the IN keyword exists purely to make our explanations easier. In fact, at the time of writing, Google lets you do a title search using the keyword intitle:, so the Google query intitle:boat finds pages with “boat” in the title. Try it for yourself!

The index for the web pages shown in the previous figure, including metawords.

How a search engine performs the search dog IN TITLE.

Let's see how a search engine could efficiently perform the query dog IN TITLE on the three-page example shown in the last two figures. First, it extracts the index entry for “dog,” which is 2-3, 2-7, 3-11. Then (and this might be a little unexpected, but bear with me for a second) it extracts the index entries for
both
and . That results in 1-1, 2-1, 3-1 for and 1-4, 2-4, 3-4 for . The information extracted so far is summarized in the figure above—you can ignore the circles and boxes for now.

The search engine then starts scanning the index entry for “dog,” examining each of its hits and checking whether or not it occurs inside a title. The first hit for “dog” is the circled entry 2-3, corresponding to the third word of page number 2. By scanning along the entries for , the search engine can find out where the title for page 2 begins—that should be the first number that starts with “2-.” In this case it arrives at the circled entry 2-1, which means that the title for page 2 begins at word number 1. In the same way, the search engine can find out where the title for page 2 ends. It just scans along the entries for , looking for a number that starts with “2-,” and therefore stops at the circled entry 2-4. So page 2's title ends at word 4.

Everything we know so far is summarized by the circled entries in the figure, which tell us the title for page 2 starts at word 1 and ends at word 4, and the word “dog” occurs at word 3. The final step is easy: because 3 is greater than 1 and less than 4, we are certain that this hit for the word “dog” does indeed occur in a title, and therefore page 2 should be a hit for the query dog IN TITLE.

The search engine can now move to the next hit for “dog.” This happens to be 2-7 (the seventh word of page 2), but because we already know that page 2 is a hit, we can skip over this entry and move on to the next one, 3-11, which is marked by a box. This tells us that “dog” occurs at word 11 on page 3. So we start scanning past the current circled locations in the rows for and , looking for entries that start with “3-.” (It's important to note that we do not have to go back to the start of each row—we can pick up wherever we left off scanning from the previous hit.) In this simple example, the entry starting with “3-” happens to be the very next number in both cases—3-1 for and 3-4 for . These are both marked by boxes for easy reference. Once again, we have the task of determining whether the current hit for “dog” at 3-11 is located inside a title or not. Well, the information in boxes tells us that on page 3, “dog” is at word 11, whereas the title begins at word 1 and ends at word 4. Because 11 is greater than 4, we know that this occurrence of “dog” occurs after the end of the title and is therefore
not
in the title—so page 3 is not a hit for the query dog IN TITLE.

So, the metaword trick allows a search engine to answer queries about the structure of a document in an extremely efficient way. The example above was only for searching inside page titles, but very similar techniques allow you to search for words in hyperlinks, image descriptions, and various other useful parts of web pages. And all of these queries can be answered as efficiently as the example above. Just like the queries we discussed earlier, the search engine does not need to go back and look at the original web pages: it can answer the query by consulting just a small number of index entries. And, just as importantly, it only needs to scan through each index entry
once.
Remember what happened when we had finished processing the first hit on page 2 and moved to the possible hit on page 3: instead of going back to the start of the entries for and , the search engine could continue scanning from where it had left off. This is a crucial element in making the IN query efficient.

Title queries and other “structure queries” that depend on the
structure
of a web page are similar to the NEAR queries discussed earlier, in that humans rarely employ structure queries, but search engines use them internally all the time. The reason is the same as before: search engines live or die by their rankings, and rankings can be significantly improved by exploiting the structure of web pages. For example, pages that have “dog” in the title are much more likely to contain information about dogs than pages that mention “dog” only in the body of the page. So when a user enters the simple query dog, a search engine could internally perform a dog IN TITLE search (even though the user did not explicitly request that) to find pages that are most likely to be
about
dogs, rather than just happening to mention dogs.

INDEXING AND MATCHING TRICKS ARE NOT THE WHOLE STORY

Building a web search engine is no easy task. The final product is like an enormously complex machine with many different wheels, gears, and levers, which must all be set correctly for the system to be useful. Therefore, it is important to realize that the two tricks presented in this chapter do not by themselves solve the problem of building an effective search engine index. However, the word-location trick and the metaword trick certainly convey the
flavor
of how real search engines construct and use indexes.

The metaword trick did help AltaVista succeed—where others had failed—in finding efficient matches to the entire web. We know this because the metaword trick is described in a 1999 U.S. patent filing by AltaVista, entitled “Constrained Searching of an Index.” However, AltaVista's superbly crafted matching algorithm was not enough to keep it afloat in the turbulent early days of the search industry. As we already know, efficient matching is only half the story for an effective search engine: the other grand challenge is to
rank
the matching pages. And as we will see in the next chapter, the emergence of a new type of ranking algorithm was enough to eclipse AltaVista, vaulting Google into the forefront of the world of web search.

3

PageRank: The Technology That Launched Google

The Star Trek computer doesn't seem that interesting. They ask it random questions, it thinks for a while. I think we can do better than that.

—L
ARRY
P
AGE
(Google cofounder)

Architecturally speaking, the garage is typically a humble entity. But in Silicon Valley, garages have a special entrepreneurial significance: many of the great Silicon Valley technology companies were born, or at least incubated, in a garage. This is not a trend that began in the dot-com boom of the 1990s. Over 50 years earlier—in 1939, with the world economy still reeling from the Great Depression—Hewlett-Packard got underway in Dave Hewlett's garage in Palo Alto, California. Several decades after that, in 1976, Steve Jobs and Steve Wozniak operated out of Jobs' garage in Los Altos, California, after founding their now-legendary Apple computer company. (Although popular lore has it that Apple was founded in the garage, Jobs and Wozniak actually worked out of a bedroom at first. They soon ran out of space and moved into the garage.) But perhaps even more remarkable than the HP and Apple success stories is the launch of a search engine called Google, which operated out of a garage in Menlo Park, California, when first incorporated as a company in September 1998.

By that time, Google had in fact already been running its web search service for well over a year—initially from servers at Stanford University, where both of the cofounders were Ph.D. students. It wasn't until the bandwidth requirements of the increasingly popular service became too much for Stanford that the two students, Larry Page and Sergey Brin, moved the operation into the now-famous Menlo Park garage. They must have been doing something right, because only three months after its legal incorporation as a company, Google was named by
PC Magazine
as one of the top 100 websites for 1998.

And here is where our story really begins: in the words of
PC Magazine
, Google's elite status was awarded for its “uncanny knack for returning extremely relevant results.” You may recall from the last chapter that the first commercial search engines had been launched four years earlier, in 1994. How could the garage-bound Google overcome this phenomenal four-year deficit, leapfrogging the already-popular Lycos and AltaVista in terms of search quality? There is no simple answer to this question. But one of the most important factors, especially in those early days, was the innovative algorithm used by Google for ranking its search results: an algorithm known as
PageRank.

The name “PageRank” is a pun: it's an algorithm that ranks web pages, but it's also the ranking algorithm of Larry Page, its chief inventor. Page and Brin published the algorithm in 1998, in an academic conference paper, “The Anatomy of a Large-scale Hypertextual Web Search Engine.” As its title suggests, this paper does much more than describe PageRank. It is, in fact, a complete description of the Google system as it existed in 1998. But buried in the technical details of the system is a description of what may well be the first algorithmic gem to emerge in the 21st century: the PageRank algorithm. In this chapter, we'll explore how and why this algorithm is able to find needles in haystacks, consistently delivering the most relevant results as the top hits to a search query.

THE HYPERLINK TRICK

You probably already know what a hyperlink is: it is a phrase on a web page that takes you to another web page when you click on it. Most web browsers display hyperlinks underlined in blue so that they stand out easily.

Hyperlinks are a surprisingly old idea. In 1945 — around the same time that electronic computers themselves were first being developed — the American engineer Vannevar Bush published a visionary essay entitled “As We May Think.” In this wide-ranging essay, Bush described a slew of potential new technologies, including a machine he called the
memex.
A memex would store documents and automatically index them, but it would also do much more. It would allow “associative indexing,…whereby any item may be caused at will to select immediately and automatically another”—in other words, a rudimentary form of hyperlink!

The basis of the hyperlink trick. Six web pages are shown, each represented by a box. Two of the pages are scrambled egg recipes, and the other four are pages that have hyperlinks to these recipes. The hyperlink trick ranks Bert's page above Ernie's, because Bert has three incoming links and Ernie only has one.

Hyperlinks have come along way since 1945. They are one of the most important tools used by search engines to perform ranking, and they are fundamental to Google's PageRank technology, which we'll now begin to explore in earnest.

The first step in understanding PageRank is a simple idea we'll call the
hyperlink trick.
This trick is most easily explained by an example. Suppose you are interested in learning how to make scrambled eggs and you do a web search on that topic. Now any real web search on scrambled eggs turns up millions of hits, but to keep things really simple, let's imagine that only two pages come up—one called “Ernie's scrambled egg recipe” and the other called “Bert's scrambled egg recipe.” These are shown in the figure above, together with some other web pages that have hyperlinks to either Bert's recipe or Ernie's. To keep things simple (again), let's imagine that the four pages shown are the
only
pages on the entire web that link to either of our two scrambled egg recipes. The hyperlinks are shown as underlined text, with arrows to show where the link goes to.

The question is, which of the two hits should be ranked higher, Bert or Ernie? As humans, it's not much trouble for us to read the pages that link to the two recipes and make a judgment call. It seems that both of the recipes are reasonable, but people are much more enthusiastic about Bert's recipe than Ernie's. So in the absence of any other information, it probably makes more sense to rank Bert above Ernie.

Unfortunately, computers are not good at understanding what a web page actually means, so it is not feasible for a search engine to examine the four pages linking to the hits and make an assessment of how strongly each recipe is recommended. On the other hand, computers are excellent at counting things. So one simple approach is to simply
count
the number of pages that link to each of the recipes—in this case, one for Ernie, and three for Bert—and rank the recipes according to how many incoming links they have. Of course, this approach is not nearly as accurate as having a human read all the pages and determine a ranking manually, but it is nevertheless a useful technique. It turns out that, if you have no other information, the number of incoming links that a web page has can be a helpful indicator of how useful, or “authoritative,” the page is likely to be. In this case, the score is Bert 3, Ernie 1, so Bert's page gets ranked above Ernie's when the search engine's results are presented to the user.

You can probably already see some problems with this “hyperlink trick” for ranking. One obvious issue is that sometimes links are used to indicate
bad
pages rather than good ones. For example, imagine a web page that linked to Ernie's recipe by saying, “I tried
Ernie's recipe
, and it was awful.” Links like this one, that criticize a page rather than recommend it, do indeed cause the hyperlink trick to rank pages more highly than they deserve. But it turns out that, in practice, hyperlinks are more often recommendations than criticisms, so the hyperlink trick remains useful despite this obvious flaw.

THE AUTHORITY TRICK

You may already be wondering why all the incoming links to a page should be treated equally. Surely a recommendation from an expert is worth more than one from a novice? To understand this in detail, we will stick with the scrambled eggs example from before, but with a different set of incoming links. The figure on the following page shows the new setup: Bert and Ernie each now have the same number of incoming links (just one), but Ernie's incoming link is from my own home page, whereas Bert's is from the famous chef Alice Waters.

If you had no other information, whose recipe would you prefer? Obviously, it's better to choose the one recommended by a famous chef, rather than the one recommended by the author of a book about computer science. This basic principle is what we'll call the “authority trick”: links from pages with high “authority” should result in a higher ranking than links from pages with low authority.

The basis for the authority trick. Four web pages are shown: two scrambled egg recipes and two pages that link to the recipes. One of the links is from the author of this book (who is not a famous chef) and one is from the home page of the famous chef Alice Waters. The authority trick ranks Bert's page above Ernie's, because Bert's incoming link has greater “authority” than Ernie's.

This principle is all well and good, but in its present form it is useless to search engines. How can a computer automatically determine that Alice Waters is a greater authority on scrambled eggs than me? Here is an idea that might help: let's combine the hyperlink trick with the authority trick. All pages start off with an authority score of 1, but if a page has some incoming links, its authority is calculated by adding up the authority of all the pages that point to it. In other words, if pages X and Y link to page Z, then the authority of Z is just the authority of X plus the authority of Y.

The figure on the next page gives a detailed example, calculating authority scores for the two scrambled egg recipes. The final scores are shown in circles. There are two pages that link to my home page; these pages have no incoming links themselves, so they get scores of 1. My home page gets the total score of all its incoming links, which adds up to 2. Alice Waters's home page has 100 incoming links that each have a score of 1, so she gets a score of 100. Ernie's recipe has only one incoming link, but it is from a page with a score of 2, so by adding up all the incoming scores (in this case there is only one number to add), Ernie gets a score of 2. Bert's recipe also has only one incoming link, valued at 100, so Bert's final score is 100. And because 100 is greater than 2, Bert's page gets ranked above Ernie's.

A simple calculation of “authority scores” for the two scrambled egg recipes. The authority scores are shown in circles.

THE RANDOM SURFER TRICK

It seems like we have hit on a strategy for automatically calculating authority scores that really works, without any need for a computer to actually understand the content of a page. Unfortunately, there can be a major problem with the approach. It is quite possible for hyperlinks to form what computer scientists call a “cycle.” A cycle exists if you can get back to your starting point just by clicking on hyperlinks.

The figure on the following page gives an example. There are 5 web pages labeled A,
B, C
, D, and E. If we start at A, we can click through from
A
to B, and then from
B
to E—and from
E
we can click through to
A
, which is where we started. This means that
A, B
, and
E
form a cycle.

It turns out that our current definition of “authority score” (combining the hyperlink trick and the authority trick) gets into big trouble whenever there is a cycle. Let's see what happens on this particular example. Pages
C
and
D
have no incoming links, so they get a score of 1.
C
and
D
both link to A, so
A
gets the sum of
C
and D, which is 1 + 1 = 2. Then
B
gets the score 2 from A, and
E
gets 2 from B. (The situation so far is summarized by the left-hand panel of the figure above.) But now
A
is out of date: it still gets 1 each from
C
and D, but it also gets 2 from E, for a total of 4. But now
B
is out of date: it gets 4 from A. But then
E
needs updating, so it gets 4 from B. (Now we are at the right-hand panel of the figure above.) And so on: now
A
is 6, so
B
is 6, so
E
is 6, so
A
is 8,…. You get the idea, right? We have to go on forever with the scores always increasing as we go round the cycle.

BOOK: Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers
3.51Mb size Format: txt, pdf, ePub
ads

Other books

Honour by Viola Grace
Vikings in America by Graeme Davis
Heart Secret by Robin D. Owens
Missing by L C Lang
Midnight's Master by Cynthia Eden
Taste Me by Candi Silk
Outer Banks by Anne Rivers Siddons
Hurt Machine by Reed Farrel Coleman