Authors: Douglas Edwards
Had the company bitten off more than it could chew? Yahoo's traffic dwarfed Google's,
†
and the moment Inktomi stopped answering Yahoo queries, Google would need to respond to each and every one of them quickly and correctly. With the capacity we had in place at the time, it would be like hooking a garden hose to a fire hydrant.
"Operationally it was a large buildup," Urs said, "so we needed to get lots of servers." Five thousand, to be exact, each of which required hand assembly. The new data centers that would hold them had yet to be identified or prepped for occupancy. New networking systems would have to be developed to ensure queries went to the right place and results were identical regardless of where the responding machine resided.
The Google operations team
*
worked with the contractor, Rackable Systems,
†
building the machines from parts Gerald supplied. A "bat cave" was set up between Charlie's café and the marketing department for the ops staff to test and burn in servers before shipping them off to the East Coast. The loaded racks were large and heavy but fragile. Google tried renting a truck and having ops move the racks themselves, but one fell over in transit and almost crushed a technician. The decision was made to splurge on professionals. That didn't stop the ops guys from tossing hundreds of thousands of dollars' worth of RAM into their cars and driving it all over town.
Once all the racks were ready, they needed to be moved cross-country practically overnight. Christopher Bosch hired a truck to drive nonstop to the new data center in Virginia. It would leave the highway only to change drivers en route, and once it reached the data center, the racks would roll out the back and directly into the new server farm.
Serving capacity was just the first item on a very long checklist, and in some ways the easiest. The pressure to build quickly was enormous, but at least with hardware the task was clearly defined, the process known, and the progress clearly measurable. Adding hardware to add capacity would not solve Google's problems by itself, however, even without the Yahoo deal.
"Our traffic was increasing eight percent a week for a long time," said Jeff Dean. "Any time you have that rate of growth, you basically have to make software improvements continuously because you can't get the hardware deployed fast enough. They were working as hard as they could, but if you add four percent machines per week, and you've got eight percent growth, that's not good. So we continuously worked on improving our serving system and looked at alternative designs for pieces of it to get more throughput." That kind of software required creativity and design breakthroughs that could not be scheduled in the same way as a server assembly line.
"It was actually good for us," said Urs, who believed contractual obligations to continually refresh the index and meet latency guarantees made Google "a more grown-up company" by forcing it to closely monitor its own progress. "We always wanted to be fast, but the contract said, 'The maximum one-hour latency is this.'
*
You had to start measuring it. Once you can measure it, it's much easier to set goals and say, 'Can we make this ten percent better by next month?' So a lot of these things started then."
Along with speed and capacity, Google had promised Yahoo fresher results. A seemingly reasonable offer, except that in March 2000 Google's crawler was crippled and barely running. The Googlebot software would stumble around the web gathering URLs, lose its equilibrium, then crash to a halt. The engineers would restart it and the same thing would happen. Try again. Crash. Google hadn't built a new index in four months—approximately the time I had been at the company, though I never pointed out that coincidence to anyone.
Creating an index required weeks of collecting information about what websites existed and what they contained, data that then had to be compiled into a usable list of URLs that could be ranked and presented as search results when someone submitted a query. Most users assumed that when they typed in a search term the results reflected the exact state of the web at the time they hit Enter, so they were puzzled and sometimes angry when they didn't find the very latest news and information. When an index hadn't been updated for more than a month, it became noticeably stale and user dissatisfaction increased. Google's index wasn't just stale, it was covered in mold. Without a working crawler, Google would violate its contract with Yahoo, and more important,
Google.com
would become increasingly useless.
"It wasn't really very tolerant of machine failures," Jeff Dean recalls about the crawler—"a half-working mess of Python scripts"—written before he joined the company.
*
"So when a machine failed in the middle of a crawl, it just ended. Well, that crawl was useless. Let's just start it over again. So you'd crawl for ten days and then the machine would die. Oh well. Throw it away. Start over. That was kind of painful." As Jeff put it, the backup plan was "Oh, shit. Let's try that again."
Unfortunately, the more the index grew, the more machines it needed to run, and the more machines that ran, the more likely it was that one or more would fail,
†
especially since Google hadn't ponied up for the kind of hardware that gave warnings when it had problems.
"Our machines didn't have parity," noted Jeff. Parity could tell you if a computer's memory was randomly flipping bits in ways it shouldn't. "parity memory was for high-end servers, and non-parity memory was what the cheap unwashed masses bought. The problem was, if you had a lot of data and you'd say, 'Sort it,' it ended up mostly sorted. And if you sorted it again, it ended up mostly sorted a different way." As a result, Google's engineers had to build data structures that were resilient in the face of what Ben Gomes called "adversarial memory."
"For a while I had all these bugs because the machines were crappy," Gomes told me. "I was writing this code—one of my first big projects out of school—and it was crashing all the time. I was sitting in this room with people I really respected, Jeff and Sanjay and Urs, and Jeff said, 'The pieces are mostly working, but the pageranker keeps crashing,' and I wanted to sink into the ground. I stayed up for nights on end trying to figure out
why
was it crashing? Finally I checked this one thing I had set to zero and I looked at it and it was at four. And I thought, 'Not my bug.' After that I felt a lot better."
Finally, in March 2000, a new crawl hobbled to the finish line—with, according to Gomes, "a lot of pain." "The March index was the last gasp of the old crawler, but it had so many bugs in it, it was impossible to push it out." That index came to be known internally as the "MarIndex," denoting both the month it was created and the quality of its content. Larry and Sergey declared a state of emergency and, as they would time and again when events threatened to overwhelm the company, convened a war room. Jeff Dean, Sanjay Ghemawat, Craig Silverstein, Bogdan Cocosel, and Georges Harik moved their computers into the yellow conference room to bang their brains against the problem. By early April they had patched the index to the point that it could be sent to the servers, but it limped and lurched every step of the way.
The old crawler would never be able to jump through Yahoo's hoops.
Urs recognized from the beginning that before Google could make a quantum leap to a higher state it would have to correct the mistakes of the past—especially the creaking codebase underlying the main systems. "We've fixed some of the problems," he noted after putting out the fires immediately threatening to consume Google the day he first walked in the door, "but we should really restart completely from scratch." That was a risky thing to do, because it required using resources that were in short supply to fix something that wasn't yet broken. Most companies would put off such a complex task until a time they were less overcommitted.
"This wasn't the burning problem of the day," Urs told me. "The site wasn't down because of it; it was just a productivity problem. If you stayed in the old, messy world too long, your effectiveness would continue to go down." He gave the green light in the fall of 1999 to create a new codebase called Google Two. New systems would run on Google Two, and the original codebase would be phased out. Jeff and Craig started working on it, but writing new infrastructure took time—and time refused to stand still, even for the engineers at Google.
In the months that followed, ballooning traffic increased the pressure at every point, and as Urs had predicted, cracks appeared in the lines of the original Stanford code.
Sitting in the hot tub at Squaw Valley Resort during the company's annual ski trip, two months before the March index meltdown, Craig suggested to Jeff that they write an entirely new crawler and indexer for Google Two. It would be cleaner than replacing the old ones bit by bit. Jeff saw the logic in that. The MarIndex suddenly gave that project urgency. Joined by Sanjay and Ben Gomes, they ripped out Google's aging guts and replaced them with a streamlined block of high-efficiency algorithms.
The team didn't know Yahoo was floating out there, nibbling on Omid's line, or that when he landed the contract in May, they would have only a month to complete their work. The new systems would have to be completely stable under triple the highest load Google had ever handled. They would need to distribute queries to thousands of servers in multiple data centers and automatically balance the flux of traffic on the basis of machine availability. The engineers couldn't shut off Google while they tested the system, and they couldn't drop a single Yahoo query once the deal was done. Google embraced risk, but sensible, talented engineers could infer that this indicated a company-wide death wish. So many ways to go so horribly wrong. Urs evaluated the situation and his team's capabilities and decided they needed more of a challenge.
The great white whale of search in early 2000 was an index of a billion URLs. No one had come within sight of anything close, but Urs set out to harpoon just such a beast with Google's shiny new crawler. While only half the pages in the index would be fully indexed (meaning the crawler would examine their full content, not just identify the URL at which the content lived), it would still overshadow Inktomi, which claimed to have crawled 110 million full pages. A billion URLs were not required for the Yahoo deal, but to the crawl team, the "1B" index held epic significance. It would catapult Google into the undisputed lead as the builder of the best, most scalable search technology in the world.
"In search," Urs believed, "the discussion was really, How can we outdistance our current system and make it look laughable? That's the best definition of success: if a new system comes out and everyone says, 'Wow, I can't believe we put up with the old thing because it was so primitive and limited compared to this.'"
"Urs just said, 'We will have a 1B index,'" recalls Ben Gomes, "and it seemed like crazy talk." Gomes knew that increasing Google's size by that amount required more than slight improvements in current methods. "My advisor had a saying," he told me. "'An order of magnitude is qualitative, not quantitative.' When you go up by an order of magnitude, the problem is different enough that it demands different solutions. It's discontinuous."
Given the two equally impossible tasks—meeting Yahoo's requirements and creating the world's first billion-URL index—Larry and Sergey doubled down. Google would do both and do them at the same time. Google would begin answering Yahoo queries on Monday, July 3, leaving the Fourth of July holiday to repair any major disasters that might occur. It was a one-day margin of error in which to parse convoluted code, find bugs, squash them, and, if necessary, restart a system that had never dealt with the load it would now be called upon to carry twenty-four hours a day, seven days a week.
Late on an April evening in the Googleplex, a steady clicking sound filled the space between the fabric walls, echoing the spring rain tapping against the windows. In his office, Jeff Dean stood looking over the shoulder of Sanjay Ghemawat, suggesting code variations as Sanjay typed and lines of text scrolled off his screen like the stairs of an ascending escalator.
Craig Silverstein drifted in, twisting a purple and black toy spider in his hands. Craig looked over Sanjay's other shoulder and set the spider down to point to a command about which he had a question. Jeff answered and Craig, satisfied, left. The spider remained—forsaken, but not alone. A block puzzle held together by an elastic band and a grip strengthener lay nearby, visible testaments to Craig's recurring visits. Ben Gomes stood in the hall, tossing beanbags into the air and catching some of them. Juggling rejuvenated him after hours of screen time and broke up the crusty patches that formed over his creativity. Inside the "Ben Pen," the office he shared with Ben Polk and Ben Smith,
*
the soundtrack of the 1986 film
The Mission,
a tale of pride and the struggle for redemption, swelled to a crescendo.
They were babysitting the crawl.
The engineers took turns monitoring the crawl's progress to make sure it didn't fail because of a single machine running amok. Urs had finally confided in his team why he was pushing so hard for so much new code. Everyone now knew the Yahoo deal was real and the deadline firm. Intensity set around the engineering group, a hardening cement of stress and pressure that grew firmer with each passing hour.
Even the implacable JeffnSanjay were not immune to its effects. "It was only a few months after I joined," recalls Sanjay, "and it was one of the most stressful times working at Google. We saw the deal and we knew when we had to get things done. We could do the math."