The general principles of first-principle thinking apply to software, hardware, anything really. Often, we were told something was impossible, but once we broke it down into its constituent elements, we could solve those.344
For xAI in 2024, we were trying to build a training supercluster (a giant network of computers to train an AI model). We went to the various suppliers and said we needed one hundred thousand H100s (superpowerful computer chips) to be able to train coherently. Their estimates for how long it would take to complete that were eighteen to twenty-four months. Well, we needed to get that done in six months or we wouldn’t be competitive.
So, we broke that down. We need a building; we need power; we need cooling.
We didn’t have enough time to build a building from scratch. So, we found a factory in Memphis that was no longer in use. The input power was 15 megawatts and we needed 150 megawatts. So, we rented generators and put them along one side of the building.
Then, we needed cooling. So, we rented about a quarter of the mobile cooling capacity of the US and put the chillers on the other side of the building.
That didn’t fully solve the problem because the power variations during training are very big. Power can drop by 50 percent in 100 milliseconds, which the generators can’t keep up with. So then we added Tesla megapacks and modified the software in the megapacks to be able to smooth out the power variation during the training run.
Then there were a bunch of networking challenges. Networking cables for one hundred thousand GPUs (graphics processing units) are very, very difficult. We ran the networking team to do the cabling in four shifts, 24/7. I was sleeping in the data center and doing cabling myself.345
Now called Colossus, it is currently the largest AI training platform in the world. It was constructed in 122 days. Ninety-two days later, it had been doubled to two hundred thousand GPUs.346