Some of this blog’s kindly readers alerted me to an outage last week. The Internet service provider that hosts both blogs (The SSD Guy and The Memory Guy) plus the Objective Analysis website explained that one of the blogs’ activity had suddenly increased to over 100 times what was budgeted, so all three sites were shut down in response.
Time for a speedy call to my website guru for help. He and I guessed that the culprit was probably someone who was trying to cause some mischief.
After some digging he found that all the excess activity was linked to that one blog’s RSS feed which was experience a phenomenal volume of data download. A couple of calls and web searches revealed that this has become a common occurrence recently, as new AI models hustle to absorb as much web content as possible to work their magic on.
My web guy disabled the blogs’ feeds and got everything back online to allow you to read this post and all of my others. A tip of the hat to him.
What Happened? (First Thoughts)
As I thought about this I guessed that these models must keep a local copy of as much of the Datasphere as they can for training, and that involves finding every RSS feed they can to download everything they possibly can work on. I mulled over what this means to SSDs, since, after all, that’s what I do. How much storage would be involved in such an undertaking, and would it all need to be on fast SSDs? Wouldn’t 20 terabyte HDDs be an adequately fast and considerably cheaper option?
I decided to make a rough estimate of the amount of storage would it take to grab a copy of everything on the web. How much storage is actually stored on the web in the first place?
Although I expected for this number to be readily available through a casual web search, I learned that there are lots of inconsistent numbers online. The ones I found spanned a 10:1 range of 8-75 zettabytes (ZB). Let’s use the smaller number, of 8ZB, which would be equivalent to 2.5 billion 20TB HDDs, or a mere 500 million Nimbus 100TB SSDs (discussed in another post.)
Those 20TB HDDs, or their smaller 15-18TB siblings, sell for roughly $100 each. I have no idea what a 100TB SSD goes for, but at 5 cents per gigabyte (which is probably low) they would sell for $5,000 each. So the cost of storing everything on HDDs would be more than $400 billion if done in HDDs and over $4 trillion (yes, trillion!) if it were all on SSDs. After all, you need a whole lot more than just the storage media when you assemble a storage system.
A More Likely Scenario
What’s more likely is that these systems take a large of a bite as possible, cogitate on it a bit to distill it down to a manageable chunk, then do the same for another site. Any process that’s managed in serially-processed subsets of a larger data set must use speed to make up for a lack of bandwidth. How do you speed-optimize this process? By using the feed option on the website, and through the liberal use of SSDs for data storage.
To prevent future issues, the feed option has been removed from our three sites, for which I apologize to any RSS feed users who will lose these sites’ support.
Meanwhile, we all can appreciate that the current AI craze is already creating an unexpected impact on everyday activities. I expect to see more surprises as the technology matures.