Disclaimer: I run the infrastructure/ops team at the Internet Archive.
Unfortunately none of those numbers are really even close to correct (the discussion is always fun, but the folks in r/datahoarder are often not correctly informed. textfiles has more patience for it than I do). It would probably cost around 1.5M in drives, even at reasonable current enterprise volume pricing, to back up the 60+ PB of unique data in the Internet Archive (plus, as someone does note in that thread, the cost of running them -- even if it were a static backup to cold disks, you still need chassis to run them in for the backup process, space and infra for them, electricity, people, &c). I don't know offhand how much space the contents of the Wayback currently take up, but it's definitely an order of magnitude more than that number as well.
I'd love to know what the internal atmosphere was like when the IA did this. Who should I hold a grudge against for pulling possibly the stupidest move in the history of digital intellectual property?
This is the kind of radical demonstration I expect out of some fly-by-night startup, not a twenty-four year old nonprofit with less annual revenue than the lawyers who're suing them.
Does that 1.5m figure include redundancy (e.g. RAID)? What about if the data is compressed (either trivial gzipping each "file" the archive has, or using a compression window that spans multiple files)? I imagine the HTML/plaintext content of the Internet Archive would compress very well.
No, that's for a single raw copy (it could go lower based on implementation -- the sweet spot on $/bit pricing is around 8TB/disk right now, but that would actually be more expensive for us in total cost because of the increased infra/space/power necessary to run them). We could probably get a relatively trivial 20-30% savings on space for Wayback contents via compression, maybe more with work (various projects are underway to do this), but much of the rest of the contents are difficult to compress, or already compressed (music, imagery for books, video, software archives, etc). We have also historically been very reluctant to deduplicate heavily, though we are experimenting with it for certain types of content -- one principle of operation is that as an archive of last resort, we're unable to have a true deaccession plan as some other archives have. A compromise we make is that our hard drives are "landfill-ready" -- that is, the contents of a drive (assuming you can read the filesystem) are inherently meaningful, content is housed with its metadata, and so forth. This produces some unusual restrictions on how far we can take compression and certain types of bitwise redundancy.
By unique data, that excluding generated data? Can you please estimate the space for just the wayback machine? It's the actual target.
What's the wayback machine with and without images? Is it possible that we could distribute the ASCII/Unicode content now?
It's the actual target.
It may be your target, but I would not be so dismissive of the other data in the Archive. The Software Library, the tens of millions of scanned books, the music, etc etc. On top of that, the raw scrape data driving the Wayback Machine is not currently made available to download from the archive. It's stored in WARC files, which would include both the images and text of all scrapes and would not be trivial to disentangle.
I'm not dismissing any other section of the archive. The the wayback machine has the big red target on it. It's a snapshot of recent history, I have lost track of how many times I have needed it for things that would otherwise be memoryholed.
Please, consider making a subscription service for the warc files, let us pay to get access to a query interface. archive.org could raise significant defense funds.
Doesn't Jason Scott work for the Archive? Couldn't he just ask you for better numbers or something? :-)
Oh, Jason's numbers are fine. He's not the one asserting that you can buy storage for <1¢/GB (and he correctly notes that the Wayback takes up >20PB).
What would a representative file size historgam look like?
Here's some very detailed data from 2016 https://archive.org/details/ia_census_201604 and Jason has asked for a new census to be made now. https://www.reddit.com/r/DataHoarder/comments/h02jl4/lets_sa...
Google offers archival storage at $14.4 /TB/a so about $1mm/a + access costs.
Hi Jonah!
Can you ping me by email?