points by jonah-archive 6 years ago

Disclaimer: I run the infrastructure/ops team at the Internet Archive.

Unfortunately none of those numbers are really even close to correct (the discussion is always fun, but the folks in r/datahoarder are often not correctly informed. textfiles has more patience for it than I do). It would probably cost around 1.5M in drives, even at reasonable current enterprise volume pricing, to back up the 60+ PB of unique data in the Internet Archive (plus, as someone does note in that thread, the cost of running them -- even if it were a static backup to cold disks, you still need chassis to run them in for the backup process, space and infra for them, electricity, people, &c). I don't know offhand how much space the contents of the Wayback currently take up, but it's definitely an order of magnitude more than that number as well.

Causality1 6 years ago

I'd love to know what the internal atmosphere was like when the IA did this. Who should I hold a grudge against for pulling possibly the stupidest move in the history of digital intellectual property?

This is the kind of radical demonstration I expect out of some fly-by-night startup, not a twenty-four year old nonprofit with less annual revenue than the lawyers who're suing them.

DaiPlusPlus 6 years ago

Does that 1.5m figure include redundancy (e.g. RAID)? What about if the data is compressed (either trivial gzipping each "file" the archive has, or using a compression window that spans multiple files)? I imagine the HTML/plaintext content of the Internet Archive would compress very well.

  • jonah-archive 6 years ago

    No, that's for a single raw copy (it could go lower based on implementation -- the sweet spot on $/bit pricing is around 8TB/disk right now, but that would actually be more expensive for us in total cost because of the increased infra/space/power necessary to run them). We could probably get a relatively trivial 20-30% savings on space for Wayback contents via compression, maybe more with work (various projects are underway to do this), but much of the rest of the contents are difficult to compress, or already compressed (music, imagery for books, video, software archives, etc). We have also historically been very reluctant to deduplicate heavily, though we are experimenting with it for certain types of content -- one principle of operation is that as an archive of last resort, we're unable to have a true deaccession plan as some other archives have. A compromise we make is that our hard drives are "landfill-ready" -- that is, the contents of a drive (assuming you can read the filesystem) are inherently meaningful, content is housed with its metadata, and so forth. This produces some unusual restrictions on how far we can take compression and certain types of bitwise redundancy.

    • jakeogh 6 years ago

      By unique data, that excluding generated data? Can you please estimate the space for just the wayback machine? It's the actual target.

      What's the wayback machine with and without images? Is it possible that we could distribute the ASCII/Unicode content now?

      • sp332 6 years ago

        It's the actual target.

        It may be your target, but I would not be so dismissive of the other data in the Archive. The Software Library, the tens of millions of scanned books, the music, etc etc. On top of that, the raw scrape data driving the Wayback Machine is not currently made available to download from the archive. It's stored in WARC files, which would include both the images and text of all scrapes and would not be trivial to disentangle.

        • jakeogh 6 years ago

          I'm not dismissing any other section of the archive. The the wayback machine has the big red target on it. It's a snapshot of recent history, I have lost track of how many times I have needed it for things that would otherwise be memoryholed.

          Please, consider making a subscription service for the warc files, let us pay to get access to a query interface. archive.org could raise significant defense funds.

schoen 6 years ago

Doesn't Jason Scott work for the Archive? Couldn't he just ask you for better numbers or something? :-)

andromeduck 6 years ago

Google offers archival storage at $14.4 /TB/a so about $1mm/a + access costs.

bits_n_bytes 6 years ago

Hi Jonah!

Can you ping me by email?