| Svelte Hacker News

Causality1 6 years ago

I'd love to know what the internal atmosphere was like when the IA did this. Who should I hold a grudge against for pulling possibly the stupidest move in the history of digital intellectual property?

This is the kind of radical demonstration I expect out of some fly-by-night startup, not a twenty-four year old nonprofit with less annual revenue than the lawyers who're suing them.

DaiPlusPlus 6 years ago

Does that 1.5m figure include redundancy (e.g. RAID)? What about if the data is compressed (either trivial gzipping each "file" the archive has, or using a compression window that spans multiple files)? I imagine the HTML/plaintext content of the Internet Archive would compress very well.

jonah-archive 6 years ago

No, that's for a single raw copy (it could go lower based on implementation -- the sweet spot on $/bit pricing is around 8TB/disk right now, but that would actually be more expensive for us in total cost because of the increased infra/space/power necessary to run them). We could probably get a relatively trivial 20-30% savings on space for Wayback contents via compression, maybe more with work (various projects are underway to do this), but much of the rest of the contents are difficult to compress, or already compressed (music, imagery for books, video, software archives, etc). We have also historically been very reluctant to deduplicate heavily, though we are experimenting with it for certain types of content -- one principle of operation is that as an archive of last resort, we're unable to have a true deaccession plan as some other archives have. A compromise we make is that our hard drives are "landfill-ready" -- that is, the contents of a drive (assuming you can read the filesystem) are inherently meaningful, content is housed with its metadata, and so forth. This produces some unusual restrictions on how far we can take compression and certain types of bitwise redundancy.
- jakeogh 6 years ago
  
  By unique data, that excluding generated data? Can you please estimate the space for just the wayback machine? It's the actual target.
  What's the wayback machine with and without images? Is it possible that we could distribute the ASCII/Unicode content now?
  
  sp332 6 years ago
  
  It's the actual target.
  It may be your target, but I would not be so dismissive of the other data in the Archive. The Software Library, the tens of millions of scanned books, the music, etc etc. On top of that, the raw scrape data driving the Wayback Machine is not currently made available to download from the archive. It's stored in WARC files, which would include both the images and text of all scrapes and would not be trivial to disentangle.
  
  jakeogh 6 years ago
  
  I'm not dismissing any other section of the archive. The the wayback machine has the big red target on it. It's a snapshot of recent history, I have lost track of how many times I have needed it for things that would otherwise be memoryholed.
  Please, consider making a subscription service for the warc files, let us pay to get access to a query interface. archive.org could raise significant defense funds.

schoen 6 years ago

Doesn't Jason Scott work for the Archive? Couldn't he just ask you for better numbers or something? :-)

jonah-archive 6 years ago

Oh, Jason's numbers are fine. He's not the one asserting that you can buy storage for <1¢/GB (and he correctly notes that the Wayback takes up >20PB).
- jakeogh 6 years ago
  
  What would a representative file size historgam look like?
  
  sp332 6 years ago
  
  Here's some very detailed data from 2016 https://archive.org/details/ia_census_201604 and Jason has asked for a new census to be made now. https://www.reddit.com/r/DataHoarder/comments/h02jl4/lets_sa...

andromeduck 6 years ago

Google offers archival storage at $14.4 /TB/a so about $1mm/a + access costs.

bits_n_bytes 6 years ago

Hi Jonah!

Can you ping me by email?