πfs – A data-free filesystem

285 points by rnhmjoj 9 years ago

jboggan 9 years ago

I've been working on an alternative implementation of this but using tau instead, I'm not sure if my calculations are correct but I think it offers a 2x speedup over this method.

It's still in stealth mode but I'm hoping to unveil it sometime in late June.

abulman 9 years ago

Since the code already exists, within pi - or tau - surely all you have to do is to find the working code with it.
I'm sure that you can do that before noon of the first of next month.
- modalduality 9 years ago
  
  It was a joke about "Tau Day," on June 28th.
  
  hughes 9 years ago
  
  That was a joke about "April Fools", on April 1st.
  
  Nydhal 9 years ago
  
  I thought it was a dilbert style joke about the unreal expectations of people who are not software engineers.
  
  Nydhal 9 years ago
  
  I thought it was a joke about unreal expectations.
- jboggan 9 years ago
  
  Actually I was going to unveil another project that day. I've been excitedly working on this idea since I had the epiphany last night. So, I was checking our logs over the weekend and noticed that no jobs or processes ran between 2 and 3 AM early Sunday morning. I thought it was a fluke of our system, but I started poking around in unrelated logs and I found the same thing. I even looked at a couple of public systems and saw the same thing. Now I don't know why this is but apparently NO Unix systems (and I think maybe Windows too) ran any processes during that window. Is it some bug in the OS? Maybe, but I can exploit it.
  So I've designed a once-a-year data processing system. It's going to queue up all your requests throughout the year and save them for this "quiet window" and then distribute all the jobs across all network connected computers running this new code. I think I'll make a pull request to Linux for maximum coverage. The best thing is, there is absolutely no performance hit or impact on everyone's servers because they're ALREADY not being used.
  
  hinkley 9 years ago
  
  I heard that all the servers in Arizona still processed data during that time. Sounds like a conspiracy to me.
  
  hueving 9 years ago
  
  Arizona is able to compute during that time period due to the residual heat in the desert providing energy.
- amptorn 9 years ago
  
  Oh no, it turns out that pi also contains the worst, most draconian software licenses!
- gumby 9 years ago
  
  I appreciate your suggestion and will stop writing code immediately.
  Should work great for love letters too as long as I write them in UTF-8.

Retr0spectrum 9 years ago

This concept reminds me of https://libraryofbabel.info/

"At present it contains all possible pages of 3200 characters, about 104677 books." (https://libraryofbabel.info/About.html)

TremendousJudge 9 years ago

https://libraryofbabel.info/bookmark.cgi?bjrvq,mgobexe314
- eof 9 years ago
  
  Page 314.
  What? How? If this is something other than being able to write to libraryofbabel.info I am extremely confused.
  
  ikeboy 9 years ago
  
  https://libraryofbabel.info/bookmark.cgi?qsocvxgaxgav.vlg378
  
  r4pha 9 years ago
  
  I think it is page 314 of _a_ book.
  
  stephancoral 9 years ago
  
  https://libraryofbabel.info/bookmark.cgi?.lyjxgasnwrheuh,,jl...
  Your comment was contained in this text before you even wrote it
  
  dsp1234 9 years ago
  
  As others have mentioned, since that statement is less than 3200 characters (and only uses the alphabet supported by the library), it had to be present in some book.
  The trick to seeing why it's on page 314 is to note that the spacing on the sentence is really odd. Since spaces are significant to the algorithm, the same words with different whitespace will end up on different pages of different books.
  From the information on https://libraryofbabel.info/referencehex.html, we can see that each book has exactly 410 pages. Assuming that for any phrase X, the page that that phrase is on is independent and equally likely, then there is a 1/410 chance that the phrased being searched will be on page 314. So just write a script that hits libraryofbabel.info automatically with randomized spacing. You'll likely end up on a page that is 314 in about 410 or so tries.
jlebar 9 years ago

* "At present it contains all possible pages of 3200 characters, about 104677 books." *
Copy-paste error caused a dropped exponent. It's 10^4677 books.
- teddyh 9 years ago
  
  10⁴⁶⁷⁷
- Retr0spectrum 9 years ago
  
  Oops, it's too late to edit now.
david-given 9 years ago

Victor Borges' original story; if you haven't, read it, it's great (and very short): https://libraryofbabel.info/libraryofbabel.html
The library, by the way, is quite big. You couldn't fit it all in our universe --- in fact, you'd need 10^4594.
- azar1 9 years ago
  
  erm... that's Jorge Luis Borges
  
  david-given 9 years ago
  
  D'oh. Yes.
  My brain was apparently getting him confused with Victor Borge, who is also great, but differently. Stupid brain.

cyanbane 9 years ago

I think it is fair to say my gene sequence is somewhere in there. It is also fair to say that my gene sequence is somewhere with alterations where my stomach processes Chipotle burritos with less.. fanfare. So it sounds like I need to come up with some kinda of a CAS9-ish process of finding this better burrito processing version of my gene sequence inside pi. The better version of my 'life of pi' for burritos.

torvald 9 years ago

Of the more obscure file systems, pingfs is still my favorite. https://github.com/yarrick/pingfs

pimlottc 9 years ago

> pingfs is a filesystem where the data is stored only in the Internet itself, > as ICMP Echo packets (pings) travelling from you to remote servers and >back again.
So... if your Internet connection goes down, you lose all your data?
- jonknee 9 years ago
  
  > So... if your Internet connection goes down, you lose all your data?
  So it is like every cloud service...
zkms 9 years ago

this is what happens when you take the term "bandwidth-delay product" a little too literally.

ape4 9 years ago

I wonder if it might be easier to start with a immutable tree of typical office documents. Then filesystem just stores the diff. Eg you save a memo document and its pretty much the same as one of the standard memos. Think of the savings.

d4nt 9 years ago

Supposing I found the index location of the next Star Wars movie in pi, would it be copyright infringement for me to mention that on a forum?

nannal 9 years ago

We all have the same copy, so you can just tell me where it is in my hard drive, how is that theft?
BinaryIdiot 9 years ago

I'm not sure it could be but I'm not expecting this to happen during any of our lifetimes :)
- Terribledactyl 9 years ago
  
  Let's redefine copyright timelines in terms of computational effort to find its location in pi.
eliaskg 9 years ago

Relevant Numberphile: https://youtu.be/wo19Y4tw0l8
dandelion_lover 9 years ago

Probably yes
https://en.wikipedia.org/wiki/Illegal_number
paxcoder 9 years ago

I don't think any forum would allow you to post the location index simply because it would take too much space to store it.

univacky 9 years ago

Do we know that every possible finite sequence exists in pi?

lern_too_spel 9 years ago

We do not.
falcolas 9 years ago

IIRC, since there are an infinite number of digits in Pi, and there are no infinitely repeating finite sequences in Pi, there are therefore an infinite number of finite sequences in Pi.
Granted, the index might be larger to represent than the actual sequence you're indexing, but...
- ikeboy 9 years ago
  
  That doesn't imply that every finite sequence is in pi.
  
  singlow 9 years ago
  
  I agree. There are an infinite number of finite sequences that are not my social security number and do not contain my social security number. An infinite list is still infinite though it lack any specific finite sequence.
- skykooler 9 years ago
  
  "An infinite number of finite sequences" does not mean "all finite sequences". For example, the number 1.10100100010000100001... contains an infinite number of finite sequences, but does not contain, for example, any sequences with a 2, or even the sequence '10101'.
  
  aesthetics1 9 years ago
  
  You cannot bend the rules in this way and still equate it to the question at hand - you're introducing characters outside of base2 in your example. If you stuck within the realm of base2, '2' in its binary form certainly appears in the pattern.
  Not only that, but Pi is normal (all digits distributed evenly), and your sequence clearly is not.
  I understand what you are saying - indeed there are infinite finite sequences - but it just does not apply here.
  
  JamilD 9 years ago
  
  You can trivially modify their example to be 1.02003000400005... cycling through the digits and the argument still holds.
  As another commenter pointed out, pi is not proven to be normal, but is suspected to be.
  
  SomeStupidPoint 9 years ago
  
  They already showed 21 (in binary, 10101) doesn't appear anywhere, even as a binary pattern.
  
  evanb 9 years ago
  
  The example isn't base two. It's base 10. It just happens to be a number that only has 0s and 1s. Like eleven. Or 1000.
  
  TOMDM 9 years ago
  
  To phrase it another way, there are infinitely many sizes of Infinity. Just because you have one infinite sequence, does not mean it is the same size as another Infinity and as such may include or exclude elements of the other Infinity.
  
  monochromatic 9 years ago
  
  Pi is not known to be normal.
- Roujo 9 years ago
  
  > there are therefore an infinite number of finite sequences in Pi
  Sure, but this doesn't answer the question: Do we know that every possible finite sequence is present? You can have the former without the latter being true. For example, a sequence of the form "1011011101111011111..."[0], while infinite, never contains the finite sequence "1234".
  [0] https://oeis.org/A094946
- ronreiter 9 years ago
  
  Not sure that counts as proof.
wruza 9 years ago

https://en.wikipedia.org/wiki/Normal_number
Pi is not proven to be normal, but is suspected.
- inlineint 9 years ago
  
  Note that normality of a number in a given base is a sufficient condition for existence of all finite sequences in its expansion in this base, but not necessary.
  It might turn out that Pi is not normal say in base 2, but all possible finite sequences of 0s and 1s occur in it, just with some of them being slightly more frequent than others of the same length.
  
  wruza 9 years ago
  
  From that article, it seems that correct term is "disjunctive sequence".

drivingmenuts 9 years ago

Someone needs to alert the White House about this. Terrorists implementing piFS will have access to all the secrets of the US.

The movie industry should just shut down now. Any two people implementing piFS will have already shared all possible movie files, including future, unreleased films.

Your medical records are now available to anyone with time and patience.

The horror!

(Seriously, we should see if we can wind up Spicer about that first one and see if he explodes).

jlas 9 years ago

> Now, we all know that it can take a while to find a long sequence of digits in π, so for practical reasons, we should break the files up into smaller chunks that can be more readily found. In this implementation, to maximise performance, we consider each individual byte of the file separately, and look it up in π.

So basically like a regular filesystem except now lookup each byte every time you intend to use it?

notatoad 9 years ago

I think this project might not be entirely serious.
- thenewwazoo 9 years ago
  
  A joke, on the Internet? Surely not.
jerf 9 years ago
This is a reference to a common situation that arises anywhere in the internet that compression is discussed. There is a segment of people who seem to find it personally, mathematically, or even perhaps morally offensive that there can exist no algorithm that can take all possible inputs and be guaranteed to emit something compressed by at least one bit. This seems to be true despite the fact the proof of this statement is very, very simple; it can be sketched on a literal post-it note and you could easily convince an open-minded 10-year-old of its truth.
In those situations, you can bet money that some compression scheme will be proposed that fits into at least one of the two categories "did not check to see if it actually compresses data" or "works by hiding bits where you forgot to count, but they still count".
"I'll just give an offset into pi" is in the first category; it does, technically, work, in the sense that it can validly represent the original data, but it turns out that in order to represent the offsets into pi it takes more bits than the content had in the first place. (You can get lucky every once in a while, and store 14159265 as "0/8", but the vast, vast majority of sequences get expanded.) Everyone proposes this one. It's almost funny how often it gets proposed, given that it is actually a very slow method for making your data bigger. Don't hold your breath for your Fields Medal for that one.[1]
An example of the second case is "I'll just store the first few bytes in the filename of the file and then 'forget' that I have to count the filenames in the size of the compressed content". Vigorous, month-long debates about whether or not something "counts" as part of the content will follow, which can be resolved by pointing out that the challenge is basically to give a stream of bytes over the network to a computer that only has the algorithm on it that fully reconstructs all the content, but this rarely satisfies Our Hero who has discovered perfect compression.
I add the word "moral" not just as a jab, but as the only way I have to explain how these people react to things like pointing out "if that worked, I could compress all files to one bit/one byte" or "I can't actually reconstruct the original content with that information". They get offended.
[1]: Just for fun, let's implement the ideal "index into a sequence" compression algorithm. This is not written to proof standards, but "give the reader the flavor of what is going on" standards. (Technically it assumes the proof, in an attempt to illuminate it from another direction.) The problem with pi is that it tends to repeat a lot. Let's chunk up 8 bits into a byte, since that's pretty common, and let's build the idea sequence of bytes that we can index into easily. For completeness, let's just index the 8-bit bytes in a one sequence:
```
     00000000000000010000001000000011
```
and so on, 0, 1, 2, etc, up to 255. The most efficient way to index this list is to break it up along the 8-bit boundaries, because you'll find (if you try it) that any more clever attempts to exploit overlapping numbers will end up at best matching that performance and at worse taking more bits. Therefore, we can index into that list to get a 0 as... 0. And the index for 1 is... uhhh... 1. And 143 is... 143. And so on.
You'll find this approach, extended into as many consecutive bits in a row as you please, will wildly outperform pi as a sequence to index into to get values. Because this is, of course, just the identity transform, and pi in practice does way worse than that. This transform is also wildly more efficient than the pi indexing operation, having computational complexity O(zero), if you'll pardon me the abuse of notation.
- LoSboccacc 9 years ago
  
  wait! you're unto something there. every bit sequence is also a number in binary base right? so one can just calculate the equivalent number for any stream of bits and just transmit the number! /s
- TeMPOraL 9 years ago
  
  > the proof of this statement is very, very simple; it can be sketched on a literal post-it note and you could easily convince an open-minded 10-year-old of its truth
  Could you write that proof here, for the benefit of tired HNers who are just finishing their day of work?
  
  rmidthun 9 years ago
  
  Pigeonhole Principle. The number of messages of bit length N is 2^N. You are trying to map these into 2^M compressed strings, M < N. By the Pigeonhole (or Dirichlet's Box) Principle, at least one compressed string will have more than one message mapped to it. So it cannot be reversed.
  
  jerf 9 years ago
  
  Yup, that's the one.
  The "literally sketchable on a post-it note" is that you can show this for 1 bit, 2 bits, and 3 bits pretty easily, and 4 bits if you sketch small enough. At that point it's not hard to see this goes to infinity, maybe not rigorously enough to pass for a proof, but pretty strong. (It's a fairly small step to an inductive proof from there, though this verges on a trivial proof by the time you've laid all the bits out.)
  
  saghm 9 years ago
  
  I'm not sure this rigorous (or if this is the proof they were referring to), but they mentioned t a pretty compelling argument in passing; if you could compress all inputs by at least one bit, then you could use that algorithm again on all the outputs and reduce them by at least one bit, until eventually you've reduced every possible input to one bit, which is clearly not possible (at least, not if you want to ever be able to get back the original value). This means that every compression algorithm either a) can't compress some inputs to a smaller size than they already are or b) lose some of the data in the process. (Lossy compressions can still be useful, of course; that being said, I'd imagine assume that even most lossy compression algorithms to be idempotent).
  EDIT: The sibling comment from rmidthun is much more rigorous and concise, so if you're looking for a more proper explanation, ignore this and read that
falcolas 9 years ago

Err, forgive me, but it's effectively rotπ encoding?

pimlottc 9 years ago

This brings to mind a practical question - what's the largest valid file that's been located in the digits of pi? Has anyone yet found a valid jpg, say? Or gif? PDF? I suppose a text file would be easiest.

dannypgh 9 years ago

"Text files" are perhaps underspecified (what character set?) but a good fraction of bytes would, by themselves, count as valid text files.
- pimlottc 9 years ago
  
  We can be generous and say any common character set - ASCII, Latin-1, UTF-8, UTF-16, etc. It'd be interesting to see the longest intelligible phrase has found so far.

stdgy 9 years ago

Haha, I love it.

To anyone confused: Look today's date!

_joel 9 years ago

It's 14.3 over here in the UK, so I guess we need to increase the storage capacity 4.55 times.
Twirrim 9 years ago

They were talking about Pi day on the radio. "Today is the day the whole world celebrates Pi day". No, it really doesn't. It's just the US, the only country in the world that uses the MM/DD/YYYY order.
- LeoPanthera 9 years ago
  
  Other countries can celebrate Pi Approximation Day, on 22/7
- random_rr 9 years ago
  
  Well, since you're being pedantic... this _is_ the only day the whole world celebrates Pi Day, because it only occurs in the US date format. So... their statement is correct.
  
  iandioch 9 years ago
  
  We can celebrate pi day too, just to less accuracy. The third of January?
  
  random_rr 9 years ago
  
  :)
  
  oneeyedpigeon 9 years ago
  
  Or, you could celebrate pi time every single day, to as great a level of accuracy as you choose :)
- pepve 9 years ago
  
  Well, I'm in the EU, I celebrate π day. I'm not going to let the questionable origins of a perfectly good day ruin it for me. I even installed a new keyboard on my phone to be able to type π.
  I celebrate christmas too.
- scott_karana 9 years ago
  
  Many countries use YYYY-MM-DD, where today is 2017-03-14...

omginternets 9 years ago

The indexing kinda sucks though. I can never seem to find what I'm looking for.

narrator 9 years ago

If you could convert all algorithms to constant time operation you could simulate the universe at any point in the future or the past and just read the bits out of memory from simulated future RAM as long as any correct algorithm was used to read and write bits to disk.

kukx 9 years ago

I wonder, assuming the computation was instantaneous, how effective it would be for compression.

aratno 9 years ago

You mean, if there was an oracle for instant pi-index lookup? All finite-length strings can be expressed as natural numbers, so the pi-index is just a mapping from that number to another, pi: N -> N, with no guarantee of any more efficiency. Over the set of all finite-length strings, I do not think this would provide an advantage. But it might be fun.
- ehsankia 9 years ago
  
  What if we split of the data into chunks? Let's say I have enough digits of PI stored that guarantees any chunk of x characters is found in there. Now, I encrypt my file that is k * x in size with k indices into pi. May possible improve if we use variable size chunks too. Would this not provide pretty good compression ratios?
  Compression may be slow, though it'd be sped up if you can the precomputed digits of pi (and maybe even some prebuilt prefix datastructure of some sort?).
  Decrypting, you can use the BBP formula so you wouldn't need all the digits of pi (though having the database would speed it up quite a bit).
  Anyone have any sort of rough estimate on how this would perform in speed and compression ratio?
greggyb 9 years ago

If computation is instantaneous, then you could recurse down to a single pointer, which points to a pair of pointers, which each point to another pair, and so on until you have an arbitrary amount of your data. Since computation is instantaneous, it is instantaneous to compute this first pointer for the contents of any hard disk. And since computation is instantaneous, rebuilding the files on the disk from that pointer is instantaneous.
Your compression ratio would approach infinity.
- greggyb 9 years ago
  
  Nevermind, I neglected to take into account the length of that pointer:
  https://news.ycombinator.com/item?id=13870098
rnhmjoj 9 years ago

I think the offset is bigger than the actual information you want to store in most of the cases.
mikeash 9 years ago

The pigeonhole principle means that compression can never be effective on average. For every input that can be compressed, there must be others that are expanded, such that the average compression ratio is at best 1.
Useful compression algorithms take advantage of the fact that inputs aren't evenly distributed. Real-world data tends to have repetition, self-similarity, or predictability, which real-world algorithms use to produce smaller output.
It's unlikely that πfs would provide useful compression for any real data, since it would have to line up entirely by random chance.

gesman 9 years ago

The problem with that is that the length of offset pointing to your data is going to be geometrically(or worse?) bigger than the size of your data.

Thus - storing and maintaining the offset in PI becomes more expensive than storing data elsewhere.

remx 9 years ago

    As long as you know the index into p of your file and its length,
    its a simple task to extract the file using the Bailey–Borwein–Plouffe
    formula Similarly, you can use the formula to initially find the index of your file

    Now, we all know that it can take a while to find a long sequence of digits in p,
    so for practical reasons, we should break the files up into smaller chunks that
    can be more readily found.

    In this implementation, to maximise performance, we consider each individual
    byte of the file separately, and look it up in p.

I'm not entirely convinced my files would be present in those smaller chunks. You would need a copy of π (p) going way past the 5 trillion digit world record. Even then it's not guaranteed the bytes needed would be there.

daxelrod 9 years ago

Bailey–Borwein–Plouffe lets you calculate the nth binary digit of pi without calculating 0..n-1.
You would not need to store the entire number.
https://en.wikipedia.org/wiki/Bailey%E2%80%93Borwein%E2%80%9...
You

daemonk 9 years ago

I work in bioinformatics. I remember writing a rough implementation of this for the human genome during my phd as a way to compress sequence information. Turns out it doesn't perform better than gzip.

taejo 9 years ago

Did it perform better than cat?

725686 9 years ago

I bet that the index will be larger than the data for most cases... but then you could store the index of the index... of the index, and the number of times this goes on.

funkaster 9 years ago

lol, this reminds me of the time when I took the numerical calculus class at the uni and I came up with the best compression algorithm ever: compress a file by using the bytes as points, so the only thing you need to recreate the file is to calculate the newton poly for that set of numbers. Of course, that also requires the metadata for the newton poly, and since you don't want a lossy algorithm, you need to make sure all points are represented.

psadri 9 years ago

Is this inspired by "A Hard Boiled Wonderland and the End of the World" by Murakami?

A really great literary read btw.

ronreiter 9 years ago

Seriously for a second - can you mathematically prove that a certain sequence does not exist in pi?

saint_fiasco 9 years ago

It's an open question. It is believed, but not proven, that pi is a normal number. Normal numbers have every possible sequence somewhere, in every base.
Someone 9 years ago

I don't think we know whether it is decidable whether you can, but we certainly cannot (?yet?)
As other posts already said, if, as many think, π is normal (https://en.m.wikipedia.org/wiki/Normal_number), every sequence appears in it. If it isn't, that still is open.

meehow 9 years ago

This README file is more hilarious than this source code.

calebm 9 years ago

Hey, if CPU is abundant and memory is limited, why not?

robotresearcher 9 years ago

Because it's likely that the size of the index into pi is larger than the data itself.
- calebm 9 years ago
  
  Good point. Actually, that brings up an interesting question: what is the mathematical relationship between the size of the input and expected offset?
  
  ronreiter 9 years ago
  
  Probably not a good one the more you increase your data size

mdekkers 9 years ago

πfs for Hadoop!