points by whoopdedo 3 years ago

This reminds me of the early binary DOC file format that was essentially a dump of Word's working memory. You could find all sorts of leaked data in the slack space between text chunks. Sometimes from other processes since `malloc` doesn't zero memory. I think there were a few instances of people being caught doing bad things because of identifying information in a Word DOC.

But why is this even happening? The standard procedure to overwrite a file is to save to a temporary file first so an error that occurs during the write won't damage the existing file. Are they really doing an in-place write and how does that affect the shadow copy if you wanted to restore the previous version of the file?

int_19h 3 years ago

The old .doc format was never "a dump of Word's working memory", implying copy of raw bytes. It's rather Word's internal object graph serialized into COM Structured Storage (https://en.wikipedia.org/wiki/COM_Structured_Storage), which is basically a FAT-like filesystem inside a single file. This is convenient for the app because it gets an FS-like API and can serialize data into as many "files" as is convenient, and dynamically update data inside without overwriting the entire container file every time (which, back when this all was designed in late 80s - early 90s, would be slow).

Thus the reason why you could end up with old bits of a Word document sticking around inside the .doc is the same as to why your FS has bits of deleted files: the space has been marked as free, and nothing else overwrote it yet.

But none of this applies to images, so the explanation here ought to be different.

  • nneonneo 3 years ago

    This is also why "Save As..." with old Word versions would often produce files much smaller than "Save" would - it was writing a brand new, compact file which was effectively "garbage collected".

    • kevin_thibedeau 3 years ago

      This stemmed from the fast save feature which was present from Word97 through 2003. Earlier versions didn't work that way.

  • lloeki 3 years ago

    > The old .doc format was never "a dump of Word's working memory", implying copy of raw bytes. It's rather Word's internal object graph serialized into COM Structured Storage

    Probably the "dump of Word's working memory" part emanates from Word for DOS, which predates COM by the order of a decade.

  • julian_t 3 years ago

    Wow, COM Structured Storage. That brings back memories, but not necessarily good ones. And as for working with OLE... ouch.

    • int_19h 3 years ago

      It was hard on the developer, but some of the features it enabled were very impressive, like the ability to arbitrarily embed documents into other documents in a way that allows composite rendering as a single piece, without the app managing either part aware of the nature of the other. In fact, I'm not aware of any modern equivalent of this tech, not even on Windows (since Office stopped supporting embedding of its documents via OLE by third-party apps).

Retr0id 3 years ago

It feels like a lot of "things everyone knows" are slowly getting lost over time, as developers work at higher and higher levels of abstraction without deep knowledge of the layers beneath them (which is of course the whole point of abstractions, but they're never perfect)

Being a true "full stack" engineer is a superpower when it comes to performance optimisation, or vulnerability research.

  • doubled112 3 years ago

    I wanted to be a programmer as a kid, but eventually found it boring and switched to administering systems.

    Knowing how code works on systems IS a super power.

  • saagarjha 3 years ago

    Eh, that’s not really true. Adding abstraction allows for providing APIs that can handle cases like these correctly. For example, Apple provides a very capable versioning system for files that does “the right thing”, which in this case would create a new file for reliability.

    • Retr0id 3 years ago

      Sure, abstractions aren't inherently evil, but bad ones are. The abstraction you described sounds like a sensible one, which couldn't have been designed without a deep understanding of the system as a whole (or at the very least, the adjacent layers).

      • saagarjha 3 years ago

        The people writing an abstraction need to understand the system, but if done correctly the people using it don’t.

        • namaria 3 years ago

          All abstractions leak. There are some physical facts about software we keep denying for some reason. There is no silver bullet. Every enterprise systems turns into a big ball of mud over time. Team structures get imprinted in the design of systems built by these teams.

          And every abstraction leaks. Living on a given level without at least an accurate mental model of everything bellow it limits your ability as a developer. Sure you can just do scripting for a web dev team your whole career. If that's what you want...

          • saagarjha 3 years ago

            Water dissolves pretty much everything too and yet we build structures that are useful even when it rains. Likewise, all software engineers get through their careers without learning most things. It’s totally fine, as long as they understand how to poke at a leaky abstraction when necessary. If atomic file renames is a performance problem for them, or it’s breaking hardlinks, or one of the several other ways that this would leak on them, then they can go learn what’s going on and update their understanding as necessary. Good abstractions aren’t watertight but rather don’t leak in unexpected and dangerous ways.

            • namaria 3 years ago

              Hey some people spend a working life just painting bridges without understanding anything about structural engineering.

              Your choices matter to you not to society.

iggldiggl 3 years ago

> The standard procedure to overwrite a file is to save to a temporary file first

… which on the other hand has the annoying side effect of wrecking hard links.

  • bombolo 3 years ago

    I think that's a positive. So you had 2 identical files but one is now changed and they automatically got decoupled!

    Unless of course you wanted them to remain the same…

    • eviks 3 years ago

      Yes I did, that's why I created them! For decoupled files you'd use copy

      • Dylan16807 3 years ago

        But I don't want to waste disk space by storing the same data twice. And doing that with copies only works on a handful of filesystems, most notably BTRFS and XFS.

    • geocar 3 years ago

      If you want them to remain the same, use symbolic links.

      • iggldiggl 3 years ago

        Symbolic links don't give me an easily obtainable list of all "copies" of that file, and while they might survive atomic writes, they're also vulnerable to the "main" file being renamed/moved/etc.

        (Of course the thing that'd solve my actual root problem would be proper OS and file system level support for tagging, but until then it seems that there are only imperfect solutions, each with its own set of drawbacks.

        I.e. third party software is not well integrated with the file explorer and the file open/save dialogues etc., and now I'm dependent on that software lest I lose all my carefully tagged data, whereas hard or symbolic link-based solutions are clunky to use and vulnerable to either atomic saves or file renaming etc.)

thrashh 3 years ago

It sounds like someone wrote fopen(name, “r+”) instead of fopen(name, “w”)

I don’t think writing to a separate file is standard procedure at all. Way more apps don’t do that than do from my anecdotal memories.

  • alkonaut 3 years ago

    I guess it depends on how long users are expected to work on the documents/flies, and whether there is some other safety mechanism. Basically: a power outage, full disk etc. corrupting the file at save costs X hours of lost work and for some X you better make sure the app can revert to a known state via a transactional save and/or an autosave feature.

    Personally I'd probably always save to a new file even if the amount of work potentially lost is negligible (as in a snipping tool). The cost of doing so is extremely small in development so that if you ever save one customer's file, you probably saved more time than it cost you to implement transactional file writing. It's a few extra lines, if you opt out of the more complex scenarios (E.g. you can't make atomic renames to move if the target is a network share).

david2ndaccount 3 years ago

rename is not atomic on win32. That’s a posix-ism.

  • chenxiaolong 3 years ago

    Windows 10 1607 and newer support it now if you call SetFileInformationByHandle() with FileRenameInfoEx and specify the FILE_RENAME_FLAG_POSIX_SEMANTICS flag. Not sure how commonly it's used in standard libraries, but I've seen it popping up more and more.