This reminds me of the early binary DOC file format that was essentially a dump of Word's working memory. You could find all sorts of leaked data in the slack space between text chunks. Sometimes from other processes since `malloc` doesn't zero memory. I think there were a few instances of people being caught doing bad things because of identifying information in a Word DOC.
But why is this even happening? The standard procedure to overwrite a file is to save to a temporary file first so an error that occurs during the write won't damage the existing file. Are they really doing an in-place write and how does that affect the shadow copy if you wanted to restore the previous version of the file?
The old .doc format was never "a dump of Word's working memory", implying copy of raw bytes. It's rather Word's internal object graph serialized into COM Structured Storage (https://en.wikipedia.org/wiki/COM_Structured_Storage), which is basically a FAT-like filesystem inside a single file. This is convenient for the app because it gets an FS-like API and can serialize data into as many "files" as is convenient, and dynamically update data inside without overwriting the entire container file every time (which, back when this all was designed in late 80s - early 90s, would be slow).
Thus the reason why you could end up with old bits of a Word document sticking around inside the .doc is the same as to why your FS has bits of deleted files: the space has been marked as free, and nothing else overwrote it yet.
But none of this applies to images, so the explanation here ought to be different.
This is also why "Save As..." with old Word versions would often produce files much smaller than "Save" would - it was writing a brand new, compact file which was effectively "garbage collected".
This stemmed from the fast save feature which was present from Word97 through 2003. Earlier versions didn't work that way.
> The old .doc format was never "a dump of Word's working memory", implying copy of raw bytes. It's rather Word's internal object graph serialized into COM Structured Storage
Probably the "dump of Word's working memory" part emanates from Word for DOS, which predates COM by the order of a decade.
It's something people have been parroting ever since this Spolsky post from 2008 https://www.joelonsoftware.com/2008/02/19/why-are-the-micros...
MacWord 3.0+ (and then WinWord 1.1+) had fast-save which leaned on the in-memory piece-table data structure to write to disk only the changes to the Word document.
see https://web.archive.org/web/20160308183811/http://1017.songt...
The COM structured storage Office file format came from OLE2 (object linking and embedding - one of the mid-90s must-have features)
Wow, COM Structured Storage. That brings back memories, but not necessarily good ones. And as for working with OLE... ouch.
It was hard on the developer, but some of the features it enabled were very impressive, like the ability to arbitrarily embed documents into other documents in a way that allows composite rendering as a single piece, without the app managing either part aware of the nature of the other. In fact, I'm not aware of any modern equivalent of this tech, not even on Windows (since Office stopped supporting embedding of its documents via OLE by third-party apps).
It feels like a lot of "things everyone knows" are slowly getting lost over time, as developers work at higher and higher levels of abstraction without deep knowledge of the layers beneath them (which is of course the whole point of abstractions, but they're never perfect)
Being a true "full stack" engineer is a superpower when it comes to performance optimisation, or vulnerability research.
I wanted to be a programmer as a kid, but eventually found it boring and switched to administering systems.
Knowing how code works on systems IS a super power.
Eh, that’s not really true. Adding abstraction allows for providing APIs that can handle cases like these correctly. For example, Apple provides a very capable versioning system for files that does “the right thing”, which in this case would create a new file for reliability.
Sure, abstractions aren't inherently evil, but bad ones are. The abstraction you described sounds like a sensible one, which couldn't have been designed without a deep understanding of the system as a whole (or at the very least, the adjacent layers).
The people writing an abstraction need to understand the system, but if done correctly the people using it don’t.
All abstractions leak. There are some physical facts about software we keep denying for some reason. There is no silver bullet. Every enterprise systems turns into a big ball of mud over time. Team structures get imprinted in the design of systems built by these teams.
And every abstraction leaks. Living on a given level without at least an accurate mental model of everything bellow it limits your ability as a developer. Sure you can just do scripting for a web dev team your whole career. If that's what you want...
Water dissolves pretty much everything too and yet we build structures that are useful even when it rains. Likewise, all software engineers get through their careers without learning most things. It’s totally fine, as long as they understand how to poke at a leaky abstraction when necessary. If atomic file renames is a performance problem for them, or it’s breaking hardlinks, or one of the several other ways that this would leak on them, then they can go learn what’s going on and update their understanding as necessary. Good abstractions aren’t watertight but rather don’t leak in unexpected and dangerous ways.
Hey some people spend a working life just painting bridges without understanding anything about structural engineering.
Your choices matter to you not to society.
> The standard procedure to overwrite a file is to save to a temporary file first
… which on the other hand has the annoying side effect of wrecking hard links.
And various metadata!
I think that's a positive. So you had 2 identical files but one is now changed and they automatically got decoupled!
Unless of course you wanted them to remain the same…
Yes I did, that's why I created them! For decoupled files you'd use copy
But I don't want to waste disk space by storing the same data twice. And doing that with copies only works on a handful of filesystems, most notably BTRFS and XFS.
If you want them to remain the same, use symbolic links.
Symbolic links don't give me an easily obtainable list of all "copies" of that file, and while they might survive atomic writes, they're also vulnerable to the "main" file being renamed/moved/etc.
(Of course the thing that'd solve my actual root problem would be proper OS and file system level support for tagging, but until then it seems that there are only imperfect solutions, each with its own set of drawbacks.
I.e. third party software is not well integrated with the file explorer and the file open/save dialogues etc., and now I'm dependent on that software lest I lose all my carefully tagged data, whereas hard or symbolic link-based solutions are clunky to use and vulnerable to either atomic saves or file renaming etc.)
It sounds like someone wrote fopen(name, “r+”) instead of fopen(name, “w”)
I don’t think writing to a separate file is standard procedure at all. Way more apps don’t do that than do from my anecdotal memories.
I guess it depends on how long users are expected to work on the documents/flies, and whether there is some other safety mechanism. Basically: a power outage, full disk etc. corrupting the file at save costs X hours of lost work and for some X you better make sure the app can revert to a known state via a transactional save and/or an autosave feature.
Personally I'd probably always save to a new file even if the amount of work potentially lost is negligible (as in a snipping tool). The cost of doing so is extremely small in development so that if you ever save one customer's file, you probably saved more time than it cost you to implement transactional file writing. It's a few extra lines, if you opt out of the more complex scenarios (E.g. you can't make atomic renames to move if the target is a network share).
rename is not atomic on win32. That’s a posix-ism.
Windows 10 1607 and newer support it now if you call SetFileInformationByHandle() with FileRenameInfoEx and specify the FILE_RENAME_FLAG_POSIX_SEMANTICS flag. Not sure how commonly it's used in standard libraries, but I've seen it popping up more and more.