znpy 5 years ago

For everybody complaining about having to pay actual money for goods and services: if you're not okay with this you can run a self hosted registry.

The out of the box registry does very little and has a very poor user experience.

But nowadays there are Harbor from VMware and Quay from RedHat that are open source and easily self-hostable.

We run our own Harbor instance at work and I can tell you... Docker images are NOT light. You think they are, they are not. It's easy to have images proliferate a lot and burn a lot of disk space. Under some conditions when layers are shared among too many images (can't recall the exact details here) deleting an image may result in also deleting a lot more images (and this is not the correct/expected/wanted behaviour) and that means that under some circumstances you have to retain a lot more images or layers than you think you should.

The thing is, I can only wonder how much bandwidth and disk space (oh and disk space must be replicated for fault tolerance) must cost running a public registry for everybody.

It hurts the open source ecosystem a bit, I understand... Maybe some middle ground will be reached, dunno.

Edit: I also run harbor at home, it's not hard to setup and operate, you should really check that out.

  • gramakri 5 years ago

    Which IaaS do you use to selfhost for the one at work? How much does the network transfer cost you? Or are they docker pulls internal network?

    • znpy 5 years ago

      We run on openstack, managed by a local yet fairly large openstack provider (Irideos, their devops/consulting team is top notch and has helped us adopt many cloud-native technologies while still staying fairly vendor-agnostic).

      I can't see the bills, but we never worry about bandwidth usage and I am fairly sure that bandwidth is free, basically.

      Keep in mind that since we run our own harbor instance, most of the image pulls happen within our openstack network, so that does/would not count against bandwidth usage (but image storage does). In terms of bandwidth thus, we can happily set "always" as imagePullPolicy in our kubernetes clusters.

      Edit: openstack works remarkably well. The horizon web interface is slow as molasses but thanks to terraform we rarely have to use it.

  • bovermyer 5 years ago

    Harbor is excellent, especially now that you can set up automatic image pruning rules that make sense.

  • sschueller 5 years ago

    Yep, docker registry is absolute garbage and lack garbage collection.

    FYI, Gitlab (free version) has a built in registry as well and it let's you define retention rules.

    • znpy 5 years ago

      I know about that, and we used to use that but had to move away from it because it created a lot of scalability problems (mind you, most of them due to disk usage).

      With harbor we can save docker images layers to switft, the openstack flavor of object storage (S3). That solves a lot of scalability problems.

      AFAIK gitlab ships the docker Registry underneath so the problems stay, mostly. I think that harbor does the same. I skimmed the harbor source and it seems that it forwards http requests to the docker registry if you hit the registry API endpoints.

      Haven't looked at Quay but as far as I know wherever there's the docker registry you'll have garbage collection problems.

      One note on the side: I think that quay missed their chance to become the goto docker registry. Red Hat basically open sourced it after harbor had been incubated in the cncf (unsurprisingly, harbor development has skyrocketed after that event).

      • jschorr 5 years ago

        (Co-founder of Quay here)

        Scalability and garbage collection are actually two of the main areas of focus Quay has had since its inception. As you mentioned, most modern Docker registries such as Quay and Harbor will automatically redirect to blob storage for downloading of layers to help with scale; Quay actually goes one step further and (for blobs recently pulled) will skip the database entirely if the information has been cached. Further, being itself a horizontally scalable containerized app, Quay can easily be scaled out to handle thousands of requests per second (which is a very rough estimate of the scale of Quay.io)

        On the garbage collection side, Quay has had fully asynchronous background collection of unreferenced image data since one of its early versions. That, plus the ability to label tags with future expiration, means you can (reasonably) control the growth problem around images. Going forward, there are plans to add additional capabilities around image retention to help mitigate further.

        In reference to your note: We are always looking for contributors to Project Quay, and we are starting a bug bash with t-shirts as prizes for those who contribute! [1]

        [1] https://github.com/quay/quay#quay-bug-bash

        Edit: I saw the edit below and realized the bug bash is listed as ending tomorrow; we're extending it another month as we speak!

        • znpy 5 years ago

          Hey, thank you so much for your reply.

          My message wasn't meant to dismiss quay, I hope it didn't come across like that.

          I'll be giving a look at what's repo in during my vacation... I haven't loaded that many images on my private harbor registry, it might still make sense to switch :)

          Edit: I just realized that the bug bash ends tomorrow... :/

          • jschorr 5 years ago

            Not at all!

            I thought I'd just give some interesting background info and not so subtly ask people to contribute to the project :D

  • benatkin 5 years ago

    > The out of the box registry does very little and has a very poor user experience.

    I think this is indirectly what people are complaining about. Having a free registry mitigates that. So they aren't far off track.

    It's true we shouldn't be bitter about Docker. They did a lot to improve the development ecosystem. We should try to avoid picking technologies in the future that aren't scalable in both directions though.

    For example, PostgreSQL works well for a 1GB VPS containing it and the web server for dozens of users, and it also works well for big sites. With MongoDB the VPS doesn't work so well.

  • m463 5 years ago

    I thought they didn't make private registries available because it would "fragment the ecosystem"

    Here's the conversation I saw:

    https://stackoverflow.com/questions/33054369/how-to-change-t...

    pointing to this:

    https://github.com/moby/moby/issues/7203

    and also there was this comment:

    "It turns out this is actually possible, but not using the genuine Docker CE or EE version.

    You can either use Red Hat's fork of docker with the '--add-registry' flag or you can build docker from source yourself with registry/config.go modified to use your own hard-coded default registry namespace/index."

    • justincormack 5 years ago

      No you have misunderstood the issue. You can use any registry, just write out the domain for it, this has always worked and is very widely used. Red Hat changed the default if you don't specify a FQDN, before they decided not to ship Docker at all.

      • m463 5 years ago

        I understand that point, but it makes it harder to not "accidentally" pull from a public registry with intertwined docker images (which most people use)

  • ryanmccullagh 5 years ago

    It always blows my mind when people complain about free services, and products are no longer going to be free. I learned that the free tier based customers, are the worst to support when I built Amezmo. Just like with the MailGun pricing changes, we'll have the people complaining about how it's no longer free.

alexellisuk 5 years ago

Some thoughts / scenarios:

"Fine we will just pay" - I have a personal account then 4 orgs, that's ~ 500 USD / year to keep older OSS online for users of openfaas/inlets/etc.

"We'll just ping the image very 6 mos" - you have to iterate and discover every image and tag in the accounts then pull them, retry if it fails. Oh and bandwidth isn't free.

"Doesn't affect me" - doesn't it? If you run a Kubernetes cluster, you'll do 100 pulls in no time from free / OSS components. The Hub will rate-limit you at 100 per 6 hours (resets every 24?). That means you need to add an image pull secret and a paid unmanned user to every Kubernetes cluster you run to prevent an outage.

"You should rebuild images every 6 mo anyway!" - have you ever worked with an enterprise company? They do not upgrade like we do.

"It's fair, about time they charged" - I agree with this, the costs must have been insane, but why is there no provision for OSS projects? We'll see images disappear because people can't afford to pay or to justify the costs.

A thread with community responses - https://twitter.com/alexellisuk/status/1293937111956099073?s...

  • trey-jones 5 years ago

    I feel like the main response should be "OK, we'll just host our own Docker Registry."

    This has been available as a docker image since the very beginning, which might not be good enough for everyone, but I think it will work for me and mine.

    • geerlingguy 5 years ago

      Note that for OSS images, that's a non-trivial thing to do—you have to have somewhere to run the image, and somewhere to store your images (e.g. S3), both of which are non-free, and would also require more documentation and less discoverability than Docker Hub offers.

      • laurencerowe 5 years ago

        GitHub Packages is free for public repositories so seems like a good option for OSS which likely have a GitHub presence already. https://github.com/features/packages

        • GordonS 5 years ago

          Huh, I knew GitHub Packages supported npm, NuGet and Maven, but had no idea it supported Docker images too.

          My guess is a lot of people don't know this. If it becomes better known, I can imagine something of an exodus from Docker Hub to GitHub Packages for OSS projects.

          • laurencerowe 5 years ago

            Further down someone mentioned that it still requires logging in to download public packages, though they're apparently working on it.

          • ithkuil 5 years ago

            Currently github docker registry is missing some important features, such as arbitrary media types (useful for hosting helm charts and similar things)

      • rumanator 5 years ago

        Last time I checked, GitLab offered a free Docker container registry for all projects.

    • Legogris 5 years ago

      Agreed that self-hosting registries should be way more common than it is today and maybe even standard practice.

      It's crazy easy to do; just start the registry container with a mapped volume and you're done.

      Securing, adding auth/auth and properly configuring your registry for exposure to the public internet, though... The configuration for that is very poorly documented IMO.

      EDIT: Glancing through the docs, they do seem to have improved on making this more approachable relatively recently. https://docs.docker.com/registry/deploying/

  • geerlingguy 5 years ago

    It looks like if anyone pulls an image within 6 months, then the counter is reset. It seems like it's not too onerous to me—for any of the OSS images I've maintained, they are typically pulled hundreds if not thousands of times a day.

    Sometimes I don't push a new image version (if it's not critical to keep up with upstream security releases) for many months to a year or longer, but those images are still pulled frequently (certainly more than once every 6 months).

    I didn't see any notes about rate limiting in that FAQ, did I miss something?

    • Operyl 5 years ago

      The FAQ is a bit incomplete or trying to hide it. Section 2.5 of the TOS also introduced a pull rate provision. You can see it on the pricing page, https://www.docker.com/pricing

      • dawnerd 5 years ago

        That's a bit confusing, is it max pulls per image per 6 hour period, per org, per user (which is weird since it's authenticated vs anonymous).

        Honestly though 5 dollars a month isn't bad if you don't want to deal with hosting yourself.

        • zeeZ 5 years ago

          From the TOS[0], 2.5:

          > These limitations include but are not limited to [...] pull rate (defined as the number of requests per hour to download data from an account on Docker Hub) [...]

          I read that as per account that owns the repository.

          [0]: https://www.docker.com/legal/docker-terms-service

  • TeMPOraL 5 years ago

    > Oh and bandwidth isn't free.

    I'm not sure what protocol is used for pulling Docker images, but perhaps it could be enough to just initiate the connection, get Docker Hub to start sending data, and immediately terminate the connection. This should save bandwidth on both ends.

  • jstanley 5 years ago

    A friend of mine offers Docker (and more) repository hosting for $5/mo. He is competent and friendly and I would recommend his product: https://imagecart.cloud/

  • pydry 5 years ago

    >"You should rebuild images every 6 mo anyway!" - have you ever worked with an enterprise company? They do not upgrade like we do.

    No, but they've got cash and are not price sensitive. Wringing money out of them helps keep it cheap and/or free for everyone else.

    Enterprise customers might as well fork over cash to docker rather than shudder Oracle.

    • dpedu 5 years ago

      Enterprises upgrade on a slower schedule, yes, but they still patch as quickly as everybody else.

      Can you patch a docker image? Sort of, but it's easier to rebuild. And that's what they do.

      • ornornor 5 years ago

        > Enterprises upgrade on a slower schedule, yes, but they still patch as quickly as everybody else.

        Hahahahahahaaaa!! No. Not in my experience.

        • ygjb 5 years ago

          Depends on the situation. Web facing banking app that has ongoing PCI, SOX, and other scanning and monitoring by third party partners and customers? Patched quickly.

          Internally facing app that is AJAX glue over a legacy green screen app that is "only reachable from the internal network"? Probably not going get patched until something breaks.

        • dpedu 5 years ago

          > No. Not in my experience.

          Then your experience comes from somewhere with little concern for security.

    • sebazzz 5 years ago

      Companies might base their image based on another image in the docker registry. That image might be good now, might be good in two years, but what if I want to pull a, say .NET Core 1.1 docker image in four years?

      Now, .NET Core 1.1 might not be the best example, but I'm sure you can think of some example.

      • krferriter 5 years ago

        If you anticipate needing that image around in 4 years for a critical business case, you can either pull it once every 6 months from here on out, download the image and store it somewhere yourself, or make a fully reproducible Dockerfile for it so the image can be re-created later if it disappears from the registry.

  • a2800276 5 years ago

    > Oh and bandwidth isn't free.

    But neither is storage

    > have you ever worked with an enterprise company? They do not upgrade like we do.

    I'm sure someone somewhere is going to shed a tear for the enterprise organisations with shady development practices using the free tier who may be slightly inconvenienced.

  • robertlagrant 5 years ago

    > "We'll just ping the image very 6 mos" - you have to iterate and discover every image and tag in the accounts then pull them, retry if it fails. Oh and bandwidth isn't free.

    Set up CircleCI or similar to pull all your images once a month :)

  • Gelob 5 years ago

    Everyone should actually read the docker FAQ instead of assuming. This only applies t inactive images.

    https://www.docker.com/pricing/retentionfaq

    What is an “inactive” image? An inactive image is a container image that has not been either pushed or pulled from the image repository in 6 or months.

    • Bnshsysjab 5 years ago

      That may well be true, but now I have to pull images every 6 months, some I very much doubt I’ll ever upgrade but will pull anytime I format the relevant host.

      It sucks that this isn’t for new images only. Now I have to go and retrospectively move my old images to a self hosted registry, update all my absolve scripts to the new uris, debug any changes, etc.

  • lima 5 years ago

    > "You should rebuild images every 6 mo anyway!" - have you ever worked with an enterprise company? They do not upgrade like we do.

    Good opportunity to sell a support contract. The point still stands - a six month old image is most likely stale.

    Docker is doing the ecosystem a favor.

  • matsemann 5 years ago

    > Oh and bandwidth isn't free.

    Neither is it for Docker...

  • masonhensley 5 years ago

    > "You should rebuild images every 6 mo anyway!" - have you ever worked with an enterprise company? They do not upgrade like we do.

    A bunch of enterprises are going to get burned when say ubuntu:trusty-20150630 disappears.

    It's not that they even have to rebuild their images... they might be pulling from one that will go stale.

  • tuananh 5 years ago

    > "Doesn't affect me" - doesn't it? If you run a Kubernetes cluster, you'll do 100 pulls in no time from free / OSS components. The Hub will rate-limit you at 100 per 6 hours (resets every 24?). That means you need to add an image pull secret and a paid unmanned user to every Kubernetes cluster you run to prevent an outage.

    I can't find this. It's not in the original link, is it?

brutos 5 years ago

This will be quite bad for reproducible science. Publishing bioinformatics tools as containers was becoming quite popular. Many of these tools have a tiny niche audience and when a scientist wants to try to reproduce some results from a paper published years ago with a specific version of a tool they might be out of luck.

  • hvs 5 years ago

    Maybe they should switch to Github. https://github.com/features/packages

    • toomuchtodo 5 years ago

      Or store the containers in the Internet Archive alongside the paper. They’re just tarballs. Lots of options as long as you're comfortable with object storage.

      • captn3m0 5 years ago

        quay is another alternative.

      • brutos 5 years ago

        This still means that tools published in the last few years until now might just be gone soon. The people who uploaded the images might have graduated or moved on and none will be there to save the work.

        • icebraining 5 years ago

          Sounds like a job for the Archive Team, as long as there's some way to identify the images worth saving.

          • maxfan8 5 years ago

            Yep, just mentioned it to the Archive Team IRC. We're probably going to selectively archive particular Docker images, although that's a lot of manual labor.

            If you have any ideas wrt to selecting important images, that'd be great.

            • thebouv 5 years ago

              Rough idea: maintain an Awesome List of images worth saving, take submissions from public, use that list to automate what to pull?

              • maxfan8 5 years ago

                Yeah, good idea — I’m not in these fields so it’s difficult for me to judge. Also, it sounds like we should be prioritizing niche images that only a handful of papers use rather than images that people rely upon regularly.

                • cosmie 5 years ago

                  Couldn't you bootstrap a list by searching/parsing the Archive dataset itself? Searching for

                  A) "docker pull" commands and parsing the text that comes after it based on the command's syntax[1] to extract instructional references to images such as "docker pull ubuntu:latest, and

                  B) Searching for links/text beginning with "https://hub.docker.com/_/" to identify informational references to image base pages such as (https://hub.docker.com/_/ubuntu)

                  [1] https://docs.docker.com/engine/reference/commandline/pull/

                  • maxfan8 5 years ago

                    Good idea! The base images are probably not in danger of being deleted though.

                    The other issue is that (to my knowledge) the amount of papers on IA isn't terribly impressive. I think maybe indexing and going through SciHub will be better since some of these fields slap paywalls in front of their papers.

                    However, that's a pretty large task as well. The other thing is that papers rarely say "to reproduce my work do . . .". Usually the best we've got is a link to a GitHub repo (if that). I'm not sure how effective that strategy will be since it's guaranteed to be an under-count of the docker images we'd need to archive. Perhaps in conjunction with archiving all images that fall under particular search queries, we'd get the best of both worlds.

                    I've you've got ideas, feel free to hop onto efnet (#archiveteam and #archiveteam-bs) (also on hackint) to share your thoughts.

            • contravariant 5 years ago

              Since images tend to be based on each other I wonder if someone's analyzed the corresponding dependency graph yet. In theory you should get quite far if you isolate the most commonly used base images.

              • CameronNemo 5 years ago

                Are those not the images that are basically guaranteed to stay in Dockerhub?

                • toomuchtodo 5 years ago

                  “Guaranteed” is a strong word.

    • vegannet 5 years ago

      GitHub storage for docker images is very expensive relative to free: I don’t think it’s a viable solution in this case.

    • lstamour 5 years ago

      Publishing containers to GitHub might be free but you have to login to GitHub to download the containers from free accounts, significantly hampering end-user usability compared to Docker Hub, particularly if 2FA authentication is enabled on a GitHub account. As mentioned elsewhere Quay.io might be another alternative.

      • qppo 5 years ago

        You don't need to register an SSH key to download a public repo I thought

        • timdorr 5 years ago

          Not an SSH key, but you do need an access token:

          > You need an access token to publish, install, and delete packages in GitHub Packages.

          https://docs.github.com/en/packages/using-github-packages-wi...

          • qppo 5 years ago

            ...but not to download. You can clone a repo and download release artifacts without a PAT. That's only necessary for interacting with the API for actions that need authentication, which would be anything involving mutating a repository.

          • laurencerowe 5 years ago

            GitHub access tokens are a bit of a nightmare since you can't limit the permissions for a token. Only workaround I've found is to create another GitHub user for an access token and restrict that user's access.

  • chrisandchris 5 years ago

    It seems you simply have to pull it every 5.99 months to not get it removed. So add all your images into a bash script and pull them every couple weeks using crontab and you‘re fine.

    On the other side, I see the need for making money and storage/services cannot be free (someone pays somewhere for it - always), but 6 months is not that much for specific usages.

    • crazysim 5 years ago

      I'm sure you've cited research older than 5.99 months right?

      I wish they would grandfather images before this new ToS to not get wiped so that future images would be uploaded to more stable and accepting platforms so images on Docker Hub from research pre-ToS update don't get wiped.

      • riffic 5 years ago

        well it sounds like someone's gotta pony up the bucks for a their own image repo, rather than freeload off someone else's storage infra.

        • edoceo 5 years ago

          Science Docker Repo as a Service backed by Amazon Glacier and index, one time fee to access?

        • salawat 5 years ago

          Full circle achieved.

          START Run your own stuff on stuff you own. Run your stuff on other people's stuff you rent. This is too expensive to maintain at your rent. Pay us more. Back to run your own stuff on stuff you own. ...and so on, and so forth.

          And this, ladies and gentlemen, is why anything worth doing is worth actually doing yourself. Nothing is worse than building something conditionally feasible on someone else only to have the rug pulled out from under you by sudden business pivots.

          But that's the nature of the beast I suppose. I've certainly not found a great way to do it any other way.

      • chrisandchris 5 years ago

        Oh yes I did, some probably as old as myself. Things just don‘t change that much in certain areas.

    • jasonhansel 5 years ago

      "Pulling docker images every 5 months as a service"

      • imglorp 5 years ago

        Hey, you could distribute that as a container on Docker hub...

        • AsyncAwait 5 years ago

          Which would mean people will regularly pull it and thus prevent it from being deleted. I call that a self-sustainable business model.

          • swinges 5 years ago

            This is both amusing and actually feasible

      • baq 5 years ago

        Finally a good use for that raspberry pi idling in the corner

        • Rebelgecko 5 years ago

          How hard is it to pull non-ARM images from a pi?

          • skissane 5 years ago

            You can pull another platform's image. If the image is only available for one platform, you can just pull it. If the image is available for multiple platforms, it will pull your platform instead, you need to explicitly specify the digest to get another platform's:

            For example:

            docker pull python@sha256:2d29705d82489bf999b57f9773282eecbcd54873d308a7e0da3adb8a2a6976af

            Pulls the latest Python image for Linux Alpine running on IBM System/z.

            Not what you were asking, but if you have qemu-user-static installed, you can even run Docker images for other platforms. For example:

            docker run -it python@sha256:2d29705d82489bf999b57f9773282eecbcd54873d308a7e0da3adb8a2a6976af

            That's Python for Linux for IBM System/z. Check the CPU architecture:

            import os;os.uname().machine

            Running other platform's images under QEMU is going to be quite slow on a Raspberry Pi, I imagine (I haven't tried it). But of course for this case you don't have to run the image, you just want to pull it, so that doesn't matter.

      • sebazzz 5 years ago

        Just ensure you've busted the cache, otherwise you're only pulling a joke.

  • quotemstr 5 years ago

    Why? It'll force a shift to a more elegant and general model of specifying software environments. We shouldn't be relying on specific images but specific reproducible instructions for building images. Relying on a specific Docker image for reproducible science is like relying on hunk of platinum and iridium to know how big a kilogram is: artifact-based science is totally obsolete.

    • dguest 5 years ago

      I couldn't agree more. The defense of images over instructions to build them has often been "scientists don't work this way", but to me that's either overly cynical or an indication that something is rotting in academic incentive structures.

      • CameronNemo 5 years ago

        > rotting

        I would not say rotting. From my perspective, the academic community has always lagged behind engineering best practices (except in their specific fields).

      • eat_veggies 5 years ago

        You could say the same about distributing docker images for deploying code for non-scientific software as well (and honestly, it may very well be true).

        But that doesn't change the fact that it's just way easier to skim a paper and pull a docker image than follow every paper's custom build instructions and software stack.

        • quotemstr 5 years ago

          Why would build instructions have to be custom? Making a reproducible image should be as easy as getting a docker image

    • dijksterhuis 5 years ago

      These reproducible instructions you speak of are already present in Dockerfiles.

      It seems like you're arguing against using docker images, when docker builds solve the very issue you speak of.

      Correct me if I'm wrong...?

      • euank 5 years ago

        A Dockerfile is not a reproducible set of build instructions in most cases. I'd guess that the vast majority of Dockerfiles are not reproducible.

        Let's look at an example dockerfile for redis (based on [0])

            FROM debian:buster-slim
            RUN apt-get update; apt-get install -y --no-install-recommends gcc
            RUN wget http://download.redis.io/releases/redis-6.0.6.tar.gz && tar xvf redis* && cd redis-6.0.6 && make install
        

        (Note, modified from upstream for this example; won't actually build)

        The unreproducible bits are the following:

        1. FROM debian:buster-slim -- unreproducible, the base image may change

        2. apt-get update && apt-get install -- unreproducible, will give a different version of gcc and other apt packages at different times

        Those two bits of unreprodicble-ness are so core to the image, that they result in every other step not being reproducible either.

        As a result, when you 'docker build' that over time, it's very unlikely you'll get a bit-for-bit identical redis binary at the other end. Even a minor gcc version change will likely result in a different binary.

        As a contrast to this, let's look at a reproducible build of redis using nix. In nixpkgs, it looks like so [1].

        If I want a reproducible shell environment, I simply have to pin down its dependencies, which can be done by the following:

            let
              pkgs = import (builtins.fetchTarball {
                url = "https://github.com/NixOS/nixpkgs/archive/48dfc9fa97d762bce28cc8372a2dd3805d14c633.tar.gz";
                sha256 = "0mqq9hchd8mi1qpd23lwnwa88s67ac257k60hsv795446y7dlld2";
              }) {};
            in pkgs.mkShell {
              buildInputs = [ pkgs.redis];
            }
        

        If I distribute that nix expression, and say "I ran it with nix version 2.3", that is sufficient for anyone to get a bit-for-bit identical redis binary. Even if the binary cache (which lets me not compile it) were to go away, that nixpkgs revision expresses the build instructions, including the exact version of gcc. Sure, if the binary cache were deleted, it would take multiple hours for everything to compile, but I'd still end up with a bit-for-bit identical copy of redis.

        This is true of the majority of nix packages. All commands are run in a sandbox with no access to most of the filesystem or network, encouraging reproducibility. Network access is mediated by special functions (like fetchTarball and fetchGit) which require including a sha256.

        All network access going through those specially denoted means of network IO means it's very easy to back up all dependencies (i.e. the redis source code referenced in [1]), and the sha256 means it's easy to use mirrors without having to trust them to be unmodified.

        It's possible to make an unreproducible nix package, but it requires going out of your way to do so, and rarely happens in practice. Conversely, it's possible to make a reproducible dockerfile, but it requires going out of your way to do so, and rarely happens in practice.

        Oh, and for bonus points, you can build reproduible docker images using nix. This post has a good intro to how to play with that [2].

        [0]: https://github.com/docker-library/redis/blob/bfd904a808cf68d...

        [1]: https://github.com/NixOS/nixpkgs/blob/a7832c42da266857e98516...

        [2]: https://christine.website/blog/i-was-wrong-about-nix-2020-02...

        • ahmedtd 5 years ago

          Unless something changed in the months since I have used Nix, this will not get you bit-for-bit reproducible builds. Nix builds its hash tree from the source files of your package and the hashes of its dependencies. The build output is not considered at any step of process.

          I was under the impression that Nix also wants to provide bit-for-bit reproducible builds, but that that is a much longer term goal. The immediate value proposition of Nix is ensuring that your source and your dependencies' source are the same.

          • beefee 5 years ago

            This is true, but the Nix sandbox does make it a little easier. If you're going for bit-for-bit reproducibility, it has some nice features that help, like normalizing the date, hostname, and so on. And optionally you can use a fixed output derivation where you lock the output to a specific hash.

          • euank 5 years ago

            You're right that nixos / all nix packages isn't/aren't perfectly reproducible.

            In practice, most of the packages in the nixos base system seem to be reproducible, as tested here: https://r13y.com/

            Naturally, that doesn't prove they are perfectly reproducible, merely that we don't observe unreproducibility.

            Nix has tooling, like `nix-build --check`, the sandbox, etc which make it much easier to make things likely to be reproducible.

            I'm actually fairly confident that the redis package is reproducible (having run `nix-build --check` on it, and seen it have identical outputs across machines), which is part of why I picked it as my example above.

            However, I think my point stands. Dockerfiles make no real attempt to enforce reproducibility, and rarely are reproducible.

            Nix packages push you in the right direction, and from practical observation, usually are reproducible.

          • juliosueiras 5 years ago

            the focus of nix in the build process is the ideal of if you have three build inputs bash 4, gcc 4.8.<patch>, libc <whatever version> , and the source of the package being the same(hash-wise) , the output is very much(for most cases) going to be the same, since nix itself(even on non-nixos) uses very little of the system stuff, it won't be using the system libc, gcc, bash, ncurses, etc, it will use its own that is lock to a version down to the hash, it follow a target(with exact spec) -> output , where as Dockerfile more resemble, of a build that is output first , and not doing build very often, this is why Nix have their own CI/CD system, Hydra to allow ensure daily or even hourly safety of reproducible builds

        • quotemstr 5 years ago

          Exactly. Basically, if your product needs network access during build, you don't have a reproducible build, and if you don't have a reproducible build, it's only a matter of time before something goes horribly wrong.

    • siscia 5 years ago

      Hummmm, what if the instructions says to get a binary that is been deprecated 5 years ago?

      What if it use a patched version of a weird library?

      Software preservation is an huge topic and it is not done based on instructions.

      • 838hhh 5 years ago

        Include the patch in the build instructions

      • Legogris 5 years ago

        There will always be these cases. The issue is that in many fields it is the norm rather than an exception.

      • haroldp 5 years ago

        The FreeBSD Ports tree specifies package building via reproducible instructions, and handles things like running extra patches for compatibility and security on source distributions. FreeBSD binary packages are simply packaged ports.

  • dguest 5 years ago

    My field is doing something similar.

    Reproducible science is definitely a good goal, but reproducible doesn't mean maintainable. Really scientists should be getting in the habit of versioning their code and datasets. Of course a docker container is better than nothing, but I would much rather have a tagged repository and a pointer to an operating system where it compiles.

    It's true that many scientists tend to build their results on an ill-defined dumpster fire of a software stack, but the fact that docker lets us preserve these workflows doesn't solve the underlying problem.

    • MengerSponge 5 years ago

      FYI, and for anyone else still learning how to version and cite code: Zenodo + GitHub is the most feature rich and user-friendly combination I've found.

      https://guides.github.com/activities/citable-code/

      • dguest 5 years ago

        Zenodo is great! In theory you could also upload a docker image to Zenodo and give it a DOI, but it doesn't seem to have an especially elegant way to pull this image after the fact.

      • wadkar 5 years ago

        Thank you for mentioning Zenodo. I really liked how EU funding agencies push for reproducibility/citability of data and code when you submit proposals to them.

        I haven’t filed any NSF stuff (yet) but didn’t come across any such hard requirements where you had to commit to something like zenodo or else to archive the result of your research work for archiving/citations purposes.

        • MengerSponge 5 years ago

          I <3 Zenodo. My societies don't require open data, but that's a generational shift.

          Also, if you do bio-type research, you can use Data Dryad too!

  • dijksterhuis 5 years ago

    Simplest answer is to release the code with a Dockerfile. Anyone can then inspect build steps, build the resulting image and run the experiments for themselves.

    Two major issues I can see are old dependencies (pin your versions!) and out of support/no longer available binaries etc.

    In which case, welcome to the world of long term support. It's a PITA.

    • OnlyOneCannolo 5 years ago
      • atomi 5 years ago

        I would recommend running a registry mirror as it's fairly straightforward.

        https://docs.docker.com/registry/recipes/mirror/

        • OnlyOneCannolo 5 years ago

          That's still more effort than pushing a tar file to a free public gh repo.

          • atomi 5 years ago

            There is a bit of upfront work, but backups are thereafter automated.

            • OnlyOneCannolo 5 years ago

              True. I was thinking more for archiving science. Most people in that category would probably rather push to gh or upload to Dropbox than set up a docker registry.

    • Nullabillity 5 years ago

      That doesn't help against expiring base images though.. :/

      • donmcronald 5 years ago

        Yeah. That’ll be a mess. The way I try to do it is to build an image for a project’s build environment and then use that to build the project. The build env image never changes and stays around forever or as long as is needed. So when you have to patch something that hasn’t been touched for 5 years you can build with the old image instead of doing a big update to the build config of the project.

        Many Docker based builds are not reproducible. Even something as simple as apt-get update failing with a zero exit code (it does this) adds complexity and most people don’t bother doing a deep dive.

        Personally I use Sonatype Nexus and keep everything important in my own registry. I don’t trust any free offerings unless they’re self hosted.

    • AlphaSite 5 years ago

      Tern is designed to help with this sort of thing: https://github.com/tern-tools/tern#dockerfile-lock

      It can take a Dockerfile and generate a 'locked' version, with dependencies frozen, so you at least get some reproducibility.

      Disclaimer work for VMware; but on a different team.

    • globular-toast 5 years ago

      The Dockerfile should always be published, but it does not enable reproducible builds unless the author is very careful but even so there's no support built in. It would be cool if you could embed hashes into each layer of the Dockerfile, but in practice it's very hard to achieve.

  • Legogris 5 years ago

    As long as the Dockerfile is released alongside, this should not be an issue.

    I don't see any valid reason why anyone would upload and share a public docker image but not its Dockerfile and therefore do not pull anything from Dockerhub that doesn't also have the Dockerfile on the Dockerhub page.

    • Bedon292 5 years ago

      What about when the image that it is based on goes out of date and is pruned too?

      • Legogris 5 years ago

        This is part of why I tend to only use images that only build from a small set of well-established base images like scratch, alpine, debian and occasionally ubuntu. Those base images can also be handled in the same way. For any exception, you can always do the same.

        A bonus to this is that you no longer have the risks of systems breaking because of Dockerhub or quay.io (which I haven't seen mentioned here yet, btw) being offline.

  • diffeomorphism 5 years ago

    Couldn't journals host the images? Or some university affiliated service, let us call it "dockXiv"?

    Having the images on dockerhub is more convenient, but as long as the paper says where to find the image this does not seem that bad.

  • casept 5 years ago

    They should be using Nix or similar then. The typical Dockerfile is not reproducible.

stevebmark 5 years ago

Makes sense. I don't get how Docker could offer so much free hosting in the first place. I know storage is cheap, but not this cheap. Eventually they're going to need to make these rules more stringent.

  • DJHenk 5 years ago

    I also don't get why people put all their stuff on free services like this and expect it to work until eternity.

    Come on, if you stop just half a second and think about it, you know it is a stupid idea and you know that one day you will have a problem. You really don't have to be a genius for that. Same goes for all these other kinds of "services" that are bundled together with things that used to be a one-time purchase, like cars, etc.

    Oh, I now have a t.v. that can play Netflix and Youtube, but is otherwise not extendible. But what happens in ten years? T.v. still works fine, but Netflix has gone bust and this new video-service won't work. Too bad, gonna buy a new t.v then. I can get really mad about this stupid short-sightedness everybody has these days.

    Spoiler alert: one day Github will be gone too.

    • judge2020 5 years ago

      Enough billion dollar companies put their weight behind Docker that you'd think someone big is already running Docker hub pushing people to use their paid offerings, but that's not the case. Google created Kubernetes which is almost always used along with Docker, but they don't directly invest in Docker, Inc (at least, based on Crunchbase) and run their own container registry at gcr.io. The same goes for Amazon and Azure where their customers are increasingly moving to Docker instead of VMs, yet none of them directly back the company.

    • globular-toast 5 years ago

      Do you save a copy of every web page you think might be useful later? I have a small archive of things I consider to be "at risk", but there are many things I enjoy that exist only on other people's servers now. I can't keep it all on my own machines forever, so the difficulty is guessing what will disappear and what won't.

      • DJHenk 5 years ago

        No, I don't save every page that might be useful. But I do save some content if I notice that I keep referring to it multiple times.

        However, that is just information and that is not what I am talking about. I am talking about tools and things that stop functioning because they need some free service on the internet to work. Yes, all my projects and tools can work without internet access. Sure, they might not get updated anymore, but they will keep functioning and I could continue living my live no matter what shuts down.

        This even extends to non-free services. For instance, I don't use Spotify, even though it is a nice product and I love exploring new music. But if there was a change of service, Trump decides to block my country economically or something like that, and I am kicked off the platform, I would suddenly have no music anymore, even though I would have paid for it for years. So I buy cd's and vinyl instead and rip them to flac.

    • m463 5 years ago

      Because docker wired in their registry as the basis for everything by design. I think people had to fork docker to get one where you could create your own registry.

      I think it's a good idea to NOT be pulling from someone else's image on the internet.

nickjj 5 years ago

The FAQ[0] says pulling an image once every 6 months will prevent it from being purged by resetting the timer.

It doesn't seem like a big deal really. It just means old public images from years ago that haven't been pulled or pushed to will get removed.

[0]: https://www.docker.com/pricing/retentionfaq

  • klysm 5 years ago

    looks like there will be some bots that pull images on a periodic basis cropping up

    • dgellow 5 years ago

      Yep, everybody will have a small scheduled Github Action pulling their image once per month or similar ¯\_(ツ)_/¯

    • jacekm 5 years ago

      Why not just pay this 60$/year? I mean if it's something important then it's worth paying for. If not - there is cheaper storage available when one can archive their containers.

      • klysm 5 years ago

        I agree it’s probably better to just pay, but certainly if it’s that cheap to circumvent then people will do it to save $60

      • res0nat0r 5 years ago

        Everyone should just start using Googles Cloud Build service IMO, it will cost you pennies. You can literally just do a `docker build -t gcr.io/project/image:0.0.1` and it will automatically tar up and send your build directly to their build service and create the container. It's about the cheapest and easiest build service I've seen.

        https://cloud.google.com/cloud-build

        • mamurphy 5 years ago

          Is there any special evidence to suggest this service will exist a year from now?

          • dividedbyzero 5 years ago

            I believe the GCP stuff is relatively stable, at least nothing we're using has disappeared so far over about 3 years (several have been renamed and consolidated, though, but without breaking anything already existing)

          • res0nat0r 5 years ago

            Lol Google spent over a billion dollars building out their cloud offerings, just last year. The silly cynical view on HN that everything Google runs goes away doesn't apply to their multi-continent multi-billion dollar cloud business.

      • dividedbyzero 5 years ago

        Not a lot of money for a viable company, but it might be prohibitive if you're the main contributor to a smaller open source project, and that money will have to come out of your own pocket.

  • spiffytech 5 years ago

    This may not be a big deal for small-time projects. But does this mean e.g., the official Node images for older runtime versions could disappear? I recently needed to revive an app that specifically required node:8.9.4-wheezy, pushed 2 years ago. An image that specific and old will quite possibly hit 0 downloads per 6 months in short order, if it hasn't already.

    • nickjj 5 years ago

      That is a really good point. I wonder if official images will be treated differently.

      • manishyt 5 years ago

        (I work for Docker). Inactive official images will not be deleted. We are updating the FAQ shortly.

jacques_chester 5 years ago

Docker is partly to blame for its own predicament by conflating URIs with URNs. When you give an image reference as `foo/bar`, the implicit actual name is `index.docker.io/foo/bar`.

That means that "which image" is mixed with "where the image is". You can't deal with them separately. Because everyone uses the shorthand, Docker gets absolutely pummeled. Meanwhile in Java-land, private Maven repos are as ordinary as dirt and a well-smoothed path.

It's time for a v3 of the Registry API, to break this accidental nexus and allow purely content-addressed references.

  • pcthrowaway 5 years ago

    > the implicit actual name is `index.docker.io/foo/bar`

    `index.docker.io/foo/bar:latest` to be more exact, which is a URL, but not really a URI if we're being pedantic.

    Docker doesn't really provide an interface to address images by URI (which would be more like the SHA), though in practice, tags other than latest should function closer to a URI

  • dariusj18 5 years ago

    The second issue, is they purposefully do not allow you to change the default domain to point to. The only thing you can do it use a pull through a proxy.

mrweasel 5 years ago

That's fantastic, my main issue with Docker Hub is that there's a ton of unmaintained and out of date images.

Some just pollutes my search result, I don't care that "yes, technically there's an image that does this thing I want, but it's Ubuntu 14.04 and 4 years old".

Even better, it prevents people from using these unmaintained image as a base for new project, which they will do, because many developer don't look at the Dockerfile and actually review the images they use in shipping product.

As a bonus perhaps this will mean that some a the many image of extremely low quality will go away.

I think it's fair, now you can either pay or maintain your images.

  • swozey 5 years ago

    You really shouldn't be pulling someones random images off of dockerhub. If I made a POC 4 years ago on some random kubernetes configuration/tutorial that I was testing and I decided to use dockerhub to host its images (as one typically does, and it used to not have private repos) I'm not posting that for you to come consume 4 years later out of the blue in production because you found it randomly via the search.

    You also tend to have no idea what's in those images and what context people are creating them under. Sure, a lot of us know to check the dockerfile, github repo, etc but I have images with 10k+ downloads from OSS contributions but as you've said a whole lot of developers just grab whatever looks fitting on there. My biggest dockerhub pull has no dockerfile, no github repo, and is a core network configuration component I put up randomly just for my own testing because no docker image for it existed years ago.

    • mrweasel 5 years ago

      You’re right, but people tend to see Docker Hub as some master registry for quality and official images, even if it never claimed to be such a thing. Reading and understanding the Dockerfile it vital, before deciding to use it in any sort of production environment. The never policy well help clean up Docker Hub.

      • swozey 5 years ago

        Totally! I'm really worried what'll happen when these docker images I made get deleted. Dockerhub doesn't give you ANY details beyond the download number (which stops at 10k+) so I can't tell if they're still getting used or what.

        I'm hopeful they'll add statistics/refers when this goes live.

freedomben 5 years ago

If I'm reading this correctly, a single pull every <6 months would avoid this. This seems like NBD to me.

Still, I keep my images mirrored on quay.io and I would recommend that to others (disclaimer: I work for Red Hat which acquired quay.io)

LockAndLol 5 years ago

If people really think this is a problem, they'd contribute a non-abusive solution. Writing cron jobs to pull periodically in order to artificially reset the timer is abusive.

Non-abusive solutions include:

- extending docker to introduce reproducible image builds

- extending docker push and pull to allow discovery from different sources that use different protocols like IPFS, TahoeLAFS, or filesharing hosts

I'm sure you can come up with more solutions that don't abuse the goodwill of people.

  • dgellow 5 years ago

    If their business model isn't working for them, it's their job to fix it in a way that does. I don't see how you can put the responsibility on users. If you say to your users their data will be deleted if they don't pull it at least once per N months, well that's exactly what they will do, and they are perfectly in their right to automate that process.

  • gruez 5 years ago

    >- extending docker to introduce reproducible image builds

    It's already reproducible... sort of. All you need is to eliminate any outside variables that can affect the build process. This mainly takes the form of network access (eg. to run npm install, for instance).

  • Legogris 5 years ago

    docker pull and push integrated with IPFS is a great idea!

    • zoobab 5 years ago

      IPFS is only a partial solution, if you are the only one to have a copy and you pull the plug, content is gone. You would need a bot that takes care of maintained at least 3 or 5 copies always available on the whole file system.

morpheuskafka 5 years ago

This seems like a non-issue, if you need to keep a rarely-used image alive for some reason just write a cron job to pull it once every six months. If the goal is long term archival it should be entrusted to something like Internet Archive.

bithavoc 5 years ago

This is fine and completely fair, I bet is not cheap paying for storage for docker images no one cares about.

gramakri 5 years ago

Does anyone know what is docker's business model these days?

  • gtirloni 5 years ago

    After they sold Docker Enterprise to Mirantis, I don't know anymore.

    Probably hold on long enough to get acquired?

    • CameronNemo 5 years ago

      Why would anyone want to acquire them after they sold off the majority of their client base?

      Did they retain a significant amount of talent?

  • dawnerd 5 years ago

    “Enterprise”

  • thebouv 5 years ago

    Continuously throw stuff at the wall to see what sticks?

SEJeff 5 years ago

https://quay.io/plans/

""" Can I use Quay for free? Yes! We offer unlimited storage and serving of public repositories. We strongly believe in the open source community and will do what we can to help! """

francislavoie 5 years ago

That includes public images? That'll hurt OSS. That's a bummer.

It wouldn't surprise me if people move to Github's registry for open source projects. https://github.com/features/packages

  • gramakri 5 years ago

    Only if they never got pulled for 6 months

  • jimktrains2 5 years ago

    If it's oss there should be a docker file available to build yourself?

  • gruez 5 years ago

    >It wouldn't surprise me if people move to Github's registry for open source projects. https://github.com/features/packages

    The egress pricing is going to be a dealbreaker. The free plan only includes 1GB out.

  • NathanKP 5 years ago

    Probably not at 50 cents per GB of data transfer outside of Github Actions. Unfortunately the only place you can viably use Github registry right now is inside Github actions

    • laurencerowe 5 years ago

      That pricing is for private repos. It's free for public repos.

nhumrich 5 years ago

I would delete my own images to clear up room on dockerhub, but they dont have an api to remove images. the only way is to manually click the x in the UI. So, in a lot of ways, docker forced us to "abuse" their service and store thousands of images on a free/open source account. I get this change, and it was inevitable. But its still ironic that you cant delete tour own images. The best way to delete your image is to just stop using it and let docker delete it for you in 6 months.

  • lightswitch05 5 years ago

    I agree, I have a little tool called `php-version-audit` that literally becomes useless after a few weeks without an update (you can't audit your php version without the knowledge of the latest CVEs). I have manually cleaned up old images like you say by clicking through them all, but having a way to define retention limits is a feature to me.

ncrmro 5 years ago

Been wondering when image space would start to be a concern.

Actually just set up my own private registry and pull though registry.

Pretty easy stuff although no real GUI to browse as of yet.

This is all sitting on my NAS running in Rancher OS

  • chrisandchris 5 years ago

    Take a look at Portus, a project maintained by SUSE, which has a pretty nice GUI for a private docker registry.

    https://github.com/SUSE/Portus

    • GordonS 5 years ago

      This looks fantastic, thanks for posting it!

    • ncrmro 5 years ago

      Actually spent some time looking at this today. It’s a bit more complex than I was hopping. As right now I’m the only user. And the only way to conecto to my registry atm is through wireguard.

      It is cool seeing opensuse. Same the rancher

      • chrisandchris 5 years ago

        It‘s a bit an overkill if you‘re alone but even then, I‘m using it for my own private registry too because it‘s the nicest/easiest way IMHO for adding auth to a docker registry.

djsumdog 5 years ago

Seems like companies are relearning what they should have in the 2001 dotcom bust.

Keep free stuff free and add paid stuff. If your free stuff isn't sustainable, you really should have though that through early on.

This limit seems reasonable, because storage costs are expensive. But it should have been implemented day one so people have reasonable expectations on retention. Other's have mentioned open source projects and artifacts for scientific publication being two niche use cases where people still might want this data years later, but it'd be rare for it to be pulled every six months.

I only have a few things on docker hub, but I'll probably move them to a self-hosted repo pretty soon. At least if it's self hosted, I know it will stay up until I die and my credit cards stop working.

  • Macha 5 years ago

    I think in Docker's case, in their original plan this free unlimited hosting was probably sustainable in a freemium model where businesses paid for Docker Enterprise and Docker.com was about marketing and user acquisition, similar to open source on GitHub.com being marketing and user acquisition for paid accounts/Github Enterprise.

    Its not an unreasonable strategy to provide generous free hosting if you derive some other business benefit from it (YouTube being another example).

    But Docker Inc. found their moat was not that deep and other projects from the big cloud providers killed the market they saw for Docker Enterprise and they sold it off.

    So now they just have docker.com and Docker CE - which even that has alternatives now with other runtimes existing. So they need to make docker.com a profitable business on its own or find something else to do which changes the equation significantly.

anderspitman 5 years ago

If you've never used Singularity containers[0], I highly recommend checking them out. They make a very different set of tradeoffs compared to Docker, which can be a nice fit for some uses. Here's a few of my favorites:

* Images are just files. You can copy them around (or archive them) like any other file. Docker's layer system is cool but brings a lot of complexity with it.

* You can build them from Docker images (it'll even pull them directly from Dockerhub).

* Containers are immutable by default.

* No daemon. The runtime is just an executable.

* No elevated permissions needed for running.

* Easy to pipe stdin/stout through a container like any other executable.

[0]: https://github.com/hpcng/singularity

  • GordonS 5 years ago

    > * Images are just files. You can copy them around (or archive them) like any other file

    Never heard of Singularity before, and it does look interesting. Wanted to point out though that you can create tarballs of Docker images, copy them around, and load them into a Docker instance. This is really common for air-gapped deployments.

Bedon292 5 years ago

If they are doing this, they should add stats on when the last time an image was pulled, so you can see what is at risk of being removed. Would be curious about a graph, like NPM has a weekly downloads one so you can see how active something is.

  • manishyt 5 years ago

    (I work for Docker). We will be updating the UI to show status of each image (active or inactive). We will be updating the FAQ shortly to clarify this.

fgribreau 5 years ago

Docker is the new Heroku. Cronjobs will pull images to simulate image activity

sebazzz 5 years ago

> What is an “inactive” image?

> An inactive image is a container image that has not been either pushed or pulled from the image repository in 6 or months.

>

> How can I view the status of my images

> All images in your Docker Hub repository have a “Last pushed” date and can easily be accessed in the Repositories view when logged into your account. A new dashboard will also be available in Docker Hub that offers the ability to view the status of all of your container images.

That still does not tell the whole story, does it? I still don't know if my image have been pulled for the last six months. Only when I pushed it.

rmoriz 5 years ago

So this means that open source projects need to pay to keep older images alive?

  • dewey 5 years ago

    If no one pulled an image in 6 months it's probably not such a big deal for a project? And if it's open source you could still push a container yourself if you want.

  • moondev 5 years ago

    They say image not tag, so it wouldn't appear this should impact active projects

laksjd 5 years ago

I just got an email notification and while I can understand that they're doing this (all those GB must add up to a significant cost), the relatively short notice seems unnecessary.

  • ownagefool 5 years ago

    6 month notice doesn't seem terrible for a free service imo.

    • amenod 5 years ago

      3 month notice - they will start on Nov 1st. Still not bad.

      • ownagefool 5 years ago

        Image retention is 6 months though, so it seems slightly unclear if the timer starts counting from today or from Nov 1st.

        Best to assume the worst but still plenty of time to write a cron that pulls all your images, assuming for some reason you need images you don't pull for > 6 months.

GordonS 5 years ago

Hmm, the very nature of layered images presumably means big storage savings; I wonder if block-level deduplication at the repository backend would be feasible too?

  • brown9-2 5 years ago

    Registries already do this

    • GordonS 5 years ago

      Do you mean at the filesystem level, or higher up? Have you got any sources for this?

      • binman-docker 5 years ago

        Hi, I work at Docker. Registry sees each layer as a SHA and does not store multiple copies of the same SHA for obvious reasons. This is not unique to Hub, it's part of the registry design spec.

        Registry is open source (https://github.com/docker/distribution) and implements the OCI Distribution Specification (https://github.com/opencontainers/distribution-spec/blob/mas...) if you want to dig into it.

        • GordonS 5 years ago

          Yes, that's what I meant when I mentioned layers; clearly copies of the same layer are not kept :) My question was about block-level, or other forms of deduplication.

          • binman-docker 5 years ago

            Deduplication at the block level would be dependent on the choice of storage driver (https://docs.docker.com/registry/storage-drivers/). In the case of Hub, S3 is the storage medium and that's an object store rather than a block store.

            In theory you could modify the spec/application to try to break layers down into smaller pieces but I have a feeling you would reach the point of diminishing returns for normal use cases pretty quickly.

            • jacques_chester 5 years ago

              I found this recent paper interesting: https://www.usenix.org/conference/atc20/presentation/zhao

              > Containers are increasingly used in a broad spectrum of applications from cloud services to storage to supporting emerging edge computing paradigm. This has led to an explosive proliferation of container images. The associated storage performance and capacity requirements place high pressure on the infrastructure of registries, which store and serve images. Exploiting the high file redundancy in real-world images is a promising approach to drastically reduce the severe storage requirements of the growing registries. However, existing deduplication techniques largely degrade the performance of registry because of layer restore overhead. In this paper, we propose DupHunter, a new Docker registry architecture, which not only natively deduplicates layer for space savings but also reduces layer restore overhead. DupHunter supports several configurable deduplication modes , which provide different levels of storage efficiency, durability, and performance, to support a range of uses. To mitigate the negative impact of deduplication on the image download times, DupHunter introduces a two-tier storage hierarchy with a novel layer prefetch/preconstruct cache algorithm based on user access patterns. Under real workloads, in the highest data reduction mode, DupHunter reduces storage space by up to 6.9x compared to the current implementations. In the highest performance mode, DupHunter can reduce the GET layer latency up to 2.8x compared to the state-of-the-art.

              • GordonS 5 years ago

                This is really interestingm thanks for posting this! It's exactly the kind of thing I was thinking of, even if I expected a comment like yours to come from someone at Docker ;)

Bnshsysjab 5 years ago

Remember that time you were looking for an answer to some obscure question, you find the perfect google result - description, page title and URL all indicate it’s going to answer your question so you click it, and... nothing.. the page cannot be found.

You now have that, with docker.

voltagex_ 5 years ago

A 2019 paper says there's 47TB of Docker images on the Hub. Get scraping.

voltagex_ 5 years ago

I wonder what kind of account the Home Assistant images are using. This could break a whole lot of stuff - and I've seen projects that don't publish a Dockerfile anywhere.

maztaim 5 years ago

Relying on goodwill works until that goodwill stops. Store your images locally at least as a backup, but it has other advantages.

avian 5 years ago

Is there a way to see when an image was last pulled? I can see the last push date, but not pull.

wildpeaks 5 years ago

This will begin November 1, 2020

thrownaway954 5 years ago

given their track record with developers over the years, i wouldn't be surprised if microsoft scammbles to build a competitor to docker repo service and integrates it with github.

pjmlp 5 years ago

The complaints as expected, are the usual ones from free generation, apparently Mozilla is not enough.

  • zoobab 5 years ago

    Let's mirror dockerhub on a distributed fault tolerant file system. And IPFS sucks.

PaywallBuster 5 years ago

tl;dr images hosted on free accounts without downloads for 6 months will be (scheduled) removed.

oauea 5 years ago

Time for someone to create a new service that will pull your images into /dev/null once a month.

  • wildpeaks 5 years ago

    No need for a new service, a simple Github Action with a cron trigger can do it.

  • mr__y 5 years ago

    This could also be solved by one person running a service that would crawl all public docker images and pull those that are close to expiration automatically every 6 months. At this moment I'm just curious how much resources would be needed for that

    • remram 5 years ago

      If you just need to send the request, not read the content in full, that can be done by one free-tier cloud VM.

      • ptspts 5 years ago

        Which hosting provider and product provides such free-tier cloud VMs?

      • csunbird 5 years ago

        Reading the content still should not be a problem, since ingress is free for almost all cloud providers.

        • mr__y 5 years ago

          > ingress is free for almost all cloud providers.

          that would still pose a problem, not cost-wise, but you'd still need to download image after image. Will a single instance be capable of "downloading" all existing images every 6 months?

    • SkyBelow 5 years ago

      If Docker cared enough to implement this policy, then why wouldn't they just modify it enough to delete images being protected in such a fashion?

  • zoobab 5 years ago

    A community of bots.

dataminded 5 years ago

I'm selling a SAAS service that will pull each of your images once every 6 months...thank you Docker.

jmondi 5 years ago

Add one more coin to the "always self host" bucket. Just another example of a service that starts free, then they pull the rug from under you and hold you hostage for their ransom.