Merge branch 'pdt'

This commit is contained in:
John Ericson 2025-04-17 13:52:17 -04:00
commit b72c594f0e

View File

@ -41,7 +41,7 @@ Wrapping up the core of this long-experimental feature is the first step.
(One of the reasons for that is that using dynamic derivations requires content-addressing derivations, because derivations themselves are always content-addressed.)
Completely moving the whole ecosystem over to content-addressing derivations is the ultimate goal, but this doesn't need to coincide with wrapping up the core of the experiment.
For example, as others have written out, "sedding" binaries to rewrite self-references is unlikely to work in general.
For example, as others have written out, "`sed`-ing" binaries to rewrite self-references is unlikely to work in general.
That's fine for me
--- we'll simply keep input-addressing in the cases where it doesn't work.
(Not only is this expedient, this also incentivizes trying to modify packages to stop needing self-references, which I think is a good thing to do regardless.)
@ -49,19 +49,19 @@ That's fine for me
So what does "wrapping up the core of the experiment" entail?
For the big test is "don't put junk in the cache".
I am OK with the "client side" missing various conveniences, like tooling to understand trust map conflicts, or fancier garbage collection.
So long as there are still input-addressed Nixpkgs, no one will be "forced" to us them (by network effects) and so client UX issues can just be dodged by "just opting out".
On the "server side", however, I don't anything sketchy to be going on, because I don't want people to accidentally opt in to issues, especially highly nuanced "cache semantic" issues, that they didn't sign up for.
So long as there is an still input-addressed Nixpkgs, no one will be "forced" to use them (by network effects) and so client UX issues can just be dodged by "just opting out".
On the "server side", however, I don't want anything sketchy to be going on, because I don't want people to accidentally opt into issues, especially highly nuanced "cache semantics" issues, that they didn't sign up for.
Cached build artifacts, even local ones but especially shared internet-accessible ones, are potentially very long-lived.
If we get this wrong, we open ourselves up to "cache poisoning" issues, which because of the distributed nature of Nix stores and copying, may be hard to completely eradicate.
I wouldn't want to be responsible for any of those.
If we get the roll-out wrong, we open ourselves up to "cache poisoning" issues, which because of the distributed nature of Nix stores and copying, may be hard to completely eradicate.
I don't want content-addressing derivations to be responsible for any of those.
#### Medium level
Drilling deeper, so what does "ensuring the binary cache is sound" entail?
Drilling deeper, what does "ensuring the binary cache is sound" entail?
I think the essential issue is [Nix#11896].
"deep realisations" --- build trace key-value pairs where the key includes derivations that depend on other derivations' outputs --- are fundamentally ambiguous.
This ambiguous makes them hard to verify/challenge, and hard to know when they conflict --- two deep realisations may implicitly make incompatible assumptions about the outputs of those dependency derivations.
We currently have a notion of "dependent realisations" that seeks to address this issue, but I do not think it is sound, and it is certainly not consistently implemented.
We currently have a notion of "dependent realisations" that seeks to address this issue, but I do not think this mechanism is sound, and it is certainly not consistently implemented.
The simplest thing to do is....just rip out deep realisations.
Build trace keys should always be derivations that just depend on "opaque" store objects.
@ -78,38 +78,39 @@ There are two downsides to "just do shallow addressing only" which are
2. [Nix#11928] We regress with the current scheduling logic, causing build build-time inputs to be built/downloaded unnecessarily when the downstream thing we actually need should just be substitute exists but was built slightly differently.
Re (1): once again, I am quite willing to defer polishing something that is client-side, and thus has problems that the user is free to side-step entirely by opting out.
We can always delete *all* realisations
We can always delete *all* realisations locally
(there are no hard references between shallow realisations -- no "closure property"),
so that sledgehammer can always be exposed as a fail-safe way to unbreak anyone's machine running out of disk space.
so that sledgehammer can always be presented as a fail-safe last resort to unbreak anyone's machine that ran out of disk space.
Again, the current way we GC realisations (leveraging those "dependent realisations") is not necessarily a good or the only way to do things
--- in fact, because the relationships between realisations are "soft" and not "hard", I very this as a situation where there are many possible "policies", and choosing between them is a matter of opinion.
Multiple policy/opinion territory is a clear place to cut scope for the first version.
Two however I consider more series
--- it would be really annoying to always download GCC whenever you just want some cached binary built with Clang/some cached binary built with Clang.
Yes, you can GC that Clang right away, but that just makes the problem seem sillier.
Downside two however I consider more series
--- it would be really annoying to always download GCC whenever you just want some cached binary built with GCC.
Yes, you can GC that GCC right away, so there is no wasted disk space, but there is still the wasted time waiting for the download, and wasted network usage.
Downloading to then delete is not a solution, but just exposes how artificial and silly the status quo is.
[Nix#11928] is this something I consider required to fix if we're going to get rid of deep realisations (as I propose).
[Nix#11928] is thus something I consider required to fix if we're going to get rid of deep realisations (as I propose).
The good thing is that we can simply change the scheduling logic so it's no longer a problem.
The fix is conceptually simple enough: we can resolve derivations (normalize their inputs) without actually downloading those inputs.
We just look up build trace key-value pairs and substitute within the derivation accordingly.
The less good news is that it is a bit harder than it sounds to implement, because the scheduling code is currently such a confusing mess.
The less good news is that it is a bit harder than it sounds to implement, because the scheduling code was such a confusing mess.
#### Low level
This in turn leans me to [Nix#12663].
To make progress on the schedule code (and actually a bunch of other issues, which I'll hopefully get to), we need to untangle scheduling and building.
Only then we'll we have a "clean workbench" upon which we can address reworking the scheduling logic for [Nix#11928] (and hte other issues too).
Only then we'll we have a "clean workbench" upon which we can address reworking the scheduling logic for [Nix#11928] (and the other issues too).
This might sound hard, but it actually isn't so bad --- it's just long overdue.
(*Not* doing this and attempting to fix the issues anyways is much harder.)
After Planet Nix, @L-as and I started on a "bottom up" approach to this, which is the one outlined in [Nix#12663].
\[You should now just read that issue, it attempts to lay out a roadmap also --- if I said more here I would be just inlining the ticket.\]
So far, we got [Nix#12630] and [Nix#12662] done, and have [Nix#12658] and [Nix#12658] "on deck".
So far, we got [Nix#12630], [Nix#12662], and [Nix#12658] done, and [Nix#12668] "on deck".
This will get local building pretty well "off to the side".
Then we do something similar for remote building (maybe just moving the hook code, or maybe indulging a little scope creep and getting rid of it altogether per [Nix#5025]).
At that point, the building logic (local and remote cases) will be completely "out of the way", and we should be able to solve [Nix#11928].
And at *that* point, we can (with some stop-gap for local GC) fix #11896, just ripping out shallow derivations.
And at *that* point, we can (with some stop-gap for local GC) fix [Nix#11896], just ripping out shallow derivations.
Along with / right after doing [Nix#11896], we can also do [Nix#11897].
This is a good simple cleanup --- the scheduling changes and lack of deep realisations mean that there is absolutely use hash derivations "modulo fixed-output derivations", because resolved derivations never depend on fixed-output derivations (because they never depend on any derivation's output at all).
@ -118,7 +119,7 @@ We can go back to just using derivation paths.
#### Hydra
With the Nix changes done, the next task is getting Hydra to work with the revamped system.
This is especially important given my "server first" approach --- I want to see us building at scale to find and eradicate problems before I worry about anyone actually building this stuff.
This is especially important given my "server first" approach --- I want to see us building at scale to find and eradicate problems before I worry about regular users actually using this stuff.
This should be a very simple fix --- Hydra already computes deep and shallow realisations and uploads both. It just needs to stop doing the former.
One interesting thing to note is we should also upload the resolved derivations that the shallow realisation refers to
@ -140,8 +141,10 @@ The linked issue contains a discussion of alternatives, I lean towards something
#### Rollout, Nixpkgs, RFC
This is probably the most contentious part, and the least "technical stuff I can just do myself", so I don't want to speculate too much.
But basically I see a path like this:
Whereas the above is mostly "technical stuff I can just *do* without having to ask anyone for for permission", this part is squarely on community by-in.
I think what follows is a good process to follow, but, of course, no one knows for sure how the community will react until they do.
This is the roadmap I have in mind; the "...." indicates perhaps more intermediate steps to gain confidence in the new way things work before a major "flip the switch" milestone.
1. Implement and document, per the above
2. Do a lot of builds of Nixpkgs, publicly, with a public cache.
@ -227,7 +230,7 @@ But, it is in much better shape than earlier attempts because it gets to reuse n
[Nix#12630]: https://github.com/NixOS/nix/pull/12630
[Nix#12658]: https://github.com/NixOS/nix/pull/12658
[Nix#12658]: https://github.com/NixOS/nix/pull/12668
[Nix#12668]: https://github.com/NixOS/nix/pull/12668
[Nix#12662]: https://github.com/NixOS/nix/pull/12662
[Nix#12662]: https://github.com/NixOS/nix/pull/12662
[Nix#12591]: https://github.com/NixOS/nix/pull/12591