Managing virtualenvs with dynamic libraries on NixOS (Part 1)
Intro: NixOS vs. VirtualEnvs – At the Flying Circus we run numerous business-critical customer applications on our platform and some of them are written in Python.
A problem, many NixOS users probably know, is that when installing Python dependencies as wheels, this won't work out of the box if wheels depend on e.g. C libraries. This is because NixOS doesn't have paths such as /lib where these libraries are normally installed in. This article describes a few of our attempts to tackle this issue and the benefits and issues associated with these.
All the Nix hackers reading this may wonder why we don't package the applications we host for our customers with Nix. While I agree that this would be a desirable choice, we came to the conclusion that packaging software with arbitrarily old or new interpreters & Python dependencies always resulted in far too much effort. This is especially true for guided projects where customers are operating their applications independently and our operations team supports them in a facilitating role.
The third and last article of this mini-series will cover these issues in more detail and what I think is missing to make this route viable.
LD_LIBRARY_PATH and its pitfalls
A common solution for this challenge involved using the environment variable LD_LIBRARY_PATH. This accepts a colon-separated list of directories containing dynamic libraries. If a Python application depends on psycopg2, the PostgreSQL C library (libpq.so.5) is needed. We solve this by creating a Nix environment with all the needed libraries in it via e.g.
# pyenv.nix with import <nixpkgs> {}; buildEnv { name = "python-c-env"; paths = [ postgresql # for libpq.so.5 glibc # for libc.so.6 ]; }
This expression can be built with nix-build and you'll get a symlink called result/ that points to a store path that'd roughly look like this:
result/lib ├── libc.so.6 -> /nix/store/hash-glibc-2.x/lib/libc.so.6 └── libpq.so.5 -> /nix/store/hash-postgresql/lib/libpq.so.5
Then, the Python application was started with LD_LIBRARY_PATH pointing to the Nix environment, i.e:
source .venv/bin/activate export LD_LIBRARY_PATH=$(nix-build pyenv.nix)/lib python -m application
As soon as a program requests to load libc.so.6 or libpq.so.5, the shared objects from the store-path will be used.
This worked fine until we discovered the following error messages in the logs in an application packaged like this:
sh: /.../.nix-profile/lib/libc.so.6: version `GLIBC_2.38' not found (required by sh)
After a bit of debugging we realized that the error was caused by os.system() in this case invoking another command. os.system uses a shell for that.
So what has happened here? Let's go through it:
- The VM's OS was NixOS 23.11
- The user environment was pinned down to an older nixpkgs with a glibc 2.37.
- Due to the LD_LIBRARY_PATH the Python process was started with an old glibc, from the env's nixpkgs.
- Now, the environment including LD_LIBRARY_PATH is passed down to the shell started by os.system.
LD_LIBRARY_PATH has a higher priority than DT_RUNPATH, the field in the header of ELF binaries where nixpkgs writes the dependencies' paths into, see also ld.so(8).
This means that sh now gets a glibc 2.37 even though it is linked against glibc 2.38, the libc from NixOS 23.11.
As a reminder glibc is backwards-compatible, i.e. you can run old programs with a new glibc. However, the inverse - which would be needed here - isn't supported.
The GLIBC_2.XX not found error from above is usually a sign that this has happened.
Phew.
The quick fix is relatively simple: upgrading the user environment to have programs with the same glibc in there.
However, we have a few projects where this is not trivial. Also, there's a deeper issue: the LD_LIBRARY_PATH contains dependencies for the Python environment and those leak into any other process started by the Python program recursively. In fact, this abolishes Nix's core ability to isolate dependencies by statically declaring every program's dependencies. This is a general deficiency by how LD_LIBRARY_PATH and generally environment variables work.
Now, let's move on to more effective ways dealing with this.
Fixing the underlying problem
It was clear that a thorough fix required an alternative and stepping back a bit further. Let's take a closer look at where dynamic libraries are loaded from, from highest to lowest priority:
- The DT_RPATH-section in an ELF binary unless DT_RUNPATH exists.
- The LD_LIBRARY_PATH environment-variable, as described above.
- The DT_RUNPATH-section in an ELF binary. This is how binaries from nixpkgs resolve their libraries. In contrast to DT_RPATH the paths from this section only apply to the binary itself, but not its dependencies: see figure 1
Potential solutions must also cover another requirement: some Python wheels bring their own prebuilt dynamic libraries that are required by other dynamic libraries. These must be loaded from a path relative to the loading library. For that, the special identifier $ORIGIN may be used in the DT_RUNPATH/DT_RPATH sections: see figure 2
The illustration is supposed to show the behavior of $ORIGIN: bar.so is looked up in a path relative to the location of foo.so.
As a side-note, there are a few more special tokens that can be used in dynamic library search-paths. All of those are documented in the "Dynamic string tokens" of ld.so(8).
Now, the general idea is to utilize one of those ELF sections and patch it to use the correct search-paths on runtime. It turns out that the only choice we've got is DT_RPATH since it allows all transitively loaded libraries to re-use the search-path from their parents. In the illustration above, this would mean that /lib/foo/bar.so could load further dynamic libraries from /lib/foo since the search-path from its parent ($ORIGIN/foo) would be inherited. This is necessary to make sure all libraries of numpy load correctly.
The downside of this is that DT_RPATH is officially deprecated, but more on that later. On the other hand, the prebuilt Python wheels also use that feature and given the popularity I don't see it going away any time soon. Long-term I'd love to see a better, purely Nix-based approach to packaging these applications, but that's a long way to go. More on that later.
For now, we need a way to patch the binaries correctly. In fact, NixOS has a tool for this purpose called patchelf. This tool is widely used in nixpkgs to make proprietary software work that is only distributed in binary form, but compiled for other distributions that e.g. have a /usr/lib.
After a few iterations I came up with the following incantation:
patchelf \ --force-rpath --add-rpath $HOME/.nix-profile/lib \ library-to-patch.so patchelf \ --force-rpath --shrink-rpath --allowed-rpath-prefixes $HOME/.nix-profile/lib:'$ORIGIN' \ library-to-patch.so
Effectively, the following things happen here:
- The path containing shared libraries we provide is added to DT_RPATH.
- To avoid unintended side-effects, all unused search-path entries from DT_RPATH are removed. However, only libraries from $HOME/.nix-profile/lib or libraries relative to library-to-patch.so are allowed (because of $ORIGIN).
That way you end up with a shared library that's correctly patched to use Nix provided libraries.
This worked fine, until we noticed something odd: when performing a --add-rpath and --force-rpath operation multiple times in a row, you'll end up with dynamic libraries that segfault when being opened. And to our surprise, we saw that the libraries in question had doubled in size.
It turned out that each time --add-rpath was executed, the ELF header was enlarged. However, --shrink-rpath made the rpath smaller, but didn't clean up the extra space allocated until a critical size was reached.
For that I wrote a patch that introduces the option --add-rpath-and-shrink that performs both operations at once so that no extra space is allocated if it would be freed again by --shrink-rpath. The patch is in the process of being upstreamed in NixOS/patchelf#570.
Now, this is missing a few important things: how to integrate it into e.g. a virtualenv? And how to automate these tasks during deployments?
This is what we'll cover in the next part of this short series.
