njodell – A programming blog

I recently discovered a strange performance issue in a popular statistics library that makes it possible to speed up fitting an ARIMA model by a factor of four, just by adding one line of code, and without changing how the model is fit.

Here's how I discovered it.

Motivation

Recently, I was working on a project which required me to fit ARIMA models to over ten thousand time series using the Python package pmdarima. This is a handy Python package which automates the selection of ARIMA models by varying the p, d, and q parameters, and measuring model fit for each one, while penalizing more complex models.

Unfortunately, fitting these models is very slow, so I started looking for a way to parallelize it. When I started this work, I ran into something surprising: it's already using multiple cores!

htop screenshot showing multiple cores in use

This contradicts the pmdarima documentation. The documentation says that parallelism is only used when doing a grid search. However, I'm using the stepwise algorithm for fitting the model, which is supposedly not parallelized.

As I'll show later in this blog post, although it's running in parallel, it's not doing anything useful with those extra cores.

Dataset

In order to demonstrate this issue, I'm using a time-series dataset for grocery stores in Ecuador. The way I'm going to frame this problem is to consider each store and category of items separately, and fit an ARIMA model to each one.

Measuring performance

Most Python benchmarking only tools measure elapsed time, but to debug this issue I needed a way to measure both wall-clock time and CPU time. (Wall-clock time is another name for elapsed time. CPU time is wall-clock time multiplied by the average number of cores that were used.)

I wrote a context manager which measures both kinds of time - the full details are in the notebook at the end of the post. The sections marked with with MyTimer() as timer: are being timed this way.

How parallel is auto_arima?

I wrote a test which uses a package called threadpoolctl to restrict parallelism. I tested fitting the model with and without restricting parallelism.

for limit_cores in [True, False]:
    series = array[0, 0]
    with MyTimer() as timer:
        if limit_cores:
            controller = ThreadpoolController()
            with controller.limit(limits=1, user_api='blas'):
                fit = pm.auto_arima(series)
        else:
            fit = pm.auto_arima(series)
    print(f"lim: {limit_cores} wall: {timer.wall_time:.3f} cpu: {timer.cpu_time:.3f} ")

Here are the results of this test:

lim: True wall: 8.006 cpu: 8.539 
lim: False wall: 8.312 cpu: 33.715

There are several things we can learn from the output here:

When limiting the parallelism, the wall-clock time is roughly equal to the CPU time.
When not limiting the parallelism, the CPU time is about eight times larger than the wall clock time, so something is using every core, even if the pmdarima documentation says otherwise.
The multicore version takes longer to run than the singlecore version. That's pretty surprising - I would expect the multicore version to be faster.
The multicore version falls even further behind in CPU time: although it slightly slower in elapsed time, it is 3.9x slower in CPU time.

Slowdown cause

Why does pmdarima slow down when you give it additional cores?

I'm not exactly sure what causes the slowdown here. My rough theory is that pmdarima uses statsmodels internally, which uses scipy internally, which uses numpy internally, which uses OpenBLAS internally.

OpenBLAS is a linear algebra library which provides various matrix and vector operations, and can use multiple threads to process a large matrix operation. However, for some small operations, the overhead associated with giving work to a different thread will be larger than the gain from parallelism.

I'm guessing that the threshold for where it switches from a singlecore operation to a multicore operation is set too low, and that's why restricting the parallelism makes it faster.

As evidence for this, note that with controller.limit(limits=1, user_api='blas'): restricts parallelism, but only for libraries that implement the BLAS api.

Adding parallelism back in

My original purpose for looking into this was that I needed to fit over ten thousand ARIMA models to different time series. Rather than trying to parallelize within a single ARIMA model, I can run multiple different ARIMA models in parallel. This is more efficient, because the units of work are larger.

I set up a very similar problem, except that instead of fitting a single model, it fits one model for each category in a certain store. It uses multiprocessing to distribute each time series to a different process. (You cannot use ThreadPool here because auto_arima() holds the GIL most of the time.)

For the first test, I put no limit on parallelism, so within each ARIMA model, the calculation is also parallelized. For the second test, each process is limited to a single thread.

controller = ThreadpoolController()

def attach_limit(func, limit, *args, **kwargs):
    """Call func() using a limited number of cores, or no limit if limit is None."""
    if limit:
        return func(*args, **kwargs)
    else:
        with controller.limit(limits=1, user_api='blas'):
            return func(*args, **kwargs)

def predict(x):
    return pm.auto_arima(x, error_action="ignore")

for limit in [True, False]:
    with multiprocessing.Pool() as p:
        # Get one store
        store_array = array[1]
        with MyTimer() as timer:
            predict_restrict = functools.partial(attach_limit, predict, limit)
            p.map(predict_restrict, store_array)
        print(f"lim: {str(limit).ljust(4)} time: {timer.wall_time:.3f}")

Result:

lim: True time: 144.185
lim: False time: 534.951

The result of this is a 3.7x speedup for disabling BLAS parallelism, with no changes to how the model is fit. (This is on a 4 core computer - you may get different numbers on a computer with more or less cores.)

Notebook

A jupyter notebook demonstrating this technique can be downloaded here.

Summary

OpenBLAS does some things in parallel even if you don't ask for it.
You can turn this behavior off with the threadpoolctl library.
Turning it off results in a 1.8x speedup, or a 3.7x speedup if you're also fitting multiple ARIMA models.

nginx uses a master process, and several worker processes. Normally, the master process runs as root. If you look online, the common wisdom is that there’s no way around this, and nginx needs root access to bind to low-numbered ports:

The process you noticed is the master process, the process that starts all other nginx processes. This process is started by the init script that starts nginx. The reason this process is running as root is simply because you started it as root! […]
Most importantly; Only root processes can listen to ports below 1024. A webserver typically runs at port 80 and/or 443. That means it needs to be started as root.
In conclusion, the master process being run by root is completely normal and in most cases necessary for normal operation.

However, Linux has a feature called capabilities, which allow a process to do one privileged operation without being able to do any kind of privileged operation. If you look through that manual page, you’ll find one capability which is exactly what we need: CAP_NET_BIND_SERVICE. This allows a process to bind to a low-numbered port, despite not being root. Perfect!

Editing the systemd service file

Now we need a way to start nginx as an unprivileged user, with this one additional capability. You can do this with systemd. We just need to change a few configuration files.

First, stop the nginx process.

sudo systemctl stop nginx

Now, copy the system-provided nginx service file into the local configuration area.

sudo cp /lib/systemd/system/nginx.service \
    /etc/systemd/system/nginx.service

Now, use your favorite text editor to edit /etc/systemd/system/nginx.service. When we make edits to this file, it will override the system-provided service file.

Go down to the [Service] section, and add these two lines:

User=www-data
Group=www-data

This will start nginx as an unprivileged user. However, to make this work, we need to give nginx the CAP_NET_BIND_SERVICE capability. Add this line:

AmbientCapabilities=CAP_NET_BIND_SERVICE

Next, we need to create a place for nginx to write its PID file. Currently, it writes to /run/nginx.pid, which is a directory owned by root. We need to create a directory called /run/nginx which is owned by www-data. To do this, add this line:

RuntimeDirectory=nginx

systemd will automatically create this directory with the correct ownership.

Now, we need to move the PID file. Edit the line starting with PIDFile to read:

PIDFile=/run/nginx/nginx.pid

We’ll also need to tell nginx about this new PID file.

Edit the file /etc/nginx/nginx.conf. Change the line starting with pid to read:

pid /run/nginx/nginx.pid;

Now restart nginx. Run

sudo systemctl daemon-reload
sudo systemctl restart nginx

If you get an error, run this command to see a detailed error message.

sudo journalctl -u nginx

Additional sandboxing

Note: the following section assumes you have a systemd version greater than 235. To see your systemd version, run systemctl --version .

Running nginx as a non-root user is a good first step, but what else can we do to make this more secure? Linux has many built-in sandboxing features which systemd can make use of.

I added the following to my systemd configuration for nginx.service:

# Process may not gain any capabilities besides the one we just gave it
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
# Process is not allowed to gain new privileges using SUID binaries such as sudo
NoNewPrivileges=true
# Disables use of the personality(2) system call, which may have security bugs
LockPersonality=true
# Allows only common service-related system calls
SystemCallFilter=@system-service
# When system call is disallowed, return error code instead of killing process 
SystemCallErrorNumber=EPERM

You can download my full systemd service file and my nginx configuration here.

Using systemd-analyze

systemd ships with a tool to analyze how much each of the services on your system make use of systemd-related security features. (Note: this report doesn’t consider non-systemd methods of sandboxing, such as a service dropping privileges using setuid.) Run this command to see the report:

SYSTEMD_EMOJI=0 systemd-analyze security

You can also get detailed information about a single unit by running

systemd-analyze security nginx

By following this guide, you can reduce the systemd’s risk score for nginx from 9.5 (UNSAFE) to 5.0 (MEDIUM.)

Further work

There are several other things you could do to improve this sandbox:

Make the syscall filter more restrictive. The @system-service filter is very broad and over-inclusive. Using perf, you can record exactly which syscalls a service makes, and allow only those syscalls. However, keep in mind that loading new plugins into nginx, or changing its configuration, may cause your syscall list to become out-of-date. For example, an nginx configuration which serves static files will use different syscalls than one which proxies traffic to another service. Here’s a writeup on how to do this: https://prefetch.net/blog/2017/11/27/securing-systemd-services-with-seccomp-profiles/
Disallow nginx from changing kernel tunables and modules.
Disallow nginx from connecting to unix domain sockets, netlink sockets, or opening raw sockets.
Whitelist which devices in /dev nginx is allowed to read/write.
Blacklist namespace-altering syscalls.

However, I chose to not include these things. First, many of them would require an attacker to have root privilege anyway, so once the service is no longer running as root, they have little value. Second, they have some possibility of breaking someone’s configuration. The sandbox settings I show are intended to be general-purpose and work in a variety of contexts.

Testing notes

I have tested this configuration on recent versions of Debian, Fedora, and Ubuntu. Here’s what I’ve found:

Works on Debian Buster
Partially works on Debian Stretch (Note: You must comment out LockPersonality and SystemCallFilter.)
Doesn’t work on Fedora 32. The use of NoNewPrivileges interferes with SELinux somehow. If you skip the “Additional sandboxing” step, and substitute ‘nginx’ for ‘www-data’, it will work. This is possibly fixable, but I don’t have much knowledge of SELinux.
Works on Ubuntu 20.04
Partially works on Ubuntu 18.04. (Note: You must comment out SystemCallFilter.)