Worker Tool for Computational Studies

programming
Published

February 12, 2023

Large statistical simulations and other computational studies often consist of a collection of independent tasks to be processed. For example, to study properties of an estimator, we may wish to vary the parameters of an assumed data-generating mechanism along with the sample size. The estimator requires substantial amount of computation because it makes use of a Markov chain Monte Carlo (MCMC) procedure. We generate 500 datasets in each setting, evaluate the estimator, and save the results to be summarized and later. From a parallel computing perspective, such tasks are described as embarrassingly parallel because they do not need to interact while they run.

There are usually many more tasks than available processors. Suppose we are limited to the use of \(k\) processors; we would like to keep these \(k\) processors busy with our workload without using additional processors. Ideally, we would like to do this without the need for manual intervention. Worker is a lightweight tool to help automate this process.

Suppose our study consists of a one parameter with 1000 levels. Suppose that we have set up 1000 corresponding directories, named as follows.

$ ls -1
run000
run001
...
run999

In each folder, suppose there is a script launch.R that runs the corresponding level of the simulation. An output file output.csv is placed in the directory upon successful completion of launch.R.

$ ls run000
launch.R    output.csv

We would like to run \(k\) of the launch.R scripts at a time until the collection of 1000 tasks is complete. The worker script attempts to automate this in a way that is agnostic of the underlying simulation. It loops through tasks and runs them sequentially, but multiple workers can be run simultaneously on the same study to achieve “embarrassingly parallel” parallel computing. Workers make use of lock files to avoid stepping on each other’s toes: each task is taken up by only one worker. This paradigm can be useful in shared computing environments where users may be asked to limit the number of processors requested at any given time. We can start, say, ten workers to ensure ten simultaneous tasks rather than having to continually monitor a scheduler’s queue and submit more tasks.

Here is an example of the syntax used to invoke each worker in this example.

$ worker.sh -p 'run???' -c 'Rscript launch.R'

The worker looks for folders whose names match pattern run???. When such a folder is found, it enters the folder and first checks to see if the task has previously been claimed by a worker (including itself in a previous iteration). If not, it claims the task and launches it via the specified command Rscript launch.R. The task is run in the worker’s foreground until completion. The worker terminates when it is no longer actively engaged and there are no unclaimed jobs remaining.

After all of the runs are completed, we will likely want to load each of the output.csv files and construct tables and plots to summarize the results. This post may be helpful if you are looking to construct Latex tables from a computational study.