Framework for parallel/distributed log file processing There are two small programs written in C: split-print and cat-range-exec These C programs were only necessary because we were processing a single large log file rather than already split-up log files (which are far more common on any sane webhost). cat-range-exec allows languages that lack facilities to do random file I/O work effectively on partitioned data. The key component is one very short GNU Makefile to describe parallelism and task dependencies: runner.mk And finally the frontend, a POSIX shell script, used to launch runner.mk: wf2-mk The backend code is selected by wf2-mk, and there have been complete implementations in Ruby, Perl (multiple), and POSIX shell + awk. However, the API has since been modified for performance reasons and the shell + awk implementation is the only one still standing. Each backend has two processes: 1) worker - works on a partition provided by cat-range-exec 2) reducer - joins the output of the worker-provided data. Some of them can even be mixed and matched as long as the reducer can understand the workers output. The original sh+awk implementation was prone to integer overflow in the reducer phase, so I quickly wrote a Perl reducer that could handle limitations in mawk. Because of the partitioning done by split-print and cat-range-exec, neither the worker nor reducer need to deal with the complexity of byte/file-offsets in the data and can indeed be run independently of runner.mk GNU Make? What? =============== It's true. The heart of parallel processing uses GNU Make to coordinate dependencies for parallelization. I personally find the GNU Makefile syntax much better than any C-based syntax I've seen to describe parallelization GNU Make is well-tested, fast and readily available on all modern UNIXes. The process call stack is like this: wf2-mk INPUT_FILE `- runner.mk (calls split-print to calculate byte ranges) === REDUCERS (runs and tails files in background while workers work) === `- reducer[uhits] `- reducer[ubytes] `- reducer[s404s] `- reducer[clients] `- reducer[refs] === WORKERS (atomic appends with write(2) ==== `- cat-range-exec worker >> {uhits,ubytes,s404s,clients,refs}.tail `- cat-range-exec worker >> {uhits,ubytes,s404s,clients,refs}.tail `- cat-range-exec worker >> {uhits,ubytes,s404s,clients,refs}.tail ... `- cat-range-exec worker >> {uhits,ubytes,s404s,clients,refs}.tail `- cat-range-exec worker >> {uhits,ubytes,s404s,clients,refs}.tail `- cat-range-exec worker >> {uhits,ubytes,s404s,clients,refs}.tail === FINALIZATION PHASE (runs when all workers are done) === 1. creates fifo 2. appends blank line to files tailed by reducers `- reducers read blank line, sort and format output reducers then write to fifo created at 1) when complete `- wait for fifos to be written to, signaling reducer completion 3. cat temporary output files created by reducers tail -f is used by the reducers to follow files that are being appended to. On POSIX-compliant filesystems, write(2) calls to the same file opened with O_APPEND will be atomic. Since awk uses buffered I/O by default, fflush() is called after every print statement to force the atomic write to the kernel.