Thursday, August 10
catenating files in order of modification time, a bad solution
Setting: Linux - a recent Ubuntu. GNU coreutils.
I wanted to join a large set of small logfiles together into a single file, in
the order they were originally written. The list of files exceeded
ARG_MAX, so cat * > foo
would fail with:
bash: /bin/cat: Argument list too long
I knew I’d probably use some sort of find | xargs
combo, with NULs instead of
newlines because I couldn’t be entirely sure that logfiles would never have
spaces or other weirdness in the names.
As usual, there’s a set of StackExchange answers for this. I wound up writing this ridiculous variant:
find . -name '*.log' -printf '%T@ %p\0' |
sort -nz |
sed -Ez 's/^[^ ]+ (.*)$/\1/' |
xargs -0 cat > all
A script with a test and some explanatory comments:
#!/bin/sh
# create some test files:
echo "a" > "a a.log"
echo "b" > "b b.log"
echo "c" > "c c.log"
# print mtime, space, full path to file, separated by NULs:
find . -name '*.log' -printf '%T@ %p\0' |
# the -z option to GNU sort(1) and sed(1) treats NUL as line delimiter
# sort lines numerically:
sort -nz |
# strip leading timestamp - I'd use cut(1) here but it lacks a -z option:
sed -Ez 's/^[^ ]+ (.*)$/\1/' |
# feed filenames, separated by NULs, to cat(1), and
# redirect output to a file called "all":
xargs -0 cat > all
cat all
When run, this outputs:
a
b
c
I think this works. It is, no doubt, several kinds of wrong. It does function as a useful illustration of how silly things can get when everything is a string and quoting problems take over an otherwise simple solution. find(1) and xargs(1) really seem to live in the space where classical Unix shell and filesystem approaches expose their sharp edges quickly.