Thursday, August 10

catenating files in order of modification time, a bad solution

Setting: Linux - a recent Ubuntu. GNU coreutils.

I wanted to join a large set of small logfiles together into a single file, in the order they were originally written. The list of files exceeded ARG_MAX, so cat * > foo would fail with:

bash: /bin/cat: Argument list too long

I knew I’d probably use some sort of find | xargs combo, with NULs instead of newlines because I couldn’t be entirely sure that logfiles would never have spaces or other weirdness in the names.

As usual, there’s a set of StackExchange answers for this. I wound up writing this ridiculous variant:

find . -name '*.log' -printf '%T@ %p\0' |
  sort -nz |
  sed -Ez 's/^[^ ]+ (.*)$/\1/' |
  xargs -0 cat > all

A script with a test and some explanatory comments:

#!/bin/sh

# create some test files:
echo "a" > "a a.log"
echo "b" > "b b.log"
echo "c" > "c c.log"

# print mtime, space, full path to file, separated by NULs:
find . -name '*.log' -printf '%T@ %p\0' |

  # the -z option to GNU sort(1) and sed(1) treats NUL as line delimiter

  # sort lines numerically:
  sort -nz |

  # strip leading timestamp - I'd use cut(1) here but it lacks a -z option:
  sed -Ez 's/^[^ ]+ (.*)$/\1/' |

  # feed filenames, separated by NULs, to cat(1), and
  # redirect output to a file called "all":
  xargs -0 cat > all

cat all

When run, this outputs:

a
b
c

I think this works. It is, no doubt, several kinds of wrong. It does function as a useful illustration of how silly things can get when everything is a string and quoting problems take over an otherwise simple solution. find(1) and xargs(1) really seem to live in the space where classical Unix shell and filesystem approaches expose their sharp edges quickly.