Sunday, August 4

Yesterday I spent an hour sitting on the floor sorting painstakingly through my change jar. (Big score: Two wheathead pennies, one from 1954 and one from 1929, minted in Denver. Lesser finds include four Sacagawea dollars, some Hungarian forints, and a 20p piece from 2004.) I also oiled the hinges on my front door, played pinball for 45 minutes, read about programming languages, drank most of a bottle of red wine, and laid on my floor staring at the ceiling past a rack of drying towels and boxer briefs for a minor ocean of time.

I'm prepared by now to acknowledge that living alone for very long is dangerous. I'm a weird enough human being in any setting — weirder than most, anyway — but being around other people tends to regulate this. Enter into some isolated stretch with no responsibilities more ramified than doing my own laundry and eating enough to stay alive, and I become a complete space alien by about hour 15. Every now and then I think about abandoning my current life and disappearing alone to some distant cabin-in-the-mountains situation, and then I remember that I'm right on the edge of permanent outsider weirdo status as it stands.

I used to wonder where all the weird old guys came from and how they got to be what they are. I'm pretty sure now that if there's a track, I've already been on it for years. It's like C.S. Lewis's conception of the afterlife in "The Great Divorce": You get to heaven/hell and you realize you've been there all along.

This is how writing becomes a sort of lifeline. Organizing thoughts coherently enough to transmit them to others isn't all that distinguishable from having coherent thoughts in the first place. You can humanize yourself just by trying to communicate something to other humans, at least temporarily.

I get curious if I’m repeating myself too much, writing this, so I write something like the following Perl program:


#!/usr/bin/env perl

# A quick way to ask check myself on reusing a word too much in a given
# scrap of writing.

use strict;
use warnings;
use 5.10.0;

my $shortest_word_cared_about   = 2;
my $least_instances_cared_about = 2;

my $everything;
{
  # get everything at once, more or less, by locally unsetting the record
  # separator and slurping the file into $everything in one invocation of
  # the diamond operator.

  local $/ = undef;
  $everything = ;
}

# strip some punctuation:
$everything =~ s{
  [
    ,.';"()<>&+/ \\ \$ \} \{
  ]
}{ }gx;

# chop into individual "words" by any amount of whitespace:
my (@all_words) = split /\s+/, $everything;

my %counts;
foreach my $word (@all_words) {
  next if length($word) < $shortest_word_cared_about;
  ++$counts{ $word };
}

foreach my $word (keys %counts) {
  my $count = $counts{ $word };
  next if $count < $least_instances_cared_about;
  say "$count\t$word";
}

I suddenly realize that my muscles are incredibly rusty here. For one thing, I probably could just have looked at the output of ptx(1) to get the same idea. (ptx is cool as shit, and if you haven’t played with it, you owe yourself a bit of a detour here.) Failing that, I suspect this could be a much shorter shell script, probably even a one-liner of a form I write constantly:

words [filename] | sort | uniq -c | sort -nr | head -20

…where words is whatever command will split a file into words, one per line.

What’s weird is, while I know I could do it with sed or a Perl one-liner or something, I can’t think of the command specifically designed to do this. tsort seems relevant although I’m not exactly sure what it does. ptx piped to the right invocation of cut then sort and uniq would get you there. But these seem cheesy. It’s odd that there isn’t some standard encapsulation of this operation.

I would Google this, and probably be immediately embarrassed by whatever I’m forgetting, but the network is basically nonexistent here, and I know I can sketch out the command I want with the resources at hand.

I eventually wind up with words, which documents itself thus:

Usage: words [-ucaih] [-s n] [-b n] [-d pattern] [file]
Split input into individual words, crudely understood.

    -u:  print each unique word only once
    -c:  print a count of words and exit
    -uc: print a count for each unique word
    -a:  strip non-alphanumeric characters for current locale
    -i:  coerce all to lowercase, ignore case when considering duplicates
    -h:  print this help and exit

    -s n, -b n: (s)hortest and (b)iggest words to pass through
    -d pattern: word delimiter (a Perl regexp)

If no file is given, standard input will be read instead.

Examples:

   # list all unique words, ignoring case, in foo:
   words -ui ./foo

   # find ten most-used words longer than 6 letters in foo:
   words -uci -s6 foo | sort -nr | head -10

With this featureset, it tries to hit the same sweet spot as the versions of classic Unix utilities I use most often - constrained, on the one hand, to a very specific task or concept, but on the other not without some incidental conveniences. Its basic task is to split up a file into individual words by whitespace, but it was easy to add uniq(1)-like options, and these seem likely to shorten a lot of oneliners, so why not? Similarly, if you’re going to look at frequencies, you may not care much about case, so we provide an -i flag by analogy to its “ignore case” counterparts on commands like grep(1) and many regular expression engines.

What I honestly can’t decide is what side of the “you’ve gotta be kidding me” vs. “mildly useful abstraction” line this falls on. Is this so obviously a trivial function of existing tools as to be meaningless? Or does naming the operation, no matter how simple it is at heart, make it easier to use for subsequent problems?

Update now that I have network: Script now lives in bpb-kit.

tags: bpb-kit, cli, perl, technical

p1k3 / 2013 / 8 / 4