Slice N Dice

Jun 22, 2019 16:57 · 770 words · 4 minute read shell data

Today we will have a closer look as to what data manipulation, transformation, cleaning, and querying can be done, out-of-the-box, on the command line. Most of the tools discussed here, such as tr or awk have been around for, like, ever.

Others, for example jq, might be younger but given the omnipresence of JSON-based payloads I’d argue it is indispensible. Oftentimes it makes much more sense to use jq to pull out a deeply-nested value rather than grep-ing or awk-ing it—less brittle and easier to understand, that is, to maintain. But this is the topic of another post, this time we focus on the basics.

So, let’s rock. In the following exercises, assume we have a file input.csv with this content:

$ cat input.csv
colA,colB,colC
row1A,row1B,row1C
row2A,row2B,row2C
row3A,row3B,row3C

In the following, please note that I’m executing the command on macOS, using Bash. YMMV!

Extracting rows with head and tail

Want to see the column header colX only?

# piping the output of cat to head (reads it from stdin):
$ cat input.csv | head -n 1
colA,colB,colC

# directly invoking head on the file:
$ head -n 1 input.csv
colA,colB,colC
~

Want to see row 2 only?

# first, tail gets us the last two rows, head filters it down to the one we want:
$ tail -n 2 input.csv | head -n 1
row2A,row2B,row2C

Extracting vertical ranges with cut

Now we move on from slicing horizontally to pulling out stuff vertically.

First up: want only the 2nd column?

$ cut -d ',' -f2 input.csv
colB
row1B
row2B
row3B

As usual, one can obviously combine that with above around selecting rows, so if you want only the value in the 2nd row of the 2nd column, for example:

$ cut -d ',' -f2 input.csv | tail -n 2 | head -n 1
row2B

You can literally implement SQL in Bash. Dope, right?

Converting and fixing stuff with tr

Have a CSV and want to convert it to a TSV file?

$ cat input.csv | tr ',' '\t'
colA    colB    colC
row1A   row1B   row1C
row2A   row2B   row2C
row3A   row3B   row3C 

What if you want to uppercase the rows, only?

$ export INPUTFILE=input.csv

$ cat $INPUTFILE | \
  head -n 1 && \
  tail -n +2 $INPUTFILE | \
  tr "[:lower:]" "[:upper:]"
colA,colB,colC
ROW1A,ROW1B,ROW1C
ROW2A,ROW2B,ROW2C
ROW3A,ROW3B,ROW3C

OK, in above command a lot is going on, so let’s break it down:

  1. The cat command acts as the source, feeding the pipeline.
  2. The head command slices off the header row and prints it out.
  3. the tail command then operates on the rest of the file, that is, from the 2nd row on.
  4. The tr command, finally, converts the rows to upper case.

Some more fun: sed and awk

When tr is not enough … what about adding a http:// in front of each entry, other than the header?

$ cat $INPUTFILE | sed s.row.http://row.g
colA,colB,colC
http://row1A,http://row1B,http://row1C
http://row2A,http://row2B,http://row2C
http://row3A,http://row3B,http://row3C

Hmmm, isnt’ it usually s/XXX/YYY/g? Yep. But here we have the / in the value we want to replace. Turns out you can use other characters as the delimiter in sed, and this is a case where we’re forced to do this.

One more, this time with awk. My favorite of the three is the k aka Brian Kernighan and if you want to spend an hour well, sit back and relax while watching his presentation on successful language design.

Imaging the following: you want to process certain parts of each row (microparsing) and do something fancy with it. Let’s say, we want to extract the part after the row, so for row2C that would be 2C and then we want to have all those resulting values nicely lined up in one row. How about that?

$ cat $INPUTFILE | awk -F',' \
                   'NR>1 { for(i = 1; i <= NF; i++)  { out = substr($i, 4) ; printf "%-10s", out  } }'
1A        1B        1C        2A        2B        2C        3A        3B        3C

The NR>1 instructs awk to skip the first row (headers). Then we iterate over all the columns, with NF you might have guessed it the number of fields. For each record we then use the substr function to extract the characters from the 4th to the end of the record and then print it out in an aligned fashion.

OK, and with this we wrap it. Have a very awesome day and remember: the best way to learn any language (including computer-related languages) is to actively read and write it, on a daily basis ;)

Got a better way to slice and dice data on the command line, with out-of-the-box tooling? Let us know!