Slice N Dice
Today we will have a closer look as to what data manipulation, transformation,
cleaning, and querying can be done, out-of-the-box, on the command line.
Most of the tools discussed here, such as tr
or awk
have been around for,
like, ever.
Others, for example jq
,
might be younger but given the omnipresence of JSON-based payloads I’d argue
it is indispensible. Oftentimes it makes much more sense to use jq
to pull
out a deeply-nested value rather than grep-ing or awk-ing it—less brittle and
easier to understand, that is, to maintain. But this is the topic of another
post, this time we focus on the basics.
So, let’s rock. In the following exercises, assume we have a file input.csv
with this content:
$ cat input.csv
colA,colB,colC
row1A,row1B,row1C
row2A,row2B,row2C
row3A,row3B,row3C
In the following, please note that I’m executing the command on macOS, using Bash. YMMV!
Extracting rows with head
and tail
Want to see the column header colX
only?
# piping the output of cat to head (reads it from stdin):
$ cat input.csv | head -n 1
colA,colB,colC
# directly invoking head on the file:
$ head -n 1 input.csv
colA,colB,colC
~
Want to see row 2 only?
# first, tail gets us the last two rows, head filters it down to the one we want:
$ tail -n 2 input.csv | head -n 1
row2A,row2B,row2C
Extracting vertical ranges with cut
Now we move on from slicing horizontally to pulling out stuff vertically.
First up: want only the 2nd column?
$ cut -d ',' -f2 input.csv
colB
row1B
row2B
row3B
As usual, one can obviously combine that with above around selecting rows, so if you want only the value in the 2nd row of the 2nd column, for example:
$ cut -d ',' -f2 input.csv | tail -n 2 | head -n 1
row2B
You can literally implement SQL in Bash. Dope, right?
Converting and fixing stuff with tr
Have a CSV and want to convert it to a TSV file?
$ cat input.csv | tr ',' '\t'
colA colB colC
row1A row1B row1C
row2A row2B row2C
row3A row3B row3C
What if you want to uppercase the rows, only?
$ export INPUTFILE=input.csv
$ cat $INPUTFILE | \
head -n 1 && \
tail -n +2 $INPUTFILE | \
tr "[:lower:]" "[:upper:]"
colA,colB,colC
ROW1A,ROW1B,ROW1C
ROW2A,ROW2B,ROW2C
ROW3A,ROW3B,ROW3C
OK, in above command a lot is going on, so let’s break it down:
- The
cat
command acts as the source, feeding the pipeline. - The
head
command slices off the header row and prints it out. - the
tail
command then operates on the rest of the file, that is, from the 2nd row on. - The
tr
command, finally, converts the rows to upper case.
Some more fun: sed
and awk
When tr
is not enough … what about adding a http://
in front of each entry,
other than the header?
$ cat $INPUTFILE | sed s.row.http://row.g
colA,colB,colC
http://row1A,http://row1B,http://row1C
http://row2A,http://row2B,http://row2C
http://row3A,http://row3B,http://row3C
Hmmm, isnt' it usually s/XXX/YYY/g
? Yep. But here we have the /
in the value
we want to replace. Turns out you can use other characters as the delimiter
in sed
, and this is a case where we’re forced to do this.
One more, this time with awk
. My favorite of the three is the k
aka
Brian Kernighan and if you want to spend an hour well, sit back and relax while
watching his presentation on successful language design.
Imaging the following: you want to process certain parts of each row
(microparsing) and do something fancy with it. Let’s say, we want to extract the
part after the row
, so for row2C
that would be 2C
and then we want to have
all those resulting values nicely lined up in one row. How about that?
$ cat $INPUTFILE | awk -F',' \
'NR>1 { for(i = 1; i <= NF; i++) { out = substr($i, 4) ; printf "%-10s", out } }'
1A 1B 1C 2A 2B 2C 3A 3B 3C
The NR>1
instructs awk
to skip the first row (headers). Then we iterate over
all the columns, with NF
you might have guessed it the number of fields. For each
record we then use the substr
function to extract the characters from the 4th
to the end of the record and then print it out in an aligned fashion.
OK, and with this we wrap it. Have a very awesome day and remember: the best way to learn any language (including computer-related languages) is to actively read and write it, on a daily basis ;)
Got a better way to slice and dice data on the command line, with out-of-the-box tooling? Let us know!