Command-line tricks for basic data science operations

Akilesh Anandharaj
3 min readMay 29, 2020

Some simple data science operations can be done using the command line interface without having to spin up an iPython notebook or an R shell.

I’ll try to explain a few commands using a sample dataset. Our sample dataset is written into a file named fruits.csv and it looks like this,

fruit, colour, qty
apple, green, 1
banana, yellow, 2
carrot, orange, 3
apple, red, 4

Total number of records

wc -l also counts the header. So be sure to subtract the total by 1.

wc -l fruits.csv
5 fruits.csv

View the first few lines

Use the head command.

head fruits.csv
fruit, colour, qty
apple, green, 1
banana, yellow, 2
carrot, orange, 3
apple, red, 4

View the first two lines

You can use the head command with the -n argument.

head -n 2 fruits.csv
fruit, colour, qty
apple, green, 1

View the last two lines

Use the tail command with the -n argument.

tail -n 2 fruits.csv
carrot, orange, 3
apple, red, 4

View specific columns

cat fruits.csv | awk '{split($0, a, ","); print a[1] a[3]}'
fruit qty
apple 1
banana 2
carrot 3
apple 4

Let’s try to understand what’s happening here,

“AWK is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool.”

The syntax for awk is

awk '{<operation>}'

In the command above, we perform a split operation followed by a print operation.

The split operation,

split($0, a, “,”);

$0 : Take the whole string

a : Initialize an array to store the values that are split

“,” : Use comma as the delimiter

The print operation,

print a[1] a[3]

Print the values from the array at index 1 and 3.

Get specific line numbers from a file

To get the values from the 3rd row through the 5th row, we can use sed with the -n argument.

sed -n ‘3,5p’ fruits.csv
banana, yellow, 2
carrot, orange, 3
apple, red, 4

Find the unique counts for each value in a given column

Let us find the count of unique values in the fruit column. Note that the index of the fruit column is 1.

cat fruits.csv | awk '{split($0, a, ","); print a[1]}' | sort | uniq | wc -l
4

What’s happening?

split the columns using the comma delimiter and get them into an array. Sort the values in the array at index number 1 and then get the unique values. Finally, count the number of unique values.

This can be an expensive operation in larger datasets because we sort the values before finding the unique values.

Get the unique values in a column

Let’s get the unique colours from the colour column.

What happening here is similar to the explanation above, but instead of counting we display the output.

cat fruits.csv | awk '{split($0, a, ","); print a[2]}' | sort | uniq
colour
green
orange
red
yellow

Get the count of each kind of fruit (a group by operation)

Here, we are doing a split and sort as before, but instead of just finding the unique values, we find the unique count using the uniq -c command.

cat fruits.csv | awk '{split($0, a, ","); print a[1]}' | sort | uniq -c
2 apple
1 banana
1 carrot
1 fruit

Conditional expressions after group by

As a follow up to the above command, let us show the rows where the count is greater than 1.

cat fruits.csv | awk '{split($0, a, ","); print a[1]}' | sort | uniq -c | awk '{ if ($1 > 1) print $2":"$1}'
apple:2

Hope that was useful.

I’ll update this page as and when I come across more useful commands.

--

--