Command-line tricks for basic data science operations
Some simple data science operations can be done using the command line interface without having to spin up an iPython notebook or an R shell.
I’ll try to explain a few commands using a sample dataset. Our sample dataset is written into a file named fruits.csv
and it looks like this,
fruit, colour, qty
apple, green, 1
banana, yellow, 2
carrot, orange, 3
apple, red, 4
Total number of records
wc -l
also counts the header. So be sure to subtract the total by 1.
wc -l fruits.csv
5 fruits.csv
View the first few lines
Use the head
command.
head fruits.csv
fruit, colour, qty
apple, green, 1
banana, yellow, 2
carrot, orange, 3
apple, red, 4
View the first two lines
You can use the head
command with the -n
argument.
head -n 2 fruits.csv
fruit, colour, qty
apple, green, 1
View the last two lines
Use the tail
command with the -n
argument.
tail -n 2 fruits.csv
carrot, orange, 3
apple, red, 4
View specific columns
cat fruits.csv | awk '{split($0, a, ","); print a[1] a[3]}'
fruit qty
apple 1
banana 2
carrot 3
apple 4
Let’s try to understand what’s happening here,
“AWK is a domain-specific language designed for text processing and typically used as a data extraction and reporting tool.”
The syntax for awk is
awk '{<operation>}'
In the command above, we perform a split operation followed by a print operation.
The split operation,
split($0, a, “,”);
$0 : Take the whole string
a : Initialize an array to store the values that are split
“,” : Use comma as the delimiter
The print operation,
print a[1] a[3]
Print the values from the array at index 1 and 3.
Get specific line numbers from a file
To get the values from the 3rd row through the 5th row, we can use sed
with the -n
argument.
sed -n ‘3,5p’ fruits.csv
banana, yellow, 2
carrot, orange, 3
apple, red, 4
Find the unique counts for each value in a given column
Let us find the count of unique values in the fruit column. Note that the index of the fruit column is 1.
cat fruits.csv | awk '{split($0, a, ","); print a[1]}' | sort | uniq | wc -l
4
What’s happening?
split the columns using the comma delimiter and get them into an array. Sort the values in the array at index number 1 and then get the unique values. Finally, count the number of unique values.
This can be an expensive operation in larger datasets because we sort the values before finding the unique values.
Get the unique values in a column
Let’s get the unique colours from the colour column.
What happening here is similar to the explanation above, but instead of counting we display the output.
cat fruits.csv | awk '{split($0, a, ","); print a[2]}' | sort | uniq
colour
green
orange
red
yellow
Get the count of each kind of fruit (a group by operation)
Here, we are doing a split and sort as before, but instead of just finding the unique values, we find the unique count using the uniq -c
command.
cat fruits.csv | awk '{split($0, a, ","); print a[1]}' | sort | uniq -c
2 apple
1 banana
1 carrot
1 fruit
Conditional expressions after group by
As a follow up to the above command, let us show the rows where the count is greater than 1.
cat fruits.csv | awk '{split($0, a, ","); print a[1]}' | sort | uniq -c | awk '{ if ($1 > 1) print $2":"$1}'
apple:2
Hope that was useful.
I’ll update this page as and when I come across more useful commands.