Quick introduction to data science and how to get started with it.

13 min readMay 19, 2020

This image has no relevance to the content below, I simply did not want another image with computers and charts.

6 years ago, when I wanted to learn data science, I spend hours trying out every tutorial I came across and realized that one of the key challenges of self-learning was knowing where to start and how to proceed. 6 years later and after a decent understanding of the processes, I wanted to share my thoughts on the subject.

I will be addressing these 5 key questions here.

What on earth is data science?
What is the general process of approaching a data science problem?
What are the types of roles in data science?
How do I get started?
How to set up your portfolio?

Let’s go through each one of the key questions one at a time.

1. What is data science?

Just a colourful image of the topic so that it doesn’t sound intimidating already ;)

In simple English, data science is the process of making sense of data and using it to solve a problem.

Some examples,

Using pollution data to visualize states with a high pollution rate and causes for it.
Using weather data to predict if it will rain tomorrow.
Using COVID data to analyze spread in the country.

2. What is the general process of approaching a data science problem?

I am going to add a boring flow chart here.

Bear with me, all of these seemingly confusing terms will soon begin to make sense.

Let’s approach the flowchart above, one tile at a time, and with an example.

Our example story goes like this,

Picture of overfishing somewhere in the ocean

There is news of overfishing somewhere in one of the oceans and the local authority has asked us for help with it. He doesn’t leave you in the middle of the ocean and provides you with some information.

The fishing trawlers come in various lengths, widths, and capacities.
Big trawlers have big fishing nets, smaller ones have smaller nets (That is so not obvious..!!)
The big trawlers are all parked very far from each other and there is no way for one of them to know if the other one is at sea.
Overfishing happens because the big trawlers go to the ocean all at the same time.
Every trawler reports the fish they caught after coming to shore and the harbour authority maintains a system in which he and his team have recorded every activity for the past 10 years. (that’s a lot of data.!!)
The fishermen have been told about this problem and they are willing to save the oceans, but they need to some guidance with how to do it.
The boats have been equipped with a GPS tracking device and every boat information is captured in a computer onshore. He also mentions that the co-ordinates are in the form of a table.

Now, we have a challenge that needs to be addressed and we’ll go step by step in addressing it.

Problem definition:

This is simply defining the issue you want to address. In our case our problem definition could be two fold,

Prevent overfishing.
Have a system in place where the trawlers owners can know the type of trawlers (big or small) which are at sea.

Data acquisition:

We know that we have some data about the fishes being caught every time, but we have no idea about the sizes of the trawlers or the size of the fishing nets they use. To get these details, we go down to the harbour and start talking to every trawler owner to obtain details about their boats. We take note note of this data in an excel sheet to be used later. Since there are about 200 trawlers, we are rather doing our data entry quickly.

At this point, we have acquired three datasets. We obtained two from the data owners and we collected one ourselves. The datasets we now have in hand are,

Trawler dataset: Information about the trawlers (we collected this)
Fishes caught dataset: Information on the fish being caught
GPS dataset: GPS information of every boat movement for the past ten years.

This sums up our data acquisition process.

Data cleaning:

Let us take a look at our datasets one after the other starting with the Trawler dataset.

The first 5 rows of our data look like this,

Just by glancing at the data we observe that row 5 in our dataset has the trawler length as 2000m (that’s 2km). Also, the width of trawler 3 is 300m. These numbers seem a bit off in comparison to the rest of the data.

To confirm, we quickly call up the harbour authority and ask him if there are any trawlers that are 2km in length. His response is that the average length of a trawler is about 20 to 30m with some reaching up to 40m but there are no trawlers that are 2kms long. We must have made an error when doing the data entry quickly.

We also ask him for the phone number of the owner of trawler 5, ring him up and ask for his trawler length and find out that it is 20m. We then do the same for trawler 3 and find out that the width was 30m and not 300m. We also gather information from the harbour authority that the width does not exceed 30m.

Armed with this new information, we go through each row in our dataset and verify that the length does not exceed 40m and the width does not exceed 30m. In this process, we identify trawler 4 to have a width of 50m. We called up the owner of trawler 4, but he is being firm and says his trawler is 50m wide. Since there is no point in further argument and since we don’t have the time or the know how on how to measure a trawler width, we decide to fix this by ourself.

The easiest possible solution seems to be to take the average width of trawlers with a length of 20m and replace the 50m with the value obtained. Let us we got 35m as the average and we fill that in.

We then took a look at the fishes caught and GPS dataset and did some cleaning on it. We notice that both datasets have a date column and did some cleaning on it. What we cleaned with the date and how we cleaned is up to your imagination. Here is an example of how unclean dates can be.

This sums up the data cleaning process.

Data processing:

Now that we have cleaned our datasets, we take a look at the fishes caught dataset, it looks like this,

We observe that there are different species of crabs, turtles, fishes and sharks being caught. Categorizing them would help us answer the questions like “How many sharks were caught?”.

We can categorize Coconut crab and Spider crab as Crabs, Tunas and Herring can be categorized as Fishes, Turtle can be categorized as Turtles and Basking shark could be categorized as Sharks.

Our fishes caught dataset looks like this now,

This process of taking the raw data to derive new and useful columns can be generalized as data processing. Data processing does not mean we only add new columns, sometimes we drop a few irrelevant columns that do not add any value to our analysis.

Data analysis and visualization:

Data analysis is the cool part of asking questions about your dataset, lots of questions. This is where you let your curiosity run wild.

Here are some questions you can ask with the fishes caught dataset.

How many fishes, turtles, and crabs were caught each day?
How many sharks were caught in total?
How many crabs were caught?
How many trawlers operated every day?
What was the average number of fishes caught every day?
What was the average size of the trawlers that caught more than 3000 fishes?
Which trawlers caught more turtles and sharks?
and more…

Let us just take the first question. “How many fishes, turtles, sharks and crabs were caught each day?”.

The answer to that question looks like,

This type of tabular formats are visually not very useful. Although they give us a lot of information, the visualization below immediately tells us that the number of sea animals caught has reduce from day one to day 2. That explains why visualization plays an important role in data science.

Modeling:

Modeling is where we take the data and try to predict an outcome. For example, we could build a model to answer the following questions.

At what time of the year are big trawlers probable to go into sea.
What are the types of sea animals we could end up catching if we were at a specific location.

What machine learning is and the types of machine learning beyond the scope of this article, a simple Google search will result in tons of really good posts on that.

Process automation:

Now that we have acquired the data, cleaned, processed, analyzed/visualized, and trained a model with it, it is time to automate this process. Why automate? Simply because it is a tedious process to do each of these processes manually over and over again. Here we will need to decide where to put the data that we have acquired, and the tools we will use to process, the tools we will use to build the model and finally where and how we will deploy the model in a production system.

BONUS: A possible solution:

We build an app (android and iOS) to show the location of all trawlers and their current location.
We also write a model that predicts the path of a trawler and informs them if they are heading into a dense zone(where there are too many trawlers) so that they can divert to a different location.
We design a dashboard using one of the BI(BI stands for business intelligence) tools like Tableau, PowerBi, Data studio etc, to report the fishes caught and have a warning system if we catch other sea animals.
We write an image recognition model that scans the animals caught in the net and immediately reports if an endangered species has fallen into the net.

3. Types of roles in data science

As per my knowledge, there are 3 career roles in data science.

Data analysts: They usually clean the data, process it and come up with the analysis and visualization to help answer key business questions.
Data engineers: Data engineers are the ones who build the pipelines (remember process automation?). Data engineers do a fair bit of data analysis and data cleaning as well.
Data scientists: Data scientists are usually involved in building the model or improving on an existing model. They also explain the model and are responsible to convey to the business why their model works and how it improves the existing challenge. The data scientists role has come to include most of the process in the general process today.

4. Getting started

Before we dive into this section, I wanted to answer a few FAQs I came across,

Do I need to set up a dev environment or need some license to get started?

The simple answer is No. All the content and the environment are available online. It’s magic.

What do I need to get started?

An interest in data science and some consistent effort over time. (Maybe you heard it somewhere else too, but trust me, that is all you need)

Should I know programming?

You don’t need programming knowledge to get started but you will need to learn a fair bit of programming to become an expert in data science.

How much does it cost to learn?

Absolutely free, but if you want to pay for it there are options too.

Is learning to code difficult?

I come from a non-coding background and I speak for myself when I answer this question. Learning to code was not very difficult with python. I have spend some time learning R as well and did not feel it was very difficult.

Should I be very good at statistics before getting started?

Not Ph.D. level stats for beginners but basic stats yes. Something along these should be good for beginners.

I am a product design expert (a mechanical engineer), can I learn data science?

Yes. (P.S. I was a mechanical engineer before all this.)

Which programming language should I learn?

One of the most popular languages for Data science is python. SQL and R are quite useful as well. I use python and SQL extensively.

https://businessoverbroadway.com/2019/01/13/programming-languages-most-used-and-recommended-by-data-scientists/

Now that we have addressed some common FAQs, let’s dive into the most important question of all,

How do I get started?

Free options:

Kaggle: This in my opinion is a very good place to get started. The link takes you to the courses page and you can find a list of courses you can immediately start learning and getting hands on experience. Kaggle only teaches you python so it might not be suitable for someone willing to learn other languages.

Datacamp: Datacamp offers great courses as well and they have an option to choose a career track. You can choose the data analyst track, data engineer track, or the data scientist track just to mention a few. Something I like about Datacamp is the fact that it also courses for Data engineers. Datacamp has a free tier after which you will have to go for either a monthly or yearly subscription. You can also share your completions certificates on LinkedIn which is a good way to tell recruiters about your interest and progress.

Paid options:

Datacamp: I did not opt for the paid version in Datacamp. I took the free version with Datacamp and found their content really good, so I am assuming that their paid content is equally good.

Udemy: Udemy is another platform where there are a plethora of courses available for data science. I remember taking a course by Frank Kane and Jose Portilla which I found to be very useful. They guide you step by step and are very easy to follow.

Is there an order in which I should learn?

As a beginner, I always wished someone told me “Just start from here” and as time went by allow me to figure out what the best path is. With that in mind, my suggestion would be to start with python and Kaggle.

Kaggle, apart from being a platform to learn also allows you to showcase your work. This helps you get feedback on your work. You can also look at other’s work for inspiration. There are also competitions in Kaggle where you can participate and see how well you do.

This is purely my suggestion as a beginner’s path, feel free to choose a path of your own.

From Kaggle courses,

Learn python
Learn pandas. (Pandas is a very powerful library for data manipulation)
Then learn to visualize. (Draw those really cool charts yourself)
Followed by learning how to build models.
Then learn how to explain your models.
Finally, learn about feature engineering.

If some of the buzzwords are completely new to you, don’t worry, the courses do a good job of explaining them.

As a bonus, I am adding a few places from where you can get really cool datasets,

Government data: https://data.gov.in/ (Most governments are open sourcing a lot of their data)
Kaggle datasets: https://www.kaggle.com/datasets
https://github.com/awesomedata/awesome-public-datasets
https://www.dataquest.io/blog/free-datasets-for-projects/

5. How to set up your portfolio?

Here is a really good article on how to build a data science portfolio.

I am still going to go ahead and suggest a few that I think are good.

Kaggle: As I mentioned before, Kaggle gives you the opportunity to showcase your work to the global community by either throwing across snippets of your data analysis or participating in competitions. Here is an example of using Kaggle to analyze Air pollution in India.

Datacamp: Datacamp gives you the opportunity to share your content on LinkedIn which is good way to let recruiters know of your skills.

Github: If you are not familiar with what Github is, this is a good read. A good demo on how to use Github to showcase your portfolio is here.

Your own blog/website: If you know how to and have the time for it, a blog or your own website is a really good way to showcase your portfolio.

Participate in competitions: Kaggle is not the only platform where you can put the skills you’ve learned to good use. Here are some amazing platforms that host competitions.

Well, that’s all folks. Thanks for your time.

Happy learning.