Data Science and the Data Science Process

Before we get into the fun part of working with data, let’s break down how data science involves more than just statistics, why it’s becoming more important, and the data science process.

Data Science vs. Statistics
In short, data science is extracting knowledge from data. But how is that different between statistics? Data science encompasses more than statistics. Statistics is good to have for exploratory data analysis, making sure insights are statistically significant, and creating predictive models. Some extra skills other than statistics are good to have to become a data scientist:

Creativity will help you think of ways to use the data to find associated variables that may not be trivial to find.
The data, the insights that you find, and visualizations can all come together to tell an intriguing story. Putting this together is a key way to present your findings and to deploy your models.
Knowledge of programming is a great thing to have as a data scientist. Several things you can do are:
- Programmatically creating statistical or machine learning models.
- Use pandas or dplyr to programmatically do data wrangling and cleaning.
- Can create web or mobile applications to use the created models.

Why Data Science is Becoming More Important
This article on Forbes about big data helps sum up the importance of data scientists and why they are becoming more sought after. Here are a couple of highlights on what it will be like in the future:

By then (2020), our accumulated digital universe of data will grow from 4.4 zettabytes today to around 44 zettabytes, or 44 trillion gigabytes.

Within five years there will be over 50 billion smart connected devices in the world, all developed to collect, analyze and share data.

And the one that is most relevant to data scientists:

At the moment less than 0.5% of all data is ever analyzed and used.

Imagine all the knowledge that can be gathered from looking at even 1% of that data! Also, imagine how much of that data can be used as training data for machine learning opportunities.

A good bit of that data, too, can be used for really positive outcomes. Can the data help us figure out who is more prone to certain diseases? Will doctors be able to get better diagnoses? Is safer travel possible?

Data Science Process
So what does a data scientist actually do? Data scientists seem to have a bit of a magical quality to them. They are perceived to get a data set, apply some magic to it, and instantly comes insights that will transform the business to higher profits. As much as that may seem like it is, there is a lot more work into the process.

To get a better idea of this process, here’s a diagram of the Cross Industry Standard Process for Data Mining.

Data Science Process in Detail

Let’s look at each of these items in more detail to help give more definition to them.

Business Understanding – In this first step, we try to get a better idea of what business needs we should be extracting from data. What kind of questions should we be asking to help further the business and to help the business understand what kinds of actions it should take from the trends that the data shows. This could be open ended in such that you, as the data scientist, ask questions about the data that you see and find. Or it could be a series of questions from your client that they specifically want to know.
Data Understanding – This is getting a business idea of the data that you have and understanding what each part of the data means. This may involve actually figuring out what data would be best needed and the best ways to acquire it. This also means finding out what each of the data points signifies in terms of the business. For instance, if you’re given a data set from a client, you have to know what each column and row represent. Do rows represent a single customer? Does this one column with a heading of what looks to be an acronym has a big relationship with the data? We can’t really know this without understanding what exactly it means.
Data Preparation – The data preparation part of the process is where most of your time will be. Cleaning the data can be more of an art form than a science since you have to realize if you have the correct data to proceed to a good model and knowing how to clean it correctly so it won’t corrupt your model. I would also consider that having reliable data is part of this, as well. There’s an old saying, “garbage in, garbage out”. Your model won’t be very effective if you’re giving it bad data.
Modeling – Here is where doing statistics and analyzing the data come in to create a model that best fits the data. You may have to try several models in order to find one with the best fit. In order to do that, going back to how the data was prepared may often happen. There are more ways to clean missing data. Is it safe to just remove the rows? Is there an average we can put in for it? There may even be a better value to put in the missing ones depending on the business. All of these can help make the model much better.
Evaluation – This part is where you test to see if you have a good model or not before deploying or presenting. As the diagram indicates, this is also the part where you make sure the model answers the business questions you had at the beginning of this process. Perhaps it may even uncover more questions that are more important.
Deployment – This is where you share your findings of the data. This isn’t limited to having an API to call that uses your model. It could simply be documenting your findings in an email, a shared document, or presenting to a group of executives. While it’s easy to talk technical with your colleagues, relaying what you find in the data to a sales team or the executives so they can take action with it is the key with this step.

The interesting parts of this diagram indicate that it’s best to have an understanding of the business. Without that, it would be much harder to ask the right questions and extract the most information from the data. Also, some items can have the potential to go back and be iterated on again. For example, if you’ve moved from data preparation to modeling but new data came in, you would have to go back to preparing the new data and merging it with the old data that you already had to help give you more accurate results.

Once you have a model and are evaluating it, like the arrow indicates, it’s helpful to go back to make sure that the results of the model are in line with the business. Does it help the business take action? Can it give answers to the business questions we had at the beginning? Are there any new questions that were raised?

For a more in depth look at this process with a practical example, this post from Springboard is a really good one.

Now that we looked at what data science is and its process, I think it’s time to look at a data set to see if we can answer any questions from it.