by Jose Marcial Portilla
How to become a Data Scientist
Hi! I’m Jose Portilla and I’m an instructor on Udemy with over 250,000 students enrolled across various courses on Python for Data Science and Machine Learning, R Programming for Data Science, Python for Big Data, and many more.
Almost every day a student will ask me some form of this question:
“What should I do to become a data scientist?”
In this post, I’ll try my best to help answer this question and point to resources that can help guide you to an answer, also hopefully this post serves as something I can quickly link to my students :)
Before we get started, I’m now teaching Data Science for Python and R on Udemy. You can check out these courses below and get a discount for using these links:
Now on to the rest of this post. I’ve broken down the steps into some key topics and discussed helpful details for each.
“The secret of getting ahead is getting started.” — Mark Twain
If you are interested in becoming a data scientist the best advice is to begin preparing for your journey now. Taking the time to understand core concepts will not only be very useful once you are interviewing, but it will also help you decide whether you are truly interested in this field.
Before starting on the path to becoming a data scientist, its important that you are honest with yourself about why you want to do this. There are probably some questions you should ask yourself:
- Do you enjoy statistics and programming? (Or at least what you’ve learned so far about them?)
- Do you enjoy working in a field where you need to constantly be learning about the latest techniques and technologies in this space?
- Are you interested in becoming a data scientist, even if it just paid an average salary?
- Are you okay with other job titles (e.g. Data Analyst, Business Analyst, etc…)?
Ask yourself these questions and be honest with yourself. If you answered yes, then you are on your way to become a data scientist.
The path to becoming a data scientist will most likely take you some time, depending on your previous experience and your network. Leveraging these two can help place you in a data scientist role faster, but be prepared to always be learning. Let’s now jump to discussions on some more tangible topics.
“Do not worry about your difficulties in Mathematics. I can assure you mine are still greater.” — Albert Einstein
The main topics concerning mathematics that you should familiarize yourself with if you want to go into data science are probability, statistics, and linear algebra. As you learn more about other topics such as statistical learning (machine learning) these core mathematical foundations will serve as a base for you to continue learning from. Let’s briefly describe each and give you a few resources to learn from!
Probability is the measure of the likelihood that an event will occur. A lot of data science is based on attempting to measure likelihood of events, everything from the odds of an advertisement getting clicked on, to the probability of failure for a part on an assembly line.
For this classic topic I recommend going with a book, such as A First Course in Probability by Sheldon Ross or Probability Theory by E.T. Jaynes. Since these are textbooks they can be quite expensive if you buy new directly from amazon, so I suggest looking at used copies online or at pdf versions to save yourself some money!
If you prefer learning through a video format, you can also check out Khan Academy’s video series on probability. You can also check out MIT’s OpenCourseWare lectures on Probability and Statistics. Both can be found easily for free on Youtube with a simple search.
Once you have a firm grasp on probability theory you can move on to learning about statistics, which is the general branch of mathematics that deals with analyzing and interpreting data. Having a full understanding of the techniques used in statistics requires you to understand probability and probability notation!
Again, I’m more of a textbook person, and fortunately there are two great online textbooks that are completely free for you to reference:
If you prefer more old-school textbooks, I like Statistics by David Freedman. I would suggest using this book as your main base and then checking out the other resources listed here for deeper dives into other topics (like ANOVA).
For practice problems I really enjoyed using Shaum’s Outlines Series (you can find books in this series for both Probability and Statistics).
If you prefer video, check out Brandon Holtz’s great series on statistics on Youtube!
This is the branch of math that covers the study of vector spacing and linear mapping between these spaces. Its used heavily in machine learning, and if you really want to understand how these algorithms work, you will need to build a basic understanding of Linear Algebra.
I recommend checking out Linear Algebra and Its Applications by Strang, its a great textbook that is also used in the MIT Linear Algebra course you can access via OpenCourseWare! With these two resources you should be able to build a solid foundation in linear algebra.
Depending on your position and workflow, you may not need to dive very deep into some of the more complex details of linear algebra, once you get more familiar with programming, you’ll see that some libraries tend to handle a lot of the linear algebra tasks for you. But it is still important to understand how these algorithms work!
“Measuring programming progress by lines of code is like measuring aircraft building progress by weight.” — Bill Gates
The data science community has mainly adopted R and Python as its main languages for programming. Other languages such as Julia and Matlab are used as well, but R and Python are by far the most popular in this space.
In this section I’m going to describe some of the main basic topics of programming and data science, and then point out the main libraries used for both R and Python!
This is a topic that is extremely dependent on your personal preference, I’m just going to briefly describe some of the more popular options for development environments (IDEs) for data science with R and Python.
Python — Since Python is a general programming language lots of options are available! You could just use a plain text editor such as Sublime Text or Atom and then customize to your own liking, I personally use this approach for larger projects. Another popular IDE for python is PyCharm from JetBrains, which provides a free community edition that has plenty of features for most users. My favorite environment for Python has to be the Jupyter Notebook , previously known as iPython Notebooks, this notebook environment uses cells to break up your code and provides instant output, so you can interact with the code and visualizations easily! Jupyter Notebook supports many kernels, including Scala, R, Julia, and more. Python is by far the best supported out of all of these, although the other languages improve all the time! Jupyter notebooks are extremely popular in the field of data science and machine learning. I use this for all my Python courses and most students have really enjoyed it. While probably not the best solution for larger projects that need to be deployed, its fantastic for a learning environment.
As far as getting Python installed on your computer, you can always use the official source — python.org , but I usually suggest using the Anaconda distribution, which comes with many of the packages I’ll discuss in this section!
R — RStudio is probably the most popular development environment for R. It has a great community behind it, its basic full version is completely free. It displays visualizations well, gives you lots of options for customizing experience and a lot more. It is pretty much my go to for anything with R! Jupyter Notebooks also support R kernels, and while I have used them, I have found the experience lacking compared to Jupyter Notebook’s capabilities with Python.
Python — For data analysis, two libraries are the main workhorses of Python: NumPy and Pandas. NumPy is a numerical scientific computing package that serves as the base for almost all the other Python packages in the Python Data Science ecosystem. Pandas is a data analysis library that is built directly off of NumPy that is designed to mimic many of the built-in features or R, such as DataFrames! You can think of it as a super version of Excel that allows you to quickly clean and analyze data. If you become a data scientist that uses Python, pandas will quickly become one of your main tools! It is personally my favorite Python library! I would also recommend checking out SciPy for details and links for the libraries in the PyData system.
R — For the most part R already comes with a lot of data analysis features built-in, such as Dataframes! But the R community has also created a lot of useful packages for helping deal with data in an even more efficient manner! These packages are known as the “tidyverse”, and its a collection of useful packages for data science, all designed with a similar philosophy of working with data, meaning that they all work very well together. These packages include dplyr for data manipulation, tidyr for cleaning your data, readr for reading in data, and packages like purr and tibble which improve some built-in functionalities of R. Learning the tidyverse of packages is a must for a data scientist using R! ggplot2 is also part of the tidyverse, but is for data visualization, so let’s jump to that topic next!
Python — The “grandfather” of visualization with Python is matplotlib. Matplotlib was created to provide a visualization API for Python reminiscent of the style used in MatLab. If you have used MatLab for visualization before, the transition will feel very natural. However, due to its huge library of capabilities, a lot of other visualization libraries have been created off of matplotlib in an attempt to simplify things or provide more specific functionality!
Seaborn is a great statistical plotting library that works very well with pandas and is written with the use matplotlib. It creates beautiful plots with just a few lines of code.
Pandas also comes with built-in plotting capabilities built off of matplotlib!
R — By far the most popular plotting library for R is ggplot2. It philosophy on designed and its layer based API makes it easy to use and allows you to make basically any major plot you can think of! What is also great is that is works easily with Plotly, allowing you to quickly convert ggplot2 graphs into interactive visualizations through the use of ggplotly!
Python — SciKit-learn is the most popular machine learning library for Python, with built-in algorithms and models for classification, regression, clustering, dimensionality reduction, model selection, and preprocessing. If you are more interested in building statistical inference models (such as analyzing p-values after a linear regression), you should check out statsmodels, it also is a great choice for working with time series data! For Deep Learning, check out TensorFlow, PyTorch, or Keras. I recommend Keras for beginners due to its simplified API. For Deep Learning topics you should always reference the official documentation, as this is a field that changes very fast!
R — One of the issues with R for beginner data scientists is that it has a huge variety of options for packages when it comes to machine learning. Each major algorithm can have its own separate packages, each with different focuses. When you are starting out I recommend first checking out the caret package, which provides an nice interface for classification and regression tasks. Once you’ve moved on to unsupervised learning techniques such as clustering, your best bet is to do a quick google search to see which packages are the most popular for whatever technique you plan to use, you’ll even discover that R already had some of the basic algorithms built-in, such as kmeans clustering.
Where to learn these libraries and skills?
I teach these topics in full, you can check out the courses for 95% off by using the links below.
My Python for Data Science and Machine Learning Bootcamp:
My course on R for Data Science, Visualization, and Machine Learning:
Now that we’ve gone over the general background of programming topics, let’s discuss the path to actually landing a data science job!
“Good company in a journey makes the way seem shorter.” — Izaak Walton
The job search for data scientist positions can take a while, its best to begin building out your network!
One of the best ways to begin to build out your network is to attend meetups that involve data science! But you don’t need to be limited strictly to data science, you should attend meetups with any topics that are related to data science, things like Python meetups, Visualization meetups, etc.
Conferences are another great way to connect to data scientists, while many conferences can be prohibitively expensive, conferences will often have a career fair as part of the event. If you only intend to visit for the career fair you can often get discounted or even free passes to the conference. Conferences also often host workshops for you to learn new skills!
You should also begin to check out online communities and resources, things like O’Reilly data newsletter, Kaggle, and KDNuggets are great resources to plug yourself into what is happening in the data science community. Podcasts are another great way to get started learning about the data science community. I recommend checking out Talking Machines, Partially Derivatives, and the O’Reilly Data Show.
It is also worth exploring general technology communities, such as Quora and HackerNews!
The Job Search and the Interview
“If we have data, let’s look at data. If all we have are opinions, let’s go with mine.” — James L. Barksdale
So you’ve learned your skills, networked, and are now ready to begin working as a data scientist!
The Job Search
The first step is to begin your search for a new job, a lot of this will vary depending on your personal circumstances and goals, so I’ll try to keep advice as general as possible.
One of the best ways to begin your search and practice your skills at the same time is to participate in Kaggle challenges and blog about your experience with them. Some Kaggle challenges can even directly lead to interviews as part of the prize! Even if nothing comes of the prize, its still valuable experience on a real data set! Note that Kaggle also has its own job board for data scientists.
Freelancing through sites like UpWork, contributing to open-source projects, and answering questions on StackOverflow is another great way to make your presence known to recruiters.
You will also want to make sure that your CV, LinkedIn, and Github are all updated to reflect your new skills and projects.
Make use of sites like Indeed or DataJobs for a general job search, of try out sites like Triplebyte that directly give you a series of technical interviews to quickly go through the initial interview phase for many companies at once. You can also check out startup jobs with the AngelList Job board and HackerNews Job Board.
For better or for worse, many companies still rely on classic interview questions that involve Data Structures and Algorithms. To prepare for these sort of questions you should review topics such as Arrays,Graphs, Recursion, Linked Lists, Stacks, etc… you should reference a book or course, and go through lots of practice problems! I have courses on these topics, you can get a free viewing of some of the material by checking out my popular github repository containing lots of jupyter notebooks with practice questions and solutions!
You can also check out a list of practice problems on leetcode:
For more specific data science questions, you’ll need to familiarize yourself with a wide variety of topics, such as questions on probability, programming questions on R or Python, SQL queries, and possibly big data management (topics such as Spark). You should also familiarize yourself with modeling and the reasoning behind choosing parameters, for example the differences between L1 and L2 regularization.
Many companies also do take home tasks, this can be a great opportunity to get some extra practice in, even if the job offer itself doesn’t pan out.