This article is part 4 of the “Data Science with Python” series. You can consider this a general introduction to common modules used in Python for doing data science.
Note that such a list might not always stay relevant. New modules and frameworks keep coming and going. But I believe these modules have proven so effective to the maths and science community that they have made their way to even academic courses. And as such, they will stay relevant for a long time and it would serve you well if you learn them at the beginning of your data science journey.
What are Modules in Python?
Although learning mathematics, data structures and algorithms are a significant part of any data science course, real data science jobs hardly start with those. Maybe if you’re advanced enough they will matter, but not for a beginner. As a beginner data science programmer, you just have to assemble previously developed and tested components to achieve your goals.
So your first set lessons should be what modules are best for data science. Or in other words, what modules are most commonly used in data science. There are some modules that have become foundational to any data science work. These are the modules that you would expect any data scientist should be familiar with. These modules are also what you would learn in most data science, machine learning or artificial intelligence courses.
4 Modules I Recommend You Learn First
At it’s core, Numpy is just a better way to work with arrays that hold only one type of data. Python already has arrays by means of the ‘list’ type. That list object is fine for most regular programming tasks. But in data science and analytics, data structures must be optimized to hold much larger datasets than usual. Even homeworks given in data science courses involve datasets with thousands of records. The Numpy module provides an optimized array object to handle such work loads.
Using Numpy arrays vs. Python lists have 3 main advantages while doing data analytics –
- Speed: Numpy (and many other such modules), are faster mainly because internally, most of their functionality is implemented in lower level languages (like C).
- Space: Numpy takes advantage of data types – if the values you store in the array are all 8-bit numbers, then thats’ all the space they’ll take (with a little bit of fixed overhead). But in Python lists, each of those numbers will take space for a reference and an integer object – which is wasteful when you are working with big data sets.
- Functions: Python list functionality is limited to what you would need in a general programming task. But Numpy expands it considerably – including but not limited to, vector operations, algebra and matrix operations. Most of the time we can do operations on arrays without even writing loops – for example –
array * 2will multiply all values in the array by 2.
I like to think of Pandas as the programmatic version of spreadsheet software. It excels at working with tabular data – that is, data that’s arranged in rows and columns. As far as I have seen, Pandas is the first python module that is introduced to data science students – simply because most of them start with loading data from a csv file and manipulating / analysing it. Y0u can load data from a variety of sources (like CSV files, Microsoft Excel files, REST APIs etc.), do data-wrangling tasks (like cleaning, enriching, transforming etc.) on the loaded data, produce analytic output (like summaries, charts etc.) – all using just Pandas.
Even if you are going to do much more advanced tasks like Machine learning, the first steps of loading, analyzing and cleaning data will probably be done by Pandas. If you are doing exploratory analysis (like Business Intelligence reports), I can safely say that Pandas is all you need.
Just like how Numpy has array as the core datastructure, Pandas has two core datastructures – dataframe and series. A series is similar to a one dimensional array (or simply a list of numbers). In a lot of places, Pandas series and Numpy array can be swapped without much difference (but we will see the differences as we advance). A dataframe is like a spreadsheet. It’s used to store data as rows and columns, and provides powerful features to manipulate and analyze tabular data.
Data Science, Business Intelligence, or even simple analytics on data – none of it be complete without neat reports presenting the findings from the analysis. Matplotlib is the library that we use to make charts and other pictorial representations in our reports.
Matplotlib has functionality to render charts on the screen, output charts as image files and even display charts in IPython/Jupyter notebooks. I’ve come across comments that Matplotlib has a hard learning curve, but I don’t think so. You just need to have patience to learn it’s foundations rather than hurry up to produce charts – it’s not that difficult.
Even modern visualization modules like Seaborn actually use Matplotlib underneath. Seaborn is considered to make “prettier” charts than Matplotlib, and even if you would like to use it, I suggest you start by learning Matplotlib. Pandas also has chart producing capabilities – and yes, it uses Matplotlib internally. If you are working on a Python data science project – there’s a very high chance that your output is rendered using Matplotlib. That’s how common this module is for making charts.
Scikit-learn, also called sklearn, is the most used library for machine learning. It has functionalities central to machine learning, namely – clustering, regression and classification. It is a collection of a lot of complex algorithms – quite a lot, that I can say this is not one module – it is a collection of several machine learning modules. Scikit-learn also gives some data-wrangling functionality to preprocess your data where Pandas might come a bit short.
If you do a course related to data science, you will probably learn algorithms like k-means, random forests, nearest neighbors, logistic regression (to name a few). Although you learn how these work internally, it is never expected for a data scientist (or a data science programmer), to implement these algorithms themselves. They just use a module (probably Scikit-learn), which already has implementations of these algorithms in a generalized, re-usable way. We just need to pick the required components and implement our project using them.
Scikit-learn internally uses Numpy for it’s processing, and integrates naturally with Pandas and Matplotlib. Not only Scikit-learn, all the four modules introduced in this article, interoperate well with one another. It is an important reason they have become so popular and useful – they focus strongly on their own purpose, at the same time, working well in connection with one another.
Most data science projects have a pattern. We aquire data, do numerical and algebraic calculations, run our data science algorithms on it, finally present our results as visualizations. The four modules that I’ve recommended above, map directly to these four tasks. Pandas to aquire data, Numpy for crunching numbers, Sckit-learn for some algorithmic magic, Matplotlib to add charts to your reports.
Once you have learned those four modules, you can expand your skillset by knowing which direction you are going to go from there. By the time you’ve learned these, you will know what you want to learn next. Some notable mentions are Tensorflow, Scikit-image, Keras, and PyTorch. Four is a pretty small number, because there are countless libraries and modules in the world of Python and data science. But learning these four modules will give you the solid grounding you need to launch your data science journey.