Python has become one of the most important tools used in the data science world. This article is for new students of data science who are just getting started and would like to know their options when it comes to installing a Python environment to work on and learn data science. Note: This is not a tutorial for setting up Python – I only try to assemble the different options available to create a Python environment to do your work.
My advice is to try out all the options and settle on one that you find most convenient to you. Even if you find one particular way to setup things convenient, you will still have to have some familiarity with other approaches also, because you don’t know what kind of projects you might have to work on in future.
The options available to you as a learner, to setup python on your computer are –
- Plain Vanilla
- Python Distributions
- pyenv and venv
- Cloud Services
The basic way to get started. Install python on your computer. Then install whatever frameworks and libraries you need as you go about learning data science. First download Python from the official website, and follow instructions from there (mostly just double click on the file you downloaded). Then use
pip install <package-name> to install the packages you need. As a beginning data science student, the packages you need would be (but not limited to) –
- Jupiter –
pip install jupyterlab
- NumPy –
pip install numpy
- Pandas –
pip install pandas
- Matplotlib –
pip install matplotlib
- Seaborn –
pip install seaborn
- Scikit-learn –
pip install scikit-learn
The above option of having Python installed along with a few libraries through pip would be sufficient for most data science students when they are just beginning. Of late, the more convenient option seems to be to use one of the “Python Distributions”. Python distributions are nothing but a version of Python and a set of packages clubbed together. So you don’t have the task of installing or configuring the packages yourself. These are actually quite nice. My favourite one is the Anaconda distribution, but there are several others. A nice list of popular distributions can be found here.
Although apparently python distributions exist to simplify python installations and be a convenience, in my experience, I’ve found that installing python using pyenv and managing projects using venv is more convenient for me. My personal opinion is, unless your course/project expects you to use a distribution like Anaconda, better install Python (using pyenv) and use virtual environments to keep your life simple.
pyenv and venv
The above two options, although is quite simple and sufficient for most use cases, has one problem. What if you need multiple version of Python? There are cases where you might need both Python 2 and 3 or different versions of Python 3 itself. Also, what if you need different versions of dependencies? This is where pyenv and venv help. Using pyenv, you can install multiple versions of Python on your computer.
If you are taking this option, you should first install pyenv and then install Python using pyenv. Checkout the pyenv repository to see instructions how to install pyenv. pyenv is one of my favourite tools. It allows us to install all the popular versions and distributions of Python and very easily switch between them.
Even if you don’t need pyenv and are going to use only the latest version of Python always, I still recommend working with venv. Simply create a directory for your python project, then run
python -m venv <<env-name>> to create a virtual environment. Then run
source <<env-name>>/bin/activate to activate the environment. Now your command prompt will change to show that you are inside a virtual environment. Checkout the official page for documentation on venv.
venv is included with Python 3. If you are working with both Python 2 and Python 3, and you are using pyenv, I suggest you use the pyenv-virtualenv plugin to manage virtual environments rather than venv.
Cloud services are also a popular option to learn data science using Python. The concept is that they provide an environment like Jupyter notebook but it’s hosted online on shared/dedicated machines on the cloud. This is a neat option in that you don’t have to install anything at all on your computer.
There are multiple options – CoCalc, Google Colab, Kaggle are just a few examples. You just have to register, login and start doing your data science work. These even allow to have ‘shared notebooks’ where you can collaborate with others on the same file. As a student, this is my favourite approach when I’m doing group assignments. I prefer starting a document on Google Colab, share it so my peers can contribute. And then for submission, I export a ipynb file from it and submit for grading.
If you are from the world of R and have used RStudio, you’ll feel right at home with this IDE. This is also a no hassle approach to get started with data science. It has a neat toolset, an internal python installation and pre-installed packages for doing data science work on Python. If your use cases are quite simple and you don’t need to install many third party packages, using this IDE can be such a simple option for you. But I don’t prefer this either – because I try out a lot of python packages and installing them is not straightforward. Even after I figured out how to install packages, I didn’t like to continue using this IDE. But that’s only my opinion you should definitely try this one out if you’re still figuring out your favourite setup.
If you’re willing to spend some money, and you like IDEs, you can also take a look at PyCharm. I don’t prefer IDEs for data science anyway. And definitely I won’t advice anyone to spend much money for learning because there’s a ton of stuff for free. Once you have learnt to a decent level, and are at professional capacity, definitely go for paid tools. At that point, they add a lot of productivity to justify their cost. But not while learning.
So that’s pretty much all the common ways to setup Python. I repeat, try all of them out and then settle with one. The data science world requires that you have familiarity with a broad set of tools. When you search for some solutions or samples, you don’t know what you will find. So it’s good to have some basic exposure to several tools.
As for me, my favourite approach is installing Python 3 using pyenv. Using venv to manage environments. And then run a local Jupyter notebook for doing reports. I use Google Colab for doing assignments that require collaboration.