September 2021 – Page 2

This tutorial is part of a series “Data Science with Python“. A set of tutorials aimed at helping beginners get started with data science and Python.

Installing Python

Installing Python could be as simple as just downloading the Python executable from the official website. But the more common way to get Python in the Data Science world is to have a package manager like conda.

Conda is a popular package manager as well as environment manager.

There are 3 main advantages to installing Conda over Python –

Your work will be reproducible. When you send your project to someone else you just have to tell them to install Conda (and maybe the version), instead of telling them the versions of all your dependencies. Using a dependency manager makes it easy to share your work.
You will avoid package installation and dependency problems. There’s not much chance for dependency conflicts and such, as a beginner, but it’s still an important advantage.
You can use it as a environment manager – that is if different projects you are working on use different dependencies (or different versions), you can create isolated environments for each project to avoid dependency problems, and easily switch between them.

Download the latest version of Miniconda as per your operating system (from the Conda download page) and install it. Installation for most people is just double-clicking on the downloaded file or running a command on the terminal.

For Windows, you just have to double-click on the exe file you downloaded.
For Mac, there are two options
- Download the pkg file and double-click to run it, or –
- Download the script and run it in the terminal using bash ~/Downloads/Miniconda3-latest-MacOSX-x86_64.sh, and answer the questions asked.
For Linux, you need to download the script and run it in the terminal using bash ~/Downloads/Miniconda3-latest-Linux-x86_64.sh, and answer the questions asked.

If you need, the full installation instructions can be found here – https://conda.io/projects/conda/en/latest/user-guide/install/index.html#regular-installation.

I’ve deliberately kept the installation instructions minimal because there’s just about a ton of references and troubleshooting information if you just search the internet for ‘miniconda installation on <<operating system>>’. After you have successfully installed you should be able to run conda --version in the terminal to see the version of Conda you installed. Also run python --version to see the python version installed.

(base) ➜  data-science python --version
Python 3.9.5
(base) ➜  data-science conda --version
conda 4.10.3
(base) ➜  data-science

Writing a Python Program

You need a text editor to write your code. I use Visual Studio Code. You can install helpful extensions to Visual Studio Code if you like, but I’m not using any for this tutorial. The basic way to run Python programs is to write a program in your text editor and then running it in the terminal using the Python interpreter.

Now that we have everything we need setup, let’s write and execute a simple Python program. A program that prints ‘Hello World!’ on the screen.

1. Create an Environment

An ‘environment’ is the combination of the Python version and packages you need. For one project you might be using Python 3.9 and packages x, y and z. For another project you might be using Python 2.7 and packages a, b and c. Environments provide us a way of keeping these dependencies clear from one another and reduce confusion when switching between projects.

You can also share the environment as a environment.yml file along with your code. So that the recipient can run your project without worrying much about the dependencies. Let’s create an environment named dstut (for data science tutorial) to use along with this set of tutorials.

Execute conda create --name dstut python=3.9 to create the environment. This will create an environment with Python version 3.9 and you will be asked to confirm with a set of default packages required for this.

Once the environment is created you have to ‘activate’ it every time you need to use it. Execute conda activate dstut to activate the environment we created. You’ll see that the command prompt now includes (dstut) to denote that you are in your new Conda environment.

2. Write your Code

Open your text editor and create a file with a single line of code –

print("Hello World!")

Save the file as hello.py

3. Execute the Program

Switch back to your terminal and execute python hello.py to run your code. If everything is in order, you should see the text Hello World! printed on the terminal.

(dstut) ➜  hello-world python hello.py 
Hello World

Intro to Jupyter Notebook

A Jupyter notebook is a format where you can create a single document, which includes your python code, widgets, charts, documentation. It’s great for sharing your work. In fact in academic circles it’s kind of default to share your work through such ‘notebooks’. Almost every homework or assignment I submitted is using these notebooks.

So let’s also see how to run our ‘Hello World!’ program using a Jupyter notebook.

Installing Jupyterlab

I hope you still haven’t closed the terminal. If you have, then open it back again, change to your directory (cd <<directory-name>>) and execute conda activate dstut to activate our tutorial environment.

Execute conda install jupyterlab to install Jupyter in the environment. You will be presented with a list of packages and asked to confirm to install. Go ahead and finish the installation.

Note that you have installed Jupyterlab only inside the environment. That is, once you are out of the environment, (by closing the terminal or by doing conda deactivate), it will be as if there is no Jupyterlab on your computer. And whenever you do conda activate dstut, Jupyterlab is back on again! That’s one of the core functions of an environment manager like Conda.

Okay, now that Jupyterlab is installed, execute jupyter notebook to start the notebook server. This will start the notebook server and open the interface automatically in your default browser. If you need to open the page by yourself, or on a different browser – the terminal will show an URL (like http://localhost:8888/?token=5410e03089b55baba71dubidabab57dudu85207ce07380a9). Copy that URL and paste it in your browser’s address bar to open the Jupyter notebook interface.

Creating our Document

Once you’re in the homepage, use the ‘New’ menu to create a new notebook.

On the new document that’s created, there’s one ‘cell’ by default. A Jupyter notebook is a set of such ‘cells’ with types. For now, we are going to have two cells – one for giving a title to our document and one for our ‘Hello World!’ code.

Change the first cell’s type to ‘Markdown’ using the dropdown on the toolbar or by using the ‘Cell -> Cell Type’ menu. And then type # Program to Print 'Hello World!'. Then add a new cell using the ‘+’ button on the toolbar or by using the ‘Insert’ menu.

The new cell is by default of type ‘Code’ which is what we want too. Type your code in the new cell. If you remember, the code we wrote for hello.py is print('Hello World!'). Type this code in the new cell we created.

Running the Document

Now let’s get the output by using the ‘Run’ command. You can either execute the two cells one by one using the ‘Run’ button on the toolbar. Or, use the ‘Cell -> Run All’ menu item to run both cells one by one.

We can also click on the ‘Untitled’ title on the top, and give it a meaningful name that suits our project.

Close the window and switch back to your terminal. You’ll find the server is still running there. Press Ctrl + C to shutdown the server. If you type ls you’ll see that Jupyterlab has saved your document as a file with an ipynb extension. (ipynb stands for IPython Notebook). Then you can do conda deactivate to deactivate our dstut environment or just close the terminal.

That’s it. Now you know how to setup a Python Data Science environment, write and execute Python code, and create Jupyter notebooks.

Python has become one of the most important tools used in the data science world. This article is for new students of data science who are just getting started and would like to know their options when it comes to installing a Python environment to work on and learn data science. Note: This is not a tutorial for setting up Python – I only try to assemble the different options available to create a Python environment to do your work.

My advice is to try out all the options and settle on one that you find most convenient to you. Even if you find one particular way to setup things convenient, you will still have to have some familiarity with other approaches also, because you don’t know what kind of projects you might have to work on in future.

The options available to you as a learner, to setup python on your computer are –

Plain Vanilla
Python Distributions
pyenv and venv
Cloud Services

Plain Vanilla

The basic way to get started. Install python on your computer. Then install whatever frameworks and libraries you need as you go about learning data science. First download Python from the official website, and follow instructions from there (mostly just double click on the file you downloaded). Then use pip install <package-name> to install the packages you need. As a beginning data science student, the packages you need would be (but not limited to) –

Jupiter – pip install jupyterlab
NumPy – pip install numpy
Pandas – pip install pandas
Matplotlib – pip install matplotlib
Seaborn – pip install seaborn
Scikit-learn – pip install scikit-learn

Python Distributions

The above option of having Python installed along with a few libraries through pip would be sufficient for most data science students when they are just beginning. Of late, the more convenient option seems to be to use one of the “Python Distributions”. Python distributions are nothing but a version of Python and a set of packages clubbed together. So you don’t have the task of installing or configuring the packages yourself. These are actually quite nice. My favourite one is the Anaconda distribution, but there are several others. A nice list of popular distributions can be found here.

Although apparently python distributions exist to simplify python installations and be a convenience, in my experience, I’ve found that installing python using pyenv and managing projects using venv is more convenient for me. My personal opinion is, unless your course/project expects you to use a distribution like Anaconda, better install Python (using pyenv) and use virtual environments to keep your life simple.

pyenv and venv

The above two options, although is quite simple and sufficient for most use cases, has one problem. What if you need multiple version of Python? There are cases where you might need both Python 2 and 3 or different versions of Python 3 itself. Also, what if you need different versions of dependencies? This is where pyenv and venv help. Using pyenv, you can install multiple versions of Python on your computer.

If you are taking this option, you should first install pyenv and then install Python using pyenv. Checkout the pyenv repository to see instructions how to install pyenv. pyenv is one of my favourite tools. It allows us to install all the popular versions and distributions of Python and very easily switch between them.

Even if you don’t need pyenv and are going to use only the latest version of Python always, I still recommend working with venv. Simply create a directory for your python project, then run python -m venv <<env-name>> to create a virtual environment. Then run source <<env-name>>/bin/activate to activate the environment. Now your command prompt will change to show that you are inside a virtual environment. Checkout the official page for documentation on venv.

venv is included with Python 3. If you are working with both Python 2 and Python 3, and you are using pyenv, I suggest you use the pyenv-virtualenv plugin to manage virtual environments rather than venv.

Cloud Services

Cloud services are also a popular option to learn data science using Python. The concept is that they provide an environment like Jupyter notebook but it’s hosted online on shared/dedicated machines on the cloud. This is a neat option in that you don’t have to install anything at all on your computer.

There are multiple options – CoCalc, Google Colab, Kaggle are just a few examples. You just have to register, login and start doing your data science work. These even allow to have ‘shared notebooks’ where you can collaborate with others on the same file. As a student, this is my favourite approach when I’m doing group assignments. I prefer starting a document on Google Colab, share it so my peers can contribute. And then for submission, I export a ipynb file from it and submit for grading.

Spyder IDE

If you are from the world of R and have used RStudio, you’ll feel right at home with this IDE. This is also a no hassle approach to get started with data science. It has a neat toolset, an internal python installation and pre-installed packages for doing data science work on Python. If your use cases are quite simple and you don’t need to install many third party packages, using this IDE can be such a simple option for you. But I don’t prefer this either – because I try out a lot of python packages and installing them is not straightforward. Even after I figured out how to install packages, I didn’t like to continue using this IDE. But that’s only my opinion you should definitely try this one out if you’re still figuring out your favourite setup.

If you’re willing to spend some money, and you like IDEs, you can also take a look at PyCharm. I don’t prefer IDEs for data science anyway. And definitely I won’t advice anyone to spend much money for learning because there’s a ton of stuff for free. Once you have learnt to a decent level, and are at professional capacity, definitely go for paid tools. At that point, they add a lot of productivity to justify their cost. But not while learning.

Conclusion

So that’s pretty much all the common ways to setup Python. I repeat, try all of them out and then settle with one. The data science world requires that you have familiarity with a broad set of tools. When you search for some solutions or samples, you don’t know what you will find. So it’s good to have some basic exposure to several tools.

As for me, my favourite approach is installing Python 3 using pyenv. Using venv to manage environments. And then run a local Jupyter notebook for doing reports. I use Google Colab for doing assignments that require collaboration.

Month: September 2021

Hello World: Data Science with Python Part 1