Code Creek

Hello World: Data Science with Python Part 1

This tutorial is part of a series “Data Science with Python“. A set of tutorials aimed at helping beginners get started with data science and Python.

Installing Python

Installing Python could be as simple as just downloading the Python executable from the official website. But the more common way to get Python in the Data Science world is to have a package manager like conda.

Conda is a popular package manager as well as environment manager.

There are 3 main advantages to installing Conda over Python –

Your work will be reproducible. When you send your project to someone else you just have to tell them to install Conda (and maybe the version), instead of telling them the versions of all your dependencies. Using a dependency manager makes it easy to share your work.
You will avoid package installation and dependency problems. There’s not much chance for dependency conflicts and such, as a beginner, but it’s still an important advantage.
You can use it as a environment manager – that is if different projects you are working on use different dependencies (or different versions), you can create isolated environments for each project to avoid dependency problems, and easily switch between them.

Download the latest version of Miniconda as per your operating system (from the Conda download page) and install it. Installation for most people is just double-clicking on the downloaded file or running a command on the terminal.

For Windows, you just have to double-click on the exe file you downloaded.
For Mac, there are two options
- Download the pkg file and double-click to run it, or –
- Download the script and run it in the terminal using bash ~/Downloads/Miniconda3-latest-MacOSX-x86_64.sh, and answer the questions asked.
For Linux, you need to download the script and run it in the terminal using bash ~/Downloads/Miniconda3-latest-Linux-x86_64.sh, and answer the questions asked.

If you need, the full installation instructions can be found here – https://conda.io/projects/conda/en/latest/user-guide/install/index.html#regular-installation.

I’ve deliberately kept the installation instructions minimal because there’s just about a ton of references and troubleshooting information if you just search the internet for ‘miniconda installation on <<operating system>>’. After you have successfully installed you should be able to run conda --version in the terminal to see the version of Conda you installed. Also run python --version to see the python version installed.

(base) ➜  data-science python --version
Python 3.9.5
(base) ➜  data-science conda --version
conda 4.10.3
(base) ➜  data-science

Writing a Python Program

You need a text editor to write your code. I use Visual Studio Code. You can install helpful extensions to Visual Studio Code if you like, but I’m not using any for this tutorial. The basic way to run Python programs is to write a program in your text editor and then running it in the terminal using the Python interpreter.

Now that we have everything we need setup, let’s write and execute a simple Python program. A program that prints ‘Hello World!’ on the screen.

1. Create an Environment

An ‘environment’ is the combination of the Python version and packages you need. For one project you might be using Python 3.9 and packages x, y and z. For another project you might be using Python 2.7 and packages a, b and c. Environments provide us a way of keeping these dependencies clear from one another and reduce confusion when switching between projects.

You can also share the environment as a environment.yml file along with your code. So that the recipient can run your project without worrying much about the dependencies. Let’s create an environment named dstut (for data science tutorial) to use along with this set of tutorials.

Execute conda create --name dstut python=3.9 to create the environment. This will create an environment with Python version 3.9 and you will be asked to confirm with a set of default packages required for this.

Once the environment is created you have to ‘activate’ it every time you need to use it. Execute conda activate dstut to activate the environment we created. You’ll see that the command prompt now includes (dstut) to denote that you are in your new Conda environment.

2. Write your Code

Open your text editor and create a file with a single line of code –

print("Hello World!")

Save the file as hello.py

3. Execute the Program

Switch back to your terminal and execute python hello.py to run your code. If everything is in order, you should see the text Hello World! printed on the terminal.

(dstut) ➜  hello-world python hello.py 
Hello World

Intro to Jupyter Notebook

A Jupyter notebook is a format where you can create a single document, which includes your python code, widgets, charts, documentation. It’s great for sharing your work. In fact in academic circles it’s kind of default to share your work through such ‘notebooks’. Almost every homework or assignment I submitted is using these notebooks.

So let’s also see how to run our ‘Hello World!’ program using a Jupyter notebook.

Installing Jupyterlab

I hope you still haven’t closed the terminal. If you have, then open it back again, change to your directory (cd <<directory-name>>) and execute conda activate dstut to activate our tutorial environment.

Execute conda install jupyterlab to install Jupyter in the environment. You will be presented with a list of packages and asked to confirm to install. Go ahead and finish the installation.

Note that you have installed Jupyterlab only inside the environment. That is, once you are out of the environment, (by closing the terminal or by doing conda deactivate), it will be as if there is no Jupyterlab on your computer. And whenever you do conda activate dstut, Jupyterlab is back on again! That’s one of the core functions of an environment manager like Conda.

Okay, now that Jupyterlab is installed, execute jupyter notebook to start the notebook server. This will start the notebook server and open the interface automatically in your default browser. If you need to open the page by yourself, or on a different browser – the terminal will show an URL (like http://localhost:8888/?token=5410e03089b55baba71dubidabab57dudu85207ce07380a9). Copy that URL and paste it in your browser’s address bar to open the Jupyter notebook interface.

Creating our Document

Once you’re in the homepage, use the ‘New’ menu to create a new notebook.

On the new document that’s created, there’s one ‘cell’ by default. A Jupyter notebook is a set of such ‘cells’ with types. For now, we are going to have two cells – one for giving a title to our document and one for our ‘Hello World!’ code.

Change the first cell’s type to ‘Markdown’ using the dropdown on the toolbar or by using the ‘Cell -> Cell Type’ menu. And then type # Program to Print 'Hello World!'. Then add a new cell using the ‘+’ button on the toolbar or by using the ‘Insert’ menu.

The new cell is by default of type ‘Code’ which is what we want too. Type your code in the new cell. If you remember, the code we wrote for hello.py is print('Hello World!'). Type this code in the new cell we created.

Running the Document

Now let’s get the output by using the ‘Run’ command. You can either execute the two cells one by one using the ‘Run’ button on the toolbar. Or, use the ‘Cell -> Run All’ menu item to run both cells one by one.

We can also click on the ‘Untitled’ title on the top, and give it a meaningful name that suits our project.

Close the window and switch back to your terminal. You’ll find the server is still running there. Press Ctrl + C to shutdown the server. If you type ls you’ll see that Jupyterlab has saved your document as a file with an ipynb extension. (ipynb stands for IPython Notebook). Then you can do conda deactivate to deactivate our dstut environment or just close the terminal.

That’s it. Now you know how to setup a Python Data Science environment, write and execute Python code, and create Jupyter notebooks.

Python options for Data Science

Python has become one of the most important tools used in the data science world. This article is for new students of data science who are just getting started and would like to know their options when it comes to installing a Python environment to work on and learn data science. Note: This is not a tutorial for setting up Python – I only try to assemble the different options available to create a Python environment to do your work.

My advice is to try out all the options and settle on one that you find most convenient to you. Even if you find one particular way to setup things convenient, you will still have to have some familiarity with other approaches also, because you don’t know what kind of projects you might have to work on in future.

The options available to you as a learner, to setup python on your computer are –

Plain Vanilla
Python Distributions
pyenv and venv
Cloud Services

Plain Vanilla

The basic way to get started. Install python on your computer. Then install whatever frameworks and libraries you need as you go about learning data science. First download Python from the official website, and follow instructions from there (mostly just double click on the file you downloaded). Then use pip install <package-name> to install the packages you need. As a beginning data science student, the packages you need would be (but not limited to) –

Jupiter – pip install jupyterlab
NumPy – pip install numpy
Pandas – pip install pandas
Matplotlib – pip install matplotlib
Seaborn – pip install seaborn
Scikit-learn – pip install scikit-learn

Python Distributions

The above option of having Python installed along with a few libraries through pip would be sufficient for most data science students when they are just beginning. Of late, the more convenient option seems to be to use one of the “Python Distributions”. Python distributions are nothing but a version of Python and a set of packages clubbed together. So you don’t have the task of installing or configuring the packages yourself. These are actually quite nice. My favourite one is the Anaconda distribution, but there are several others. A nice list of popular distributions can be found here.

Although apparently python distributions exist to simplify python installations and be a convenience, in my experience, I’ve found that installing python using pyenv and managing projects using venv is more convenient for me. My personal opinion is, unless your course/project expects you to use a distribution like Anaconda, better install Python (using pyenv) and use virtual environments to keep your life simple.

pyenv and venv

The above two options, although is quite simple and sufficient for most use cases, has one problem. What if you need multiple version of Python? There are cases where you might need both Python 2 and 3 or different versions of Python 3 itself. Also, what if you need different versions of dependencies? This is where pyenv and venv help. Using pyenv, you can install multiple versions of Python on your computer.

If you are taking this option, you should first install pyenv and then install Python using pyenv. Checkout the pyenv repository to see instructions how to install pyenv. pyenv is one of my favourite tools. It allows us to install all the popular versions and distributions of Python and very easily switch between them.

Even if you don’t need pyenv and are going to use only the latest version of Python always, I still recommend working with venv. Simply create a directory for your python project, then run python -m venv <<env-name>> to create a virtual environment. Then run source <<env-name>>/bin/activate to activate the environment. Now your command prompt will change to show that you are inside a virtual environment. Checkout the official page for documentation on venv.

venv is included with Python 3. If you are working with both Python 2 and Python 3, and you are using pyenv, I suggest you use the pyenv-virtualenv plugin to manage virtual environments rather than venv.

Cloud Services

Cloud services are also a popular option to learn data science using Python. The concept is that they provide an environment like Jupyter notebook but it’s hosted online on shared/dedicated machines on the cloud. This is a neat option in that you don’t have to install anything at all on your computer.

There are multiple options – CoCalc, Google Colab, Kaggle are just a few examples. You just have to register, login and start doing your data science work. These even allow to have ‘shared notebooks’ where you can collaborate with others on the same file. As a student, this is my favourite approach when I’m doing group assignments. I prefer starting a document on Google Colab, share it so my peers can contribute. And then for submission, I export a ipynb file from it and submit for grading.

Spyder IDE

If you are from the world of R and have used RStudio, you’ll feel right at home with this IDE. This is also a no hassle approach to get started with data science. It has a neat toolset, an internal python installation and pre-installed packages for doing data science work on Python. If your use cases are quite simple and you don’t need to install many third party packages, using this IDE can be such a simple option for you. But I don’t prefer this either – because I try out a lot of python packages and installing them is not straightforward. Even after I figured out how to install packages, I didn’t like to continue using this IDE. But that’s only my opinion you should definitely try this one out if you’re still figuring out your favourite setup.

If you’re willing to spend some money, and you like IDEs, you can also take a look at PyCharm. I don’t prefer IDEs for data science anyway. And definitely I won’t advice anyone to spend much money for learning because there’s a ton of stuff for free. Once you have learnt to a decent level, and are at professional capacity, definitely go for paid tools. At that point, they add a lot of productivity to justify their cost. But not while learning.

Conclusion

So that’s pretty much all the common ways to setup Python. I repeat, try all of them out and then settle with one. The data science world requires that you have familiarity with a broad set of tools. When you search for some solutions or samples, you don’t know what you will find. So it’s good to have some basic exposure to several tools.

As for me, my favourite approach is installing Python 3 using pyenv. Using venv to manage environments. And then run a local Jupyter notebook for doing reports. I use Google Colab for doing assignments that require collaboration.

Functional Programming Vs. Object Oriented Programming

Object oriented programming has been the de-facto programming methodology since they day I learnt that there is something called computer programming. Several of the most popular programming languages are primarily object oriented programming languages. The most commonly asked interview questions for programmers are about object oriented programming.

Until, functional programming just blew up a few years ago. Functional programming languages have been there since the 1960s mind you, but only a few years ago they gained traction among ‘commercial’ developers. I found a ton of people learning Scala and lambdas and observables and all the associated jargon. A group of us jumping ships and embracing functional programming as the way to go for all our new projects. And another group of us sticking to the familiar grounds that is object oriented programming.

So far I’ve done two projects in the functional programming style, both web applications. One in Scala and one in Java (Spring Reactor). Here’s what I’ve learnt so far as contrasts between Object Oriented Programming and Functional Programming.

Core Principle

The first difference that we need to appreciate is the core principle guiding these methodologies. Whenever I see coders struggling to adopt functional programming, it is because they don’t have a grasp of this.

Object Oriented

In object oriented programming, we think of everything as objects with state. The flow of the application is dictated by change in state of the objects involved.

Functional

In functional programming, we think of everything as operations. The flow of the application is dictated by the chain of operations.

Design Patterns

If you are like most other programmers, you would have learnt a bunch of design patterns to apply in your projects. And I bet many of them are not just ‘design patterns’, they are ‘object oriented programming design patterns’. You need to even unlearn many of these. If you try to apply these common design patterns when you do functional programming, it will be like hammering a square peg in a round hole.

For the last couple decades or so, object oriented programming has overshadowed all the other ways of programming and this has one bad side effect – so much of the learning material depends on real world analogies. We’ve all come across things like the Animal --> Dog --> Dalmatian kind of analogies right? Well you need to forget all that and realise that abstractions are only there to lighten the cognitive load.

Computer doesn’t do dogs and cats. Computers do sets of operations on very basic units of data. Data like bits and bytes. Functional programming relates more to this attitude in a programmer. You are not working with simulated models of objects. You are taking an input, doing some processing and giving an output. Try to apply that to any programming task that you do. For example, a web service, takes a request as an input, does its processing stuff, and then sends a response as an output. This can be broken down into several levels to achieve cognitive ease, by breaking down the processing into multiple functions, and then chaining them by giving the output of one function as the input to the next function.

Choosing between the Two

It is not simple to classify an application, because most applications fall into multiple categories for purposes like these. Think about the core purpose of your application and make your decision based on that.

Object Oriented

Object oriented programming is preferable when you are representing a world of objects. For example a simulator, or a video game.

Functional

Functional programming works well in scenarios where your application is a processing pipeline. For example a event stream processor or a data processing API.

When you are trying to choose between the two, do not think of peripheral activities like logging, IO (even database updates). Rather, think of the main purpose your application is solving for its user. Is it giving your user an object-state model that they can manipulate? Or is it providing an engine that transforms their input and gives them an output?

Combining the Two

In real world, you probably are going to combine the two programming paradigms, rather than use strictly only one. Functional programming seems like a good fit for a web service, but some of the components are still better represented by an object oriented model. We can think of the web service as a series of functions that, in the beginning take in a request, and in the end emit a response. But entities like service and repository can still stay as objects. In fact, a pure function will not have side effects, but this is hardly useful in the real world right? We need to almost always have side effects – update a database, write to a log file, send out an email and so on.

The Code

The most visible differences between the two programming methods is when you look at the code. Some important differences are –

Object Oriented

Objects are the core entities. The program flow is usually instantiating objects and modifying their states. Any processing that is done, is as a means to change some state.

Control flow in object oriented programs is done through simple and traditional constructs like loops and if-else blocks.

There is a global scope, then there is a session scope, then there is a thread scope. Or simply put, there is always some global state from where you can get your environment variables from. The ‘context’ (that holds things like current user, configuration parameters etc) are available from their respective scopes.

Concepts like threads and concurrency are handled by the application code. Even if you use a multi-threading friendly framework, you still might have to declare what is thread-safe and so on.

Functional

Functions are things. You create functions and assemble them in chains. You can assign functions to variables and pass them as arguments to other functions.

Control flow in functional programming is done by chaining, filtering and recursion. Streams are preferred to collections.

There is only a local scope. In functional programming, the best practice is to provide everything your function needs as arguments. The ‘context’ (that holds things like current user, configuration parameters etc) gets passed in as an argument to all functions that need it.

Since a ‘global state’ is not even assumed, programs are by default thread safe. And most functional programming platforms manage concurrency by themselves, upon this same assumption.

Conclusion

To put in a simple way, when you are doing functional programming, don’t think of objects and types – rather, think that you are making a lot of small black boxes. Each one takes an input and gives an output. And then you’re arranging all those black boxes to make a useful application. If you can grasp this, you’re probably going to have a very easy time settling into functional programming.

Please Stop Infinite Scrolling

I think this post is more like a follow up to my previous post on whether to use SPAs or not. Confession: I started writing that article, and this one, because of a website that completely irritated me. It was a matrimony website. If you didn’t know, matrimony websites are like dating websites, but you skip the dating and go straight to the wedding. The interfaces are made very similar to shopping sites – in fact it actually feels like a shopping site for brides and grooms.

Of course I have a problem with the ideology of those sites, but I’m here to talk about this particular one – and what irked me about it the most. The fact that it was an SPA was obviously off-putting because the concept of SPA was deliberately and unnecessarily thrust upon the poor unsuspecting wife-shopper. But

Where am I?

I have no idea how much I have already seen. A page with 16 or 20 items which I can checkout and then click on a ‘Next’ button to see the next set of items would have been so much clearer and easy to use. Without knowing how much I have seen and how many more is left, I don’t even know whether I should keep scrolling or whether I should give up. If there’s only another 10 items to checkout, I’ll continue browsing to checkout everything. But if there’s another 2000 items, I’d probably give up.

There’s no way to know that, when the page does ‘infinite-scrolling’.

How to Get Back to Where I was?

After scrolling through a countless number of items, what if I scroll back to the top for some reason? What if I refreshed the page? What if I want to look at the item that I saw a while ago? These are all not even possible when you are infinite-scrolling. You have to scroll back up all the way and find that item again. You have to scroll all the way down if you accidentally refresh that page.

Think about the User

I think the infinite scroll is a classic example of over engineering of a user experience. A classic case where a designer forgot all about user experience and just wanted to be fancy for the sake of their own vanity. Things like that are the bane of user interface design. The designer is so proud of something they did that they completely forget to get feedback from UX testers. Or even from actual unhappy users. The worse thing is, they probably actually spend more money on doing this, over the simple, easy-to-use traditional way.

Sometimes I Need to Reach the Bottom

This is the problem that I actually faced, and made me go on a rant on my blog. The menu I needed was at the bottom of the page. I had to scroll down to the bottom. Only, when I scrolled, the page just loaded more items and I had to scroll again. Then it loaded more items. After doing that a few more times, a brilliant idea struck me and I pressed the ‘End’ button – on a Mac, the end button takes you instantly to the bottom of the page. And surprise! Before I could move the mouse and click on my menu, the page loaded a bunch of items and the menu went back out of view. Not only I couldn’t click my button, the page loaded a ton of content that I wasn’t even interested in.

It’s Slow

Making one server request for a page displaying 25 items, is often faster than 25 separate requests for each item. Significantly faster. In most implementations of infinite scrolling, the page makes more and more requests as you scroll. Also, the experience of waiting for a second and seeing 25 items, is better than the items loading one by one with a fraction of a second gap in between. So infinite scrolling is not only actually slower, it also amplifies it’s own slowness, by reminding the user often that there’s something loading.

What’s Better Then?

What’s better is the plain old pagination. Don’t fix what’s not broken!

Do I know how far along I am, browsing the search results? Yes! Because a list of page numbers on the bottom always show me which page I am on, and among the list of the ‘finite’ number of things on each page, I can easily get back to where I was.

I can scroll down and see the footer, use the footer menu if there is one. My browser doesn’t have to load a ton of content that’s not useful at all to me. I get to have a calm peaceful life.

And the page doesn’t have to load a run a lot of JavaScript code if it’s avoiding fancy things like this. No matter what fancy techniques you use on your page, they will never beat speed. A snappy fast loading page, with familiar user experience is much better than fancy pages with things like infinite scrolling and animations.

Just do pagination if you’re showing me a catalog. Please. Thank you.

SPA or Not

The latter half of the last decade can be considered an explosion of SPAs. With introduction of Angular 2, ReactJS, Vue and a ton of such frameworks, creating highly interactive web pages became very easy. So easy that competing technologies like Java applets and Flash are being pushed into extinction.

Of course whenever a new UI technology comes in, it’s going to look exciting, have a big bunch of people jumping on the bandwagon, some realise that it’s actually not relevant to them, others realise something newer has come, then finally, most of them move on. But this time, a new ‘concept’ was spun out. The concept of SPAs.

What are SPAs?

SPAs or Single Page Applications are a concept, where your whole website is just one HTML page. Pieces of content inside the page get dynamically modified and updated using content from the server. For example, a single page that has a menu at the top and an empty box at the bottom – when you click on a menu item, the page will fetch the respective data from the server and populate the empty box. When you click another menu item, it will fetch different content, and replace the content of the (initially) empty box.

In contrast, traditional web applications are made of several pages. So the above example in traditional style would be each menu item would be a a link to the corresponding web page. Clicking on a menu item would just order the web browser to load a new page entirely. The disadvantage being, it’s a bit slower to load an entire webpage rather than populating just data into an existing container.

How to Decide?

SPAs are not an improvement on the web UI, and as such, it’s incorrect to assume that ‘modern’ websites are SPAs. SPAs are just a different way to make websites. So it’s important to choose whether or not to use them. This has become an important decision to make because there are significant differences in user experience between SPAs and traditional applications. So much so, that choosing the wrong type can either make or break the success of your web application.

When to Prefer an SPA

When you’re making an interactive user interface, where there is communication between the different components of the page, it’s better to do an SPA. For example, a dashboard that shows data as tables and charts. You probably would like if the charts are all interactive and respond to different clicks on the page – like if you click on a geography, all the charts get redrawn to show data only for that geography. Another example is a drawing application – a large canvas in the centre and a set of tools like pencil, eraser, shapes etc in a toolbar. These kinds of applications are even possible only because of the advances in UI frameworks and SPAs.

When to Prefer a Traditional Website

When your audience is going to consume information rather than interact with it, then it’s better to do a traditional website. Think of blogs, news websites, video streaming sites, forums – the bulk of the internet. It is unnecessary complication to do an SPA if the interactivity it brings is not utilised. Because it’s way more complex to develop SPAs than normal web pages. There are more possibility of bugs and weird behaviour. More importantly, you page is going to be unnecessarily large and slow to download – SPA frameworks are usually heavy.

Also, if you are making a website where people come to consume information, then you probably depend on search engines to bring you traffic. Well search engines are not very good with SPAs. Chances are that your website won’t even be indexed by search engines, if it’s entirely an SPA.

How to Choose

By now, it should be obvious that there is more chance that you do not need an SPA, because most websites exist for consumption rather than interaction. Most people come to the internet to read, watch or listen. And a smaller portion usage is interactive applications like posting blog entries, working with documents, editing images and so on. So the choice is simple. If your website is more for reading, watching or listening, then do a traditional site. If your website is more for interacting – filtering, sorting, drill-downs, slice-n-dice, drawing and so on, do an SPA.

How About Both?

The thing is most of the times, your website might have to do both. Think of a shopping website. The shoppers all have to read the pages, look at the product details, read reviews – so it seems like a site where people primarily read information. But, it also has to be interactive -filtering products, sorting search results and so on. What to pick in this case?

Such websites can benefit from both approaches. So I would use a combination of both. Start out with a normal traditional website. Then introduce SPA features into the pages where it’s necessary. So your website would be like a collection of pages, some of which are mini SPAs. For example, the search results page is a normal page without SPA functionalities. But to improve user experience, the product page might have features like commenting, reviewing, browse multiple product images, buttons to add/remove the product from the shopping cart etc. These can be done SPA-style, so that the user won’t be navigating away from the page to do these little actions.

Still Doubtful?

When in doubt, do a traditional website. It’s easy to get a normal website right. But getting SPAs right is hard work. Wait for circumstances to strongly push you towards SPAs – and then you can refactor your website to be an SPA. Because often when the developers are in doubt, it means there’s not much benefit in increasing complexity. Presenting an SPA when there is not need for one, will just make the user experience worse. Where the situation doesn’t demand it, SPAs stick out like sores and sometimes even end up irritating the user. So again, if you’re confused, just do a plain old website and live peacefully ever after.