Basic Modules: Data Science with Python Part 4

This article is part 4 of the “Data Science with Python” series. You can consider this a general introduction to common modules used in Python for doing data science.

Note that such a list might not always stay relevant. New modules and frameworks keep coming and going. But I believe these modules have proven so effective to the maths and science community that they have made their way to even academic courses. And as such, they will stay relevant for a long time and it would serve you well if you learn them at the beginning of your data science journey.

What are Modules in Python?

Although learning mathematics, data structures and algorithms are a significant part of any data science course, real data science jobs hardly start with those. Maybe if you’re advanced enough they will matter, but not for a beginner. As a beginner data science programmer, you just have to assemble previously developed and tested components to achieve your goals.

So your first set lessons should be what modules are best for data science. Or in other words, what modules are most commonly used in data science. There are some modules that have become foundational to any data science work. These are the modules that you would expect any data scientist should be familiar with. These modules are also what you would learn in most data science, machine learning or artificial intelligence courses.

4 Modules I Recommend You Learn First


At it’s core, Numpy is just a better way to work with arrays that hold only one type of data. Python already has arrays by means of the ‘list’ type. That list object is fine for most regular programming tasks. But in data science and analytics, data structures must be optimized to hold much larger datasets than usual. Even homeworks given in data science courses involve datasets with thousands of records. The Numpy module provides an optimized array object to handle such work loads.

Using Numpy arrays vs. Python lists have 3 main advantages while doing data analytics –

  1. Speed: Numpy (and many other such modules), are faster mainly because internally, most of their functionality is implemented in lower level languages (like C).
  2. Space: Numpy takes advantage of data types – if the values you store in the array are all 8-bit numbers, then thats’ all the space they’ll take (with a little bit of fixed overhead). But in Python lists, each of those numbers will take space for a reference and an integer object – which is wasteful when you are working with big data sets.
  3. Functions: Python list functionality is limited to what you would need in a general programming task. But Numpy expands it considerably – including but not limited to, vector operations, algebra and matrix operations. Most of the time we can do operations on arrays without even writing loops – for example – array * 2 will multiply all values in the array by 2.


I like to think of Pandas as the programmatic version of spreadsheet software. It excels at working with tabular data – that is, data that’s arranged in rows and columns. As far as I have seen, Pandas is the first python module that is introduced to data science students – simply because most of them start with loading data from a csv file and manipulating / analysing it. Y0u can load data from a variety of sources (like CSV files, Microsoft Excel files, REST APIs etc.), do data-wrangling tasks (like cleaning, enriching, transforming etc.) on the loaded data, produce analytic output (like summaries, charts etc.) – all using just Pandas.

Even if you are going to do much more advanced tasks like Machine learning, the first steps of loading, analyzing and cleaning data will probably be done by Pandas. If you are doing exploratory analysis (like Business Intelligence reports), I can safely say that Pandas is all you need.

Just like how Numpy has array as the core datastructure, Pandas has two core datastructures – dataframe and series. A series is similar to a one dimensional array (or simply a list of numbers). In a lot of places, Pandas series and Numpy array can be swapped without much difference (but we will see the differences as we advance). A dataframe is like a spreadsheet. It’s used to store data as rows and columns, and provides powerful features to manipulate and analyze tabular data.


Data Science, Business Intelligence, or even simple analytics on data – none of it be complete without neat reports presenting the findings from the analysis. Matplotlib is the library that we use to make charts and other pictorial representations in our reports.

Matplotlib has functionality to render charts on the screen, output charts as image files and even display charts in IPython/Jupyter notebooks. I’ve come across comments that Matplotlib has a hard learning curve, but I don’t think so. You just need to have patience to learn it’s foundations rather than hurry up to produce charts – it’s not that difficult.

Even modern visualization modules like Seaborn actually use Matplotlib underneath. Seaborn is considered to make “prettier” charts than Matplotlib, and even if you would like to use it, I suggest you start by learning Matplotlib. Pandas also has chart producing capabilities – and yes, it uses Matplotlib internally. If you are working on a Python data science project – there’s a very high chance that your output is rendered using Matplotlib. That’s how common this module is for making charts.


Scikit-learn, also called sklearn, is the most used library for machine learning. It has functionalities central to machine learning, namely – clustering, regression and classification. It is a collection of a lot of complex algorithms – quite a lot, that I can say this is not one module – it is a collection of several machine learning modules. Scikit-learn also gives some data-wrangling functionality to preprocess your data where Pandas might come a bit short.

If you do a course related to data science, you will probably learn algorithms like k-means, random forests, nearest neighbors, logistic regression (to name a few). Although you learn how these work internally, it is never expected for a data scientist (or a data science programmer), to implement these algorithms themselves. They just use a module (probably Scikit-learn), which already has implementations of these algorithms in a generalized, re-usable way. We just need to pick the required components and implement our project using them.

Scikit-learn internally uses Numpy for it’s processing, and integrates naturally with Pandas and Matplotlib. Not only Scikit-learn, all the four modules introduced in this article, interoperate well with one another. It is an important reason they have become so popular and useful – they focus strongly on their own purpose, at the same time, working well in connection with one another.


Most data science projects have a pattern. We aquire data, do numerical and algebraic calculations, run our data science algorithms on it, finally present our results as visualizations. The four modules that I’ve recommended above, map directly to these four tasks. Pandas to aquire data, Numpy for crunching numbers, Sckit-learn for some algorithmic magic, Matplotlib to add charts to your reports.

Once you have learned those four modules, you can expand your skillset by knowing which direction you are going to go from there. By the time you’ve learned these, you will know what you want to learn next. Some notable mentions are Tensorflow, Scikit-image, Keras, and PyTorch. Four is a pretty small number, because there are countless libraries and modules in the world of Python and data science. But learning these four modules will give you the solid grounding you need to launch your data science journey.

Python Basics: Data Science with Python Part 3

This is the continuation of the Python Basics tutorial. This is the second part of Python Basics, and the third part of the Data Science with Python series. This tutorial can also be consumed as a Jupyter notebook available here . Let’s continue then.

Preparing Data for Analysis

Remember the list of strings that we created above –

data_list = [
    "John Smith,35,Male,Australia",
    "Lily Pina,13,Female,USA",
    "Julie Singh,16,Female,India",
    "Rita Stuart,20,Female,Singapore",
    "Trisha Patrick,32,Female,USA",
    "Adam Stork,32,Male,USA",
    "Mohamed Ashiq,20,Male,Malaysia",
    "Yogi Bear,25,Male,Singapore",
    "Ravi Kumar,33,Male,India",
    "Ali Baba,40,Male,China"

Let’s convert this list of bulky strings into a neat list of dictionaries with proper data type for age. What we are going to do is –

  • Create an empty list to store the processed lines
  • Loop over each string using the for line in data_list: syntax
  • Split each line into it’s components using the .split() method
  • Create a dictionary with proper field names
    • Use int() method to convert age into a number. Otherwise it will be stored as a string.
    • Append this dictionary to the processed lines list using the .append() method
processed_data = []
for line in data_list:
    fields = line.split(',')
        'name': fields[0],
        'age': int(fields[1]),
        'sex': fields[2],
        'country': fields[3]


Using the processed list

Each element in a collection, can be a collection itself. That’s what we have done here. We have created a collection of collections – or rather, a list of dictionaries.

  • processed_data is a list – it’s elements are accessed using a numerical index (starting with 0)
  • processed_data[0] is the first element of the list. processed_data[1] is the second element and so on. Note that each element is a dictionary (that we appended in the previous step)
  • Elements of a dictionary are accessed using their key names. So processed_data[0]['name'] means to fetch the first element (which is a dictionary) and then fetch the ‘name’ field from it.
print(processed_data[0]['name'], processed_data[0]['age'])
print(processed_data[1]['name'], processed_data[1]['age'])

Stepping into Data Science

Let’s print some statistics from our data. First calculate average age of people in our dataset –

  • Average is sum divided by count.
  • Count can be easily obtained using the len() function that returns the size of any string or collection passed as argument
  • Sum can be obtained by looping through the list and accumulating the age values into a sum variable.
  • Finally divide sum by count to get the average
sum_of_ages = 0
number_of_persons = len(processed_data) # len function gives size of collection
for person in processed_data:
    sum_of_ages = sum_of_ages + person['age']
print("Number of persons:", number_of_persons)
print("Average age:", sum_of_ages / number_of_persons)


A condition specified using ‘if’, ‘elif’ and ‘else’ keywords help us branch out our programs execution based on the given condition. Statement blocks to be executed Example –

if age < 18:
    person_type = 'kid'
    print('Person is just a kid')
elif age < 60: # elif means else-if
    person_type = 'adult'
    print('Person is an adult')
    person_type = 'senior'
    print('Person is a senior')

The < in age < 18 is a ‘comparison operator’. Other comparison operators are <, >, <=, >=, ==, !=. == means True if both sides are equal. != means True if both sides are not equal.

Two more operators are in and not in. These are for conditions where you have to check whether a value is in a collection. Example if country in country_list: or if student not in class:.

Let’s use the conditions to report how many of our people are eligible to vote –

number_of_voters = 0
number_of_nonvoters = 0
for person in processed_data:
    # person['country'] gives a country name
    # which can be used to get voting age
    # from the voting_ages dictionary
    if person['age'] > voting_ages[person['country']]:
        number_of_voters = number_of_voters + 1
        number_of_nonvoters = number_of_nonvoters + 1

print("Number of voters:", number_of_voters)
print("Number of non voters:", number_of_nonvoters)

Logical Operators

Conditions often require to be combined to be useful. For example a person should be over 18 years and should be a male. To represent combinations of conditions like this, Python has logical operators – and, or and not.

if age < 18 and  sex == "Male":
    print("Male Child")

Another example –

if country == "India" or country == "China":
elif country == "Spain" or country == "Italy":

Slicing Operator

Slicing is an operation that can be used on lists and strings in Python. It is quite simple and very useful to quickly fetch a range of items from a list (or a range of characters from a string).

We access list items with their numerical indexes, but we can also give a range inside the square brackets to get multiple items at once – sort of like a sub-list. For example, my_list[2:6] will return the elements from index 2 to 5. Register this – [a:b] means from ‘a’, upto, but not including ‘b’. Also remember index starts with zero.

Using negative values with the slicing operator is also possible. It simply means elements are counted from the end. Or you subtract the values from the length. That is, if the length of the list is l, then [-a:-b] means [l-a:l-b]

Leaving out the values could means start from the beginning or till the end. That is, [:b] means from starting, upto, but not including ‘b’. And [a:] means from a, until the end.

my_list = ['apple', 'orange', 'grape', 'melon', 'lemon', 'cherry', 'banana', 'strawberry']
print(my_list[2:6])   # prints ['grape', 'melon', 'lemon', 'cherry']
print(my_list[2:])    # prints ['grape', 'melon', 'lemon', 'cherry', 'banana', 'strawberry']
print(my_list[:6])    # prints ['apple', 'orange', 'grape', 'melon', 'lemon', 'cherry']

print(my_list[-6:-2]) # prints ['grape', 'melon', 'lemon', 'cherry']
print(my_list[:-2])   # prints [

This works exactly the same with used with strings. For example, if my_name is a string variable, my_name[-2:] means last two characters from the given string. my_name[:2] means first two characters.

Experiment and learn the slicing operator until you are confident.


Functions are named, reusable bits of code. Which you first define and then call whenever required. Defining our own functions will help modularize our code and reduce duplication of code. It also makes it convenient to introduce changes in future. For example, let’s introduce a concept of short names for people in our data set. For now, a short name is made by joining the first letter of the first name and the first 4 letters of the last name. So for “John Smith”, the short name will be “JSmit”.

name = "John Smith"
first_letter = name[:1]
last_name = name.split(" ")[1]
short_last_name = last_name[:4]
short_name = first_letter + short_last_name

The above code gives short name for “John Smith”. But when we need short name for “Jacob Nilson”, we have to write the same set of 5 lines of code again. Any time we change our formula for creating short names, we have to change all this code. This is where functions help.

Functions allow us to package processing like this and reuse it as and where it is required. Functions are defined using the def keyword. Then a function name (in this example ‘short_name’) and then in brackets, a list of parameters that the function can take as input. Similar to the for loop and if conditions, the set of statements forming the function block is indented below the declaration line.

def short_name(full_name):
    first_letter = full_name[:1]
    last_name = full_name.split(" ")[1]
    short_last_name = last_name[:4]
    short_name = first_letter + short_last_name
    return short_name

print(short_name("John Smith"))
print(short_name("Jacob Nilson"))
print(short_name("James Maroon"))
print(short_name("Jill Jack"))

Now wherever we need this functionality, we can just call this function by it’s name. We don’t have to write the same code again and again.

Another advantage is when introducing a change, we just make the change in the function definition and it reflects in all places where we have called the function. So whenever you’re implementing a formula or an algorithm, it’s better to define it as a function and then call it wherever required.


Functions defined like the above are usually grouped together into a ‘module’ that we import before using the function. For example there are a ton of functions in the math module in Python. Keeping all functions in the global scope is bad for manageability. So we arrange them into modules and ‘import’ them into our programs if and when required.

Say we need the square root function. It’s in the math module. So we can import the math module and call math.sqrt function from it.

import math

Or, we can import only the sqrt function and use it without specifying a module name.

from math import sqrt

As you advance, you will not only define your own methods, you will also organize your code into modules.


So that concludes my super fast intro to Python course. It’s not really much when compared to the vastness of the Python ecosystem, but it’s a good start to the data science journey we’re taking up. There are more concepts but I find it easier to introduce concepts when we’re about to actually use them for something. So although the Python Basics part is over, I will continue to introduce Python concepts and methods as we progress.

Python Basics: Data Science with Python Part 2

This tutorial is part of a series “Data Science with Python”. A set of tutorials aimed at helping beginners get started with data science and Python.

Consider this article a super fast tutorial for Python. But I’m not taking the usual feature-by-feature tutorial route. This is a super fast introduction to Python. Because there’s an unbelievable amount of Python tutorials already available on the internet. If you are completely new to programming or you are interested in learning Python more in depth, I advice you to read the official Python tutorial.

This tutorial can also be consumed as a Jupyter notebook available here . Let’s get started then.

Assignment Statement

  • One thing that’s common in programs is to give names to values. This is called ‘assignment’.
  • person_name = 'John Smith' is an assignment statement. person_name on the left is a ‘variable’ (note that it has no quotes). 'John Smith' on the right is a ‘literal’
  • person_age = 25 is also an assignment statement. Only difference is now we have assigned a number (25) to a variable named person_age
  • total_value = 25 + 35 + 45 is also an assignment statement. First the value of 25 + 35 + 45 is calculated and the result is assigned to a variable named total_value.
  • Variable names are not put in quotes. Text values like ‘John Smith’ and “This is my chat message” are put in single or double quotes. Numbers and Boolean Values (True, False) are not put in quotes.

The print function

  • print("Hello World!") prints “Hello World!” to the screen. Text (like ‘Hello World!’) are called strings in Python and should be enclosed in single or double quotes. Numbers and Booleans (True, False) should not be enclosed in quotes.
  • print() is called a function in Python. print is the name of the function and the text you provide inside brackets is called an argument. As you work with Python you will use a lot more functions and even write your own functions.
  • print("Hello", "World!") prints the same thing as above. You can put any number of items in and the print() function will print them separated by spaces. Now you have provided two arguments to print.
  • print("John", "James", "Stuart", "Jacob", sep=", ") prints the same thing as above but uses a comma as a separator. Now you have provided three arguments to the print function. One of them – sep is a ‘keyword argument’ – an argument that has a name.


In Python, a collection is a bunch of values grouped together. A dictionary is a type of such a collection. It is a list of values where each value has a key associated with it.

Say we want to store a list of voting ages in different countries. It would be cumbersome to create and work with lots of variables like usa_voting_age, india_voting_age, singapore_voting_age and so on.

Instead, we create a dictionary. We will name this dictionary voting_ages. The country names will be ‘keys’ and the voting ages will be ‘values’. When we need the voting age of India, we can simply fetch it by voting_ages['India'].

Dictionaries are created by specifying a list of key:value inside a set of curly brackets.

voting_ages = {
    "India": 18,
    "USA": 18,
    "China": 18,
    "Australia": 18,
    "Singapore": 21,
    "Malaysia": 21

print("Voting age in China is", voting_ages['China'])
Voting age in China is 18


How common are tasks like, ‘add up all the values in this list’, ‘print all the names from this list’, ‘check which of the items in this list weight heavier than 20 kilograms’? Pretty common right? Almost all of your time as a data science programmer will be spent doing loops. There are different types of loops in Python. Let’s learn one common loop – looping through a collection.

Syntax of a loop :

for variable_name in collection:
# outside the loop now
print('Loop finished.')

for variable_name in collection: marks the start of a loop. This means ‘execute the following statements for every value in the collection’. Each value in the collection is assigned to the variable_name, and then the set of statements underneath it are executed. This is repeated for every element in the collection.

The following statements after that are indented to denote that they are part of the loop. The set of statements with the indent is called a ‘block’. The block ends when we stop indenting.

for country in voting_ages:
    # country is a variable which gets each key in the dictionary
    # in this case, each country name.
    print("Voting age in", country, "is", voting_ages[country])
Voting age in India is 18
Voting age in USA is 18
Voting age in China is 18
Voting age in Australia is 18
Voting age in Singapore is 21
Voting age in Malaysia is 21

Preparing Data from Text

One common task done in data science is to read a bunch of text line-by-line and create better quality data from it. For example, if each line of your data is like "John Smith,35,Male,Australia" – name, age, sex and country separated by commas. It would be easier to work with, if it was a dictionary with each of those values mapped to their corresponding names.

So that line of text gets converted into a dictionary – {"name": "John Smith", "age": 35, "sex": "Male", "country": "Australia"}.

Obviously you have several lines of such data. So we can create a list of this data for easier processing. A list is another type of collection in Python. Dictionaries have ‘keys’ to access the values, whereas lists don’t have keys – it’s just a collection of values. You can access list values using the looping syntax we saw above, or by using a numerical index. Lists are created by specifying a bunch of values inside square brackets.

list_names = ["John", "Jacob", "James", "Julie"]
print(list_names[0]) # will print "John"
print(list_names[1]) # will print "Jacob"
# Declaring a list of strings
data_list = [
    "John Smith,35,Male,Australia",
    "Lily Pina,13,Female,USA",
    "Julie Singh,16,Female,India",
    "Rita Stuart,20,Female,Singapore",
    "Trisha Patrick,32,Female,USA",
    "Adam Stork,32,Male,USA",
    "Mohamed Ashiq,20,Male,Malaysia",
    "Yogi Bear,25,Male,Singapore",
    "Ravi Kumar,33,Male,India",
    "Ali Baba,40,Male,China"

Dot operator

A function associated with a particular object is called a ‘method’. Methods do something with the object they are associated with. For example, strings have a method called ‘split’. It splits a string into multiple parts and returns the parts as a list. To call such methods, we use the dot operator.

names = 'John,Jacob,Jaden,Jill,Jack'
names_as_list = names.split(',')
# names is a string and 'split' is a string method.
# split(',') means split the string considering comma as separator

Now you can loop through the names using for-loop syntax like for name in names_as_list:.

Similarly, lists have an .append() method which can be used to add more elements to a list. It is common to declare an empty list using empty square brackets (like my_list = []) and then adding elements to it using the append method (like my_list.append(25)).

Data Types

Every variable in Python has a ‘type’ based on the value assigned to it. Handling data types is quite common when doing data science tasks because data is usually provided as text and it’s upto the programmer to convert it to any type that they want. This is important because what Python can do with the data differs by what type the data is.

For example, 25 can be a number, and 25 also can be thought of as a string.

a = 25
b = '25'
# a is a number, and b is a string
print(a * 3) # will print 75 : 3 times 25
print(b * 3) # will print 252525 : 3 times 25

To be clear about these things, we will have to check and convert data types wherever required. To convert a string value to an integer value, we use the int() function. Example b = int('25') will make ‘b’ a variable of type integer, even though we have given 25 in quotes. If we do b = float('25'), b will be a number with a decimal point (like 25.0). The other way is also possible where you convert a number into a string – b = str(25) will make b a string variable even though you have specified 25 without quotes.

[Continued in next part…]

Hello World: Data Science with Python Part 1

This tutorial is part of a series “Data Science with Python“. A set of tutorials aimed at helping beginners get started with data science and Python.

Installing Python

Installing Python could be as simple as just downloading the Python executable from the official website. But the more common way to get Python in the Data Science world is to have a package manager like conda.

Conda is a popular package manager as well as environment manager.

There are 3 main advantages to installing Conda over Python –

  1. Your work will be reproducible. When you send your project to someone else you just have to tell them to install Conda (and maybe the version), instead of telling them the versions of all your dependencies. Using a dependency manager makes it easy to share your work.
  2. You will avoid package installation and dependency problems. There’s not much chance for dependency conflicts and such, as a beginner, but it’s still an important advantage.
  3. You can use it as a environment manager – that is if different projects you are working on use different dependencies (or different versions), you can create isolated environments for each project to avoid dependency problems, and easily switch between them.

Download the latest version of Miniconda as per your operating system (from the Conda download page) and install it. Installation for most people is just double-clicking on the downloaded file or running a command on the terminal.

  • For Windows, you just have to double-click on the exe file you downloaded.
  • For Mac, there are two options
    • Download the pkg file and double-click to run it, or –
    • Download the script and run it in the terminal using bash ~/Downloads/, and answer the questions asked.
  • For Linux, you need to download the script and run it in the terminal using bash ~/Downloads/, and answer the questions asked.

If you need, the full installation instructions can be found here –

I’ve deliberately kept the installation instructions minimal because there’s just about a ton of references and troubleshooting information if you just search the internet for ‘miniconda installation on <<operating system>>’. After you have successfully installed you should be able to run conda --version in the terminal to see the version of Conda you installed. Also run python --version to see the python version installed.

(base) ➜  data-science python --version
Python 3.9.5
(base) ➜  data-science conda --version
conda 4.10.3
(base) ➜  data-science

Writing a Python Program

You need a text editor to write your code. I use Visual Studio Code. You can install helpful extensions to Visual Studio Code if you like, but I’m not using any for this tutorial. The basic way to run Python programs is to write a program in your text editor and then running it in the terminal using the Python interpreter.

Now that we have everything we need setup, let’s write and execute a simple Python program. A program that prints ‘Hello World!’ on the screen.

1. Create an Environment

An ‘environment’ is the combination of the Python version and packages you need. For one project you might be using Python 3.9 and packages x, y and z. For another project you might be using Python 2.7 and packages a, b and c. Environments provide us a way of keeping these dependencies clear from one another and reduce confusion when switching between projects.

You can also share the environment as a environment.yml file along with your code. So that the recipient can run your project without worrying much about the dependencies. Let’s create an environment named dstut (for data science tutorial) to use along with this set of tutorials.

Execute conda create --name dstut python=3.9 to create the environment. This will create an environment with Python version 3.9 and you will be asked to confirm with a set of default packages required for this.

Once the environment is created you have to ‘activate’ it every time you need to use it. Execute conda activate dstut to activate the environment we created. You’ll see that the command prompt now includes (dstut) to denote that you are in your new Conda environment.

2. Write your Code

Open your text editor and create a file with a single line of code –

print("Hello World!")

Save the file as

3. Execute the Program

Switch back to your terminal and execute python to run your code. If everything is in order, you should see the text Hello World! printed on the terminal.

(dstut) ➜  hello-world python 
Hello World

Intro to Jupyter Notebook

A Jupyter notebook is a format where you can create a single document, which includes your python code, widgets, charts, documentation. It’s great for sharing your work. In fact in academic circles it’s kind of default to share your work through such ‘notebooks’. Almost every homework or assignment I submitted is using these notebooks.

So let’s also see how to run our ‘Hello World!’ program using a Jupyter notebook.

Installing Jupyterlab

I hope you still haven’t closed the terminal. If you have, then open it back again, change to your directory (cd <<directory-name>>) and execute conda activate dstut to activate our tutorial environment.

Execute conda install jupyterlab to install Jupyter in the environment. You will be presented with a list of packages and asked to confirm to install. Go ahead and finish the installation.

Note that you have installed Jupyterlab only inside the environment. That is, once you are out of the environment, (by closing the terminal or by doing conda deactivate), it will be as if there is no Jupyterlab on your computer. And whenever you do conda activate dstut, Jupyterlab is back on again! That’s one of the core functions of an environment manager like Conda.

Okay, now that Jupyterlab is installed, execute jupyter notebook to start the notebook server. This will start the notebook server and open the interface automatically in your default browser. If you need to open the page by yourself, or on a different browser – the terminal will show an URL (like http://localhost:8888/?token=5410e03089b55baba71dubidabab57dudu85207ce07380a9). Copy that URL and paste it in your browser’s address bar to open the Jupyter notebook interface.

Creating our Document

Once you’re in the homepage, use the ‘New’ menu to create a new notebook.

On the new document that’s created, there’s one ‘cell’ by default. A Jupyter notebook is a set of such ‘cells’ with types. For now, we are going to have two cells – one for giving a title to our document and one for our ‘Hello World!’ code.

Change the first cell’s type to ‘Markdown’ using the dropdown on the toolbar or by using the ‘Cell -> Cell Type’ menu. And then type # Program to Print 'Hello World!'. Then add a new cell using the ‘+’ button on the toolbar or by using the ‘Insert’ menu.

The new cell is by default of type ‘Code’ which is what we want too. Type your code in the new cell. If you remember, the code we wrote for is print('Hello World!'). Type this code in the new cell we created.

Running the Document

Now let’s get the output by using the ‘Run’ command. You can either execute the two cells one by one using the ‘Run’ button on the toolbar. Or, use the ‘Cell -> Run All’ menu item to run both cells one by one.

We can also click on the ‘Untitled’ title on the top, and give it a meaningful name that suits our project.

Close the window and switch back to your terminal. You’ll find the server is still running there. Press Ctrl + C to shutdown the server. If you type ls you’ll see that Jupyterlab has saved your document as a file with an ipynb extension. (ipynb stands for IPython Notebook). Then you can do conda deactivate to deactivate our dstut environment or just close the terminal.

That’s it. Now you know how to setup a Python Data Science environment, write and execute Python code, and create Jupyter notebooks.

Benchmarking Java Methods using JMH with Gradle

JMH (Java Microbenchmarking Harness) is a library for benchmarking Java code. Running a benchmark using a library such as JMH is better than simply measuring execution time of methods using System.currentTimeMillis() or System.nanoTime().

The easiest way to run a benchmark using JMH is to use Maven/Gradle plugins. This article shows how to write and run a simple benchmark using the Gradle plugin.

I suppose you already have a Gradle project in which you want to use JMH, but if you wish to try out the walkthrough in this article, you can checkout the code from here as a starting point – which was used in the other article ‘Measuring Execution Time of Java Methods‘. It is a simple project with a class called ‘RandomNumbers’ that creates a list of 100 random numbers and stores it in memory when an object is instantiated.

Add The Plugin

The latest version of this plugin is 0.5.3 at the time of writing this post. You should probably use whichever latest version is available now.

plugins {
    id 'me.champeau.jmh' version '0.6.5'

Write Your Benchmarks

The plugin expects all our benchmarks are put in a src/jmh directory. I like this because it is similar to how we would keep our unit tests – in a separate src/tests folder. So create a jmh directory inside the src, create folder structure to represent your package inside this. That is, if your package name is com.example.mybenchmark, then create com/example/mybenchmark directory inside the jmh directory.

Create a java file for putting your benchmarks in. This is a benchmark –

public class RandomNumbersBenchmark {
    public void initializeRandomNumbers(Blackhole bh) {
        bh.consume(new RandomNumbers());

new RandomNumbers() is the piece of code that we are benchmarking. We are benchmarking the performance of creating new RandomNumbers objects. The Blackhole.consume() method ensures that JVM optimisations don’t come in the way of our benchmark. Without it, the JVM can see that we are not actually using the new RandomNumbers object, and optimise our code accordingly. JVMs do many such things and benchmarking frameworks like JMH give us ways to minimise JVM optimisations from hurting our benchmark numbers.

Run the Benchmarks

Running the benchmarks is quite simple. Just execute ./gradlew jmh from the project directory (where the build.gradle file is). You should be able to see the benchmark running and displaying it’s metrics as it runs. And the output would end with a summary like so –

Benchmark                                       Mode  Cnt     Score    Error  Units
RandomNumbersBenchmark.initializeRandomNumbers  avgt   25  2056.808 ± 84.686  ns/op

The full project for trying out can be downloaded from GitHub –

Screenshot of a JMH Benchmark Running


There are a ton of annotations in JMH which you can refer to in the documentation. But I’ll briefly write about the annotations used in the above sample.

@State is used to assign a “Scope” for the benchmark. Possible scopes are Benchmark, Group and Thread.

The @Benchmark annotation identifies this method as a benchmark. Similar to the @Test annotation that marks Unit tests.

@OutputTimeUnit annotation specifies what time unit your benchmark output should be in.

@BenchmarkMode annotation specifies in which modes the benchmark would run. Note that this annotation can supply multiple modes. Possible modes are AverageTime, SampleTime, SingleShotTime, Throughput and All.

For reference, all annotations and their meanings can be found in the JavaDoc for JMH.

Why JMH?

Microbenchmarking is a big step above simple time capturing. JMH does things like warmup, iterating and managing threads for us. It also helps us avoid common traps while benchmarking – for example, side stepping from JVM optimisations so that our benchmarks are accurate.