Python Basics: Data Science with Python Part 3

This is the continuation of the Python Basics tutorial. This is the second part of Python Basics, and the third part of the Data Science with Python series. This tutorial can also be consumed as a Jupyter notebook available here . Let’s continue then.

Preparing Data for Analysis

Remember the list of strings that we created above –

data_list = [
    "John Smith,35,Male,Australia",
    "Lily Pina,13,Female,USA",
    "Julie Singh,16,Female,India",
    "Rita Stuart,20,Female,Singapore",
    "Trisha Patrick,32,Female,USA",
    "Adam Stork,32,Male,USA",
    "Mohamed Ashiq,20,Male,Malaysia",
    "Yogi Bear,25,Male,Singapore",
    "Ravi Kumar,33,Male,India",
    "Ali Baba,40,Male,China"
]

Let’s convert this list of bulky strings into a neat list of dictionaries with proper data type for age. What we are going to do is –

  • Create an empty list to store the processed lines
  • Loop over each string using the for line in data_list: syntax
  • Split each line into it’s components using the .split() method
  • Create a dictionary with proper field names
    • Use int() method to convert age into a number. Otherwise it will be stored as a string.
    • Append this dictionary to the processed lines list using the .append() method
processed_data = []
for line in data_list:
    fields = line.split(',')
    processed_data.append({
        'name': fields[0],
        'age': int(fields[1]),
        'sex': fields[2],
        'country': fields[3]
    })

processed_data

Using the processed list

Each element in a collection, can be a collection itself. That’s what we have done here. We have created a collection of collections – or rather, a list of dictionaries.

  • processed_data is a list – it’s elements are accessed using a numerical index (starting with 0)
  • processed_data[0] is the first element of the list. processed_data[1] is the second element and so on. Note that each element is a dictionary (that we appended in the previous step)
  • Elements of a dictionary are accessed using their key names. So processed_data[0]['name'] means to fetch the first element (which is a dictionary) and then fetch the ‘name’ field from it.
print(processed_data[0]['name'], processed_data[0]['age'])
print(processed_data[1]['name'], processed_data[1]['age'])

Stepping into Data Science

Let’s print some statistics from our data. First calculate average age of people in our dataset –

  • Average is sum divided by count.
  • Count can be easily obtained using the len() function that returns the size of any string or collection passed as argument
  • Sum can be obtained by looping through the list and accumulating the age values into a sum variable.
  • Finally divide sum by count to get the average
sum_of_ages = 0
number_of_persons = len(processed_data) # len function gives size of collection
for person in processed_data:
    sum_of_ages = sum_of_ages + person['age']
    
print("Number of persons:", number_of_persons)
print("Average age:", sum_of_ages / number_of_persons)

Conditions

A condition specified using ‘if’, ‘elif’ and ‘else’ keywords help us branch out our programs execution based on the given condition. Statement blocks to be executed Example –

if age < 18:
    person_type = 'kid'
    print('Person is just a kid')
elif age < 60: # elif means else-if
    person_type = 'adult'
    print('Person is an adult')
else:
    person_type = 'senior'
    print('Person is a senior')

The < in age < 18 is a ‘comparison operator’. Other comparison operators are <, >, <=, >=, ==, !=. == means True if both sides are equal. != means True if both sides are not equal.

Two more operators are in and not in. These are for conditions where you have to check whether a value is in a collection. Example if country in country_list: or if student not in class:.

Let’s use the conditions to report how many of our people are eligible to vote –

number_of_voters = 0
number_of_nonvoters = 0
for person in processed_data:
    # person['country'] gives a country name
    # which can be used to get voting age
    # from the voting_ages dictionary
    if person['age'] > voting_ages[person['country']]:
        number_of_voters = number_of_voters + 1
    else:
        number_of_nonvoters = number_of_nonvoters + 1

print("Number of voters:", number_of_voters)
print("Number of non voters:", number_of_nonvoters)

Logical Operators

Conditions often require to be combined to be useful. For example a person should be over 18 years and should be a male. To represent combinations of conditions like this, Python has logical operators – and, or and not.

if age < 18 and  sex == "Male":
    print("Male Child")

Another example –

if country == "India" or country == "China":
    print("Asia")
elif country == "Spain" or country == "Italy":
    print("Europe")

Slicing Operator

Slicing is an operation that can be used on lists and strings in Python. It is quite simple and very useful to quickly fetch a range of items from a list (or a range of characters from a string).

We access list items with their numerical indexes, but we can also give a range inside the square brackets to get multiple items at once – sort of like a sub-list. For example, my_list[2:6] will return the elements from index 2 to 5. Register this – [a:b] means from ‘a’, upto, but not including ‘b’. Also remember index starts with zero.

Using negative values with the slicing operator is also possible. It simply means elements are counted from the end. Or you subtract the values from the length. That is, if the length of the list is l, then [-a:-b] means [l-a:l-b]

Leaving out the values could means start from the beginning or till the end. That is, [:b] means from starting, upto, but not including ‘b’. And [a:] means from a, until the end.

my_list = ['apple', 'orange', 'grape', 'melon', 'lemon', 'cherry', 'banana', 'strawberry']
print(my_list[2:6])   # prints ['grape', 'melon', 'lemon', 'cherry']
print(my_list[2:])    # prints ['grape', 'melon', 'lemon', 'cherry', 'banana', 'strawberry']
print(my_list[:6])    # prints ['apple', 'orange', 'grape', 'melon', 'lemon', 'cherry']

print(my_list[-6:-2]) # prints ['grape', 'melon', 'lemon', 'cherry']
print(my_list[:-2])   # prints [

This works exactly the same with used with strings. For example, if my_name is a string variable, my_name[-2:] means last two characters from the given string. my_name[:2] means first two characters.

Experiment and learn the slicing operator until you are confident.

Functions

Functions are named, reusable bits of code. Which you first define and then call whenever required. Defining our own functions will help modularize our code and reduce duplication of code. It also makes it convenient to introduce changes in future. For example, let’s introduce a concept of short names for people in our data set. For now, a short name is made by joining the first letter of the first name and the first 4 letters of the last name. So for “John Smith”, the short name will be “JSmit”.

name = "John Smith"
first_letter = name[:1]
last_name = name.split(" ")[1]
short_last_name = last_name[:4]
short_name = first_letter + short_last_name

The above code gives short name for “John Smith”. But when we need short name for “Jacob Nilson”, we have to write the same set of 5 lines of code again. Any time we change our formula for creating short names, we have to change all this code. This is where functions help.

Functions allow us to package processing like this and reuse it as and where it is required. Functions are defined using the def keyword. Then a function name (in this example ‘short_name’) and then in brackets, a list of parameters that the function can take as input. Similar to the for loop and if conditions, the set of statements forming the function block is indented below the declaration line.

def short_name(full_name):
    first_letter = full_name[:1]
    last_name = full_name.split(" ")[1]
    short_last_name = last_name[:4]
    short_name = first_letter + short_last_name
    return short_name

print(short_name("John Smith"))
print(short_name("Jacob Nilson"))
print(short_name("James Maroon"))
print(short_name("Jill Jack"))

Now wherever we need this functionality, we can just call this function by it’s name. We don’t have to write the same code again and again.

Another advantage is when introducing a change, we just make the change in the function definition and it reflects in all places where we have called the function. So whenever you’re implementing a formula or an algorithm, it’s better to define it as a function and then call it wherever required.

Modules

Functions defined like the above are usually grouped together into a ‘module’ that we import before using the function. For example there are a ton of functions in the math module in Python. Keeping all functions in the global scope is bad for manageability. So we arrange them into modules and ‘import’ them into our programs if and when required.

Say we need the square root function. It’s in the math module. So we can import the math module and call math.sqrt function from it.

import math
print(math.sqrt(25))

Or, we can import only the sqrt function and use it without specifying a module name.

from math import sqrt
print(sqrt(25))

As you advance, you will not only define your own methods, you will also organize your code into modules.

Conclusion

So that concludes my super fast intro to Python course. It’s not really much when compared to the vastness of the Python ecosystem, but it’s a good start to the data science journey we’re taking up. There are more concepts but I find it easier to introduce concepts when we’re about to actually use them for something. So although the Python Basics part is over, I will continue to introduce Python concepts and methods as we progress.

Python Basics: Data Science with Python Part 2

This tutorial is part of a series “Data Science with Python”. A set of tutorials aimed at helping beginners get started with data science and Python.

Consider this article a super fast tutorial for Python. But I’m not taking the usual feature-by-feature tutorial route. This is a super fast introduction to Python. Because there’s an unbelievable amount of Python tutorials already available on the internet. If you are completely new to programming or you are interested in learning Python more in depth, I advice you to read the official Python tutorial.

This tutorial can also be consumed as a Jupyter notebook available here . Let’s get started then.

Assignment Statement

  • One thing that’s common in programs is to give names to values. This is called ‘assignment’.
  • person_name = 'John Smith' is an assignment statement. person_name on the left is a ‘variable’ (note that it has no quotes). 'John Smith' on the right is a ‘literal’
  • person_age = 25 is also an assignment statement. Only difference is now we have assigned a number (25) to a variable named person_age
  • total_value = 25 + 35 + 45 is also an assignment statement. First the value of 25 + 35 + 45 is calculated and the result is assigned to a variable named total_value.
  • Variable names are not put in quotes. Text values like ‘John Smith’ and “This is my chat message” are put in single or double quotes. Numbers and Boolean Values (True, False) are not put in quotes.

The print function

  • print("Hello World!") prints “Hello World!” to the screen. Text (like ‘Hello World!’) are called strings in Python and should be enclosed in single or double quotes. Numbers and Booleans (True, False) should not be enclosed in quotes.
  • print() is called a function in Python. print is the name of the function and the text you provide inside brackets is called an argument. As you work with Python you will use a lot more functions and even write your own functions.
  • print("Hello", "World!") prints the same thing as above. You can put any number of items in and the print() function will print them separated by spaces. Now you have provided two arguments to print.
  • print("John", "James", "Stuart", "Jacob", sep=", ") prints the same thing as above but uses a comma as a separator. Now you have provided three arguments to the print function. One of them – sep is a ‘keyword argument’ – an argument that has a name.

Dictionaries

In Python, a collection is a bunch of values grouped together. A dictionary is a type of such a collection. It is a list of values where each value has a key associated with it.

Say we want to store a list of voting ages in different countries. It would be cumbersome to create and work with lots of variables like usa_voting_age, india_voting_age, singapore_voting_age and so on.

Instead, we create a dictionary. We will name this dictionary voting_ages. The country names will be ‘keys’ and the voting ages will be ‘values’. When we need the voting age of India, we can simply fetch it by voting_ages['India'].

Dictionaries are created by specifying a list of key:value inside a set of curly brackets.

voting_ages = {
    "India": 18,
    "USA": 18,
    "China": 18,
    "Australia": 18,
    "Singapore": 21,
    "Malaysia": 21
}

print("Voting age in China is", voting_ages['China'])
Voting age in China is 18

Looping

How common are tasks like, ‘add up all the values in this list’, ‘print all the names from this list’, ‘check which of the items in this list weight heavier than 20 kilograms’? Pretty common right? Almost all of your time as a data science programmer will be spent doing loops. There are different types of loops in Python. Let’s learn one common loop – looping through a collection.

Syntax of a loop :

for variable_name in collection:
    inside_the_loop()
    print(variable_name)
    do_some_more_things()
# outside the loop now
print('Loop finished.')

for variable_name in collection: marks the start of a loop. This means ‘execute the following statements for every value in the collection’. Each value in the collection is assigned to the variable_name, and then the set of statements underneath it are executed. This is repeated for every element in the collection.

The following statements after that are indented to denote that they are part of the loop. The set of statements with the indent is called a ‘block’. The block ends when we stop indenting.

for country in voting_ages:
    # country is a variable which gets each key in the dictionary
    # in this case, each country name.
    print("Voting age in", country, "is", voting_ages[country])
Voting age in India is 18
Voting age in USA is 18
Voting age in China is 18
Voting age in Australia is 18
Voting age in Singapore is 21
Voting age in Malaysia is 21

Preparing Data from Text

One common task done in data science is to read a bunch of text line-by-line and create better quality data from it. For example, if each line of your data is like "John Smith,35,Male,Australia" – name, age, sex and country separated by commas. It would be easier to work with, if it was a dictionary with each of those values mapped to their corresponding names.

So that line of text gets converted into a dictionary – {"name": "John Smith", "age": 35, "sex": "Male", "country": "Australia"}.

Obviously you have several lines of such data. So we can create a list of this data for easier processing. A list is another type of collection in Python. Dictionaries have ‘keys’ to access the values, whereas lists don’t have keys – it’s just a collection of values. You can access list values using the looping syntax we saw above, or by using a numerical index. Lists are created by specifying a bunch of values inside square brackets.

list_names = ["John", "Jacob", "James", "Julie"]
print(list_names[0]) # will print "John"
print(list_names[1]) # will print "Jacob"
# Declaring a list of strings
data_list = [
    "John Smith,35,Male,Australia",
    "Lily Pina,13,Female,USA",
    "Julie Singh,16,Female,India",
    "Rita Stuart,20,Female,Singapore",
    "Trisha Patrick,32,Female,USA",
    "Adam Stork,32,Male,USA",
    "Mohamed Ashiq,20,Male,Malaysia",
    "Yogi Bear,25,Male,Singapore",
    "Ravi Kumar,33,Male,India",
    "Ali Baba,40,Male,China"
]

Dot operator

A function associated with a particular object is called a ‘method’. Methods do something with the object they are associated with. For example, strings have a method called ‘split’. It splits a string into multiple parts and returns the parts as a list. To call such methods, we use the dot operator.

names = 'John,Jacob,Jaden,Jill,Jack'
names_as_list = names.split(',')
# names is a string and 'split' is a string method.
# split(',') means split the string considering comma as separator

Now you can loop through the names using for-loop syntax like for name in names_as_list:.

Similarly, lists have an .append() method which can be used to add more elements to a list. It is common to declare an empty list using empty square brackets (like my_list = []) and then adding elements to it using the append method (like my_list.append(25)).

Data Types

Every variable in Python has a ‘type’ based on the value assigned to it. Handling data types is quite common when doing data science tasks because data is usually provided as text and it’s upto the programmer to convert it to any type that they want. This is important because what Python can do with the data differs by what type the data is.

For example, 25 can be a number, and 25 also can be thought of as a string.

a = 25
b = '25'
# a is a number, and b is a string
print(a * 3) # will print 75 : 3 times 25
print(b * 3) # will print 252525 : 3 times 25

To be clear about these things, we will have to check and convert data types wherever required. To convert a string value to an integer value, we use the int() function. Example b = int('25') will make ‘b’ a variable of type integer, even though we have given 25 in quotes. If we do b = float('25'), b will be a number with a decimal point (like 25.0). The other way is also possible where you convert a number into a string – b = str(25) will make b a string variable even though you have specified 25 without quotes.

[Continued in next part…]