This is the continuation of the Python Basics tutorial. This is the second part of Python Basics, and the third part of the Data Science with Python series. This tutorial can also be consumed as a Jupyter notebook available here . Let’s continue then.
Preparing Data for Analysis
Remember the list of strings that we created above –
data_list = [
"John Smith,35,Male,Australia",
"Lily Pina,13,Female,USA",
"Julie Singh,16,Female,India",
"Rita Stuart,20,Female,Singapore",
"Trisha Patrick,32,Female,USA",
"Adam Stork,32,Male,USA",
"Mohamed Ashiq,20,Male,Malaysia",
"Yogi Bear,25,Male,Singapore",
"Ravi Kumar,33,Male,India",
"Ali Baba,40,Male,China"
]
Let’s convert this list of bulky strings into a neat list of dictionaries with proper data type for age. What we are going to do is –
- Create an empty list to store the processed lines
- Loop over each string using the
for line in data_list:
syntax - Split each line into it’s components using the
.split()
method - Create a dictionary with proper field names
- Use
int()
method to convert age into a number. Otherwise it will be stored as a string. - Append this dictionary to the processed lines list using the
.append()
method
- Use
processed_data = []
for line in data_list:
fields = line.split(',')
processed_data.append({
'name': fields[0],
'age': int(fields[1]),
'sex': fields[2],
'country': fields[3]
})
processed_data
Using the processed list
Each element in a collection, can be a collection itself. That’s what we have done here. We have created a collection of collections – or rather, a list of dictionaries.
processed_data
is a list – it’s elements are accessed using a numerical index (starting with 0)processed_data[0]
is the first element of the list.processed_data[1]
is the second element and so on. Note that each element is a dictionary (that we appended in the previous step)- Elements of a dictionary are accessed using their key names. So
processed_data[0]['name']
means to fetch the first element (which is a dictionary) and then fetch the ‘name’ field from it.
print(processed_data[0]['name'], processed_data[0]['age'])
print(processed_data[1]['name'], processed_data[1]['age'])
Stepping into Data Science
Let’s print some statistics from our data. First calculate average age of people in our dataset –
- Average is sum divided by count.
- Count can be easily obtained using the
len()
function that returns the size of any string or collection passed as argument - Sum can be obtained by looping through the list and accumulating the age values into a sum variable.
- Finally divide sum by count to get the average
sum_of_ages = 0
number_of_persons = len(processed_data) # len function gives size of collection
for person in processed_data:
sum_of_ages = sum_of_ages + person['age']
print("Number of persons:", number_of_persons)
print("Average age:", sum_of_ages / number_of_persons)
Conditions
A condition specified using ‘if’, ‘elif’ and ‘else’ keywords help us branch out our programs execution based on the given condition. Statement blocks to be executed Example –
if age < 18:
person_type = 'kid'
print('Person is just a kid')
elif age < 60: # elif means else-if
person_type = 'adult'
print('Person is an adult')
else:
person_type = 'senior'
print('Person is a senior')
The <
in age < 18
is a ‘comparison operator’. Other comparison operators are <
, >
, <=
, >=
, ==
, !=
. ==
means True if both sides are equal. !=
means True if both sides are not equal.
Two more operators are in
and not in
. These are for conditions where you have to check whether a value is in a collection. Example if country in country_list:
or if student not in class:
.
Let’s use the conditions to report how many of our people are eligible to vote –
number_of_voters = 0
number_of_nonvoters = 0
for person in processed_data:
# person['country'] gives a country name
# which can be used to get voting age
# from the voting_ages dictionary
if person['age'] > voting_ages[person['country']]:
number_of_voters = number_of_voters + 1
else:
number_of_nonvoters = number_of_nonvoters + 1
print("Number of voters:", number_of_voters)
print("Number of non voters:", number_of_nonvoters)
Logical Operators
Conditions often require to be combined to be useful. For example a person should be over 18 years and should be a male. To represent combinations of conditions like this, Python has logical operators – and
, or
and not
.
if age < 18 and sex == "Male":
print("Male Child")
Another example –
if country == "India" or country == "China":
print("Asia")
elif country == "Spain" or country == "Italy":
print("Europe")
Slicing Operator
Slicing is an operation that can be used on lists and strings in Python. It is quite simple and very useful to quickly fetch a range of items from a list (or a range of characters from a string).
We access list items with their numerical indexes, but we can also give a range inside the square brackets to get multiple items at once – sort of like a sub-list. For example, my_list[2:6]
will return the elements from index 2 to 5. Register this – [a:b]
means from ‘a’, upto, but not including ‘b’. Also remember index starts with zero.
Using negative values with the slicing operator is also possible. It simply means elements are counted from the end. Or you subtract the values from the length. That is, if the length of the list is l
, then [-a:-b]
means [l-a:l-b]
Leaving out the values could means start from the beginning or till the end. That is, [:b]
means from starting, upto, but not including ‘b’. And [a:]
means from a, until the end.
my_list = ['apple', 'orange', 'grape', 'melon', 'lemon', 'cherry', 'banana', 'strawberry']
print(my_list[2:6]) # prints ['grape', 'melon', 'lemon', 'cherry']
print(my_list[2:]) # prints ['grape', 'melon', 'lemon', 'cherry', 'banana', 'strawberry']
print(my_list[:6]) # prints ['apple', 'orange', 'grape', 'melon', 'lemon', 'cherry']
print(my_list[-6:-2]) # prints ['grape', 'melon', 'lemon', 'cherry']
print(my_list[:-2]) # prints [
This works exactly the same with used with strings. For example, if my_name
is a string variable, my_name[-2:]
means last two characters from the given string. my_name[:2]
means first two characters.
Experiment and learn the slicing operator until you are confident.
Functions
Functions are named, reusable bits of code. Which you first define and then call whenever required. Defining our own functions will help modularize our code and reduce duplication of code. It also makes it convenient to introduce changes in future. For example, let’s introduce a concept of short names for people in our data set. For now, a short name is made by joining the first letter of the first name and the first 4 letters of the last name. So for “John Smith”, the short name will be “JSmit”.
name = "John Smith"
first_letter = name[:1]
last_name = name.split(" ")[1]
short_last_name = last_name[:4]
short_name = first_letter + short_last_name
The above code gives short name for “John Smith”. But when we need short name for “Jacob Nilson”, we have to write the same set of 5 lines of code again. Any time we change our formula for creating short names, we have to change all this code. This is where functions help.
Functions allow us to package processing like this and reuse it as and where it is required. Functions are defined using the def
keyword. Then a function name (in this example ‘short_name’) and then in brackets, a list of parameters that the function can take as input. Similar to the for loop and if conditions, the set of statements forming the function block is indented below the declaration line.
def short_name(full_name):
first_letter = full_name[:1]
last_name = full_name.split(" ")[1]
short_last_name = last_name[:4]
short_name = first_letter + short_last_name
return short_name
print(short_name("John Smith"))
print(short_name("Jacob Nilson"))
print(short_name("James Maroon"))
print(short_name("Jill Jack"))
Now wherever we need this functionality, we can just call this function by it’s name. We don’t have to write the same code again and again.
Another advantage is when introducing a change, we just make the change in the function definition and it reflects in all places where we have called the function. So whenever you’re implementing a formula or an algorithm, it’s better to define it as a function and then call it wherever required.
Modules
Functions defined like the above are usually grouped together into a ‘module’ that we import before using the function. For example there are a ton of functions in the math module in Python. Keeping all functions in the global scope is bad for manageability. So we arrange them into modules and ‘import’ them into our programs if and when required.
Say we need the square root function. It’s in the math module. So we can import the math module and call math.sqrt function from it.
import math
print(math.sqrt(25))
Or, we can import only the sqrt function and use it without specifying a module name.
from math import sqrt
print(sqrt(25))
As you advance, you will not only define your own methods, you will also organize your code into modules.
Conclusion
So that concludes my super fast intro to Python course. It’s not really much when compared to the vastness of the Python ecosystem, but it’s a good start to the data science journey we’re taking up. There are more concepts but I find it easier to introduce concepts when we’re about to actually use them for something. So although the Python Basics part is over, I will continue to introduce Python concepts and methods as we progress.