This is in continuation of the series “Learn Python Programming in 3 hours”. In case you have not gone through the previous section, please do so before you proceed.
In the previous chapter (Learn Python Programming in 3 hours- Part 1 ) we discussed the following:-
Installing Anaconda
The basic Spyder interface.
Import data into Python
Check: no of rows, no of column and the structure of your data in Python
View the data that you imported in Python
Create computed columns and use basic numeric functions in Python
Sort the data in Python
In this chapter, we will continue from where we left and by end of this chapter you should be able to do the following in Python:-
• You should be able to derive the value of column based on condition in Python
• You should be able to apply basic character functions to your data in Python
• You should be able to convert data types in Python
• You should be able to deal with basic date columns in Python
• You should be able to do mathematical calculations with your data in Python
• You should be able to remove duplicates in your data in Python
• You should be able to replace and find missing values in your data in Python
• You should be able to Remove or Keep columns from a dataset in Python
So lets get started:-
Create New Column based on Condition
Now what we need to do is to create a new column called gender, we will create this column based on the values of the column sex. We will be using a method ‘where’ from the package numpy:-
names.head(2)
names[‘Gender’]=np.where(names[‘sex’]==’boy’,’Male’,’Female’)
names.head(2)
All I am telling Python is to create a column ‘Gender’ in table ‘names’…. assign it value ‘Male’ where sex==’boy’ , otherwise assign as ‘Female’. I checked the sample set of data using head function.
Following is how it (np.where) works:
np.where(dataframe[‘column’]==’check’,’ValuefrTrue’,’ValuefrFalse’)
Please remember to use, ‘==’ to check and not ‘=’
If we had to assign binary values instead of Male and Female we could do the same as follows:
#names[‘Gender’]=np.where(names[‘sex’]==’boy’,1,0)
However for this course, we will stick to Male and Female instead of 1 and 0.
Adding a # sign before a line in Python, comments it out.
Modify a column
Previously we named the values in Column Gender as Male or Female. Now we will try to recreate the column gender by applying the substring function to extract only the first alphabet i.e. ‘M’ or ‘F’.
names[‘Gender’]=names.Gender.str[:1]
names.head(2)
The method that we used in code above is called String Slicing
The “slice” syntax is a handy way to get a substring. The slice s[start:end] is the elements beginning at start and extending up to but not including end. Suppose we have a string s = “Hello”
the string ‘hello’ will have following indexes to denote the location of alphabets
H e l l o
0 1 2 3 4
s[1:4] will result in output 'ell' because we are asking Python to give us characters starting at index 1 and extending up to but not including index 4
In code above for our names dataset, we asked Python to give us a slice of string upto index 1. i.e only where index is 0. We used head function to view output.
One important thing to note is that the index in Python starts from 0 and in R starts from 1.
Now lets change the case to lowercase and view the output.
names[‘Gender’]=names[‘Gender’].str.lower()
names.head(2)
Similary we can chage the case to upper as follows:
names[‘Gender’]=names[‘Gender’].str.upper()
Adding a column based on date
from datetime import date
names[‘Today’]=date.today()
names.head(5)
First we imported date from module datetime. And then we used a class attribute today() to assign the value to column ‘Today; in dataset names.
Compute age
Now lets create another column in table names as age. We will try to calculate the persons age by finding out the number of days between today and the persons date of birth and then divide it by 365 to find the number of years.
names[‘Age’]=names[‘Today’]-names[‘year_of_birth’]
As you will see if we try to execute the code above,we get an error. To understand the error and get past it lets take a step back and talk briefly about data types in Python. There are many data types in Python but we will discuss only few that are relevant for this course.
Float: Decimal values are called Float in Python.
Integer: Non Decimal numbers are stored as integers in Python.
Boolean: These are variables whose value is stored as True or False value.
Strings: Strings are text values composed of a sequence of characters.
Date: A date in Python is not a data type of its own, but we can import a module named datetime to work with dates as date objects.
A table/dataset in Python is called a data frame.
You can try the following piece of code quickly (One line at a time) to help you understand the data types in Python better, type() function in Python will tell you the datatype of the object:
a=1
type(a)
b=2.0
type(b)
c=True
type(c)
d=”hello”
type(d)
Lets check the structure of data frame names, by using info function.
names.info()
We can see that id is stored as int and name,sex,year_of_birth and today as object. The object is actually a str or mixed numeric and non-numeric value.
If you notice even columns like year_of_birth and today are being stored as object, hence our error. To fix it we will have to convert it, and for that we use ‘to_datetime; function :
names[‘year_of_birth’]=pd.to_datetime(names[‘year_of_birth’])
names[‘Today’]=pd.to_datetime(names[‘Today’])
Now if we check again the structure of the table
names.info()
The issue seems to have been fixed. Lets try again to compute Age
names[‘Age’]=(names[‘Today’]-names[‘year_of_birth’])/365
Now lets try to view the output
names.head(1)
There are weird values in column Age, Its actually the age which is stored as datetime format. If you run names.info() you can see the data type for age is not what we want it to be. To fix that we will have to recalcuate Age as follows:
names[‘Age’]=(names[‘Today’]-names[‘year_of_birth’])/np.timedelta64(1,’Y’)
Here instead of dividing timedelta by 365, we divide timedelta by timedelta to give us float datatype
If we check now, we will see Age will still have decimal values. To fix that we change its datatype to int
names.head(1)
names[‘Age’]=names.Age.astype(int)
names.head(1)
Remove Duplicates
Let’s run a simple statement to select only rows where id=1015 and retrieve all columns.
salary[salary.id==1015]
I am telling Python to retrieve all rows from salary dataframe, where column id ==1015
As we can see there are duplicate values. Before we get rid of the same lets take a backup of our data frame. We can do so by using function copy()
sal_bckup=salary.copy()
Next we use the function drop_duplicates() to remove duplicates
salary.drop_duplicates(subset=’id’,inplace=True)
The syntax is dataframe.drop_duplicates(subset=’columname’,inplace=True)
inplace takes Boolean values and removes rows with duplicates if True.
Now if we check for column id=1015 again, we see there are no duplicates.
salary[salary.id==1015]
Replace missing values
Lets first view the rows in data frame names where value for column sex is missing.
names[names.sex.isnull()]
Here I am telling Python to return all rows from dataframe names, where column ‘sex’ is haing null values.
We see that the people where value for column sex as missing, have been incorrectly labelled as f (females) in column gender. Lets try to change that value to u (unknown).
names.loc[names[‘sex’].isnull(),’Gender’]=’u’
Here what I am telling Python is to select the rows with missing values of sex and update the column gender with values u.
.loc method is a method that takes only index labels or rows where condition is met and returns row or dataframe if the index label exists in the caller data frame.
If we check again, we see that incorrectly labelled gender has been corrected.
names[names.sex.isnull()]
Drop/Retain Variables
To Keep centain columns only (id and name in this case), we simply specify those columns as:
new_k=names[[‘id’,’name’]]
new_k.head(1)
Lets try to drop the columns/variables like sex,Gender.Same can be done as follows
new_d=names.drop([‘sex’,’Gender’],axis=1)
new_d.head(1)
See you soon with Part3….
Contributed by: Ubaid Darwaish