Learn Python Programming in 3 hours-Part1

For any data analyst the tools used to explore, visualize and interpret data is as important as a toolbox for any handyman. With right set of tools you can get the job done in much less time and more efficently. For many data ninjas the preference for the tool is stricly personal and can vary; however at times the client can have their own preference for a specific tool or technology. Keeping in view the market trtends in tech space, Python has become one of the most sought out skill among data analysts and data scientists; so much so that any seasoned data analyst is considered incomplete if he doesnt have Python under his/her belt.

Many people make a mistake while learning Python for data science or data analysis; they tend to learn it the programmers route. Even though that is the most correct way of learning any programming language, the learning curve however is quite steep and often it takes too long to come to the point where one can see the data application of it. Python as a programming language has got a widespread application, and data is just one part of it. Since our focus here is strictly on data, I have designed this course in such a way that we dont go into details that are not pertinet to data. This is not a comprehensive Python tutorial that covers all the nuances of the language, but a crash course that will enable you to get your hands dirty right from the start. The intention is to get you working with Python as quickly as possible, with as little theory as possible. This course is for the one who wants to be able to use Python for data analysis in next 2-3 hours. So if that is what you are looking for, read on…

I have split this course into three parts such that it doesn’t become too lengthy or overwhelming. By end of this Python course:-

• You should be familiar with the Spyder, a Python integrated development environment (IDE) specifically designed for Engineers, Data Analaysts and Data Scientists.
• You should be able to import data into Python
• You should be able to check: no of rows, no of column and the structure of your data in Python
• You should be able to view the data that you imported in Python
• You should be able to create computed columns and use basic numeric functions in Python
• You should be able to sort the data in Python
• You should be able to derive the value of column based on condition in Python
• You should be able to apply basic character functions to your data in Python
• You should be able to convert data types in Python
• You should be able to deal with basic date columns in Python
• You should be able to do mathematical calculations with your data in Python
• You should be able to remove duplicates in your data in Python
• You should be able to replace and find missing values in your data in Python
• You should be able to Remove or Keep columns from a dataset in Python
• You should be able to do joins in Python (Left, Right, Inner, Outer)
• You should be able to do the basic summary and aggregation of data in Python
• You should be able to subset the data in Python
• You should be able to create new datasets in Python
• You should be able to create basic plots with data in Python
• You should be able to export a dataset into a file using Python

So let’s begin:-

Intro:-

Python is an interpreted, high-level, general-purpose open source programming language. Created by Guido van Rossum and first released in 1991.
Python is used for many different applications; the user community for Python in recent years have created numerous packages that are built specifically for data science.

Python for Data Analysis

Python like we mentioned before is a very versatile language that has won many hearts for its readability and ease. In recent years Python has become quite popular for large-scale data processing, analytics, and computing. There are lot of data specific libraries that have been added to Python that makes it an ideal candidate for data wrangling, analytics and visualization.

What is Anaconda?

Anaconda is an Open source Python distribution (a collection of specific software components) that provides you with Python and other essential data analysis tools. We will use Anaconda for our tutorial, following is the link where we can download it:

https://www.anaconda.com/distribution/

When installed, Anaconda includes Spyder, the Scientific Python Development Environment. Its a free integrated development environment (IDE) that is included with Anaconda. We will use it for our tutorial.

Spyder Interface:-
Once you are done with Installations, and click on the Anaconda icon on your computer, you should see something like this:

In there, you will find Spyder. You can click on Launch. As Spyder opens you will see a screen like:

Now, obviously I will not be talking about the all the options and functionalities of Spyder, but rather limit us to three most important areas of Spyder screen.
1) This is where we type our code.
2) Variable Explorer: This is where you find all the data elements that you create(Tables, lists etc.).
3) Console: This is where where we can see the output.

Now lets get our hands dirty, with real stuff. I will be using two files during this tutorial and you can download them here:-

http://www.sharecsv.com/s/8cb788eef6935c98f03dceb3feb007f2/names.csv

http://www.sharecsv.com/s/ac99fae5179270819e6a1c6079b70a8f/salary.csv

Import the files

Once you have downloaded these files to a particular location on your computer, we will import two important packages that we will use in the course: pandas and numpy
Pandas is a package in Python specific to data analysis. Numpy is another package in Python that is used for scientific computing.
You can read more about them here:

https://pandas.pydata.org/
https://numpy.org/

import pandas as pd
import numpy as np

When we import these packages, we add an alias to them using the keyword ‘as’. Its done such that when we reference them in the code we dont have to type the full name of the package. i.e we can access pandas by pd, and numpy by np

We will use a function read_csv, that is the part of the pandas package, to read csv files. This code needs to be typed in Console of Spyder, followed by F9 to execute the code/command.

names=pd.read_csv(“/home/dell/Python/names.csv”)
salary=pd.read_csv(“/home/dell/Python/salary.csv”)

Firstly I am importing the csv file called names.csv and loading the data into a data frame (tables/datasets in Python are called as data frames) called names.
Next I am loading the file salary.csv from my computer into a data frame called salary.

I might be using the words data frame and dataset interchangeably in this course, they both mean the same thing.

You will be obviously required to change the path in these commands to the location where you downloaded names.csv and salary.csv respectively.
On a windows machine, you will have to use two back slashes instead of one, for path eg:
Instead of writing c:\ temp you will have to write c:\\temp. This is because ‘\’ is an escape character in Python.

Structure of Dataset

Now let’s try took at the structure of the data frames that we just created. Same can be accomplished by

names.info()

It gives us information about the no of observation, no of variables and also their data types. You don’t have to stress about the data types of different variables as we are trying to be as light on theory as possible. We will briefly discuss about data types later.

Now if we want to just get names of variables in a data frame without data types, we can use the following:

names.columns

Find out Number of observations in your dataset:

names.count()+1

View few observations:-

names.head(5)
names.tail(5)

You can limit the records/rows from the data frame that you want to view, by using head or tail function. As the name implies, head will return rows from top and tail will return rows from bottom. In this case I am asking Python to return 5 rows.

Create a new column

Now let’s try to create a new column namely Income, as sum of column Salary and column Bonus.
In Python we can reference the column of a table by using the following pattern :- table_name[‘column_name’]

salary[‘Income’]=salary[‘Salary’]+salary[‘Bonus’]

Here I am telling Python to create column Income in table salary as sum of column Salary from table salary and column Bonus from table salary. Please don’t get confused between Salary column and salary table. Also remember that Python is case sensitive, which means, Variable and variable are not the same.

After creating this column, lets preview the data using head function.

salary.head(5)

Round off that column

As you might have noticed, the values in column income contain decimal values; let’s try to round them off using function astype that is used to change data type.

salary[‘Income’]=salary.Income.astype(int)

Here I am telling Python, to change the data type of column Income from ‘float’ to ‘int’.
Integers and floating points are separated by the presence or absence of a decimal point. 5 is integer whereas 5.0 is a floating point number.
There, we already learned about two major data types in python 🙂

Let’s preview the data using head function.

salary.head(5)

Sort by ascending Order of Column

Now let’s try to sort by ascending order of Income and preview the results using tail function.

salary=salary.sort_values(by=[‘Income’],ascending=True)
salary.head(5)

Here I am telling Python, to overwrite table salary, as table salary sorted by column Income, in ascending order. Please note how Python interprets the above statements, being an Object Oriented Programming Language, data frame is an object and on that we use a function called ‘sort_values’.

Sort in Descending Order

Alternatively we can sort in descending order by just changing ‘True’ to ‘False’ in above piece of code.

See you soon with Part 2 of this tutorial….

Contributed by: Ubaid Darwaish