With the advent of data science different languages and tools have emerged in the market, while many of them couldn’t survive the test of time and were eventually forgotten, R has kept its ground, fought every criticism that came its way and has survived to become one of the most preferred languages in the market.
Now to be a good programmer, you need to study the language to its depths and breadths and understand its nuances, but that is not the focus here. This is not a comprehensive course on R, it’s rather a crash course that intends to get you working with R as quickly as possible, with as little theory as possible. This course is for the one who wants to be able to use R in next 2-3 hours. So if that is what you are looking for, read on…
Couple of things before we start: I have split this course into three parts such that it doesn’t become too lengthy or overwhelming. I have purposefully used images instead of text, for code examples. Because if you had to do copy and paste, there is no learning there. One important aspect of programming is to know the syntax, and you only know it by doing(typing) it yourself. Also, by end of this course:-
• You should be familiar with the basic R Studio interface.
• You should be able to import data into R
• You should be able to check: no of rows, no of column and the structure of your data
• You should be able to view the data that you imported
• You should be able to create computed columns and use basic numeric functions
• You should be able to sort the data in R
• You should be able to derive the value of column based on condition
• You should be able to apply basic character functions to your data
• You should be able to convert data types in R
• You should be able to deal with basic date columns
• You should be able to do mathematical calculations with your data
• You should be able to remove duplicates in your data
• You should be able to replace and find missing values in your data
• You should be able to Remove or Keep columns from a dataset
• You should be able to do joins in R (Left, Right, Inner, Outer)
• You should be able to do the basic summary and aggregation of data
• You should be able to subset the data
• You should be able to create new datasets in R
• You should be able to create basic plots with data
• You should be able to export a dataset into a file using R
So let’s begin:-
Intro:-
Ris an open source programming language and software environment for statistical analysis, graphics representation and reporting. R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems like Linux, Windows and Mac.
This programming language was named R, based on the first letter of first name of the two R authors (Robert Gentleman and Ross Ihaka).
Downloading & Installing R:-
R can be downloaded from the following website:- https://www.r-project.org/
By clicking on the link “Download R” and then following the instructions.
Now once you have downloaded and installed R, you need to download and install R studio as well.
RStudiois a free and open-source integrated development environment (IDE) for R. Technically you don’t need R Studio to be able to use R, but it makes R more user friendly and easy to use.
You can download R Studio here:- https://www.rstudio.com/
R Studio Interface:-
Once you are done with Installations, and click on the RStudio icon on your computer, you should see something like this:
Now, obviously I will not be talking about the all the options and functionalities of R Studio, but rather limit us to four most important areas of RStudio screen. (Please note that the positioning of these 4 items might vary from installation to installation)
1) History: As the name implies, the history of your commands/code is available here for reuse.
2) Console: This is where you write the code and hit “Enter” to execute it
3) Environment: This is where all the datasets and items created will reside.
4) Plots: This is where your Graphs and charts will be displayed
Now lets get our hands dirty, with real stuff. I will be using two files during this tutorial and you can download them here:-
Import the files
Once you have downloaded these files to a particular location on your computer, please use the following to import the files into your environment. This code needs to be typed in Console of RStudio (defined in R Studio interface section), one line at a time, followed by enter to execute the code/command.
Firstly I am importing the csv file called names.csvand loading the data into a data frame (tables/datasets in R are called as data frames) called names.
Next I am loading the file salary.csvfrom my computer into a data frame called salary.
You will be obviously required to change the path in these commands to the location where you downloaded names.csv and salary.csv respectively.
You might have noticed that I am using two back slashes instead of one, for path eg:
Instead of writing c:\ tempI am writing c:\\temp. This is because ‘\’ is an escape character in R.
Structure of Dataset
Now let’s try took at the structure of the data frames that we just created. Same can be accomplished by str(name_of_data_frame)
It gives us information about the no of observation, no of variables and also their data types. You don’t have to stress about the data types of different variables as we are trying to be as light on theory as possible. We will briefly discuss about data types later.
Now if we want to just get names of variables in a data frame without data types, we can use the function names():-
Find out Number of observations in your dataset
In R we can reference the column of a table by using the following pattern :- table_name$column_name
Here I am asking R to find out how many values are there in column ‘id’ of table ‘names’.
View few observations:-
You can limit the records/rows from the data frame that you want to view, by using head or tail function. As the name implies, head will return rows from top and tail will return rows from bottom. In this case I am asking R to return 5 rows.
Alternatively you can view your entire data frame by using print(table_name)e.g. print(salary)
Create a new column
Now let’s try to create a new column namely Income, as sum of column Salaryand column Bonus.
Here I am telling R to create column incomein table salaryas sum of column salaryfrom table salary and column bonus from table salary. Please don’t get confused between Salary column and Salary table. Also remember in r how we reference column as table_name$column_name.
After creating this column, lets preview the data using tail function.
Round off that column
As you might have noticed, the values in column income contain decimal values; let’s try to round them off using function flooror ceiling.
Here I am telling R, update column incomein table salary and replace the values as rounded up values of column incomeof table salary.
Let’s preview the data using tail function.
Sort by ascending Order of Column
Now let’s try to sort by ascending order of Income and preview the results using tail function.
Here I am telling R, to overwrite table salary, as salary with rows sorted (ordered) in ascending order of Income column and include all columns. Please note how R interprets the above statements:table_name[operation for rows , operation for columns]
In the example above we are sorting rowsby Income column [before comma] and we have nothing/blank [after the comma], that means give me all columns.
Sort in Descending Order
Alternatively we can sort in descending order by just including the minus (-) sign in the front.
See you soon with Part2….
Contributed by: Ubaid Darwaish