Introduction to R

Katia Oleinik
June 20, 2017

Brief R History

R has a long history:

  • Introduced in 1996 as a free open-source version of the S language
  • Primarily designed for Data Analysis
  • Community-developed packages are published at CRAN (Comprehensive R Archive Network)
  • A new version of R is released during spring time each year. The latest version for today ( June 2017 ) is 3.4.0

R language

  • Scripting language (not a compiled language like C, FORTRAN, etc.). It is designed for data analytics (not as a general purpose language).
  • Can run interactively (with or without GUI) and in a batch mode
  • A few thousands of additional R packages have been developed by R community (10,810 as of June 2017)
  • A new version of R is released during spring time each year. The latest version for today ( June 2017 ) is 3.4.0

Links to the tutorial materials

The scripts for the tutorial can be found on the following webpage: http://rcs.bu.edu/examples/r/tutorials/

There are a number of useful links to various tutorials, books and examples.

For this tutorial we will need:
R1-Intro.R and Salaries.csv

Starting R

You can start R from the Start menu on Windows or by typing “R” in the terminal on Mac or Linux:

[koleinik ~] R

R version 3.2.3 (2015-12-10) – “Wooden Christmas-Tree” Copyright © 2015 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit)

Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

A few simple R commands

2+3
[1] 5
2^3
[1] 8
log(2)
[1] 0.6931472
pi 
[1] 3.141593

Statistical Excersises

# Calculate p value for the normal statistics
pnorm(1)
[1] 0.8413447
# calculate z-score (cumulative density function) from the p value
qnorm(0.84)
[1] 0.9944579

Rstudio

Rstudio is a popular GUI interface to R.

It provides some additional features, like debugging sessions, git interaction, session environment viewer and many more

Open Rstudio on your computer and then browse to open the R1-Intro.R file

Rstudio

Within Rstudio the console window by default appears in the lower left corner and the script is above it.

To execute a line from the script, select the line(s) you want to execute and then press “Run”

The selected lines are copied to the console screen and then executed.

Assigning a value to a variable

# Classic way to assign a value to a variable
a <- 3

# The following will work too (though you need to be careful using it within some functions:
b = -5

# R is case senitive. This will create another variable that is different from a:
A <- 7

Variable names and types

Avoid using names c, t, cat, F, T, D as those are built-in functions/constants

The variable name can contain letters, digits, underscores and dots and start with the letter or dot. The variable name cannot contain a dollar sign

To see the value of the variable, type its name in a consol window

Variable names and types (cont.)

# Character variable
mystring <- "Hello, World"

# Numeric variable
myvalue <- 21/17

# Boolean Variable (TRUE or FALSE)
answer <- 5 < 3

Read Input Data

The file IO.R in http://rcs.bu.edu/examples/r/tutorials/ folder contains examples of reading various file types: csv, tab, fixed format, stata, SAS, Excel, etc.

There are some packages (data.table, readr) that might be useful for reading very large input files.

Read Input Data (cont.)

# Read comma-separated file (the following line will read the file from the RCS website)
salaries <- read.csv("http://scv.bu.edu/examples/r/tutorials/Salaries.csv")

If you want to read a file from your local computer you either need to place the file into current directory or provide a path, i.e.:

# Read file from the local disk:
#salaries <- read.csv("C:\MyData\Salaries.csv")

Explore the data

# View the first 6 records of the dataset:
head(salaries)
  rank discipline yrs.since.phd yrs.service  sex salary
1 Prof          B            56          49 Male 186960
2 Prof          A            12           6 Male  93000
3 Prof          A            23          20 Male 110515
4 Prof          A            40          31 Male 131205
5 Prof          B            20          18 Male 104800
6 Prof          A            20          20 Male 122400

Explore the data (cont.)

# View the first 6 records of the dataset:
head(salaries)
  rank discipline yrs.since.phd yrs.service  sex salary
1 Prof          B            56          49 Male 186960
2 Prof          A            12           6 Male  93000
3 Prof          A            23          20 Male 110515
4 Prof          A            40          31 Male 131205
5 Prof          B            20          18 Male 104800
6 Prof          A            20          20 Male 122400

Explore the data (cont.)

# Get the list of the columns:
names(salaries)
[1] "rank"          "discipline"    "yrs.since.phd" "yrs.service"  
[5] "sex"           "salary"       
# Get the number of columns
ncol(salaries)
[1] 6
# Get the number of observations
nrow(salaries)
[1] 78

Explore the data (cont.)

# Get the structure of the dataset
str(salaries)
'data.frame':   78 obs. of  6 variables:
 $ rank         : Factor w/ 3 levels "AssocProf","AsstProf",..: 3 3 3 3 3 3 1 3 3 3 ...
 $ discipline   : Factor w/ 2 levels "A","B": 2 1 1 1 2 1 1 1 1 1 ...
 $ yrs.since.phd: int  56 12 23 40 20 20 20 18 29 51 ...
 $ yrs.service  : int  49 6 20 31 18 20 17 18 19 51 ...
 $ sex          : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
 $ salary       : int  186960 93000 110515 131205 104800 122400 81285 126300 94350 57800 ...

Explore the data (cont.)

The R object we created when we read the input dataset is called dataframe.

The rows in a dataframe are observations and might contain various types of values (numerical, string, boolean)

The columns in a dataframe are vectors of values with the same type

R vectors

Function c() concatinates a few values of the same type into a single vector:

# Create numeric vector
temp <- c( 50, 45, 65, 85, 90, 90, 89)

# Print the values of a vector
print(temp)
[1] 50 45 65 85 90 90 89

R vectors (cont.)

There are many other ways to create a numeroc vector in R. Below are some examples:

# Create a sequence of numbers
vals <- 30:50

# Sequence of numbers with a fixed increment
vals <- seq(from=0, to=10, by=0.5)

# Repeat a value
vals <- rep( 1, times=10)

# Create a vector initialized with 20 zeros
vals <- numeric(20)

# Create a vector filled with normally distributted data
vals <- rnorm (50, mean=0, sd = 1)
#

Vector slicing

# Define a numeric vector
x <- c(11,22,33,44,55,66,77)
# Return second element of the vector
x[2]
[1] 22
# Return second through fifth elements
x[2:5]
[1] 22 33 44 55
# Return all but second element
x[-2]
[1] 11 33 44 55 66 77
#Return all the elements that are less than 50
x[ x < 50 ]
[1] 11 22 33 44

Vector functions

There are many R functions that work with vectors.

The most helpful summary functions are:

  • summary
  • unique
  • table (for vectors with very few unique values - categorical variables)

There are many basic statistical functions:
mean, median, min, max, sd, var, range, sort, etc.

R Help

To find a help file for any R function, type ?function_name or help(function_name) in the console window.

To search through R help topics: ??“key phrase” or help.search(“key phrase”)

Explore the data (practice)

The dataframe salaries that we read earlier contains 6 different columns/vectors. Let's explore each column. To access the column by name use a dollar sign:

# Find the range of the column salary
range( salaries$salary )
[1]  57800 186960

Use summary(), min(), max(), mean() functions to explore the salary column
Try to execute: hist( salaries$salary )

Dataframe slicing

Dataframe slicing is very similar to vector's, but now we have to work with 2-dimentional data

# Specify row and column:
salaries [3, 5]
[1] Male
Levels: Female Male
# Sepcify row number only (you still need to enter comma, but leave the column value blank
salaries [3, ]
  rank discipline yrs.since.phd yrs.service  sex salary
3 Prof          A            23          20 Male 110515

Dataframe slicing (cont.)

# Get only those observations for which the salary value is greater than 100,000:
salaries[ salaries$salary > 100000, ]
        rank discipline yrs.since.phd yrs.service    sex salary
1       Prof          B            56          49   Male 186960
3       Prof          A            23          20   Male 110515
4       Prof          A            40          31   Male 131205
5       Prof          B            20          18   Male 104800
6       Prof          A            20          20   Male 122400
8       Prof          A            18          18   Male 126300
11      Prof          B            39          33   Male 128250
12      Prof          B            23          23   Male 134778
14      Prof          B            35          33   Male 162200
15      Prof          B            25          19   Male 153750
16      Prof          B            17           3   Male 150480
19      Prof          A            19           7   Male 107300
20      Prof          A            29          27   Male 150500
22      Prof          A            33          30   Male 103106
27      Prof          A            38          19   Male 148750
28      Prof          A            45          43   Male 155865
30      Prof          B            21          20   Male 123683
31 AssocProf          B             9           7   Male 107008
32      Prof          B            22          21   Male 155750
33      Prof          A            27          19   Male 103275
34      Prof          B            18          18   Male 120000
35 AssocProf          B            12           8   Male 119800
36      Prof          B            28          23   Male 126933
37      Prof          B            45          45   Male 146856
38      Prof          A            20           8   Male 102000
40      Prof          B            18          18 Female 129000
41      Prof          A            39          36 Female 137000
45      Prof          B            23          19 Female 151768
46      Prof          B            25          25 Female 140096
48 AssocProf          B            11          11 Female 103613
49      Prof          B            17          17 Female 111512
50      Prof          B            17          18 Female 122960
52      Prof          B            20          14 Female 127512
53      Prof          A            12           0 Female 105000
59      Prof          B            36          26 Female 144651
60 AssocProf          B            12          10 Female 103994
62 AssocProf          B            13          10 Female 103750
63 AssocProf          B            14           7 Female 109650
66      Prof          A            36          19 Female 117555
70      Prof          A            28           7 Female 116450
73      Prof          B            24          15 Female 161101
74      Prof          B            18          10 Female 105450
75 AssocProf          B            19           6 Female 104542
76      Prof          B            17          17 Female 124312
77      Prof          A            28          14 Female 109954
78      Prof          A            23          15 Female 109646