Katia Oleinik

June 20, 2017

R has a long history:

- Introduced in 1996 as a free open-source version of the S language
- Primarily designed for Data Analysis
- Community-developed packages are published at CRAN (Comprehensive R Archive Network)
- A new version of R is released during spring time each year. The latest version for today ( June 2017 ) is 3.4.0

- Scripting language (not a compiled language like C, FORTRAN, etc.). It is designed for data analytics (not as a general purpose language).
- Can run interactively (with or without GUI) and in a batch mode
- A few thousands of additional R packages have been developed by R community (10,810 as of June 2017)
- A new version of R is released during spring time each year. The latest version for today ( June 2017 ) is 3.4.0

The scripts for the tutorial can be found on the following webpage: http://rcs.bu.edu/examples/r/tutorials/

There are a number of useful links to various tutorials, books and examples.

For this tutorial we will need:

*R1-Intro.R* and *Salaries.csv*

**You can start R from the Start menu on Windows or by typing “R” in the terminal on Mac or Linux:**

[koleinik ~] R

R version 3.2.3 (2015-12-10) – “Wooden Christmas-Tree” Copyright © 2015 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit)

…

Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

```
2+3
```

```
[1] 5
```

```
2^3
```

```
[1] 8
```

```
log(2)
```

```
[1] 0.6931472
```

```
pi
```

```
[1] 3.141593
```

```
# Calculate p value for the normal statistics
pnorm(1)
```

```
[1] 0.8413447
```

```
# calculate z-score (cumulative density function) from the p value
qnorm(0.84)
```

```
[1] 0.9944579
```

Rstudio is a popular GUI interface to R.

It provides some additional features, like debugging sessions, git interaction, session environment viewer and many more

Open Rstudio on your computer and then browse to open the *R1-Intro.R* file

Within Rstudio the console window by default appears in the lower left corner and the script is above it.

To execute a line from the script, select the line(s) you want to execute and then press *“Run”*

The selected lines are copied to the console screen and then executed.

```
# Classic way to assign a value to a variable
a <- 3
# The following will work too (though you need to be careful using it within some functions:
b = -5
# R is case senitive. This will create another variable that is different from a:
A <- 7
```

Avoid using names *c, t, cat, F, T, D* as those are built-in functions/constants

The variable name can contain letters, digits, underscores and dots and start with the letter or dot. The variable name cannot contain a dollar sign

To see the value of the variable, type its name in a consol window

```
# Character variable
mystring <- "Hello, World"
# Numeric variable
myvalue <- 21/17
# Boolean Variable (TRUE or FALSE)
answer <- 5 < 3
```

The file IO.R in http://rcs.bu.edu/examples/r/tutorials/ folder contains examples of reading various file types: csv, tab, fixed format, stata, SAS, Excel, etc.

There are some packages (data.table, readr) that might be useful for reading very large input files.

```
# Read comma-separated file (the following line will read the file from the RCS website)
salaries <- read.csv("http://scv.bu.edu/examples/r/tutorials/Salaries.csv")
```

If you want to read a file from your local computer you either need to place the file into current directory or provide a path, i.e.:

```
# Read file from the local disk:
#salaries <- read.csv("C:\MyData\Salaries.csv")
```

```
# View the first 6 records of the dataset:
head(salaries)
```

```
rank discipline yrs.since.phd yrs.service sex salary
1 Prof B 56 49 Male 186960
2 Prof A 12 6 Male 93000
3 Prof A 23 20 Male 110515
4 Prof A 40 31 Male 131205
5 Prof B 20 18 Male 104800
6 Prof A 20 20 Male 122400
```

```
# View the first 6 records of the dataset:
head(salaries)
```

```
rank discipline yrs.since.phd yrs.service sex salary
1 Prof B 56 49 Male 186960
2 Prof A 12 6 Male 93000
3 Prof A 23 20 Male 110515
4 Prof A 40 31 Male 131205
5 Prof B 20 18 Male 104800
6 Prof A 20 20 Male 122400
```

```
# Get the list of the columns:
names(salaries)
```

```
[1] "rank" "discipline" "yrs.since.phd" "yrs.service"
[5] "sex" "salary"
```

```
# Get the number of columns
ncol(salaries)
```

```
[1] 6
```

```
# Get the number of observations
nrow(salaries)
```

```
[1] 78
```

```
# Get the structure of the dataset
str(salaries)
```

```
'data.frame': 78 obs. of 6 variables:
$ rank : Factor w/ 3 levels "AssocProf","AsstProf",..: 3 3 3 3 3 3 1 3 3 3 ...
$ discipline : Factor w/ 2 levels "A","B": 2 1 1 1 2 1 1 1 1 1 ...
$ yrs.since.phd: int 56 12 23 40 20 20 20 18 29 51 ...
$ yrs.service : int 49 6 20 31 18 20 17 18 19 51 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
$ salary : int 186960 93000 110515 131205 104800 122400 81285 126300 94350 57800 ...
```

The R object we created when we read the input dataset is called dataframe.

The rows in a dataframe are observations and might contain various types of values (numerical, string, boolean)

The columns in a dataframe are vectors of values with the same type

Function *c()* concatinates a few values of the same type into a single vector:

```
# Create numeric vector
temp <- c( 50, 45, 65, 85, 90, 90, 89)
# Print the values of a vector
print(temp)
```

```
[1] 50 45 65 85 90 90 89
```

There are many other ways to create a numeroc vector in R. Below are some examples:

```
# Create a sequence of numbers
vals <- 30:50
# Sequence of numbers with a fixed increment
vals <- seq(from=0, to=10, by=0.5)
# Repeat a value
vals <- rep( 1, times=10)
# Create a vector initialized with 20 zeros
vals <- numeric(20)
# Create a vector filled with normally distributted data
vals <- rnorm (50, mean=0, sd = 1)
#
```

```
# Define a numeric vector
x <- c(11,22,33,44,55,66,77)
# Return second element of the vector
x[2]
```

```
[1] 22
```

```
# Return second through fifth elements
x[2:5]
```

```
[1] 22 33 44 55
```

```
# Return all but second element
x[-2]
```

```
[1] 11 33 44 55 66 77
```

```
#Return all the elements that are less than 50
x[ x < 50 ]
```

```
[1] 11 22 33 44
```

There are many R functions that work with vectors.

The most helpful summary functions are:

- summary
- unique
- table (for vectors with very few unique values - categorical variables)

There are many basic statistical functions:

*mean, median, min, max, sd, var, range, sort, etc.*

To find a help file for any R function, type *?function_name* or *help(function_name)* in the console window.

To search through R help topics: *??“key phrase”* or *help.search(“key phrase”)*

The dataframe salaries that we read earlier contains 6 different columns/vectors. Let's explore each column. To access the column by name use a dollar sign:

```
# Find the range of the column salary
range( salaries$salary )
```

```
[1] 57800 186960
```

**Use summary(), min(), max(), mean() functions to explore the salary column**

**Try to execute: hist( salaries$salary )**

Dataframe slicing is very similar to vector's, but now we have to work with 2-dimentional data

```
# Specify row and column:
salaries [3, 5]
```

```
[1] Male
Levels: Female Male
```

```
# Sepcify row number only (you still need to enter comma, but leave the column value blank
salaries [3, ]
```

```
rank discipline yrs.since.phd yrs.service sex salary
3 Prof A 23 20 Male 110515
```

```
# Get only those observations for which the salary value is greater than 100,000:
salaries[ salaries$salary > 100000, ]
```

```
rank discipline yrs.since.phd yrs.service sex salary
1 Prof B 56 49 Male 186960
3 Prof A 23 20 Male 110515
4 Prof A 40 31 Male 131205
5 Prof B 20 18 Male 104800
6 Prof A 20 20 Male 122400
8 Prof A 18 18 Male 126300
11 Prof B 39 33 Male 128250
12 Prof B 23 23 Male 134778
14 Prof B 35 33 Male 162200
15 Prof B 25 19 Male 153750
16 Prof B 17 3 Male 150480
19 Prof A 19 7 Male 107300
20 Prof A 29 27 Male 150500
22 Prof A 33 30 Male 103106
27 Prof A 38 19 Male 148750
28 Prof A 45 43 Male 155865
30 Prof B 21 20 Male 123683
31 AssocProf B 9 7 Male 107008
32 Prof B 22 21 Male 155750
33 Prof A 27 19 Male 103275
34 Prof B 18 18 Male 120000
35 AssocProf B 12 8 Male 119800
36 Prof B 28 23 Male 126933
37 Prof B 45 45 Male 146856
38 Prof A 20 8 Male 102000
40 Prof B 18 18 Female 129000
41 Prof A 39 36 Female 137000
45 Prof B 23 19 Female 151768
46 Prof B 25 25 Female 140096
48 AssocProf B 11 11 Female 103613
49 Prof B 17 17 Female 111512
50 Prof B 17 18 Female 122960
52 Prof B 20 14 Female 127512
53 Prof A 12 0 Female 105000
59 Prof B 36 26 Female 144651
60 AssocProf B 12 10 Female 103994
62 AssocProf B 13 10 Female 103750
63 AssocProf B 14 7 Female 109650
66 Prof A 36 19 Female 117555
70 Prof A 28 7 Female 116450
73 Prof B 24 15 Female 161101
74 Prof B 18 10 Female 105450
75 AssocProf B 19 6 Female 104542
76 Prof B 17 17 Female 124312
77 Prof A 28 14 Female 109954
78 Prof A 23 15 Female 109646
```