---
title: "Data Wrangling in R"
author: "Boston University"
date: "January 29, 2019"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Data Science Workflow
1. Import
2. Clean (tidy)
3. Understand:
+ Transform
+ Visualize
+ Model
4. Communicate
The tidyverse is an opinionated collection of R packages designed for data science.
All packages share an underlying design philosophy, grammar, and data structures.
### Core tidyverse packages
* **readr** - import data
* **tidyr** - for data "tidying"
* **dplyr** - data manipulation
* **stringr** - string manipulation
* **forcats** -factors handling
* **ggplot2** - data visualization
* **tibble** - modern representation of a data frame
* **purrr** - functional programming
### *** Other related packages ***
* **hms** - time manipulation
* **lubridate** - dates/times manimpulation
* **DBI** - databases
* **haven** - data imported from SPSS, SAS and Stata
* **readxl** - excel data import
* **rvest** - web scaping
Tidyverse packages can all be installed together:
```{r eval=FALSE}
install.packages("tidyverse")
library(tidyverse)
```
We will be using two packages - dplyr and readxl. If these packages have not been installed on your computer, execute:
```{r eval=FALSE}
install.packages("readxl")
install.packages("dplyr")
```
First we need to load packages into R environment
```{r load], message=FALSE, warning=FALSE}
library(readxl)
library(dplyr)
```
## Importing dataset from Excel worksheet
Reading data saved in Excel format often comes with some challenges:
```{r }
visit <- read_excel("MedData.xlsx", sheet = "Visits")
visit
```
We can see that the file contains some extra records at the top and at the bottom of the sheet.
Use options to read_excel() function to skip these extra lines:
```{r }
visit <- read_excel("MedData.xlsx",
sheet = "Visits",
skip = 3, # how many lines to skip at the top of the sheet
n_max = 13) # how many observations to read
visit
```
Let's take a look at the type of each record to make sure that each column was read correctly
This can be done with `str()` function.
However dplyr package has also handy glimpse() function that has a nicely formatted output
```{r }
str(visit)
glimpse(visit)
```
Notice that DBP column is marked as "character" column even though it contains numeric values.
This normally means that there are some missing values that were coded in non-standard way, so
R did not recognize them as missing values.
In this case you should examine the column and try to identify the missing values:
```{r }
min(visit$DBP)
```
Now we can go back and read file again adding `##N/A` as a Missing Data code:
```{r }
visit <- read_excel("MedData.xlsx",
sheet = "Visits",
skip = 3,
n_max = 13,
na=c("","NA","##N/A"))
glimpse(visit)
```
## Pipe symbol `%>%`
"Pipe" symbol in R was introduced in magrittr package and then it was also implemented in dplyr package - package we will use today
It allows to send the output of one function as an input object to another function:
```{r }
visit %>% glimpse()
visit %>% summary()
```
The following chain of functions:
```{r }
sort(unique(visit$`Patient ID`), decreasing=TRUE)
```
can be rewritten as
```{r }
visit$`Patient ID` %>%
unique() %>% # find unique values
sort(decreasing=TRUE) # sort in descending order
```
## Cleaning and analysing a dataset
A dataset is a collection of values, that are usually either:
* numbers (if quantitative),
* strings (if qualitative),
* logical (if binary).
Every value belongs to a variable and an observation.
A variable contains all values that measure the same underlying attribute (like phone, age) across units.
An observation contains all values measured on the same unit (like a person, or a day, or an event) across attributes
#### Clean or Tidy Data:
* Each variable forms a column
* Each observation forms a row
* Each type of observational unit forms a table
## Examples of messy datasets
* Column headers are values, not variable names
* Multiple variables are stored in one column
* Variables stored in both rows and columns
* Multiple types of observational units are stored in the same table
* A single observational unit is stored in multiple tables.
### dplyr package workflow
`dplyr` R package allows us to:
* Rename variables ( `rename()` )
* Filter observations by their values ( `filter()` )
* Reorder rows ( `arrange()` )
* Select specific columns/variables ( `select()` )
* Create new variables with functions of existing variables ( `mutate()` )
* Summarise ( `summarise()` )
### Renaming columns
# We can select columns in a Data Frame using "$" symbol, i.e:
```{r }
visit$DBP
```
# It is not as easy to do so when the column name contains spaces or other special characters:
```{r }
visit$`Patient ID`
```
In some cases in makes sense to rename some columns to make it easy to work :
```{r }
visit.clean <- visit %>%
rename(id="Patient ID",
admission="Admission date",
discharge="Discharge date",
pulse="Heart Rate")
names(visit.clean)
```
### Working with strings
```{r }
# Converting strings to upper and lower case
toupper(visit.clean$Allergies)
# or using pipe operator
visit.clean$Allergies %>% toupper()
#Search character vector for a specific pattern:
grepl ("pain", visit.clean$Symptoms, ignore.case=TRUE)
# Find all strings that start with "chest"
grepl ("^chest", visit.clean$Symptoms, ignore.case=TRUE)
# Find all strings that end with "pain"
grepl ("^chest", visit.clean$Symptoms, ignore.case=TRUE)
# Find all strings that contain either fever or pain (or both)
grepl ("fever|pain", visit.clean$Symptoms, ignore.case=TRUE)
```
`grep( value=FALSE)`: returns a vector of indecies of element where pattern is found
`grep( value=TRUE)`: returns a vector of indecies of element where pattern is found
```{r }
grep ("fever", visit.clean$Symptoms, ignore.case=TRUE, value=FALSE)
grep ("fever", visit.clean$Symptoms, ignore.case=TRUE, value=TRUE)
```
`sub()` function can be used to substitute the first occurance of a pattern with another string
```{r }
sub (",", ";", visit.clean$Symptoms)
```
`gsub()` function can be used to substitute all occurances of a pattern with another string
```{r }
gsub (",", ";", visit.clean$Symptoms)
```
### Working with dates
```{r }
class(visit.clean$admission)
curr.time <- Sys.time() # get current date and time
curr.date <- Sys.Date() # get current date
str(curr.time) # view the structure of an object
class(curr.time) # view type of an object
# some systems do not have timezone set up
Sys.timezone()
# convert character string to POSIXlt
t1=as.POSIXct("2019-01-29 11:30:00", "%Y-%m-%d %H:%M:%S", tz="EST")
str(t1)
# list of available Time Zones
#OlsonNames()
```
There are a number of handy packages that have various functions to work with dates:
* lubridate
* chron
Some more examples working with dates and times can be found in the following R script
[rcs.bu.edu/examples/r/timesSeries/dateTime.R](http://rcs.bu.edu/examples/r/timesSeries/dateTime.R)
Here we will calculate the length of stay of each patient in the hospital:
```{r }
visit.clean$discharge - visit.clean$admission
#The result variable has "difftime" class:
stay <- visit.clean$discharge - visit.clean$admission
class(stay)
#To convert it to a numeric value use as.numeric() function:
as.numeric(visit.clean$discharge - visit.clean$admission)
```
### Filtering rows
```{r }
visit.clean %>% filter( !is.na(DBP) )
```
R logical operators:
* `> >=`
* `== !=`
* `< <=`
* `is.na() !is.na()`
* `%in%`
#### Examples
```{r }
# Select only those records for which pulse columns have values 100 and greater
visit.clean %>% filter( pulse > 100 )
# Select only those records for which DBP is less than 60 or SBP is greater than 120
visit.clean %>% filter( DBP < 60 | SBP > 120)
# Select only those records for which Temperature is greater than 99 and Symptoms include "fever"
visit.clean %>% filter( Temperature > 99 & grepl("fever", Symptoms, ignore.case=T) )
```
### Selecting Columns
```{r }
visit.clean %>% select(id, Temperature:pulse)
```
dplyr comes with a set of helper functions that can help you select groups of variables inside a select() call:
* `starts_with("XY")`: every name that starts with "XY",
* `ends_with("XY")`: every name that ends with "XY",
* `contains("XY")`: every name that contains "XY",
* `matches("XY")`: every name that matches "XY", where "XY" can be a regular expression,
* `num_range("x", 1:15)`: the variables named x01, x02, x03,..., x15,
* `one_of(XY)`: every name that appears in x, where XY is a character vector.
```{r }
visit.clean %>% select( id, ends_with("BP") )
```
### Modifying existing and/or creating a new variables
```{r }
visit.clean %>% mutate( Temperature = (Temperature-32)*5/9, stay = as.numeric(discharge - admission) )
```
#### Examples
```{r }
# Create a new column MAP which is equal to SBP/3 + 2*DBP/3
visit.clean %>% mutate( MAP=SBP/3 + 2*DBP/3 )
```
Let's put it all together:
Use visit.clean dataframe as input and
* - select only those columns where Temperature is greater than 99F
* - select columns ID, DBP and SBP
* - calculate new variable MAP
```{r }
res <- visit.clean %>%
filter( Temperature > 99 ) %>%
select( id, DBP, SBP ) %>%
mutate( MAP = SBP/3 + 2*DBP/3 )
res
### Calculating summaries
```{r }
visit.clean %>% summarise( N = n(), max.t = max(Temperature), min.T = min(Temperature) )
```
**bilt-in functions often used within summarise()**:
* averages: `mean(), median()`
* spread: `sd(), IQR(), mad()`
* range: `min(), max()`
* count: `n(), n_distinct()`
#### Examples:
```{r }
# Calculate the number of distinct patients
visit.clean %>% summarise( N = n_distinct(id) )
```
### Group by one or more variables
```{r }
visit.clean %>% group_by(id) %>% summarise( ave.pulse = mean(pulse) )
```
Useful functions often used with group_by():
* `first()`
* `last()`
* `nth()`
```{r }
# For each patient select the first record and find the lenght of stay
visit.clean %>% group_by(id) %>% summarise( first(discharge - admission) )
```
### Sorting dataframe by one or more variables
```{r }
visit.clean %>% arrange(id, admission)
```
## Joining 2 dataframes
First let's read the data for each patient:
```{r }
pinfo <- read_excel("MedData.xlsx", sheet = "Patient Data")
#let's make sure the date is well formatted
pinfo %>%head()
pinfo %>%glimpse()
pinfo %>%summary()
```
Now we want to have a single dataframe that contains patient infomration and patient visit informtaion
There are a number of `join` functions in `dplyr` package:
* `inner_join()`
* `left_join()`
* `right_join()`
* `full_join()`
* `semi_join()` - return all rows from x where there are matching values in y, keeping just columns from x.
* `anti_join()` - return all rows from x where there are not matching values in y, keeping just columns from x.
```{r }
#Let's try full join of both dataframes we have
full_join(visit.clean, pinfo, by = c("id"="ID"))
# left join
result <- visit.clean %>% left_join(pinfo, by = c("id"="ID"))
result
# Calculate the mean stay of the hospital for Male and Female patients:
result %>%
filter( !is.na(Gender) ) %>%
mutate( stay = as.numeric(discharge-admission) ) %>%
group_by ( Gender ) %>%
summarize( ave.stay = mean(stay) )
```
## Converting Wide dataframes to long dataframes and back
```{r }
library(tidyr)
city.temps <- data.frame( time = as.Date('2018-09-03') + 0:4,
Boston = rnorm( 5, 75, 3),
Denver = rnorm( 5, 85, 5),
SanFrancisco = rnorm(5, 80, 5),
Austin = rnorm( 5, 90, 5) )
city.temps
# Use gather to rearrange a table into a tidy data.frame:
# gather(data, key = "key", value = "value", ..., na.rm = FALSE,
# convert = FALSE, factor_key = FALSE)
city.temps2 <- gather( city.temps,
key = "City",
value = "Temperature",
-time, # collection of the columns,
factor_key = TRUE) # if key variable needs to be converted to a factor
# or
city.temps2 <- gather( city.temps,
key = "City",
value = "Temperature",
Boston: Austin, # collection of the columns,
factor_key = TRUE) # if key variable needs to be converted to a factor
#or
city.temps2 <- gather( city.temps,
key = "City",
value = "Temperature",
c("Boston","Denver","SanFrancisco","Austin"), # collection of the columns,
factor_key = TRUE) # if key variable needs to be converted to a factor
city.temps2
glimpse(city.temps2)
```
Sometimes it is useful to be able to
perform the opposite operation: convert long format dataframe into wide representation
```{r }
city.temps3 <- spread( city.temps2, City, Temperature)
city.temps3
```