--- title: "Data Wrangling in R" author: "Boston University" date: "January 29, 2019" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Data Science Workflow 1. Import 2. Clean (tidy) 3. Understand: + Transform + Visualize + Model 4. Communicate The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. ### Core tidyverse packages * **readr** - import data * **tidyr** - for data "tidying" * **dplyr** - data manipulation * **stringr** - string manipulation * **forcats** -factors handling * **ggplot2** - data visualization * **tibble** - modern representation of a data frame * **purrr** - functional programming ### *** Other related packages *** * **hms** - time manipulation * **lubridate** - dates/times manimpulation * **DBI** - databases * **haven** - data imported from SPSS, SAS and Stata * **readxl** - excel data import * **rvest** - web scaping Tidyverse packages can all be installed together: ```{r eval=FALSE} install.packages("tidyverse") library(tidyverse) ``` We will be using two packages - dplyr and readxl. If these packages have not been installed on your computer, execute: ```{r eval=FALSE} install.packages("readxl") install.packages("dplyr") ```

First we need to load packages into R environment ```{r load], message=FALSE, warning=FALSE} library(readxl) library(dplyr) ``` ## Importing dataset from Excel worksheet Reading data saved in Excel format often comes with some challenges: ```{r } visit <- read_excel("MedData.xlsx", sheet = "Visits") visit ```

We can see that the file contains some extra records at the top and at the bottom of the sheet. Use options to read_excel() function to skip these extra lines: ```{r } visit <- read_excel("MedData.xlsx", sheet = "Visits", skip = 3, # how many lines to skip at the top of the sheet n_max = 13) # how many observations to read visit ``` Let's take a look at the type of each record to make sure that each column was read correctly This can be done with `str()` function. However dplyr package has also handy glimpse() function that has a nicely formatted output ```{r } str(visit) glimpse(visit) ``` Notice that DBP column is marked as "character" column even though it contains numeric values. This normally means that there are some missing values that were coded in non-standard way, so R did not recognize them as missing values. In this case you should examine the column and try to identify the missing values: ```{r } min(visit$DBP) ```

Now we can go back and read file again adding `##N/A` as a Missing Data code: ```{r } visit <- read_excel("MedData.xlsx", sheet = "Visits", skip = 3, n_max = 13, na=c("","NA","##N/A")) glimpse(visit) ``` ## Pipe symbol `%>%` "Pipe" symbol in R was introduced in magrittr package and then it was also implemented in dplyr package - package we will use today It allows to send the output of one function as an input object to another function: ```{r } visit %>% glimpse() visit %>% summary() ```

The following chain of functions: ```{r } sort(unique(visit$`Patient ID`), decreasing=TRUE) ``` can be rewritten as ```{r } visit$`Patient ID` %>% unique() %>% # find unique values sort(decreasing=TRUE) # sort in descending order ``` ## Cleaning and analysing a dataset A dataset is a collection of values, that are usually either: * numbers (if quantitative), * strings (if qualitative), * logical (if binary). Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like phone, age) across units. An observation contains all values measured on the same unit (like a person, or a day, or an event) across attributes #### Clean or Tidy Data: * Each variable forms a column * Each observation forms a row * Each type of observational unit forms a table ## Examples of messy datasets * Column headers are values, not variable names * Multiple variables are stored in one column * Variables stored in both rows and columns * Multiple types of observational units are stored in the same table * A single observational unit is stored in multiple tables. ### dplyr package workflow `dplyr` R package allows us to: * Rename variables ( `rename()` ) * Filter observations by their values ( `filter()` ) * Reorder rows ( `arrange()` ) * Select specific columns/variables ( `select()` ) * Create new variables with functions of existing variables ( `mutate()` ) * Summarise ( `summarise()` ) ### Renaming columns # We can select columns in a Data Frame using "$" symbol, i.e: ```{r } visit$DBP ```

# It is not as easy to do so when the column name contains spaces or other special characters: ```{r } visit$`Patient ID` ```

In some cases in makes sense to rename some columns to make it easy to work : ```{r } visit.clean <- visit %>% rename(id="Patient ID", admission="Admission date", discharge="Discharge date", pulse="Heart Rate") names(visit.clean) ``` ### Working with strings ```{r } # Converting strings to upper and lower case toupper(visit.clean$Allergies) # or using pipe operator visit.clean$Allergies %>% toupper() #Search character vector for a specific pattern: grepl ("pain", visit.clean$Symptoms, ignore.case=TRUE) # Find all strings that start with "chest" grepl ("^chest", visit.clean$Symptoms, ignore.case=TRUE) # Find all strings that end with "pain" grepl ("^chest", visit.clean$Symptoms, ignore.case=TRUE) # Find all strings that contain either fever or pain (or both) grepl ("fever|pain", visit.clean$Symptoms, ignore.case=TRUE) ```

`grep( value=FALSE)`: returns a vector of indecies of element where pattern is found
`grep( value=TRUE)`: returns a vector of indecies of element where pattern is found ```{r } grep ("fever", visit.clean$Symptoms, ignore.case=TRUE, value=FALSE) grep ("fever", visit.clean$Symptoms, ignore.case=TRUE, value=TRUE) ```

`sub()` function can be used to substitute the first occurance of a pattern with another string ```{r } sub (",", ";", visit.clean$Symptoms) ```

`gsub()` function can be used to substitute all occurances of a pattern with another string ```{r } gsub (",", ";", visit.clean$Symptoms) ``` ### Working with dates ```{r } class(visit.clean$admission) curr.time <- Sys.time() # get current date and time curr.date <- Sys.Date() # get current date str(curr.time) # view the structure of an object class(curr.time) # view type of an object # some systems do not have timezone set up Sys.timezone() # convert character string to POSIXlt t1=as.POSIXct("2019-01-29 11:30:00", "%Y-%m-%d %H:%M:%S", tz="EST") str(t1) # list of available Time Zones #OlsonNames() ``` There are a number of handy packages that have various functions to work with dates: * lubridate * chron Some more examples working with dates and times can be found in the following R script [rcs.bu.edu/examples/r/timesSeries/dateTime.R](http://rcs.bu.edu/examples/r/timesSeries/dateTime.R)

Here we will calculate the length of stay of each patient in the hospital: ```{r } visit.clean$discharge - visit.clean$admission #The result variable has "difftime" class: stay <- visit.clean$discharge - visit.clean$admission class(stay) #To convert it to a numeric value use as.numeric() function: as.numeric(visit.clean$discharge - visit.clean$admission) ``` ### Filtering rows ```{r } visit.clean %>% filter( !is.na(DBP) ) ``` R logical operators: * `> >=` * `== !=` * `< <=` * `is.na() !is.na()` * `%in%` #### Examples ```{r } # Select only those records for which pulse columns have values 100 and greater visit.clean %>% filter( pulse > 100 ) # Select only those records for which DBP is less than 60 or SBP is greater than 120 visit.clean %>% filter( DBP < 60 | SBP > 120) # Select only those records for which Temperature is greater than 99 and Symptoms include "fever" visit.clean %>% filter( Temperature > 99 & grepl("fever", Symptoms, ignore.case=T) ) ``` ### Selecting Columns ```{r } visit.clean %>% select(id, Temperature:pulse) ```

dplyr comes with a set of helper functions that can help you select groups of variables inside a select() call: * `starts_with("XY")`: every name that starts with "XY", * `ends_with("XY")`: every name that ends with "XY", * `contains("XY")`: every name that contains "XY", * `matches("XY")`: every name that matches "XY", where "XY" can be a regular expression, * `num_range("x", 1:15)`: the variables named x01, x02, x03,..., x15, * `one_of(XY)`: every name that appears in x, where XY is a character vector. ```{r } visit.clean %>% select( id, ends_with("BP") ) ``` ### Modifying existing and/or creating a new variables ```{r } visit.clean %>% mutate( Temperature = (Temperature-32)*5/9, stay = as.numeric(discharge - admission) ) ``` #### Examples ```{r } # Create a new column MAP which is equal to SBP/3 + 2*DBP/3 visit.clean %>% mutate( MAP=SBP/3 + 2*DBP/3 ) ```

Let's put it all together:
Use visit.clean dataframe as input and * - select only those columns where Temperature is greater than 99F * - select columns ID, DBP and SBP * - calculate new variable MAP ```{r } res <- visit.clean %>% filter( Temperature > 99 ) %>% select( id, DBP, SBP ) %>% mutate( MAP = SBP/3 + 2*DBP/3 ) res ### Calculating summaries ```{r } visit.clean %>% summarise( N = n(), max.t = max(Temperature), min.T = min(Temperature) ) ```

**bilt-in functions often used within summarise()**: * averages: `mean(), median()` * spread: `sd(), IQR(), mad()` * range: `min(), max()` * count: `n(), n_distinct()` #### Examples: ```{r } # Calculate the number of distinct patients visit.clean %>% summarise( N = n_distinct(id) ) ``` ### Group by one or more variables ```{r } visit.clean %>% group_by(id) %>% summarise( ave.pulse = mean(pulse) ) ```

Useful functions often used with group_by(): * `first()` * `last()` * `nth()` ```{r } # For each patient select the first record and find the lenght of stay visit.clean %>% group_by(id) %>% summarise( first(discharge - admission) ) ``` ### Sorting dataframe by one or more variables ```{r } visit.clean %>% arrange(id, admission) ``` ## Joining 2 dataframes First let's read the data for each patient: ```{r } pinfo <- read_excel("MedData.xlsx", sheet = "Patient Data") #let's make sure the date is well formatted pinfo %>%head() pinfo %>%glimpse() pinfo %>%summary() ``` Now we want to have a single dataframe that contains patient infomration and patient visit informtaion There are a number of `join` functions in `dplyr` package: * `inner_join()` * `left_join()` * `right_join()` * `full_join()` * `semi_join()` - return all rows from x where there are matching values in y, keeping just columns from x. * `anti_join()` - return all rows from x where there are not matching values in y, keeping just columns from x. ```{r } #Let's try full join of both dataframes we have full_join(visit.clean, pinfo, by = c("id"="ID")) # left join result <- visit.clean %>% left_join(pinfo, by = c("id"="ID")) result # Calculate the mean stay of the hospital for Male and Female patients: result %>% filter( !is.na(Gender) ) %>% mutate( stay = as.numeric(discharge-admission) ) %>% group_by ( Gender ) %>% summarize( ave.stay = mean(stay) ) ``` ## Converting Wide dataframes to long dataframes and back ```{r } library(tidyr) city.temps <- data.frame( time = as.Date('2018-09-03') + 0:4, Boston = rnorm( 5, 75, 3), Denver = rnorm( 5, 85, 5), SanFrancisco = rnorm(5, 80, 5), Austin = rnorm( 5, 90, 5) ) city.temps # Use gather to rearrange a table into a tidy data.frame: # gather(data, key = "key", value = "value", ..., na.rm = FALSE, # convert = FALSE, factor_key = FALSE) city.temps2 <- gather( city.temps, key = "City", value = "Temperature", -time, # collection of the columns, factor_key = TRUE) # if key variable needs to be converted to a factor # or city.temps2 <- gather( city.temps, key = "City", value = "Temperature", Boston: Austin, # collection of the columns, factor_key = TRUE) # if key variable needs to be converted to a factor #or city.temps2 <- gather( city.temps, key = "City", value = "Temperature", c("Boston","Denver","SanFrancisco","Austin"), # collection of the columns, factor_key = TRUE) # if key variable needs to be converted to a factor city.temps2 glimpse(city.temps2) ```

Sometimes it is useful to be able to perform the opposite operation: convert long format dataframe into wide representation ```{r } city.temps3 <- spread( city.temps2, City, Temperature) city.temps3 ```