Data Science Workflow

Import
Clean (tidy)
Understand:

Transform
Visualize
Model

Communicate

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Core tidyverse packages

readr - import data
tidyr - for data “tidying”
dplyr - data manipulation
stringr - string manipulation
forcats -factors handling
ggplot2 - data visualization
tibble - modern representation of a data frame
purrr - functional programming

* Other related packages hms - time manipulation

lubridate - dates/times manimpulation
DBI - databases
haven - data imported from SPSS, SAS and Stata
readxl - excel data import
rvest - web scaping

Tidyverse packages can all be installed together:

install.packages("tidyverse")
library(tidyverse)

We will be using two packages - dplyr and readxl. If these packages have not been installed on your computer, execute:

install.packages("readxl")
install.packages("dplyr")

First we need to load packages into R environment

library(readxl)
library(dplyr)

Importing dataset from Excel worksheet

Reading data saved in Excel format often comes with some challenges:

visit <- read_excel("MedData.xlsx", sheet = "Visits")
visit

## # A tibble: 19 x 9
##    `# This file conta~ X__1   X__2   X__3  X__4  X__5  X__6  X__7  X__8   
##    <chr>               <chr>  <chr>  <chr> <chr> <chr> <chr> <chr> <chr>  
##  1 # Period of observ~ <NA>   <NA>   <NA>  <NA>  <NA>  <NA>  <NA>  <NA>   
##  2 <NA>                <NA>   <NA>   <NA>  <NA>  <NA>  <NA>  <NA>  <NA>   
##  3 Patient ID          Admis~ Disch~ Temp~ DBP   SBP   Alle~ Hear~ Sympto~
##  4 11425               42786  42788  101.5 55    95    No    110   Fever,~
##  5 11425               43184  43185  98.3  60    90    No    80    Abdomi~
##  6 10873               42954  42958  98.6  100   150   yes   105   Chest ~
##  7 14562               43086  43088  98.8  70    100   NO    70    Broken~
##  8 19112               42808  42813  98.2  56    80    yes   65    Fatigu~
##  9 19112               43284  43289  98.4  60    85    yes   70    Chest ~
## 10 19112               43123  43127  103.3 60    90    YES   80    Cough,~
## 11 18567               43048  43050  102.7 80    120   no    75    Nausea~
## 12 14475               43330  43331  98.8  75    110   no    80    Proble~
## 13 15940               43136  43138  102.5 65    115   NO    77    Fever,~
## 14 15940               42805  42807  103.4 70    120   no    84    Chest ~
## 15 15940               43029  43032  101.2 ##N/A 110   no    80    Fever,~
## 16 15940               43407  43408  98.6  70    110   no    75    Schedu~
## 17 <NA>                <NA>   <NA>   <NA>  <NA>  <NA>  <NA>  <NA>  <NA>   
## 18 <NA>                <NA>   <NA>   <NA>  <NA>  <NA>  <NA>  <NA>  <NA>   
## 19 Total               13     <NA>   <NA>  <NA>  <NA>  <NA>  <NA>  <NA>

We can see that the file contains some extra records at the top and at the bottom of the sheet. Use options to read_excel() function to skip these extra lines:

visit <- read_excel("MedData.xlsx", 
                    sheet = "Visits", 
                    skip = 3,  # how many lines to skip at the top of the sheet
                    n_max = 13)  # how many observations to read
visit

## # A tibble: 13 x 9
##    `Patient ID` `Admission date`    `Discharge date`    Temperature DBP  
##           <dbl> <dttm>              <dttm>                    <dbl> <chr>
##  1        11425 2017-02-20 00:00:00 2017-02-22 00:00:00       102.  55   
##  2        11425 2018-03-25 00:00:00 2018-03-26 00:00:00        98.3 60   
##  3        10873 2017-08-07 00:00:00 2017-08-11 00:00:00        98.6 100  
##  4        14562 2017-12-17 00:00:00 2017-12-19 00:00:00        98.8 70   
##  5        19112 2017-03-14 00:00:00 2017-03-19 00:00:00        98.2 56   
##  6        19112 2018-07-03 00:00:00 2018-07-08 00:00:00        98.4 60   
##  7        19112 2018-01-23 00:00:00 2018-01-27 00:00:00       103.  60   
##  8        18567 2017-11-09 00:00:00 2017-11-11 00:00:00       103.  80   
##  9        14475 2018-08-18 00:00:00 2018-08-19 00:00:00        98.8 75   
## 10        15940 2018-02-05 00:00:00 2018-02-07 00:00:00       102.  65   
## 11        15940 2017-03-11 00:00:00 2017-03-13 00:00:00       103.  70   
## 12        15940 2017-10-21 00:00:00 2017-10-24 00:00:00       101.  ##N/A
## 13        15940 2018-11-03 00:00:00 2018-11-04 00:00:00        98.6 70   
## # ... with 4 more variables: SBP <dbl>, Allergies <chr>, `Heart
## #   Rate` <dbl>, Symptoms <chr>

Let’s take a look at the type of each record to make sure that each column was read correctly This can be done with str() function. However dplyr package has also handy glimpse() function that has a nicely formatted output

str(visit)

## Classes 'tbl_df', 'tbl' and 'data.frame':    13 obs. of  9 variables:
##  $ Patient ID    : num  11425 11425 10873 14562 19112 ...
##  $ Admission date: POSIXct, format: "2017-02-20" "2018-03-25" ...
##  $ Discharge date: POSIXct, format: "2017-02-22" "2018-03-26" ...
##  $ Temperature   : num  101.5 98.3 98.6 98.8 98.2 ...
##  $ DBP           : chr  "55" "60" "100" "70" ...
##  $ SBP           : num  95 90 150 100 80 85 90 120 110 115 ...
##  $ Allergies     : chr  "No" "No" "yes" "NO" ...
##  $ Heart Rate    : num  110 80 105 70 65 70 80 75 80 77 ...
##  $ Symptoms      : chr  "Fever, cough, sore throat, headache" "Abdominal pain" "Chest Pain" "Broken arm" ...

glimpse(visit)

## Observations: 13
## Variables: 9
## $ `Patient ID`     <dbl> 11425, 11425, 10873, 14562, 19112, 19112, 191...
## $ `Admission date` <dttm> 2017-02-20, 2018-03-25, 2017-08-07, 2017-12-...
## $ `Discharge date` <dttm> 2017-02-22, 2018-03-26, 2017-08-11, 2017-12-...
## $ Temperature      <dbl> 101.5, 98.3, 98.6, 98.8, 98.2, 98.4, 103.3, 1...
## $ DBP              <chr> "55", "60", "100", "70", "56", "60", "60", "8...
## $ SBP              <dbl> 95, 90, 150, 100, 80, 85, 90, 120, 110, 115, ...
## $ Allergies        <chr> "No", "No", "yes", "NO", "yes", "yes", "YES",...
## $ `Heart Rate`     <dbl> 110, 80, 105, 70, 65, 70, 80, 75, 80, 77, 84,...
## $ Symptoms         <chr> "Fever, cough, sore throat, headache", "Abdom...

Notice that DBP column is marked as “character” column even though it contains numeric values. This normally means that there are some missing values that were coded in non-standard way, so R did not recognize them as missing values. In this case you should examine the column and try to identify the missing values:

min(visit$DBP)

## [1] "##N/A"

Now we can go back and read file again adding ##N/A as a Missing Data code:

visit <- read_excel("MedData.xlsx", 
                    sheet = "Visits", 
                    skip = 3,
                    n_max = 13,
                    na=c("","NA","##N/A"))
glimpse(visit)

## Observations: 13
## Variables: 9
## $ `Patient ID`     <dbl> 11425, 11425, 10873, 14562, 19112, 19112, 191...
## $ `Admission date` <dttm> 2017-02-20, 2018-03-25, 2017-08-07, 2017-12-...
## $ `Discharge date` <dttm> 2017-02-22, 2018-03-26, 2017-08-11, 2017-12-...
## $ Temperature      <dbl> 101.5, 98.3, 98.6, 98.8, 98.2, 98.4, 103.3, 1...
## $ DBP              <dbl> 55, 60, 100, 70, 56, 60, 60, 80, 75, 65, 70, ...
## $ SBP              <dbl> 95, 90, 150, 100, 80, 85, 90, 120, 110, 115, ...
## $ Allergies        <chr> "No", "No", "yes", "NO", "yes", "yes", "YES",...
## $ `Heart Rate`     <dbl> 110, 80, 105, 70, 65, 70, 80, 75, 80, 77, 84,...
## $ Symptoms         <chr> "Fever, cough, sore throat, headache", "Abdom...

Pipe symbol `%>%`

“Pipe” symbol in R was introduced in magrittr package and then it was also implemented in dplyr package - package we will use today It allows to send the output of one function as an input object to another function:

visit %>% glimpse()

## Observations: 13
## Variables: 9
## $ `Patient ID`     <dbl> 11425, 11425, 10873, 14562, 19112, 19112, 191...
## $ `Admission date` <dttm> 2017-02-20, 2018-03-25, 2017-08-07, 2017-12-...
## $ `Discharge date` <dttm> 2017-02-22, 2018-03-26, 2017-08-11, 2017-12-...
## $ Temperature      <dbl> 101.5, 98.3, 98.6, 98.8, 98.2, 98.4, 103.3, 1...
## $ DBP              <dbl> 55, 60, 100, 70, 56, 60, 60, 80, 75, 65, 70, ...
## $ SBP              <dbl> 95, 90, 150, 100, 80, 85, 90, 120, 110, 115, ...
## $ Allergies        <chr> "No", "No", "yes", "NO", "yes", "yes", "YES",...
## $ `Heart Rate`     <dbl> 110, 80, 105, 70, 65, 70, 80, 75, 80, 77, 84,...
## $ Symptoms         <chr> "Fever, cough, sore throat, headache", "Abdom...

visit %>% summary()

##    Patient ID    Admission date                Discharge date      
##  Min.   :10873   Min.   :2017-02-20 00:00:00   Min.   :2017-02-22  
##  1st Qu.:14475   1st Qu.:2017-08-07 00:00:00   1st Qu.:2017-08-11  
##  Median :15940   Median :2017-12-17 00:00:00   Median :2017-12-19  
##  Mean   :15571   Mean   :2017-12-06 09:13:50   Mean   :2017-12-09  
##  3rd Qu.:18567   3rd Qu.:2018-03-25 00:00:00   3rd Qu.:2018-03-26  
##  Max.   :19112   Max.   :2018-11-03 00:00:00   Max.   :2018-11-04  
##                                                                    
##   Temperature         DBP              SBP         Allergies        
##  Min.   : 98.2   Min.   : 55.00   Min.   : 80.0   Length:13         
##  1st Qu.: 98.6   1st Qu.: 60.00   1st Qu.: 90.0   Class :character  
##  Median : 98.8   Median : 67.50   Median :110.0   Mode  :character  
##  Mean   :100.3   Mean   : 68.42   Mean   :105.8                     
##  3rd Qu.:102.5   3rd Qu.: 71.25   3rd Qu.:115.0                     
##  Max.   :103.4   Max.   :100.00   Max.   :150.0                     
##                  NA's   :1                                          
##    Heart Rate       Symptoms        
##  Min.   : 65.00   Length:13         
##  1st Qu.: 75.00   Class :character  
##  Median : 80.00   Mode  :character  
##  Mean   : 80.85                     
##  3rd Qu.: 80.00                     
##  Max.   :110.00                     
##

The following chain of functions:

sort(unique(visit$`Patient ID`), decreasing=TRUE)

## [1] 19112 18567 15940 14562 14475 11425 10873

can be rewritten as

visit$`Patient ID` %>% 
  unique()  %>%            # find unique values
  sort(decreasing=TRUE)    # sort  in descending order

## [1] 19112 18567 15940 14562 14475 11425 10873

Cleaning and analysing a dataset

A dataset is a collection of values, that are usually either:

numbers (if quantitative),
strings (if qualitative),
logical (if binary).

Every value belongs to a variable and an observation.

A variable contains all values that measure the same underlying attribute (like phone, age) across units.

An observation contains all values measured on the same unit (like a person, or a day, or an event) across attributes

Clean or Tidy Data:

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table

Examples of messy datasets

Column headers are values, not variable names
Multiple variables are stored in one column
Variables stored in both rows and columns
Multiple types of observational units are stored in the same table
A single observational unit is stored in multiple tables.

dplyr package workflow

dplyr R package allows us to:

Rename variables ( rename() )
Filter observations by their values ( filter() )
Reorder rows ( arrange() )
Select specific columns/variables ( select() )
Create new variables with functions of existing variables ( mutate() )
Summarise ( summarise() )

Renaming columns

We can select columns in a Data Frame using “$” symbol, i.e:

visit$DBP

##  [1]  55  60 100  70  56  60  60  80  75  65  70  NA  70

# It is not as easy to do so when the column name contains spaces or other special characters:

visit$`Patient ID`

##  [1] 11425 11425 10873 14562 19112 19112 19112 18567 14475 15940 15940
## [12] 15940 15940

In some cases in makes sense to rename some columns to make it easy to work :

visit.clean <- visit %>%
  rename(id="Patient ID", 
         admission="Admission date",
         discharge="Discharge date",
         pulse="Heart Rate")
names(visit.clean)

## [1] "id"          "admission"   "discharge"   "Temperature" "DBP"        
## [6] "SBP"         "Allergies"   "pulse"       "Symptoms"

Working with strings

# Converting strings to upper and lower case
toupper(visit.clean$Allergies)

##  [1] "NO"  "NO"  "YES" "NO"  "YES" "YES" "YES" "NO"  "NO"  "NO"  "NO" 
## [12] "NO"  "NO"

# or using pipe operator
visit.clean$Allergies %>% toupper()

##  [1] "NO"  "NO"  "YES" "NO"  "YES" "YES" "YES" "NO"  "NO"  "NO"  "NO" 
## [12] "NO"  "NO"

#Search character vector for a specific pattern:
grepl ("pain", visit.clean$Symptoms,  ignore.case=TRUE)

##  [1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE
## [12] FALSE FALSE

# Find all strings that start with "chest"
grepl ("^chest", visit.clean$Symptoms,  ignore.case=TRUE)

##  [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
## [12] FALSE FALSE

# Find all strings that end with "pain"
grepl ("^chest", visit.clean$Symptoms,  ignore.case=TRUE)

##  [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
## [12] FALSE FALSE

# Find all strings that contain either fever or pain (or both)
grepl ("fever|pain", visit.clean$Symptoms,  ignore.case=TRUE)

##  [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
## [12]  TRUE FALSE

grep( value=FALSE): returns a vector of indecies of element where pattern is found
grep( value=TRUE): returns a vector of indecies of element where pattern is found

grep ("fever", visit.clean$Symptoms,  ignore.case=TRUE, value=FALSE)

## [1]  1  7 10 12

grep ("fever", visit.clean$Symptoms,  ignore.case=TRUE, value=TRUE)

## [1] "Fever, cough, sore throat, headache"
## [2] "Cough, fever, sore throat"          
## [3] "Fever, cough, sore throat, headache"
## [4] "Fever, cough, sore throat, headache"

sub() function can be used to substitute the first occurance of a pattern with another string

sub (",", ";", visit.clean$Symptoms)

##  [1] "Fever; cough, sore throat, headache"    
##  [2] "Abdominal pain"                         
##  [3] "Chest Pain"                             
##  [4] "Broken arm"                             
##  [5] "Fatigue; weakness, chest pain, headache"
##  [6] "Chest pain; headache"                   
##  [7] "Cough; fever, sore throat"              
##  [8] "Nausea; vomiting"                       
##  [9] "Problems breathing"                     
## [10] "Fever; cough, sore throat, headache"    
## [11] "Chest Pain"                             
## [12] "Fever; cough, sore throat, headache"    
## [13] "Scheduled procedure"

gsub() function can be used to substitute all occurances of a pattern with another string

gsub (",", ";", visit.clean$Symptoms)

##  [1] "Fever; cough; sore throat; headache"    
##  [2] "Abdominal pain"                         
##  [3] "Chest Pain"                             
##  [4] "Broken arm"                             
##  [5] "Fatigue; weakness; chest pain; headache"
##  [6] "Chest pain; headache"                   
##  [7] "Cough; fever; sore throat"              
##  [8] "Nausea; vomiting"                       
##  [9] "Problems breathing"                     
## [10] "Fever; cough; sore throat; headache"    
## [11] "Chest Pain"                             
## [12] "Fever; cough; sore throat; headache"    
## [13] "Scheduled procedure"

Working with dates

class(visit.clean$admission)

## [1] "POSIXct" "POSIXt"

curr.time <- Sys.time()  # get current date and time
curr.date <- Sys.Date()  # get current date 
str(curr.time)           # view the structure of an object

##  POSIXct[1:1], format: "2019-01-29 22:53:46"

class(curr.time)         # view type of an object

## [1] "POSIXct" "POSIXt"

# some systems do not have timezone set up
Sys.timezone()

## [1] "America/New_York"

# convert character string to POSIXlt
t1=as.POSIXct("2019-01-29 11:30:00", "%Y-%m-%d %H:%M:%S", tz="EST")
str(t1)

##  POSIXct[1:1], format: "2019-01-29 11:30:00"

# list of available Time Zones
#OlsonNames()

There are a number of handy packages that have various functions to work with dates: * lubridate * chron

Some more examples working with dates and times can be found in the following R script rcs.bu.edu/examples/r/timesSeries/dateTime.R

Here we will calculate the length of stay of each patient in the hospital:

visit.clean$discharge - visit.clean$admission

## Time differences in days
##  [1] 2 1 4 2 5 5 4 2 1 2 2 3 1

#The result variable has "difftime" class:
stay <- visit.clean$discharge - visit.clean$admission
class(stay)

## [1] "difftime"

#To convert it to a numeric value use as.numeric() function:
as.numeric(visit.clean$discharge - visit.clean$admission)

##  [1] 2 1 4 2 5 5 4 2 1 2 2 3 1

Filtering rows

visit.clean %>% filter( !is.na(DBP) )

## # A tibble: 12 x 9
##       id admission           discharge           Temperature   DBP   SBP
##    <dbl> <dttm>              <dttm>                    <dbl> <dbl> <dbl>
##  1 11425 2017-02-20 00:00:00 2017-02-22 00:00:00       102.     55    95
##  2 11425 2018-03-25 00:00:00 2018-03-26 00:00:00        98.3    60    90
##  3 10873 2017-08-07 00:00:00 2017-08-11 00:00:00        98.6   100   150
##  4 14562 2017-12-17 00:00:00 2017-12-19 00:00:00        98.8    70   100
##  5 19112 2017-03-14 00:00:00 2017-03-19 00:00:00        98.2    56    80
##  6 19112 2018-07-03 00:00:00 2018-07-08 00:00:00        98.4    60    85
##  7 19112 2018-01-23 00:00:00 2018-01-27 00:00:00       103.     60    90
##  8 18567 2017-11-09 00:00:00 2017-11-11 00:00:00       103.     80   120
##  9 14475 2018-08-18 00:00:00 2018-08-19 00:00:00        98.8    75   110
## 10 15940 2018-02-05 00:00:00 2018-02-07 00:00:00       102.     65   115
## 11 15940 2017-03-11 00:00:00 2017-03-13 00:00:00       103.     70   120
## 12 15940 2018-11-03 00:00:00 2018-11-04 00:00:00        98.6    70   110
## # ... with 3 more variables: Allergies <chr>, pulse <dbl>, Symptoms <chr>

R logical operators:

> >=
== !=
< <=
is.na() !is.na()
%in%

Examples

# Select only those records for which pulse columns have values 100 and greater
visit.clean %>% filter( pulse > 100 )

## # A tibble: 2 x 9
##      id admission           discharge           Temperature   DBP   SBP
##   <dbl> <dttm>              <dttm>                    <dbl> <dbl> <dbl>
## 1 11425 2017-02-20 00:00:00 2017-02-22 00:00:00       102.     55    95
## 2 10873 2017-08-07 00:00:00 2017-08-11 00:00:00        98.6   100   150
## # ... with 3 more variables: Allergies <chr>, pulse <dbl>, Symptoms <chr>

# Select only those records for which DBP is less than 60 or SBP is greater than 120
visit.clean %>% filter( DBP < 60 | SBP > 120)

## # A tibble: 3 x 9
##      id admission           discharge           Temperature   DBP   SBP
##   <dbl> <dttm>              <dttm>                    <dbl> <dbl> <dbl>
## 1 11425 2017-02-20 00:00:00 2017-02-22 00:00:00       102.     55    95
## 2 10873 2017-08-07 00:00:00 2017-08-11 00:00:00        98.6   100   150
## 3 19112 2017-03-14 00:00:00 2017-03-19 00:00:00        98.2    56    80
## # ... with 3 more variables: Allergies <chr>, pulse <dbl>, Symptoms <chr>

# Select only those records for which Temperature is greater than 99 and Symptoms include "fever"
visit.clean %>% filter( Temperature > 99 &  grepl("fever", Symptoms, ignore.case=T)   )

## # A tibble: 4 x 9
##      id admission           discharge           Temperature   DBP   SBP
##   <dbl> <dttm>              <dttm>                    <dbl> <dbl> <dbl>
## 1 11425 2017-02-20 00:00:00 2017-02-22 00:00:00        102.    55    95
## 2 19112 2018-01-23 00:00:00 2018-01-27 00:00:00        103.    60    90
## 3 15940 2018-02-05 00:00:00 2018-02-07 00:00:00        102.    65   115
## 4 15940 2017-10-21 00:00:00 2017-10-24 00:00:00        101.    NA   110
## # ... with 3 more variables: Allergies <chr>, pulse <dbl>, Symptoms <chr>

Selecting Columns

visit.clean %>% select(id, Temperature:pulse)

## # A tibble: 13 x 6
##       id Temperature   DBP   SBP Allergies pulse
##    <dbl>       <dbl> <dbl> <dbl> <chr>     <dbl>
##  1 11425       102.     55    95 No          110
##  2 11425        98.3    60    90 No           80
##  3 10873        98.6   100   150 yes         105
##  4 14562        98.8    70   100 NO           70
##  5 19112        98.2    56    80 yes          65
##  6 19112        98.4    60    85 yes          70
##  7 19112       103.     60    90 YES          80
##  8 18567       103.     80   120 no           75
##  9 14475        98.8    75   110 no           80
## 10 15940       102.     65   115 NO           77
## 11 15940       103.     70   120 no           84
## 12 15940       101.     NA   110 no           80
## 13 15940        98.6    70   110 no           75

dplyr comes with a set of helper functions that can help you select groups of variables inside a select() call:

starts_with("XY"): every name that starts with “XY”,
ends_with("XY"): every name that ends with “XY”,
contains("XY"): every name that contains “XY”,
matches("XY"): every name that matches “XY”, where “XY” can be a regular expression,
num_range("x", 1:15): the variables named x01, x02, x03,…, x15,
one_of(XY): every name that appears in x, where XY is a character vector.

visit.clean %>% select( id, ends_with("BP") )

## # A tibble: 13 x 3
##       id   DBP   SBP
##    <dbl> <dbl> <dbl>
##  1 11425    55    95
##  2 11425    60    90
##  3 10873   100   150
##  4 14562    70   100
##  5 19112    56    80
##  6 19112    60    85
##  7 19112    60    90
##  8 18567    80   120
##  9 14475    75   110
## 10 15940    65   115
## 11 15940    70   120
## 12 15940    NA   110
## 13 15940    70   110

Modifying existing and/or creating a new variables

visit.clean %>% mutate( Temperature = (Temperature-32)*5/9, stay = as.numeric(discharge - admission) )

## # A tibble: 13 x 10
##       id admission           discharge           Temperature   DBP   SBP
##    <dbl> <dttm>              <dttm>                    <dbl> <dbl> <dbl>
##  1 11425 2017-02-20 00:00:00 2017-02-22 00:00:00        38.6    55    95
##  2 11425 2018-03-25 00:00:00 2018-03-26 00:00:00        36.8    60    90
##  3 10873 2017-08-07 00:00:00 2017-08-11 00:00:00        37     100   150
##  4 14562 2017-12-17 00:00:00 2017-12-19 00:00:00        37.1    70   100
##  5 19112 2017-03-14 00:00:00 2017-03-19 00:00:00        36.8    56    80
##  6 19112 2018-07-03 00:00:00 2018-07-08 00:00:00        36.9    60    85
##  7 19112 2018-01-23 00:00:00 2018-01-27 00:00:00        39.6    60    90
##  8 18567 2017-11-09 00:00:00 2017-11-11 00:00:00        39.3    80   120
##  9 14475 2018-08-18 00:00:00 2018-08-19 00:00:00        37.1    75   110
## 10 15940 2018-02-05 00:00:00 2018-02-07 00:00:00        39.2    65   115
## 11 15940 2017-03-11 00:00:00 2017-03-13 00:00:00        39.7    70   120
## 12 15940 2017-10-21 00:00:00 2017-10-24 00:00:00        38.4    NA   110
## 13 15940 2018-11-03 00:00:00 2018-11-04 00:00:00        37      70   110
## # ... with 4 more variables: Allergies <chr>, pulse <dbl>, Symptoms <chr>,
## #   stay <dbl>

Examples

# Create a new column MAP which is equal to SBP/3 + 2*DBP/3
visit.clean %>% mutate( MAP=SBP/3 + 2*DBP/3 )

## # A tibble: 13 x 10
##       id admission           discharge           Temperature   DBP   SBP
##    <dbl> <dttm>              <dttm>                    <dbl> <dbl> <dbl>
##  1 11425 2017-02-20 00:00:00 2017-02-22 00:00:00       102.     55    95
##  2 11425 2018-03-25 00:00:00 2018-03-26 00:00:00        98.3    60    90
##  3 10873 2017-08-07 00:00:00 2017-08-11 00:00:00        98.6   100   150
##  4 14562 2017-12-17 00:00:00 2017-12-19 00:00:00        98.8    70   100
##  5 19112 2017-03-14 00:00:00 2017-03-19 00:00:00        98.2    56    80
##  6 19112 2018-07-03 00:00:00 2018-07-08 00:00:00        98.4    60    85
##  7 19112 2018-01-23 00:00:00 2018-01-27 00:00:00       103.     60    90
##  8 18567 2017-11-09 00:00:00 2017-11-11 00:00:00       103.     80   120
##  9 14475 2018-08-18 00:00:00 2018-08-19 00:00:00        98.8    75   110
## 10 15940 2018-02-05 00:00:00 2018-02-07 00:00:00       102.     65   115
## 11 15940 2017-03-11 00:00:00 2017-03-13 00:00:00       103.     70   120
## 12 15940 2017-10-21 00:00:00 2017-10-24 00:00:00       101.     NA   110
## 13 15940 2018-11-03 00:00:00 2018-11-04 00:00:00        98.6    70   110
## # ... with 4 more variables: Allergies <chr>, pulse <dbl>, Symptoms <chr>,
## #   MAP <dbl>

Let’s put it all together:
Use visit.clean dataframe as input and

- select only those columns where Temperature is greater than 99F
- select columns ID, DBP and SBP
- calculate new variable MAP

res <- visit.clean %>%
            filter( Temperature > 99 ) %>%
            select( id, DBP, SBP ) %>%
            mutate( MAP = SBP/3 + 2*DBP/3 ) 
 res

## # A tibble: 6 x 4
##      id   DBP   SBP   MAP
##   <dbl> <dbl> <dbl> <dbl>
## 1 11425    55    95  68.3
## 2 19112    60    90  70  
## 3 18567    80   120  93.3
## 4 15940    65   115  81.7
## 5 15940    70   120  86.7
## 6 15940    NA   110  NA

###  Calculating summaries

r visit.clean %>% summarise( N = n(), max.t = max(Temperature), min.T = min(Temperature) )

## # A tibble: 1 x 3 ## N max.t min.T ## <int> <dbl> <dbl> ## 1 13 103. 98.2

bilt-in functions often used within summarise():

averages: mean(), median()
spread: sd(), IQR(), mad()
range: min(), max()
count: n(), n_distinct()

Examples:

# Calculate the number of distinct patients
visit.clean %>% summarise( N = n_distinct(id) )

## # A tibble: 1 x 1
##       N
##   <int>
## 1     7

Group by one or more variables

visit.clean %>% group_by(id) %>% summarise( ave.pulse = mean(pulse) )

## # A tibble: 7 x 2
##      id ave.pulse
##   <dbl>     <dbl>
## 1 10873     105  
## 2 11425      95  
## 3 14475      80  
## 4 14562      70  
## 5 15940      79  
## 6 18567      75  
## 7 19112      71.7

Useful functions often used with group_by():

first()
last()
nth()

# For each patient select the first record and find the lenght of stay
visit.clean %>% group_by(id) %>% summarise( first(discharge - admission) )

## # A tibble: 7 x 2
##      id `first(discharge - admission)`
##   <dbl>                          <dbl>
## 1 10873                              4
## 2 11425                              2
## 3 14475                              1
## 4 14562                              2
## 5 15940                              2
## 6 18567                              2
## 7 19112                              5

Sorting dataframe by one or more variables

visit.clean %>% arrange(id, admission)

## # A tibble: 13 x 9
##       id admission           discharge           Temperature   DBP   SBP
##    <dbl> <dttm>              <dttm>                    <dbl> <dbl> <dbl>
##  1 10873 2017-08-07 00:00:00 2017-08-11 00:00:00        98.6   100   150
##  2 11425 2017-02-20 00:00:00 2017-02-22 00:00:00       102.     55    95
##  3 11425 2018-03-25 00:00:00 2018-03-26 00:00:00        98.3    60    90
##  4 14475 2018-08-18 00:00:00 2018-08-19 00:00:00        98.8    75   110
##  5 14562 2017-12-17 00:00:00 2017-12-19 00:00:00        98.8    70   100
##  6 15940 2017-03-11 00:00:00 2017-03-13 00:00:00       103.     70   120
##  7 15940 2017-10-21 00:00:00 2017-10-24 00:00:00       101.     NA   110
##  8 15940 2018-02-05 00:00:00 2018-02-07 00:00:00       102.     65   115
##  9 15940 2018-11-03 00:00:00 2018-11-04 00:00:00        98.6    70   110
## 10 18567 2017-11-09 00:00:00 2017-11-11 00:00:00       103.     80   120
## 11 19112 2017-03-14 00:00:00 2017-03-19 00:00:00        98.2    56    80
## 12 19112 2018-01-23 00:00:00 2018-01-27 00:00:00       103.     60    90
## 13 19112 2018-07-03 00:00:00 2018-07-08 00:00:00        98.4    60    85
## # ... with 3 more variables: Allergies <chr>, pulse <dbl>, Symptoms <chr>

Joining 2 dataframes

First let’s read the data for each patient:

pinfo <- read_excel("MedData.xlsx", sheet = "Patient Data")

#let's make sure the date is well formatted
pinfo %>%head()

## # A tibble: 6 x 4
##      ID DOB                 Gender Language
##   <dbl> <dttm>              <chr>  <chr>   
## 1 11425 1996-02-17 00:00:00 Male   English 
## 2 10873 1949-06-03 00:00:00 Female English 
## 3 14567 2002-09-16 00:00:00 Male   Spanish 
## 4 19112 1946-06-03 00:00:00 Female English 
## 5 18567 1957-04-09 00:00:00 Female French  
## 6 14475 1971-06-19 00:00:00 Male   English

pinfo %>%glimpse()

## Observations: 8
## Variables: 4
## $ ID       <dbl> 11425, 10873, 14567, 19112, 18567, 14475, 15940, 15786
## $ DOB      <dttm> 1996-02-17, 1949-06-03, 2002-09-16, 1946-06-03, 1957...
## $ Gender   <chr> "Male", "Female", "Male", "Female", "Female", "Male",...
## $ Language <chr> "English", "English", "Spanish", "English", "French",...

pinfo %>%summary()

##        ID             DOB                         Gender         
##  Min.   :10873   Min.   :1935-10-17 00:00:00   Length:8          
##  1st Qu.:13712   1st Qu.:1948-09-02 00:00:00   Class :character  
##  Median :15176   Median :1963-08-15 12:00:00   Mode  :character  
##  Mean   :15093   Mean   :1966-02-23 03:00:00                     
##  3rd Qu.:16597   3rd Qu.:1977-08-18 06:00:00                     
##  Max.   :19112   Max.   :2002-09-16 00:00:00                     
##    Language        
##  Length:8          
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Now we want to have a single dataframe that contains patient infomration and patient visit informtaion

There are a number of join functions in dplyr package:

inner_join()
left_join()
right_join()
full_join()
semi_join() - return all rows from x where there are matching values in y, keeping just columns from x.
anti_join() - return all rows from x where there are not matching values in y, keeping just columns from x.

#Let's try full join of both dataframes we have
full_join(visit.clean, pinfo, by = c("id"="ID"))

## # A tibble: 15 x 12
##       id admission           discharge           Temperature   DBP   SBP
##    <dbl> <dttm>              <dttm>                    <dbl> <dbl> <dbl>
##  1 11425 2017-02-20 00:00:00 2017-02-22 00:00:00       102.     55    95
##  2 11425 2018-03-25 00:00:00 2018-03-26 00:00:00        98.3    60    90
##  3 10873 2017-08-07 00:00:00 2017-08-11 00:00:00        98.6   100   150
##  4 14562 2017-12-17 00:00:00 2017-12-19 00:00:00        98.8    70   100
##  5 19112 2017-03-14 00:00:00 2017-03-19 00:00:00        98.2    56    80
##  6 19112 2018-07-03 00:00:00 2018-07-08 00:00:00        98.4    60    85
##  7 19112 2018-01-23 00:00:00 2018-01-27 00:00:00       103.     60    90
##  8 18567 2017-11-09 00:00:00 2017-11-11 00:00:00       103.     80   120
##  9 14475 2018-08-18 00:00:00 2018-08-19 00:00:00        98.8    75   110
## 10 15940 2018-02-05 00:00:00 2018-02-07 00:00:00       102.     65   115
## 11 15940 2017-03-11 00:00:00 2017-03-13 00:00:00       103.     70   120
## 12 15940 2017-10-21 00:00:00 2017-10-24 00:00:00       101.     NA   110
## 13 15940 2018-11-03 00:00:00 2018-11-04 00:00:00        98.6    70   110
## 14 14567 NA                  NA                         NA      NA    NA
## 15 15786 NA                  NA                         NA      NA    NA
## # ... with 6 more variables: Allergies <chr>, pulse <dbl>, Symptoms <chr>,
## #   DOB <dttm>, Gender <chr>, Language <chr>

# left join
result <- visit.clean  %>% left_join(pinfo, by = c("id"="ID"))
result

## # A tibble: 13 x 12
##       id admission           discharge           Temperature   DBP   SBP
##    <dbl> <dttm>              <dttm>                    <dbl> <dbl> <dbl>
##  1 11425 2017-02-20 00:00:00 2017-02-22 00:00:00       102.     55    95
##  2 11425 2018-03-25 00:00:00 2018-03-26 00:00:00        98.3    60    90
##  3 10873 2017-08-07 00:00:00 2017-08-11 00:00:00        98.6   100   150
##  4 14562 2017-12-17 00:00:00 2017-12-19 00:00:00        98.8    70   100
##  5 19112 2017-03-14 00:00:00 2017-03-19 00:00:00        98.2    56    80
##  6 19112 2018-07-03 00:00:00 2018-07-08 00:00:00        98.4    60    85
##  7 19112 2018-01-23 00:00:00 2018-01-27 00:00:00       103.     60    90
##  8 18567 2017-11-09 00:00:00 2017-11-11 00:00:00       103.     80   120
##  9 14475 2018-08-18 00:00:00 2018-08-19 00:00:00        98.8    75   110
## 10 15940 2018-02-05 00:00:00 2018-02-07 00:00:00       102.     65   115
## 11 15940 2017-03-11 00:00:00 2017-03-13 00:00:00       103.     70   120
## 12 15940 2017-10-21 00:00:00 2017-10-24 00:00:00       101.     NA   110
## 13 15940 2018-11-03 00:00:00 2018-11-04 00:00:00        98.6    70   110
## # ... with 6 more variables: Allergies <chr>, pulse <dbl>, Symptoms <chr>,
## #   DOB <dttm>, Gender <chr>, Language <chr>

# Calculate the mean stay of the hospital for Male and Female patients:
result %>%
  filter( !is.na(Gender) ) %>%            
  mutate( stay = as.numeric(discharge-admission) ) %>%     
  group_by ( Gender ) %>%         
  summarize( ave.stay = mean(stay) )

## # A tibble: 2 x 2
##   Gender ave.stay
##   <chr>     <dbl>
## 1 Female     4   
## 2 Male       1.71

Converting Wide dataframes to long dataframes and back

library(tidyr)
city.temps <- data.frame( time = as.Date('2018-09-03') + 0:4,
                          Boston = rnorm( 5, 75, 3),
                          Denver = rnorm( 5, 85, 5),
                          SanFrancisco = rnorm(5, 80, 5),
                          Austin = rnorm( 5, 90, 5) )
city.temps

##         time   Boston   Denver SanFrancisco   Austin
## 1 2018-09-03 73.19155 87.96200     78.22401 76.76680
## 2 2018-09-04 77.49030 86.29670     82.17868 91.40316
## 3 2018-09-05 73.30671 80.91177     79.60690 90.05191
## 4 2018-09-06 70.86246 86.70602     77.99101 83.83112
## 5 2018-09-07 76.45631 86.77545     82.94073 98.15083

# Use gather to rearrange a table into a tidy data.frame:
# gather(data, key = "key", value = "value", ..., na.rm = FALSE,
# convert = FALSE, factor_key = FALSE)
city.temps2 <- gather( city.temps, 
                       key = "City",
                       value = "Temperature",
                       -time,  # collection of the columns,
                       factor_key = TRUE) # if key variable needs to be converted to a factor
# or
city.temps2 <- gather( city.temps, 
                       key = "City",
                       value = "Temperature",
                       Boston: Austin,  # collection of the columns,
                       factor_key = TRUE) # if key variable needs to be converted to a factor
#or
city.temps2 <- gather( city.temps, 
                       key = "City",
                       value = "Temperature",
                       c("Boston","Denver","SanFrancisco","Austin"),  # collection of the columns,
                       factor_key = TRUE) # if key variable needs to be converted to a factor


city.temps2

##          time         City Temperature
## 1  2018-09-03       Boston    73.19155
## 2  2018-09-04       Boston    77.49030
## 3  2018-09-05       Boston    73.30671
## 4  2018-09-06       Boston    70.86246
## 5  2018-09-07       Boston    76.45631
## 6  2018-09-03       Denver    87.96200
## 7  2018-09-04       Denver    86.29670
## 8  2018-09-05       Denver    80.91177
## 9  2018-09-06       Denver    86.70602
## 10 2018-09-07       Denver    86.77545
## 11 2018-09-03 SanFrancisco    78.22401
## 12 2018-09-04 SanFrancisco    82.17868
## 13 2018-09-05 SanFrancisco    79.60690
## 14 2018-09-06 SanFrancisco    77.99101
## 15 2018-09-07 SanFrancisco    82.94073
## 16 2018-09-03       Austin    76.76680
## 17 2018-09-04       Austin    91.40316
## 18 2018-09-05       Austin    90.05191
## 19 2018-09-06       Austin    83.83112
## 20 2018-09-07       Austin    98.15083

glimpse(city.temps2)

## Observations: 20
## Variables: 3
## $ time        <date> 2018-09-03, 2018-09-04, 2018-09-05, 2018-09-06, 2...
## $ City        <fct> Boston, Boston, Boston, Boston, Boston, Denver, De...
## $ Temperature <dbl> 73.19155, 77.49030, 73.30671, 70.86246, 76.45631, ...

Sometimes it is useful to be able to perform the opposite operation: convert long format dataframe into wide representation

city.temps3 <- spread( city.temps2, City, Temperature)
city.temps3

##         time   Boston   Denver SanFrancisco   Austin
## 1 2018-09-03 73.19155 87.96200     78.22401 76.76680
## 2 2018-09-04 77.49030 86.29670     82.17868 91.40316
## 3 2018-09-05 73.30671 80.91177     79.60690 90.05191
## 4 2018-09-06 70.86246 86.70602     77.99101 83.83112
## 5 2018-09-07 76.45631 86.77545     82.94073 98.15083

Data Wrangling in R

Boston University

January 29, 2019

Data Science Workflow

Core tidyverse packages

Importing dataset from Excel worksheet

Pipe symbol `%>%`

Cleaning and analysing a dataset

Clean or Tidy Data:

Examples of messy datasets

dplyr package workflow

Renaming columns

We can select columns in a Data Frame using “$” symbol, i.e:

Working with strings

Working with dates

Filtering rows

Examples

Selecting Columns

Modifying existing and/or creating a new variables

Examples

Examples:

Group by one or more variables

Sorting dataframe by one or more variables

Joining 2 dataframes

Converting Wide dataframes to long dataframes and back

Data Wrangling in R

Boston University

January 29, 2019

Data Science Workflow

Core tidyverse packages

*** Other related packages hms** - time manipulation

Importing dataset from Excel worksheet

Pipe symbol %>%

Cleaning and analysing a dataset

Clean or Tidy Data:

Examples of messy datasets

dplyr package workflow

Renaming columns

We can select columns in a Data Frame using “$” symbol, i.e:

Working with strings

Working with dates

Filtering rows

Examples

Selecting Columns

Modifying existing and/or creating a new variables

Examples

Examples:

Group by one or more variables

Sorting dataframe by one or more variables

Joining 2 dataframes

Converting Wide dataframes to long dataframes and back

* Other related packages hms - time manipulation

Pipe symbol `%>%`