Python for Data Analysis

Biostatistics BS803

Research Computing Website: rcs.bu.edu
Class materials: http://rcs.bu.edu/examples/classes/bs803

Start interactive job and change to project directory

qrsh -P bs803
cd /projectnb/bs803/<your_user_name>

Copy this notebook to the current directory:

cp /project/scv/examples/classes/bs803/

To start a notebook run the following commands at the prompt:

module load python/3.6.2
jupyter notebook

To execute a cell: Shift+Enter

To clear all output cells in the notebook: Kernel-> Restart & Clear Output

To exit a notebook: File-> Close and Hault Then press Ctrl+C in terminal and answer "y" .

In [ ]:
#Import Python Libraries 
#    similar to library() function in R
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Python can read files with various formats:

pd.read_sas('myfile.sas7bdat')
pd.read_stata('myfile.dta')
pd.read_excel('myfile.xlsx',sheet_name='Sheet1', index_col=None, na_values=['NA'])
pd.read_hdf('myfile.h5','df')
In [ ]:
#Read csv file
df = pd.read_csv("http://rcs.bu.edu/examples/python/data_analysis/Salaries.csv")
In [ ]:
#Display a few first records
df.head()
In [ ]:
#Display first 10 records
df.head(10)

Excersize

In [ ]:
#Read bs803.sas7bdat file into the object with the name fhs
In [ ]:
#Display first 20 records of bdf dataset
In [ ]:
#Display the last 10 records of bdf dataset

In [ ]:
# Identify the type of df object
type(df)
In [ ]:
# Information about the dataframe (similar to str() function in R)
df.info()
In [ ]:
# Check the type of a column "salary"
df['salary'].dtype
In [ ]:
# List the types of all columns
df.dtypes
In [ ]:
# List the column names (similar to names() function in R)
df.columns
In [ ]:
# Number of dimensions 
df.ndim
In [ ]:
# Total number of elements in the Data Frame
df.size
In [ ]:
# Number of rows and columns (similar to dim() function in R)
df.shape
In [ ]:
# Output basic statistics for the numeric columns 
# This is similar to summary() function in R and describe() function in R package psych 
df.describe()
In [ ]:
#Calculate mean for all numeric columns
df.mean()

Excersize

In [ ]:
#Use describe method to calculate the basic statistics for fhs dataset

Data slicing and grouping

In [ ]:
#Extract a column by name (method 1)
df['sex'].head()
In [ ]:
#Extract a column name (method 2)
df.sex.head()

Excersize

In [ ]:
#Calculate the basic statistics for the AGE column (used describe() method)
In [ ]:
#Calculate how many non-missing values in the HDLC column (use count() method)

In [ ]:
#Group data using rank
df_rank = df.groupby('rank')
In [ ]:
#Calculate mean of all numeric columns for the grouped object
df_rank.mean()
In [ ]:
df.groupby('rank').mean()
In [ ]:
#Calculate the mean salary for men and women. The following produce Pandas Series (single brackets around salary)
df.groupby('sex')['salary'].mean()
In [ ]:
# If we use double brackets Pandas will produce a DataFrame
df.groupby('sex')[['salary']].mean()
In [ ]:
# Group using 2 variables - sex and rank:
df.groupby(['sex','rank'], sort=False)[['salary']].mean()

Excersize

In [ ]:
# Group fhs data by the gender (use SEX column) and find the average AGE for each group
fhs.groupby('SEX')[['AGE']].mean()

Filtering

In [ ]:
#Select observation with the value in the salary column > 120K
df_sub = df[ df['salary'] > 120000]
df_sub.head()
In [ ]:
df['salary'] > 120000
In [ ]:
#Select data for female professors
df_w = df[ df['sex'] == 'Female']
df_w.head()

Excersize

In [ ]:
# Using filtering, calculate mean systolic blood pressure for patients older than 50 years old

More on slicing the dataset

In [ ]:
#Select a subset of rows (based on their position):
# Note 1: The location of the first row is 0
# Note 2: The last value in the range is not included
df[0:10]
In [ ]:
#If we want to select both rows and columns we can use method .loc
df.loc[10:20,['rank', 'sex','salary']]
In [ ]:
#Let's see what we get for our df_sub data frame
# Method .loc subset the data frame based on the labels:
df_sub.loc[10:20,['rank','sex','salary']]
In [ ]:
#  Unlike method .loc, method iloc selects rows (and columns) by poistion:
df_sub.iloc[10:20, [0,3,4,5]]

Sorting the Data

In [ ]:
#Sort the data frame by yrs.service and create a new data frame
df_sorted = df.sort_values(by = 'service')
df_sorted.head()
In [ ]:
#Sort the data frame by yrs.service and overwrite the original dataset
df.sort_values(by = 'service', ascending = False, inplace = True)
df.head()
In [ ]:
# Restore the original order (by sorting using index)
df.sort_index(axis=0, ascending = True, inplace = True)
df.head()

Excersize

In [ ]:
# Sort fhs data frame by the AGE (in descending order) and display the first 10 records of the output (head)

In [ ]:
#Sort the data frame using 2 or more columns:
df_sorted = df.sort_values(by = ['service', 'salary'], ascending = [True,False])
df_sorted.head(10)

Missing Values

In [ ]:
# Read a dataset with missing values
flights = pd.read_csv("http://rcs.bu.edu/examples/python/data_analysis/flights.csv")
flights.head()
In [ ]:
flights.describe()
In [ ]:
# Select the rows that have at least one missing value
flights[flights.isnull().any(axis=1)].head()
In [ ]:
# Filter all the rows where arr_delay value is missing:
flights1 = flights[ flights['arr_delay'].notnull( )]
flights1.head()
In [ ]:
# Remove all the observations with missing values
flights2 = flights.dropna()
In [ ]:
# Fill missing values with zeros
nomiss =flights['dep_delay'].fillna(0)
nomiss.isnull().any()

Basic descriptive statistics

Function Description
min minimum
max maximum
mean arithmetic mean of values
median median
mad mean absolute deviation
mode mode
std standard deviation
var unbiased variance
sem standard error of the mean
skew sample skewness
kurt kurtosis
quantile value at %
In [ ]:
# Convinient describe() function computes a veriety of statistics
flights.dep_delay.describe()
In [ ]:
# find the index of the maximum or minimum value
# if there are multiple values matching idxmin() and idxmax() will return the first match
flights['dep_delay'].idxmin()  #minimum value
In [ ]:
# Count the number of records for each different value in a vector
flights['carrier'].value_counts()

Explore data using graphics

In [ ]:
#Show graphs withint Python notebook
%matplotlib inline
In [ ]:
#Use seaborn package to draw a histogram
sns.distplot(df['salary']);
In [ ]:
# Use seaborn package to display a barplot
sns.set_style("whitegrid")

ax = sns.barplot(x='rank',y ='salary', data=df, estimator=len)
In [ ]:
# Split into 2 groups:
ax = sns.barplot(x='rank',y ='salary', hue='sex', data=df, estimator=len)
In [ ]:
#Violinplot
sns.violinplot(x = "salary", data=df)
In [ ]:
#Scatterplot in seaborn
sns.jointplot(x='service', y='salary', data=df)
In [ ]:
#If we are interested in linear regression plot for 2 numeric variables we can use regplot
sns.regplot(x='service', y='salary', data=df)
In [ ]:
# box plot
sns.boxplot(x='rank',y='salary', data=df)
In [ ]:
# side-by-side box plot
sns.boxplot(x='rank',y='salary', data=df, hue='sex')
In [ ]:
# swarm plot
sns.swarmplot(x='rank',y='salary', data=df)
In [ ]:
#factorplot
sns.factorplot(x='carrier',y='dep_delay', data=flights, kind='bar')
In [ ]:
# Pairplot 
sns.pairplot(df)

Excersize

In [ ]:
#Using seaborn package plot a scatterplot of dependency  Systolic Blood Pressure and Age

Basic statistical Analysis

Linear Regression

In [ ]:
# Import Statsmodel functions:
import statsmodels.formula.api as smf
In [ ]:
# create a fitted model
lm = smf.ols(formula='salary ~ service', data=df).fit()

#print model summary
print(lm.summary())
In [ ]:
# print the coefficients
lm.params

Excersize

In [ ]:
# Build a linear model for SYSBP vs AGE and SEX


#print model summary

Student T-test

In [ ]:
# Using scipy package:
from scipy import stats
df_w = df[ df['sex'] == 'Female']['salary']
df_m = df[ df['sex'] == 'Male']['salary']
stats.ttest_ind(df_w, df_m)   

Excersize

In [ ]:
# compare mean of Systolic Blood Pressure for Female and Male patients in fhs dataset

Programming structures in Python

If then statements

In [ ]:
if( sum(flights['carrier'] == "DL")  > sum(flights['carrier'] == "UA") ):
    print("There are more flights by Delta Airlines")
else:
    print("There are more flights by United Airlines")

Excersize

Using if statement calculate the average value of Systolic Blood Pressure and if it is greater than 140 print "Hypertensive sample". Otherwise print "Non-hypertensive sample"

In [ ]:
# Solution:

Loops

In [ ]:
for i in range (0, 5):
    print (flights.loc[i,'carrier'])

Excersize

In [ ]:
# Using Python loop print first 5 values of column Age in fhs dataset
In [ ]: