{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "vqv_wAtk0ibM"
},
"source": [
"# Python for Data Analysis\n",
"\n",
"### Research Computing Services\n",
"\n",
"Instructor: Scott Ladenheim
\n",
"Website: [rcs.bu.edu](http://www.bu.edu/tech/support/research/)
\n",
"Tutorial materials: [http://rcs.bu.edu/examples/python/DataAnalysis](http://rcs.bu.edu/examples/python/DataAnalysis)
\n",
"Contact us: help@scc.bu.edu"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "M3FbA7um0ibR"
},
"source": [
"## Course Content\n",
"1. Python packages for data scientists\n",
"2. Data manipulation with Pandas\n",
"3. Basic data plotting\n",
"4. Descriptive statistics\n",
"5. Inferential statistics\n",
"\n",
"\n",
"## Tutorial Evaluation\n",
"After the tutorial please submit an evaluation by clicking on this link [Tutorial Evaluation](http://scv.bu.edu/survey/tutorial_evaluation.html)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Python packages for data scientists\n",
"* [NumPy](https://numpy.org)\n",
" - Introduces objects for handling n-dimensional arrays such as vectors (1-D) and matrices (2-D)\n",
" - Introduces functions to perform advanced mathematical and statistical operations on these objects\n",
" - Provides vectorization of mathematical operations on arrays and matrices which significantly improves the performance \n",
" - Many other Python libraries are built on NumPy\n",
"* [SciPy](https://scipy.org)\n",
" - Collection of algorithms for statistics, linear algebra, optimization, differential equations, numerical integration, and more\n",
" - Part of SciPy stack\n",
" - Built on NumPy\n",
"* [Pandas](https://pandas.pydata.org)\n",
" - Adds data structures and tools designed to work with table-like data (similar to Vectors and Data Frames in R)\n",
" - Provides tools for data maniuplation: *reshaping*, *merging*, *sorting*, *slicing*, *aggregation*, etc.\n",
" - Easily allows to handle missing data\n",
"* [SciKit-Learn](https://scikit-learn.org/stable/)\n",
" - Provides machine learning algorithms: classification, regression, clustering, model validation, etc.\n",
" - Built on NumPy, Scipy, and matplotlib\n",
"\n",
"* Many others\n",
" - [Keras](https://keras.io/)\n",
" - [Pytorch](https://pytorch.org/)\n",
" - [Tensorflow](https://www.tensorflow.org/)\n",
" \n",
"### Visualization\n",
"More in-depth look at visualization in the `Data Visualization in Python` course.\n",
"* [matplotlib](https://matplotlib.org/)\n",
" - Python 2-D plotting library for pulibcation quality figures in a variety of hardcopy formats\n",
" - Functionalities similar to MATLAB\n",
" - Line plots, scatter plots, bar charts, histograms, pie charts, etc.\n",
" - Effort needed to create advanced visualizations\n",
"* [seaborn](https://seaborn.pydata.org/)\n",
" - Based on matplotlib\n",
" - Provides a high-level interface for drawing attractive statistical graphs\n",
" - Similar to the ggplot2 library in R\n",
"* [plotly](https://plotly.com/python/)\n",
" - over 40 unique chart types covering a wide range of statistical, financial, geographic, scientific, and 3-dimensional use-cases.\n",
" - Built on top of the Plotly JavaScript library\n",
" - Can create beautiful interactive web-based visualizations "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading Python libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Ea4C8rUR0ibU"
},
"outputs": [],
"source": [
"# Press shift-enter to execute a Jupyter notebook cell\n",
"# Import the Python Libraries used in the tutorial\n",
"import numpy as np\n",
"import scipy as sp\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"# Enable inline plotting\n",
"%matplotlib inline"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "ca3M2mv_0ibX"
},
"source": [
"## Pandas\n",
"The main focus of this tutorial is using the Pandas library to manipulate and analyze data.\n",
"\n",
"Pandas is a python package that deals mostly with :\n",
"- **Series** (1-D homogeneous array)\n",
"- **DataFrame** (2-D labeled heterogeneous array) \n",
"- **MultiIndex** (for hierarchical data)\n",
"- **Xarray** (built on top of Pandas for n-D arrays)\n",
"\n",
"The Pandas content of this tutorial will cover:\n",
"* Creating and understanding Series and DataFrames\n",
"* Importing/Reading data\n",
"* Data selection and filtering\n",
"* Data maniuplation via sorting, grouping, and rearranging\n",
"* Handling missing data\n",
"\n",
"\n",
"In addition we will also provide information on the following.\n",
"* Basic data plotting\n",
"* Descriptive statistics (time permitting)\n",
"* Inferential statistics (time permitting)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "vYNDrGEL0ibY"
},
"source": [
"### Pandas Series"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "hQpjwDv_0ibZ"
},
"source": [
"A Pandas *Series* is a 1-dimensional labeled array containing data of the same type (integers, strings, floating point numbers, Python objects, etc. ). It is a generalized numpy array with an explicit axis called the *index*."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "iPJvEWlG0ibZ"
},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'pd' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[2], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[39m# Create a Pandas series with a list of integers\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m s1 \u001b[39m=\u001b[39m pd\u001b[39m.\u001b[39mSeries([\u001b[39m-\u001b[39m\u001b[39m3\u001b[39m, \u001b[39m-\u001b[39m\u001b[39m1\u001b[39m, \u001b[39m1\u001b[39m, \u001b[39m3\u001b[39m, \u001b[39m5\u001b[39m])\n\u001b[1;32m 3\u001b[0m \u001b[39mprint\u001b[39m(s1)\n",
"\u001b[0;31mNameError\u001b[0m: name 'pd' is not defined"
]
}
],
"source": [
"# Example of creating Pandas series :\n",
"# Order all S1 together\n",
"s1 = pd.Series([-3, -1, 1, 3, 5])\n",
"print(s1)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "YPctLCcw0ibb"
},
"source": [
"We did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "v4vAbdqv0ibb"
},
"outputs": [],
"source": [
"# View index values\n",
"print(s1.index)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "s3I3__b90ibi"
},
"outputs": [],
"source": [
"s1[:2] # First 2 elements"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "4jsIOlUR0ibi"
},
"outputs": [],
"source": [
"print(s1[[2,1,0]]) # Elements out of order"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "qGn-aAvx2t9V"
},
"outputs": [],
"source": [
"type(s1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "WIft0v6r0ibl"
},
"outputs": [],
"source": [
"# Can place filtering conditions on series\n",
"s1[s1 > 0]"
]
},
{
"cell_type": "code",
"execution_count": 160,
"metadata": {
"id": "ZmlDR_tj0ibd"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"a 0.745068\n",
"b -2.273686\n",
"c 0.595940\n",
"d 1.315555\n",
"e -1.060865\n",
"dtype: float64\n"
]
}
],
"source": [
"# Creating Pandas series with index:\n",
"rng = np.random.default_rng()\n",
"s2 = pd.Series(rng.normal(size=5), index=['a', 'b', 'c', 'd', 'e'])\n",
"print(s2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "srVyZ9ik0ibe"
},
"outputs": [],
"source": [
"# View index values\n",
"print(s2.index)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "vX-IUZb30ibf"
},
"outputs": [],
"source": [
"# Create a Series from dictionary\n",
"data = {'pi': 3.14159, 'e': 2.71828} # dictionary\n",
"print(data)\n",
"s3 = pd.Series(data)\n",
"print(s3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "sOHbkXC10ibg"
},
"outputs": [],
"source": [
"# Create a new series from a dictionary and reorder the elements\n",
"s4 = pd.Series(data, index=['e', 'pi', 'tau'])\n",
"print(s4)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "DeofA8nZ0ibi"
},
"source": [
"NaN (Not a Number) - is used to specify a missing value in Pandas."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "skyr2siA0ibj"
},
"outputs": [],
"source": [
"# Series can be treated as a 1-D array and you can apply functions to them:\n",
"print(\"Median:\", s4.median())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "42_G7Ks-0ibl"
},
"outputs": [],
"source": [
"# Methods can be used to filter series:\n",
"s4[s4 > s4.median()]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "3f4HD5E90ibn"
},
"source": [
"### Attributes and Methods:\n",
"An attribute is a variable stored in the object, e.g., index or size with Series.\n",
"A method is a function stored in the object, e.g., head() or median() with Series.\n",
"\n",
"| Attribute/Method | Description |\n",
"|-----|-----|\n",
"| dtype | data type of values in series |\n",
"| empty | True if series is empty |\n",
"| size | number of elements |\n",
"| values | Returns values as ndarray |\n",
"| head() | First n elements |\n",
"| tail() | Last n elements |\n",
"\n",
"Execute *dir(s1)* to see all attributes and methods. \n",
"\n",
"I recommend using online documentation as well. This will be in a much easier format to read and come with examples.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# For more information on a particular method or attribute use the help() function\n",
"help(s4.head())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"help(s4.index)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You can also add a question mark to get help information\n",
"s4.head?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"s4.index?"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"One final way to get help is to press shift-tab when you are in the parentheses of a method or after an attribute. Try this in the exercise below."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "FXjjnBOj0ibn"
},
"source": [
"### Exercise - Create your own Series"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "RUmWlc790ibn"
},
"outputs": [],
"source": [
"# Create a series with 10 elements containing both positive and negative integers\n",
"# Examine the series with the head() method\n",
"# Create a new series from the originally created series with only negative numbers\n",
"# \n",
"# mys = pd.Series( ... )"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "PQprwXx40ibo"
},
"source": [
"### Pandas DataFrames"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "ZYRivqyM0ibo"
},
"source": [
"A Pandas *DataFrame* is a 2-dimensional, size-mutable, heterogeneous tabular data structure with labeled rows and columns. You can think of it as a dictionary-like container to store Python Series objects."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "vWXR_GKo0ibo"
},
"outputs": [],
"source": [
"d = pd.DataFrame({'Name': pd.Series(['Alice', 'Bob', 'Chris']), \n",
" 'Age': pd.Series([21, 25, 23])})\n",
"print(d)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "xt6NqHhS0ibo"
},
"outputs": [],
"source": [
"d2 = pd.DataFrame(np.array([['Alice','Bob','Chris'], [21, 25, 23]]).T, columns=['Name','Age'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "cZXlWYmn0ibq"
},
"outputs": [],
"source": [
"# Use the head() method to print the first 5 records in the dataframe (same as with series)\n",
"d2.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "NHte5SMZ0ibq"
},
"outputs": [],
"source": [
"# Add a new column to d2:\n",
"d2['Height'] = pd.Series([5.2, 6.0, 5.6])\n",
"d2.head()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reading data using Pandas\n",
"You can read CSV (comma separated values) files using Pandas. The command shown below reads a CSV file into the Pandas dataframe df."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Hfwv2wzd0ibq"
},
"outputs": [],
"source": [
"# Read a csv file into Pandas Dataframe\n",
"df = pd.read_csv(\"http://rcs.bu.edu/examples/python/DataAnalysis/Salaries.csv\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "2JaiSOAXnKup"
},
"source": [
"The above command has many optional arguments that you can find in the Pandas documentation online.\n",
"\n",
"You can also read many other formats, for instance:\n",
"* Excel - pd.read_excel('myfile.xlsx', sheet_name='Sheet1', index_col=None, na_values=['NA'])\n",
"* Stata - pd.read_stata('myfile.dta')\n",
"* SAS - pd.read_sas('myfile.sas7bdat')\n",
"* HDF - pd.read_hdf('myfile.h5', 'df')"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "2JaiSOAXnKup"
},
"source": [
"Before we can perform any analysis on the data we need to\n",
"\n",
"\n",
"* Check if the data is correctly imported \n",
"* Check the types of each column\n",
"* Determine how many missing values each column has\n",
"\n",
"We can then carefully prepare the data:\n",
"\n",
"* Remove columns that are not needed in our analysis\n",
"* Rename the columns (if necessary)\n",
"* Possibly rearrange the columns to make it easier to work with them\n",
"* Create new or modify existing columns (e.g., convert into different units) if necessary"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "86icwTdZ0ibr"
},
"outputs": [],
"source": [
"# Display the first 10 records\n",
"df.head(10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "z-lqfY0MiwGP"
},
"outputs": [],
"source": [
"# Display structure of the data frame\n",
"df.info()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### More details on DataFrame data types\n",
"\n",
"|Pandas Type | Native Python Type | Description |\n",
"|------------|--------------------|-------------|\n",
"| object | string | The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings).|\n",
"| int64 | int | Numeric characters. 64 refers to the memory allocated to hold this character. |\n",
"| float64 | float | Numeric characters with decimals. If a column contains numbers and NaNs (see below), pandas will default to float64, in case your missing value has a decimal. |\n",
"| datetime64, timedelta\\[ns\\]| N/A (but see the datetime module in Python’s standard library) | Values meant to hold time data. Look into these for time series experiments. |\n",
"\n",
"\n",
"### DataFrame attributes\n",
"|df.attribute | Description |\n",
"|-------------|-------------|\n",
"| dtypes | list the types of the columns |\n",
"| columns | list the column names |\n",
"| axes | list the row labels and column names |\n",
"| ndim | number of dimensions |\n",
"| size | number of elements |\n",
"| shape | return a tuple representung the dimensionality |\n",
"| values | numpy representation of the data |\n",
"\n",
"### Dataframe methods\n",
"|df.method() | Description |\n",
"|-------------|-------------|\n",
"| head(\\[n\\]), tail(\\[n\\]) | first/last n rows |\n",
"| describe() | generate descriptive statistics (for numeric columns only) |\n",
"| max(), min() | return max/min values for all numeric columns |\n",
"| mean(), median() | return mean/median values for all numeric columns |\n",
"| std() | standard deviation |\n",
"| sample(\\[n\\]) | returns a random sample of n elements from the data frame |\n",
"| dropna() | drop all the records with missing values |"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "v_Gm0ubUmu95"
},
"source": [
"Sometimes the column names in the input file are too long or contain special characters. In such cases we rename them to make it easier to work with these columns."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5PTMYO9ok8te"
},
"outputs": [],
"source": [
"# Let's create a copy of this dataframe with a new column names\n",
"# If we do not want to create a new data frame, we can add inplace=True argument\n",
"df_new =df.rename(columns={'sex': 'gender', 'phd': 'yearsAfterPhD', 'service': 'yearsOfService'})\n",
"df_new.head()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "tj85w0PMttNG"
},
"source": [
"### DataFrame Exploration"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "pbDGiE0J0ibt"
},
"outputs": [],
"source": [
"# Identify the type of df_new object\n",
"type(df_new)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "pybC4toq0ibt"
},
"outputs": [],
"source": [
"# Check the data type of the column \"salary\"\n",
"# We access columns using the brackets, e.g., df['column_name']\n",
"df_new['salary'].dtype"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# If the column name has no spaces, complex symbols, and is not the name of an attribute/method\n",
"# you can use the syntax df.column_name\n",
"df_new.salary.dtype"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "1T1wR6qq0ibu"
},
"outputs": [],
"source": [
"# List the types of all columns\n",
"df_new.dtypes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "YFXUSewc0ibu"
},
"outputs": [],
"source": [
"# List the column names\n",
"df_new.columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_Cu0DFOj0ibv"
},
"outputs": [],
"source": [
"# List the row labels and the column names\n",
"df_new.axes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "99jPywmg0ibw"
},
"outputs": [],
"source": [
"# Number of rows and columns\n",
"df_new.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "utYUKYjN0ibv"
},
"outputs": [],
"source": [
"# Total number of elements in the Data Frame (78 x 6)\n",
"df_new.size"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "jsjjlva70ibw"
},
"outputs": [],
"source": [
"# Output some descriptive statistics for the numeric columns\n",
"df_new.describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ZFxc7P8w9mJc"
},
"outputs": [],
"source": [
"# Remeber we can use the ? to get help about the function\n",
"df_new.describe?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "XHWfkQX8roey"
},
"outputs": [],
"source": [
"# Create a new column using the assign method\n",
"df_new = df_new.assign(salary_k=lambda x: x.salary/1000.0)\n",
"df_new.head(10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "vsyWDUkww3m3"
},
"outputs": [],
"source": [
"# Check how many unique values are in a column\n",
"# There is a rank attribute in DataFrame object so we access using df['rank']\n",
"df_new['rank'].unique()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "pESwRxLCt5oO"
},
"outputs": [],
"source": [
"# Get the frequency table for a categorical or binary column\n",
"df_new['rank'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "pBF1lOcY0FLH"
},
"outputs": [],
"source": [
"# Get a proportion table\n",
"df_new['rank'].value_counts()/sum(df['rank'].value_counts())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ITSJWChquiYI"
},
"outputs": [],
"source": [
"# Alternatively we can use the pandas function crosstab() to calculate a frequency table\n",
"pd.crosstab(index=df_new['rank'], columns=\"count\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "faCHhLX-vMtw"
},
"outputs": [],
"source": [
"# Two-way tables\n",
"pd.crosstab(index=df_new['rank'], columns=df_new['discipline'], margins=True)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"id": "KzvZFygO0ibz"
},
"source": [
"### Data slicing and grouping"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "LE0RknOW0ib0"
},
"outputs": [],
"source": [
"#Extract a column by name \n",
"df_new['gender'].head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "RdMYTbNlCrzE"
},
"outputs": [],
"source": [
"# If the column name does not contain spaces or other special characters and does not collide with data frame methods, we can use a dot notation\n",
"df_new.gender.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "6JrIilpE08Be"
},
"outputs": [],
"source": [
"# Calculate median number of service years\n",
"df_new.yearsOfService.median()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "3gcURpQh0ib1"
},
"source": [
"### Exercise - Working with a single column"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "LXwRAzfQ0ib2"
},
"outputs": [],
"source": [
"# Calculate the descriptive statistics for only the salary column in df_new\n",
"# "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "wk7Er3ma0ib2"
},
"outputs": [],
"source": [
"# Get a count for each of the values in the salary_k column in df_new\n",
"# "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "08H11QC80ib2"
},
"outputs": [],
"source": [
"# Calculate the average (mean) salary_k in df_new\n",
"# "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "OR761ERr0ib2"
},
"source": [
"### Grouping data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "HxIkM-uT0ib3"
},
"outputs": [],
"source": [
"# Group data using rank\n",
"df_rank = df_new.groupby('rank')\n",
"df_rank.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "dX45khRv0ib3"
},
"outputs": [],
"source": [
"# Calculate mean of all numeric columns for the grouped object\n",
"df_rank.mean()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "wrLlxmmP0ib3"
},
"outputs": [],
"source": [
"# Most of the time, the \"grouping\" object is not stored, but is used as a step in getting a summary:\n",
"df_new.groupby('gender').mean()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "txqWe6sM0ib4"
},
"outputs": [],
"source": [
"# Calculate the mean salary for men and women. The following produce Pandas Series (single brackets around salary)\n",
"df_new.groupby('gender')['salary'].mean()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "TDRTnWXo0ib4"
},
"outputs": [],
"source": [
"# If we use double brackets Pandas will produce a DataFrame\n",
"df_new.groupby('gender')[['salary']].mean()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "nUMv2CM50ib5"
},
"outputs": [],
"source": [
"# Group using 2 variables - gender and rank:\n",
"df_new.groupby(['rank','gender'], sort=True)[['salary']].mean()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "n_QMiWbN0ib5"
},
"source": [
"### Exercise - Grouping data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bK_eeYF70ib5"
},
"outputs": [],
"source": [
"# Group data by the rank and discipline and find the average yearsOfService and salary_k for each group. \n",
"# "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "5M2fqDoL0ib6"
},
"source": [
"### Filtering"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bNCJvz590ib6"
},
"outputs": [],
"source": [
"# Select observation with the value in the salary column > 120K\n",
"df_filter = df_new[df_new.salary > 120000]\n",
"df_filter.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Oj51sat40ib6"
},
"outputs": [],
"source": [
"df_filter.axes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "xGZSM5EH0ib7"
},
"outputs": [],
"source": [
"# Select data for female professors\n",
"df_w = df_new[df_new.gender == 'Female']\n",
"df_w.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "qNkmHdkP6hsN"
},
"outputs": [],
"source": [
"# To subset one column using a condition in another columns use method \"where\"\n",
"df_new.salary.where(df_new.gender=='Female').dropna().head(6)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "iUOMOcM60ib8"
},
"source": [
"### Exercise - Filtering data "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0fCg9pRv0ib8"
},
"outputs": [],
"source": [
"# Using filtering, find the mean value of the salary for the discipline A\n",
"# "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "DUHR3SlV0ib8"
},
"outputs": [],
"source": [
"# Challenge:\n",
"# Determine how many female and male professors earned more than 100K\n",
"# "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "6Fir_77Y0ib8"
},
"source": [
"### Slicing a dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "W4v86rQi0ib8",
"scrolled": true
},
"outputs": [],
"source": [
"# Select column salary\n",
"salary = df_new['salary']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "sugc9L480ib-"
},
"outputs": [],
"source": [
"# Check data type of the result\n",
"type(salary)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "XuISZ50f0ib-"
},
"outputs": [],
"source": [
"# Look at the first few elements of the output\n",
"salary.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "CKQS3knF0ib_"
},
"outputs": [],
"source": [
"# Select column salary and make the output to be a data frame\n",
"df_salary = df_new[['salary']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "gMPqOKc20ib_"
},
"outputs": [],
"source": [
"# Check the type\n",
"type(df_salary)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "4HBqgWTi0ib_"
},
"outputs": [],
"source": [
"# Select a subset of rows (based on their position):\n",
"# Note 1: The location of the first row is 0\n",
"# Note 2: The last value in the range is not included\n",
"df_new[0:10]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "4BEvyTpr0ib_"
},
"outputs": [],
"source": [
"# If we want to select both rows and columns we can use method .loc\n",
"df_new.loc[10:20, ['rank', 'gender','salary']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "fpFEVLMJ0icA"
},
"outputs": [],
"source": [
"# Recall our filtered dataset with salaries over 120K\n",
"df_filter.head(25)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "k28_O2L70icA"
},
"outputs": [],
"source": [
"# Let's see what we get for our df_filter data frame\n",
"# Method .loc subsets the data frame based on the index values:\n",
"# loc = location\n",
"df_filter.loc[10:20,['rank','gender','salary']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Xg9Tuorn0icA"
},
"outputs": [],
"source": [
"# Unlike method .loc, method iloc selects rows (and columns) by absolute position:\n",
"# iloc = integer location\n",
"df_filter.iloc[10:20, [0,3,4,5]]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise - Slicing a dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Create a new dataframe where you filter out salaries below 100K from df_new\n",
"# Call this data frame df_sub100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Extract rows 5:10 and columns ['yearsOfService', 'salary_k'] of df_sub100 using the .loc method\n",
"# How many rows are in the output?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Extract rows 5:10 and columns ['yearsOfService', 'salary_k'] from df_sub100 using the iloc method\n",
"# What are the values of the indices in the output?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Extract rows with index values [6, 12, 20, 22] and columns ['yearsOfService','salary_k'] from df_sub100\n",
"# Hint: Use the loc method"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "vzxWbXKF0icA"
},
"source": [
"### Sorting the Data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "OXybOgop0icB"
},
"outputs": [],
"source": [
"# Sort the data frame df_new by service and create a new data frame\n",
"df_sorted = df_new.sort_values(by = 'yearsOfService')\n",
"df_sorted.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "85UG8-JZ0icC"
},
"outputs": [],
"source": [
"# Sort the data frame df_new by yearsOfService and overwrite the original dataset\n",
"df_new.sort_values(by = 'yearsOfService', ascending = False, inplace = True)\n",
"df_new.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "mcrnORWR0icC"
},
"outputs": [],
"source": [
"# Restore the original order using the sort_index method\n",
"df_new.sort_index(axis=0, ascending = True, inplace = True)\n",
"df_new.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "52sGV0z_0icD"
},
"outputs": [],
"source": [
"# Sort the data frame using 2 or more columns:\n",
"df_sorted2 = df_new.sort_values(by = ['yearsOfService', 'salary'], ascending = [True,False])\n",
"df_sorted2.head(15)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "aBgfkBkK0icC"
},
"source": [
"### Exercise - Sorting "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Kh0qVebH0icD"
},
"outputs": [],
"source": [
"# Sort the data frame df_new by the salary (in descending order)\n",
"# Store the output of this sorting in a dataframe called df_desc\n",
"# Display the first 10 records of the output\n",
"# "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "XZjjbRia0icD"
},
"source": [
"### Missing Values\n",
"To discuss how to handle missing values we will import the flights data set."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bZCCJvtD0icD"
},
"outputs": [],
"source": [
"# Read a dataset with missing values\n",
"flights = pd.read_csv(\"http://rcs.bu.edu/examples/python/DataAnalysis/flights.csv\")\n",
"flights.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "PcLvDYfOjNZv"
},
"outputs": [],
"source": [
"flights.info()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "IPfS620V0icE",
"scrolled": true
},
"outputs": [],
"source": [
"# Select the rows that have at least one missing value\n",
"flights[flights.isnull().any(axis=1)].head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "AUI-q-wr0icE"
},
"outputs": [],
"source": [
"# Filter all the rows where arr_delay value is missing:\n",
"flights1 = flights[flights['arr_delay'].notnull( )]\n",
"flights1.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "oem6cDo90icF"
},
"outputs": [],
"source": [
"# Remove all the observations with missing values\n",
"flights2 = flights.dropna()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "NoCw8dmx0icH"
},
"outputs": [],
"source": [
"# Fill missing values with zeros\n",
"nomiss =flights['dep_delay'].fillna(0)\n",
"nomiss.isnull().any()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "d-Q3P5SO0icJ"
},
"source": [
"### Exercise Count missing data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "uAfRQCPs0icL"
},
"outputs": [],
"source": [
"# Count how many missing pieces of data there are in the dep_delay and arr_delay columns\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "JGOFyTVm0icN"
},
"source": [
"### Common Aggregation Functions:\n",
"\n",
"The following functions are commonly used functions to aggregate data.\n",
"\n",
"|Function|Description\n",
"|-------|--------\n",
"|min | minimum\n",
"|max | maximum\n",
"|count | number of non-null observations\n",
"|sum | sum of values\n",
"|mean | arithmetic mean of values\n",
"|median | median\n",
"|mad | mean absolute deviation\n",
"|mode | mode\n",
"|prod | product of values\n",
"|std | standard deviation\n",
"|var | unbiased variance\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "q5SgaHvG0icO"
},
"outputs": [],
"source": [
"# Find the number of non-missing values in each column\n",
"flights.describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "gmZ5Dekd9N6g"
},
"outputs": [],
"source": [
"flights.info()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "90ZAiJWb0icP"
},
"outputs": [],
"source": [
"# Find mean value for all the columns in the dataset\n",
"flights.min()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ZaZ1Uaks0icP"
},
"outputs": [],
"source": [
"# Let's compute summary statistic per a group':\n",
"flights.groupby('carrier')['dep_delay'].mean()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "HrhCXQBi0icP"
},
"outputs": [],
"source": [
"# We can use agg() methods for aggregation:\n",
"flights[['dep_delay','arr_delay']].agg(['min','mean','max'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "AVsIj0La0icQ"
},
"outputs": [],
"source": [
"# An example of computing different statistics for different columns\n",
"flights.agg({'dep_delay':['min','mean',max], 'carrier':['nunique']})"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "sPGu6w9j0icR"
},
"source": [
"## Exploring data using graphics"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "b5gSUWkR0icR"
},
"outputs": [],
"source": [
"# Show graphs within the Python notebook\n",
"%matplotlib inline"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Graphics with the Salaries dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "QJBJNupL0icR"
},
"outputs": [],
"source": [
"# Use matplotlib to draw a histogram of a salary data\n",
"plt.hist(df_new['salary'],bins=8, density=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_x-c3hL20icS"
},
"outputs": [],
"source": [
"# Use seaborn package to draw a histogram\n",
"sns.displot(df_new['salary']);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "CAwwtPMR0icS"
},
"outputs": [],
"source": [
"# Use regular matplotlib function to display a barplot\n",
"df_new.groupby(['rank'])['salary'].count().plot(kind='bar')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Q4rEH0cx0icS",
"scrolled": false
},
"outputs": [],
"source": [
"# Use seaborn package to display a barplot\n",
"sns.set_style(\"whitegrid\")\n",
"ax = sns.barplot(x='rank',y ='salary', data=df_new, estimator=len)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "9P4SKlYW0icS"
},
"outputs": [],
"source": [
"# Split into 2 groups:\n",
"ax = sns.barplot(x='rank',y ='salary', hue='gender', data=df_new, estimator=len)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "AzBt0Vih0icU"
},
"outputs": [],
"source": [
"# Violinplot\n",
"sns.violinplot(x = \"salary\", data=df_new)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "1ieYE6KF0icU"
},
"outputs": [],
"source": [
"# Scatterplot in seaborn\n",
"sns.jointplot(x='yearsOfService', y='salary', data=df_new)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0u2HEEOp0icV"
},
"outputs": [],
"source": [
"# If we are interested in linear regression plot for 2 numeric variables we can use regplot\n",
"sns.regplot(x='yearsOfService', y='salary', data=df_new)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "0WD6YnSy0icV"
},
"outputs": [],
"source": [
"# Box plot\n",
"sns.boxplot(x='rank',y='salary', data=df_new)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "U6Pmpr5l0icV"
},
"outputs": [],
"source": [
"# Side-by-side box plot\n",
"sns.boxplot(x='rank', y='salary', data=df_new, hue='gender')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Dn-ibN7k0icW"
},
"outputs": [],
"source": [
"# Swarm plot\n",
"sns.swarmplot(x='rank', y='salary', data=df_new)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "7aR8SY2r0ich"
},
"outputs": [],
"source": [
"# Factorplot\n",
"sns.catplot(x='rank', y='salary', data=df_new, kind='bar')\n",
"print(df_new.groupby('rank').mean())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "UfRCtVU20ici"
},
"outputs": [],
"source": [
"# Pairplot \n",
"sns.pairplot(df_new)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "8ABnDMpa0ici"
},
"source": [
"### Exercise 7 Graphing data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "HP7rVYGp0ici"
},
"outputs": [],
"source": [
"# Use the seaborn package to explore the dependency of arr_delay on dep_delay \n",
"# in the flights dataset. You can use a scatterplot or regplot using flight.\n",
"# "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "iWKUzQNP0icQ"
},
"source": [
"## Descriptive statistics\n",
"Statistics that are used to describe data. We have seen methods that calculate descriptive statistics before with the DataFrame describe() method. \n",
"\n",
"Descriptive statistics summarize attributes of a sample, such as the min/max values, and the mean (average) of the data. Below is a summary of some additional methods that calculate descriptive statistics."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "7qiHcIAJ0icQ"
},
"source": [
"|Function|Description\n",
"|-------|--------\n",
"|min | minimum\n",
"|max | maximum\n",
"|mean | arithmetic mean of values\n",
"|median | median\n",
"|mad | mean absolute deviation\n",
"|mode | mode\n",
"|std | standard deviation\n",
"|var | unbiased variance\n",
"|sem | standard error of the mean\n",
"|skew| sample skewness\n",
"|kurt|kurtosis\n",
"|quantile| value at %\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "y2izjRmU0icQ"
},
"outputs": [],
"source": [
"# Recall the describe() function which computes a subset of the above listed statistics\n",
"flights.dep_delay.describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "yh4S78a80icR"
},
"outputs": [],
"source": [
"# find the index of the maximum or minimum value\n",
"# if there are multiple values matching idxmin() and idxmax() will return the first match\n",
"flights['dep_delay'].idxmin() #minimum value"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "k9D_xUPK0icR"
},
"outputs": [],
"source": [
"# Count the number of records for each different value in a vector\n",
"flights['carrier'].value_counts()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "a2XW__P90icj"
},
"source": [
"## Inferential Statistics\n",
"Use data analysis on a sample of data to infer properties and make predictions that cannot be derived from descriptive statistics. Examples of this could be predicting a new unknown value based on previous data (linear regression, machine learning) or hypothesis testing (such as T-tests)."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "A5dv3AVB0icj"
},
"source": [
"### Linear Regression\n",
"A linear approach to model the relationship between a scalar output and one (or more) input variables. With one input and one output variable you are finding a line of *best fit*. You calculate the slope and y-intercept for a line that minimizes the distance between all the existing data points. You can then use this line to make predictions on unknown data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "qSC1beWi0icj"
},
"outputs": [],
"source": [
"# Import Statsmodel functions:\n",
"import statsmodels.api as sm"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "DZhLEStD0icj"
},
"outputs": [],
"source": [
"# Create a fitted model\n",
"lm = sm.OLS(df_new.yearsOfService, df_new.salary).fit()\n",
"\n",
"# Print model summary\n",
"print(lm.summary())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "otaTnzG20icj"
},
"outputs": [],
"source": [
"# Print the coefficients\n",
"lm.params"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "dC0PDTYx0ick"
},
"outputs": [],
"source": [
"# Using scikit-learn:\n",
"from sklearn import linear_model\n",
"est = linear_model.LinearRegression(fit_intercept = True) # create estimator object\n",
"# When you use dataframe columns this fits your model with feature names\n",
"est.fit(df_new[['yearsOfService']], df_new[['salary']])\n",
"\n",
"# If you pass the values, you don't have to use feature names\n",
"# est.fit(df_new[['yearsOfService']].values, df_new[['salary']].values)\n",
"\n",
"# Print result\n",
"print(\"Coef:\", est.coef_, \"\\nIntercept:\", est.intercept_)\n",
"\n",
"# Predict \n",
"# When you predict you have to use the features name otherwise you get a warning\n",
"pred = est.predict(pd.DataFrame(np.array([21]), columns=['yearsOfService']))\n",
"\n",
"# If you created a model based on values, then you predict with only a value, \n",
"# though you have to pass it with the same shape coef_ \n",
"# pred = est.predict([[21]])\n",
"print(\"Predicted salary: \", pred)\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "f8LJHj4M0ick"
},
"source": [
"### Exercise 8 Build a linear model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "X4V5kbci0ick"
},
"outputs": [],
"source": [
"# Build a linear model for arr_delay ~ dep_delay\n",
"\n",
"# Print model summary\n",
"\n",
"# Predict a value\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"id": "N8ca0v6_0icl"
},
"source": [
"### Student T-test\n",
"Used to compare the means of two groups. In this case you have a null hypothesis that the two group means are equal. The T-test then tells you whether you have statistically significant evidence to reject the null hypothesis. \n",
"The T-test has two output results, the statistic, and the p-value. The statistic (or T-value) quantifies the difference between the two mean values. The p-value is the probability of obtaining test results at least as extreme as the result that is observed assuming that the null hypothesis is correct.\n",
"\n",
"More succintly, the p-value tells you how likely it is that your data could have occured under the null hypothesis. Small p-values indicate there is a small probability of observing such a difference in the mean assuming the null hypothesis is true. Small p-values indicate that there is evidence to reject the null hypothesis (i.e., the group means are different). \n",
"\n",
"Conversely, larger p-value scores indicate there is a large probabiliy of observing the calculated statistic under the null hypothesis. Large p-values indicate that you have evidence to accept the null hypothesis (i.e., the group means are equal).\n",
"\n",
"One area the T-test is important is in clinical trials. Consider an example where you are looking at whether a drug reduces your cholesterol level. You have 2 populations, one where the drug is administered and another that is not administered. You can use the T-test to determine whether there is evidence to suggest that the drug causes a statistically significant change to the cholesterol level between the two populations.\n",
"\n",
"Below we calculate whether there is a meaningful difference between male and female salaries. Generally a p-value below 0.05 is considered statistically significant."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "W2yyiAe30icl"
},
"outputs": [],
"source": [
"# Using scipy package:\n",
"from scipy import stats\n",
"df_w = df[df['sex'] == 'Female']['salary']\n",
"df_m = df[df['sex'] == 'Male']['salary']\n",
"stats.ttest_ind(df_w, df_m) "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "xM391jsk0icl"
},
"source": [
"## Tutorial Evaluation\n",
"After the tutorial please submit an evaluation by clicking on this link [Tutorial Evaluation](http://scv.bu.edu/survey/tutorial_evaluation.html)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"colab": {
"collapsed_sections": [
"M3FbA7um0ibR",
"vYNDrGEL0ibY",
"3f4HD5E90ibn",
"PQprwXx40ibo",
"tj85w0PMttNG",
"KzvZFygO0ibz",
"5M2fqDoL0ib6",
"6Fir_77Y0ib8",
"vzxWbXKF0icA",
"XZjjbRia0icD",
"JGOFyTVm0icN",
"iWKUzQNP0icQ",
"sPGu6w9j0icR",
"a2XW__P90icj"
],
"name": "dataScience.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 1
}