{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "vqv_wAtk0ibM" }, "source": [ "# Python for Data Analysis\n", "\n", "### Research Computing Services\n", "\n", "Instructor: Scott Ladenheim
\n", "Website: [rcs.bu.edu](http://www.bu.edu/tech/support/research/)
\n", "Tutorial materials: [http://rcs.bu.edu/examples/python/DataAnalysis](http://rcs.bu.edu/examples/python/DataAnalysis)
\n", "Contact us: help@scc.bu.edu" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "M3FbA7um0ibR" }, "source": [ "## Course Content\n", "1. Python packages for data scientists\n", "2. Data manipulation with Pandas\n", "3. Basic data plotting\n", "4. Descriptive statistics\n", "5. Inferential statistics\n", "\n", "\n", "## Tutorial Evaluation\n", "After the tutorial please submit an evaluation by clicking on this link [Tutorial Evaluation](http://scv.bu.edu/survey/tutorial_evaluation.html)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Python packages for data scientists\n", "* [NumPy](https://numpy.org)\n", " - Introduces objects for handling n-dimensional arrays such as vectors (1-D) and matrices (2-D)\n", " - Introduces functions to perform advanced mathematical and statistical operations on these objects\n", " - Provides vectorization of mathematical operations on arrays and matrices which significantly improves the performance \n", " - Many other Python libraries are built on NumPy\n", "* [SciPy](https://scipy.org)\n", " - Collection of algorithms for statistics, linear algebra, optimization, differential equations, numerical integration, and more\n", " - Part of SciPy stack\n", " - Built on NumPy\n", "* [Pandas](https://pandas.pydata.org)\n", " - Adds data structures and tools designed to work with table-like data (similar to Vectors and Data Frames in R)\n", " - Provides tools for data maniuplation: *reshaping*, *merging*, *sorting*, *slicing*, *aggregation*, etc.\n", " - Easily allows to handle missing data\n", "* [SciKit-Learn](https://scikit-learn.org/stable/)\n", " - Provides machine learning algorithms: classification, regression, clustering, model validation, etc.\n", " - Built on NumPy, Scipy, and matplotlib\n", "\n", "* Many others\n", " - [Keras](https://keras.io/)\n", " - [Pytorch](https://pytorch.org/)\n", " - [Tensorflow](https://www.tensorflow.org/)\n", " \n", "### Visualization\n", "More in-depth look at visualization in the `Data Visualization in Python` course.\n", "* [matplotlib](https://matplotlib.org/)\n", " - Python 2-D plotting library for pulibcation quality figures in a variety of hardcopy formats\n", " - Functionalities similar to MATLAB\n", " - Line plots, scatter plots, bar charts, histograms, pie charts, etc.\n", " - Effort needed to create advanced visualizations\n", "* [seaborn](https://seaborn.pydata.org/)\n", " - Based on matplotlib\n", " - Provides a high-level interface for drawing attractive statistical graphs\n", " - Similar to the ggplot2 library in R\n", "* [plotly](https://plotly.com/python/)\n", " - over 40 unique chart types covering a wide range of statistical, financial, geographic, scientific, and 3-dimensional use-cases.\n", " - Built on top of the Plotly JavaScript library\n", " - Can create beautiful interactive web-based visualizations " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Loading Python libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Ea4C8rUR0ibU" }, "outputs": [], "source": [ "# Press shift-enter to execute a Jupyter notebook cell\n", "# Import the Python Libraries used in the tutorial\n", "import numpy as np\n", "import scipy as sp\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "# Enable inline plotting\n", "%matplotlib inline" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "ca3M2mv_0ibX" }, "source": [ "## Pandas\n", "The main focus of this tutorial is using the Pandas library to manipulate and analyze data.\n", "\n", "Pandas is a python package that deals mostly with :\n", "- **Series** (1-D homogeneous array)\n", "- **DataFrame** (2-D labeled heterogeneous array) \n", "- **MultiIndex** (for hierarchical data)\n", "- **Xarray** (built on top of Pandas for n-D arrays)\n", "\n", "The Pandas content of this tutorial will cover:\n", "* Creating and understanding Series and DataFrames\n", "* Importing/Reading data\n", "* Data selection and filtering\n", "* Data maniuplation via sorting, grouping, and rearranging\n", "* Handling missing data\n", "\n", "\n", "In addition we will also provide information on the following.\n", "* Basic data plotting\n", "* Descriptive statistics (time permitting)\n", "* Inferential statistics (time permitting)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "vYNDrGEL0ibY" }, "source": [ "### Pandas Series" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "hQpjwDv_0ibZ" }, "source": [ "A Pandas *Series* is a 1-dimensional labeled array containing data of the same type (integers, strings, floating point numbers, Python objects, etc. ). It is a generalized numpy array with an explicit axis called the *index*." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "iPJvEWlG0ibZ" }, "outputs": [ { "ename": "NameError", "evalue": "name 'pd' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[2], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[39m# Create a Pandas series with a list of integers\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m s1 \u001b[39m=\u001b[39m pd\u001b[39m.\u001b[39mSeries([\u001b[39m-\u001b[39m\u001b[39m3\u001b[39m, \u001b[39m-\u001b[39m\u001b[39m1\u001b[39m, \u001b[39m1\u001b[39m, \u001b[39m3\u001b[39m, \u001b[39m5\u001b[39m])\n\u001b[1;32m 3\u001b[0m \u001b[39mprint\u001b[39m(s1)\n", "\u001b[0;31mNameError\u001b[0m: name 'pd' is not defined" ] } ], "source": [ "# Example of creating Pandas series :\n", "# Order all S1 together\n", "s1 = pd.Series([-3, -1, 1, 3, 5])\n", "print(s1)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "YPctLCcw0ibb" }, "source": [ "We did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "v4vAbdqv0ibb" }, "outputs": [], "source": [ "# View index values\n", "print(s1.index)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "s3I3__b90ibi" }, "outputs": [], "source": [ "s1[:2] # First 2 elements" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4jsIOlUR0ibi" }, "outputs": [], "source": [ "print(s1[[2,1,0]]) # Elements out of order" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qGn-aAvx2t9V" }, "outputs": [], "source": [ "type(s1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "WIft0v6r0ibl" }, "outputs": [], "source": [ "# Can place filtering conditions on series\n", "s1[s1 > 0]" ] }, { "cell_type": "code", "execution_count": 160, "metadata": { "id": "ZmlDR_tj0ibd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a 0.745068\n", "b -2.273686\n", "c 0.595940\n", "d 1.315555\n", "e -1.060865\n", "dtype: float64\n" ] } ], "source": [ "# Creating Pandas series with index:\n", "rng = np.random.default_rng()\n", "s2 = pd.Series(rng.normal(size=5), index=['a', 'b', 'c', 'd', 'e'])\n", "print(s2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "srVyZ9ik0ibe" }, "outputs": [], "source": [ "# View index values\n", "print(s2.index)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vX-IUZb30ibf" }, "outputs": [], "source": [ "# Create a Series from dictionary\n", "data = {'pi': 3.14159, 'e': 2.71828} # dictionary\n", "print(data)\n", "s3 = pd.Series(data)\n", "print(s3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sOHbkXC10ibg" }, "outputs": [], "source": [ "# Create a new series from a dictionary and reorder the elements\n", "s4 = pd.Series(data, index=['e', 'pi', 'tau'])\n", "print(s4)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "DeofA8nZ0ibi" }, "source": [ "NaN (Not a Number) - is used to specify a missing value in Pandas." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "skyr2siA0ibj" }, "outputs": [], "source": [ "# Series can be treated as a 1-D array and you can apply functions to them:\n", "print(\"Median:\", s4.median())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "42_G7Ks-0ibl" }, "outputs": [], "source": [ "# Methods can be used to filter series:\n", "s4[s4 > s4.median()]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "3f4HD5E90ibn" }, "source": [ "### Attributes and Methods:\n", "An attribute is a variable stored in the object, e.g., index or size with Series.\n", "A method is a function stored in the object, e.g., head() or median() with Series.\n", "\n", "| Attribute/Method | Description |\n", "|-----|-----|\n", "| dtype | data type of values in series |\n", "| empty | True if series is empty |\n", "| size | number of elements |\n", "| values | Returns values as ndarray |\n", "| head() | First n elements |\n", "| tail() | Last n elements |\n", "\n", "Execute *dir(s1)* to see all attributes and methods. \n", "\n", "I recommend using online documentation as well. This will be in a much easier format to read and come with examples.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# For more information on a particular method or attribute use the help() function\n", "help(s4.head())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "help(s4.index)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You can also add a question mark to get help information\n", "s4.head?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s4.index?" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "One final way to get help is to press shift-tab when you are in the parentheses of a method or after an attribute. Try this in the exercise below." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "FXjjnBOj0ibn" }, "source": [ "### Exercise - Create your own Series" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RUmWlc790ibn" }, "outputs": [], "source": [ "# Create a series with 10 elements containing both positive and negative integers\n", "# Examine the series with the head() method\n", "# Create a new series from the originally created series with only negative numbers\n", "# \n", "# mys = pd.Series( ... )" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "PQprwXx40ibo" }, "source": [ "### Pandas DataFrames" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "ZYRivqyM0ibo" }, "source": [ "A Pandas *DataFrame* is a 2-dimensional, size-mutable, heterogeneous tabular data structure with labeled rows and columns. You can think of it as a dictionary-like container to store Python Series objects." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vWXR_GKo0ibo" }, "outputs": [], "source": [ "d = pd.DataFrame({'Name': pd.Series(['Alice', 'Bob', 'Chris']), \n", " 'Age': pd.Series([21, 25, 23])})\n", "print(d)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xt6NqHhS0ibo" }, "outputs": [], "source": [ "d2 = pd.DataFrame(np.array([['Alice','Bob','Chris'], [21, 25, 23]]).T, columns=['Name','Age'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cZXlWYmn0ibq" }, "outputs": [], "source": [ "# Use the head() method to print the first 5 records in the dataframe (same as with series)\n", "d2.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NHte5SMZ0ibq" }, "outputs": [], "source": [ "# Add a new column to d2:\n", "d2['Height'] = pd.Series([5.2, 6.0, 5.6])\n", "d2.head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Reading data using Pandas\n", "You can read CSV (comma separated values) files using Pandas. The command shown below reads a CSV file into the Pandas dataframe df." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Hfwv2wzd0ibq" }, "outputs": [], "source": [ "# Read a csv file into Pandas Dataframe\n", "df = pd.read_csv(\"http://rcs.bu.edu/examples/python/DataAnalysis/Salaries.csv\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "2JaiSOAXnKup" }, "source": [ "The above command has many optional arguments that you can find in the Pandas documentation online.\n", "\n", "You can also read many other formats, for instance:\n", "* Excel - pd.read_excel('myfile.xlsx', sheet_name='Sheet1', index_col=None, na_values=['NA'])\n", "* Stata - pd.read_stata('myfile.dta')\n", "* SAS - pd.read_sas('myfile.sas7bdat')\n", "* HDF - pd.read_hdf('myfile.h5', 'df')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "2JaiSOAXnKup" }, "source": [ "Before we can perform any analysis on the data we need to\n", "\n", "\n", "* Check if the data is correctly imported \n", "* Check the types of each column\n", "* Determine how many missing values each column has\n", "\n", "We can then carefully prepare the data:\n", "\n", "* Remove columns that are not needed in our analysis\n", "* Rename the columns (if necessary)\n", "* Possibly rearrange the columns to make it easier to work with them\n", "* Create new or modify existing columns (e.g., convert into different units) if necessary" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "86icwTdZ0ibr" }, "outputs": [], "source": [ "# Display the first 10 records\n", "df.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "z-lqfY0MiwGP" }, "outputs": [], "source": [ "# Display structure of the data frame\n", "df.info()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### More details on DataFrame data types\n", "\n", "|Pandas Type | Native Python Type | Description |\n", "|------------|--------------------|-------------|\n", "| object | string | The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings).|\n", "| int64 | int | Numeric characters. 64 refers to the memory allocated to hold this character. |\n", "| float64 | float | Numeric characters with decimals. If a column contains numbers and NaNs (see below), pandas will default to float64, in case your missing value has a decimal. |\n", "| datetime64, timedelta\\[ns\\]| N/A (but see the datetime module in Python’s standard library) | Values meant to hold time data. Look into these for time series experiments. |\n", "\n", "\n", "### DataFrame attributes\n", "|df.attribute | Description |\n", "|-------------|-------------|\n", "| dtypes | list the types of the columns |\n", "| columns | list the column names |\n", "| axes | list the row labels and column names |\n", "| ndim | number of dimensions |\n", "| size | number of elements |\n", "| shape | return a tuple representung the dimensionality |\n", "| values | numpy representation of the data |\n", "\n", "### Dataframe methods\n", "|df.method() | Description |\n", "|-------------|-------------|\n", "| head(\\[n\\]), tail(\\[n\\]) | first/last n rows |\n", "| describe() | generate descriptive statistics (for numeric columns only) |\n", "| max(), min() | return max/min values for all numeric columns |\n", "| mean(), median() | return mean/median values for all numeric columns |\n", "| std() | standard deviation |\n", "| sample(\\[n\\]) | returns a random sample of n elements from the data frame |\n", "| dropna() | drop all the records with missing values |" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "v_Gm0ubUmu95" }, "source": [ "Sometimes the column names in the input file are too long or contain special characters. In such cases we rename them to make it easier to work with these columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5PTMYO9ok8te" }, "outputs": [], "source": [ "# Let's create a copy of this dataframe with a new column names\n", "# If we do not want to create a new data frame, we can add inplace=True argument\n", "df_new =df.rename(columns={'sex': 'gender', 'phd': 'yearsAfterPhD', 'service': 'yearsOfService'})\n", "df_new.head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "tj85w0PMttNG" }, "source": [ "### DataFrame Exploration" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pbDGiE0J0ibt" }, "outputs": [], "source": [ "# Identify the type of df_new object\n", "type(df_new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pybC4toq0ibt" }, "outputs": [], "source": [ "# Check the data type of the column \"salary\"\n", "# We access columns using the brackets, e.g., df['column_name']\n", "df_new['salary'].dtype" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# If the column name has no spaces, complex symbols, and is not the name of an attribute/method\n", "# you can use the syntax df.column_name\n", "df_new.salary.dtype" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1T1wR6qq0ibu" }, "outputs": [], "source": [ "# List the types of all columns\n", "df_new.dtypes" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YFXUSewc0ibu" }, "outputs": [], "source": [ "# List the column names\n", "df_new.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_Cu0DFOj0ibv" }, "outputs": [], "source": [ "# List the row labels and the column names\n", "df_new.axes" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "99jPywmg0ibw" }, "outputs": [], "source": [ "# Number of rows and columns\n", "df_new.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "utYUKYjN0ibv" }, "outputs": [], "source": [ "# Total number of elements in the Data Frame (78 x 6)\n", "df_new.size" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jsjjlva70ibw" }, "outputs": [], "source": [ "# Output some descriptive statistics for the numeric columns\n", "df_new.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ZFxc7P8w9mJc" }, "outputs": [], "source": [ "# Remeber we can use the ? to get help about the function\n", "df_new.describe?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XHWfkQX8roey" }, "outputs": [], "source": [ "# Create a new column using the assign method\n", "df_new = df_new.assign(salary_k=lambda x: x.salary/1000.0)\n", "df_new.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vsyWDUkww3m3" }, "outputs": [], "source": [ "# Check how many unique values are in a column\n", "# There is a rank attribute in DataFrame object so we access using df['rank']\n", "df_new['rank'].unique()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pESwRxLCt5oO" }, "outputs": [], "source": [ "# Get the frequency table for a categorical or binary column\n", "df_new['rank'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pBF1lOcY0FLH" }, "outputs": [], "source": [ "# Get a proportion table\n", "df_new['rank'].value_counts()/sum(df['rank'].value_counts())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ITSJWChquiYI" }, "outputs": [], "source": [ "# Alternatively we can use the pandas function crosstab() to calculate a frequency table\n", "pd.crosstab(index=df_new['rank'], columns=\"count\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "faCHhLX-vMtw" }, "outputs": [], "source": [ "# Two-way tables\n", "pd.crosstab(index=df_new['rank'], columns=df_new['discipline'], margins=True)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": true, "id": "KzvZFygO0ibz" }, "source": [ "### Data slicing and grouping" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LE0RknOW0ib0" }, "outputs": [], "source": [ "#Extract a column by name \n", "df_new['gender'].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RdMYTbNlCrzE" }, "outputs": [], "source": [ "# If the column name does not contain spaces or other special characters and does not collide with data frame methods, we can use a dot notation\n", "df_new.gender.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6JrIilpE08Be" }, "outputs": [], "source": [ "# Calculate median number of service years\n", "df_new.yearsOfService.median()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "3gcURpQh0ib1" }, "source": [ "### Exercise - Working with a single column" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LXwRAzfQ0ib2" }, "outputs": [], "source": [ "# Calculate the descriptive statistics for only the salary column in df_new\n", "# " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wk7Er3ma0ib2" }, "outputs": [], "source": [ "# Get a count for each of the values in the salary_k column in df_new\n", "# " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "08H11QC80ib2" }, "outputs": [], "source": [ "# Calculate the average (mean) salary_k in df_new\n", "# " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "OR761ERr0ib2" }, "source": [ "### Grouping data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HxIkM-uT0ib3" }, "outputs": [], "source": [ "# Group data using rank\n", "df_rank = df_new.groupby('rank')\n", "df_rank.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dX45khRv0ib3" }, "outputs": [], "source": [ "# Calculate mean of all numeric columns for the grouped object\n", "df_rank.mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wrLlxmmP0ib3" }, "outputs": [], "source": [ "# Most of the time, the \"grouping\" object is not stored, but is used as a step in getting a summary:\n", "df_new.groupby('gender').mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "txqWe6sM0ib4" }, "outputs": [], "source": [ "# Calculate the mean salary for men and women. The following produce Pandas Series (single brackets around salary)\n", "df_new.groupby('gender')['salary'].mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TDRTnWXo0ib4" }, "outputs": [], "source": [ "# If we use double brackets Pandas will produce a DataFrame\n", "df_new.groupby('gender')[['salary']].mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nUMv2CM50ib5" }, "outputs": [], "source": [ "# Group using 2 variables - gender and rank:\n", "df_new.groupby(['rank','gender'], sort=True)[['salary']].mean()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "n_QMiWbN0ib5" }, "source": [ "### Exercise - Grouping data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bK_eeYF70ib5" }, "outputs": [], "source": [ "# Group data by the rank and discipline and find the average yearsOfService and salary_k for each group. \n", "# " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "5M2fqDoL0ib6" }, "source": [ "### Filtering" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bNCJvz590ib6" }, "outputs": [], "source": [ "# Select observation with the value in the salary column > 120K\n", "df_filter = df_new[df_new.salary > 120000]\n", "df_filter.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Oj51sat40ib6" }, "outputs": [], "source": [ "df_filter.axes" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xGZSM5EH0ib7" }, "outputs": [], "source": [ "# Select data for female professors\n", "df_w = df_new[df_new.gender == 'Female']\n", "df_w.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qNkmHdkP6hsN" }, "outputs": [], "source": [ "# To subset one column using a condition in another columns use method \"where\"\n", "df_new.salary.where(df_new.gender=='Female').dropna().head(6)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "iUOMOcM60ib8" }, "source": [ "### Exercise - Filtering data " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0fCg9pRv0ib8" }, "outputs": [], "source": [ "# Using filtering, find the mean value of the salary for the discipline A\n", "# " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DUHR3SlV0ib8" }, "outputs": [], "source": [ "# Challenge:\n", "# Determine how many female and male professors earned more than 100K\n", "# " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "6Fir_77Y0ib8" }, "source": [ "### Slicing a dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "W4v86rQi0ib8", "scrolled": true }, "outputs": [], "source": [ "# Select column salary\n", "salary = df_new['salary']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sugc9L480ib-" }, "outputs": [], "source": [ "# Check data type of the result\n", "type(salary)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XuISZ50f0ib-" }, "outputs": [], "source": [ "# Look at the first few elements of the output\n", "salary.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "CKQS3knF0ib_" }, "outputs": [], "source": [ "# Select column salary and make the output to be a data frame\n", "df_salary = df_new[['salary']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gMPqOKc20ib_" }, "outputs": [], "source": [ "# Check the type\n", "type(df_salary)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4HBqgWTi0ib_" }, "outputs": [], "source": [ "# Select a subset of rows (based on their position):\n", "# Note 1: The location of the first row is 0\n", "# Note 2: The last value in the range is not included\n", "df_new[0:10]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4BEvyTpr0ib_" }, "outputs": [], "source": [ "# If we want to select both rows and columns we can use method .loc\n", "df_new.loc[10:20, ['rank', 'gender','salary']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fpFEVLMJ0icA" }, "outputs": [], "source": [ "# Recall our filtered dataset with salaries over 120K\n", "df_filter.head(25)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "k28_O2L70icA" }, "outputs": [], "source": [ "# Let's see what we get for our df_filter data frame\n", "# Method .loc subsets the data frame based on the index values:\n", "# loc = location\n", "df_filter.loc[10:20,['rank','gender','salary']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Xg9Tuorn0icA" }, "outputs": [], "source": [ "# Unlike method .loc, method iloc selects rows (and columns) by absolute position:\n", "# iloc = integer location\n", "df_filter.iloc[10:20, [0,3,4,5]]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise - Slicing a dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a new dataframe where you filter out salaries below 100K from df_new\n", "# Call this data frame df_sub100" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract rows 5:10 and columns ['yearsOfService', 'salary_k'] of df_sub100 using the .loc method\n", "# How many rows are in the output?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract rows 5:10 and columns ['yearsOfService', 'salary_k'] from df_sub100 using the iloc method\n", "# What are the values of the indices in the output?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract rows with index values [6, 12, 20, 22] and columns ['yearsOfService','salary_k'] from df_sub100\n", "# Hint: Use the loc method" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "vzxWbXKF0icA" }, "source": [ "### Sorting the Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "OXybOgop0icB" }, "outputs": [], "source": [ "# Sort the data frame df_new by service and create a new data frame\n", "df_sorted = df_new.sort_values(by = 'yearsOfService')\n", "df_sorted.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "85UG8-JZ0icC" }, "outputs": [], "source": [ "# Sort the data frame df_new by yearsOfService and overwrite the original dataset\n", "df_new.sort_values(by = 'yearsOfService', ascending = False, inplace = True)\n", "df_new.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mcrnORWR0icC" }, "outputs": [], "source": [ "# Restore the original order using the sort_index method\n", "df_new.sort_index(axis=0, ascending = True, inplace = True)\n", "df_new.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "52sGV0z_0icD" }, "outputs": [], "source": [ "# Sort the data frame using 2 or more columns:\n", "df_sorted2 = df_new.sort_values(by = ['yearsOfService', 'salary'], ascending = [True,False])\n", "df_sorted2.head(15)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "aBgfkBkK0icC" }, "source": [ "### Exercise - Sorting " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Kh0qVebH0icD" }, "outputs": [], "source": [ "# Sort the data frame df_new by the salary (in descending order)\n", "# Store the output of this sorting in a dataframe called df_desc\n", "# Display the first 10 records of the output\n", "# " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "XZjjbRia0icD" }, "source": [ "### Missing Values\n", "To discuss how to handle missing values we will import the flights data set." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bZCCJvtD0icD" }, "outputs": [], "source": [ "# Read a dataset with missing values\n", "flights = pd.read_csv(\"http://rcs.bu.edu/examples/python/DataAnalysis/flights.csv\")\n", "flights.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PcLvDYfOjNZv" }, "outputs": [], "source": [ "flights.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "IPfS620V0icE", "scrolled": true }, "outputs": [], "source": [ "# Select the rows that have at least one missing value\n", "flights[flights.isnull().any(axis=1)].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AUI-q-wr0icE" }, "outputs": [], "source": [ "# Filter all the rows where arr_delay value is missing:\n", "flights1 = flights[flights['arr_delay'].notnull( )]\n", "flights1.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "oem6cDo90icF" }, "outputs": [], "source": [ "# Remove all the observations with missing values\n", "flights2 = flights.dropna()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NoCw8dmx0icH" }, "outputs": [], "source": [ "# Fill missing values with zeros\n", "nomiss =flights['dep_delay'].fillna(0)\n", "nomiss.isnull().any()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "d-Q3P5SO0icJ" }, "source": [ "### Exercise Count missing data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "uAfRQCPs0icL" }, "outputs": [], "source": [ "# Count how many missing pieces of data there are in the dep_delay and arr_delay columns\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "JGOFyTVm0icN" }, "source": [ "### Common Aggregation Functions:\n", "\n", "The following functions are commonly used functions to aggregate data.\n", "\n", "|Function|Description\n", "|-------|--------\n", "|min | minimum\n", "|max | maximum\n", "|count | number of non-null observations\n", "|sum | sum of values\n", "|mean | arithmetic mean of values\n", "|median | median\n", "|mad | mean absolute deviation\n", "|mode | mode\n", "|prod | product of values\n", "|std | standard deviation\n", "|var | unbiased variance\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "q5SgaHvG0icO" }, "outputs": [], "source": [ "# Find the number of non-missing values in each column\n", "flights.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gmZ5Dekd9N6g" }, "outputs": [], "source": [ "flights.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "90ZAiJWb0icP" }, "outputs": [], "source": [ "# Find mean value for all the columns in the dataset\n", "flights.min()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ZaZ1Uaks0icP" }, "outputs": [], "source": [ "# Let's compute summary statistic per a group':\n", "flights.groupby('carrier')['dep_delay'].mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HrhCXQBi0icP" }, "outputs": [], "source": [ "# We can use agg() methods for aggregation:\n", "flights[['dep_delay','arr_delay']].agg(['min','mean','max'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AVsIj0La0icQ" }, "outputs": [], "source": [ "# An example of computing different statistics for different columns\n", "flights.agg({'dep_delay':['min','mean',max], 'carrier':['nunique']})" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "sPGu6w9j0icR" }, "source": [ "## Exploring data using graphics" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "b5gSUWkR0icR" }, "outputs": [], "source": [ "# Show graphs within the Python notebook\n", "%matplotlib inline" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Graphics with the Salaries dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QJBJNupL0icR" }, "outputs": [], "source": [ "# Use matplotlib to draw a histogram of a salary data\n", "plt.hist(df_new['salary'],bins=8, density=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_x-c3hL20icS" }, "outputs": [], "source": [ "# Use seaborn package to draw a histogram\n", "sns.displot(df_new['salary']);" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "CAwwtPMR0icS" }, "outputs": [], "source": [ "# Use regular matplotlib function to display a barplot\n", "df_new.groupby(['rank'])['salary'].count().plot(kind='bar')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Q4rEH0cx0icS", "scrolled": false }, "outputs": [], "source": [ "# Use seaborn package to display a barplot\n", "sns.set_style(\"whitegrid\")\n", "ax = sns.barplot(x='rank',y ='salary', data=df_new, estimator=len)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9P4SKlYW0icS" }, "outputs": [], "source": [ "# Split into 2 groups:\n", "ax = sns.barplot(x='rank',y ='salary', hue='gender', data=df_new, estimator=len)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AzBt0Vih0icU" }, "outputs": [], "source": [ "# Violinplot\n", "sns.violinplot(x = \"salary\", data=df_new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1ieYE6KF0icU" }, "outputs": [], "source": [ "# Scatterplot in seaborn\n", "sns.jointplot(x='yearsOfService', y='salary', data=df_new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0u2HEEOp0icV" }, "outputs": [], "source": [ "# If we are interested in linear regression plot for 2 numeric variables we can use regplot\n", "sns.regplot(x='yearsOfService', y='salary', data=df_new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0WD6YnSy0icV" }, "outputs": [], "source": [ "# Box plot\n", "sns.boxplot(x='rank',y='salary', data=df_new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "U6Pmpr5l0icV" }, "outputs": [], "source": [ "# Side-by-side box plot\n", "sns.boxplot(x='rank', y='salary', data=df_new, hue='gender')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Dn-ibN7k0icW" }, "outputs": [], "source": [ "# Swarm plot\n", "sns.swarmplot(x='rank', y='salary', data=df_new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7aR8SY2r0ich" }, "outputs": [], "source": [ "# Factorplot\n", "sns.catplot(x='rank', y='salary', data=df_new, kind='bar')\n", "print(df_new.groupby('rank').mean())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UfRCtVU20ici" }, "outputs": [], "source": [ "# Pairplot \n", "sns.pairplot(df_new)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "8ABnDMpa0ici" }, "source": [ "### Exercise 7 Graphing data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HP7rVYGp0ici" }, "outputs": [], "source": [ "# Use the seaborn package to explore the dependency of arr_delay on dep_delay \n", "# in the flights dataset. You can use a scatterplot or regplot using flight.\n", "# " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "iWKUzQNP0icQ" }, "source": [ "## Descriptive statistics\n", "Statistics that are used to describe data. We have seen methods that calculate descriptive statistics before with the DataFrame describe() method. \n", "\n", "Descriptive statistics summarize attributes of a sample, such as the min/max values, and the mean (average) of the data. Below is a summary of some additional methods that calculate descriptive statistics." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "7qiHcIAJ0icQ" }, "source": [ "|Function|Description\n", "|-------|--------\n", "|min | minimum\n", "|max | maximum\n", "|mean | arithmetic mean of values\n", "|median | median\n", "|mad | mean absolute deviation\n", "|mode | mode\n", "|std | standard deviation\n", "|var | unbiased variance\n", "|sem | standard error of the mean\n", "|skew| sample skewness\n", "|kurt|kurtosis\n", "|quantile| value at %\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "y2izjRmU0icQ" }, "outputs": [], "source": [ "# Recall the describe() function which computes a subset of the above listed statistics\n", "flights.dep_delay.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yh4S78a80icR" }, "outputs": [], "source": [ "# find the index of the maximum or minimum value\n", "# if there are multiple values matching idxmin() and idxmax() will return the first match\n", "flights['dep_delay'].idxmin() #minimum value" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "k9D_xUPK0icR" }, "outputs": [], "source": [ "# Count the number of records for each different value in a vector\n", "flights['carrier'].value_counts()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "a2XW__P90icj" }, "source": [ "## Inferential Statistics\n", "Use data analysis on a sample of data to infer properties and make predictions that cannot be derived from descriptive statistics. Examples of this could be predicting a new unknown value based on previous data (linear regression, machine learning) or hypothesis testing (such as T-tests)." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "A5dv3AVB0icj" }, "source": [ "### Linear Regression\n", "A linear approach to model the relationship between a scalar output and one (or more) input variables. With one input and one output variable you are finding a line of *best fit*. You calculate the slope and y-intercept for a line that minimizes the distance between all the existing data points. You can then use this line to make predictions on unknown data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qSC1beWi0icj" }, "outputs": [], "source": [ "# Import Statsmodel functions:\n", "import statsmodels.api as sm" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DZhLEStD0icj" }, "outputs": [], "source": [ "# Create a fitted model\n", "lm = sm.OLS(df_new.yearsOfService, df_new.salary).fit()\n", "\n", "# Print model summary\n", "print(lm.summary())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "otaTnzG20icj" }, "outputs": [], "source": [ "# Print the coefficients\n", "lm.params" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dC0PDTYx0ick" }, "outputs": [], "source": [ "# Using scikit-learn:\n", "from sklearn import linear_model\n", "est = linear_model.LinearRegression(fit_intercept = True) # create estimator object\n", "# When you use dataframe columns this fits your model with feature names\n", "est.fit(df_new[['yearsOfService']], df_new[['salary']])\n", "\n", "# If you pass the values, you don't have to use feature names\n", "# est.fit(df_new[['yearsOfService']].values, df_new[['salary']].values)\n", "\n", "# Print result\n", "print(\"Coef:\", est.coef_, \"\\nIntercept:\", est.intercept_)\n", "\n", "# Predict \n", "# When you predict you have to use the features name otherwise you get a warning\n", "pred = est.predict(pd.DataFrame(np.array([21]), columns=['yearsOfService']))\n", "\n", "# If you created a model based on values, then you predict with only a value, \n", "# though you have to pass it with the same shape coef_ \n", "# pred = est.predict([[21]])\n", "print(\"Predicted salary: \", pred)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "f8LJHj4M0ick" }, "source": [ "### Exercise 8 Build a linear model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "X4V5kbci0ick" }, "outputs": [], "source": [ "# Build a linear model for arr_delay ~ dep_delay\n", "\n", "# Print model summary\n", "\n", "# Predict a value\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "collapsed": true, "id": "N8ca0v6_0icl" }, "source": [ "### Student T-test\n", "Used to compare the means of two groups. In this case you have a null hypothesis that the two group means are equal. The T-test then tells you whether you have statistically significant evidence to reject the null hypothesis. \n", "The T-test has two output results, the statistic, and the p-value. The statistic (or T-value) quantifies the difference between the two mean values. The p-value is the probability of obtaining test results at least as extreme as the result that is observed assuming that the null hypothesis is correct.\n", "\n", "More succintly, the p-value tells you how likely it is that your data could have occured under the null hypothesis. Small p-values indicate there is a small probability of observing such a difference in the mean assuming the null hypothesis is true. Small p-values indicate that there is evidence to reject the null hypothesis (i.e., the group means are different). \n", "\n", "Conversely, larger p-value scores indicate there is a large probabiliy of observing the calculated statistic under the null hypothesis. Large p-values indicate that you have evidence to accept the null hypothesis (i.e., the group means are equal).\n", "\n", "One area the T-test is important is in clinical trials. Consider an example where you are looking at whether a drug reduces your cholesterol level. You have 2 populations, one where the drug is administered and another that is not administered. You can use the T-test to determine whether there is evidence to suggest that the drug causes a statistically significant change to the cholesterol level between the two populations.\n", "\n", "Below we calculate whether there is a meaningful difference between male and female salaries. Generally a p-value below 0.05 is considered statistically significant." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "W2yyiAe30icl" }, "outputs": [], "source": [ "# Using scipy package:\n", "from scipy import stats\n", "df_w = df[df['sex'] == 'Female']['salary']\n", "df_m = df[df['sex'] == 'Male']['salary']\n", "stats.ttest_ind(df_w, df_m) " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "xM391jsk0icl" }, "source": [ "## Tutorial Evaluation\n", "After the tutorial please submit an evaluation by clicking on this link [Tutorial Evaluation](http://scv.bu.edu/survey/tutorial_evaluation.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "colab": { "collapsed_sections": [ "M3FbA7um0ibR", "vYNDrGEL0ibY", "3f4HD5E90ibn", "PQprwXx40ibo", "tj85w0PMttNG", "KzvZFygO0ibz", "5M2fqDoL0ib6", "6Fir_77Y0ib8", "vzxWbXKF0icA", "XZjjbRia0icD", "JGOFyTVm0icN", "iWKUzQNP0icQ", "sPGu6w9j0icR", "a2XW__P90icj" ], "name": "dataScience.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 1 }