{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vqv_wAtk0ibM"
   },
   "source": [
    "# Python for Data Analysis\n",
    "\n",
    "### Research Computing Services\n",
    "\n",
    "Instructor: Brian Gregor\n",
    "Website: [rcs.bu.edu](http://www.bu.edu/tech/support/research/)  \n",
    "Tutorial materials: [http://rcs.bu.edu/examples/python/DataAnalysis](http://rcs.bu.edu/examples/python/DataAnalysis)  \n",
    "Contact us: help@scc.bu.edu"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "M3FbA7um0ibR"
   },
   "source": [
    "## Course Content\n",
    "1. Python packages for data scientists\n",
    "2. Data manipulation with Pandas\n",
    "3. Basic data plotting\n",
    "4. Descriptive statistics\n",
    "5. Inferential statistics\n",
    "\n",
    "\n",
    "## Tutorial Evaluation\n",
    "After the tutorial please submit an evaluation by clicking on this link [Tutorial Evaluation](http://scv.bu.edu/survey/tutorial_evaluation.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Python packages for data scientists\n",
    "- [NumPy](https://numpy.org)\n",
    "    - Introduces objects for handling n-dimensional arrays such as vectors (1-D) and matrices (2-D).\n",
    "    - Introduces functions to perform advanced mathematical and statistical operations on these objects.\n",
    "    - Provides vectorization of mathematical operations on arrays and matrices which significantly improves performance.\n",
    "    - Many other Python libraries are built on NumPy\n",
    "- [SciPy](https://scipy.org)\n",
    "    - An enormous collection of algorithms for statistics, linear algebra, optimization, differential equations, numerical integration, and more.\n",
    "    - Developed and released with Numpy. \n",
    "- [Pandas](https://pandas.pydata.org)\n",
    "    - Adds data structures and tools designed to work with table-like data (similar to Vectors and Data Frames in R)\n",
    "    - Provides tools for data maniuplation: *reshaping*, *merging*, *sorting*, *slicing*, *aggregation*, etc.\n",
    "    - Easily allows to handle missing data\n",
    "      \n",
    "- [SciKit-Learn](https://scikit-learn.org/stable/)\n",
    "    - Provides machine learning algorithms: classification, regression, clustering, model validation, etc.\n",
    "    - Built on NumPy, Scipy, and matplotlib.\n",
    "\n",
    "- Machine Learning libraries\n",
    "    - [Pytorch](https://pytorch.org/)\n",
    "    - [Tensorflow](https://www.tensorflow.org/)\n",
    "    - [Jax](https://github.com/jax-ml/jax)\n",
    "    - For more info on using these on the SCC see [this page](https://www.bu.edu/tech/support/research/software-and-programming/common-languages/python/python-ml/).\n",
    "\n",
    "- Pandas alternatives\n",
    "  - Pandas is very popular and it has some alternatives\n",
    "  - [Dask](https://www.dask.org/) - process large scale data in parallel, built on Pandas.\n",
    "  - [Modin](https://github.com/modin-project/modin) - another library for scaling up Pandas to large datasets.\n",
    "  - [Polars](https://pola.rs/) - Similar functionality (but not built on Pandas), fast, parallel processing, gaining in popularity.\n",
    "    \n",
    "### Visualization\n",
    "More in-depth look at visualization in the `Data Visualization in Python` course.\n",
    "- [matplotlib](https://matplotlib.org/)\n",
    "    - Python 2-D plotting library for pulibcation quality figures in a variety of hardcopy formats\n",
    "    - Functionalities similar to MATLAB\n",
    "    - Line plots, scatter plots, bar charts, histograms, pie charts, etc.\n",
    "    - Effort needed to create advanced visualizations\n",
    "- [seaborn](https://seaborn.pydata.org/)\n",
    "    - Based on matplotlib\n",
    "    - Provides a high-level interface for drawing attractive statistical graphs\n",
    "    - Similar to the ggplot2 library in R\n",
    "- [plotly](https://plotly.com/python/)\n",
    "    - over 40 unique chart types covering a wide range of statistical, financial, geographic, scientific, and 3-dimensional use-cases.\n",
    "    - Built on top of the Plotly JavaScript library\n",
    "    - Can create beautiful interactive web-based visualizations\n",
    "- [Datashader](https://datashader.org/)\n",
    "    - Used to create visualizations and plots from very large datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading Python libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Ea4C8rUR0ibU"
   },
   "outputs": [],
   "source": [
    "# Press shift-enter to execute a Jupyter notebook cell\n",
    "# Import the Python Libraries used in the tutorial\n",
    "import numpy as np\n",
    "import scipy as sp\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ca3M2mv_0ibX"
   },
   "source": [
    "## Pandas\n",
    "The main focus of this tutorial is using the Pandas library to manipulate and analyze data.\n",
    "\n",
    "Pandas is a python package that deals mostly with :\n",
    "- **Series**  (1-D homogeneous array)\n",
    "    - the array has 1 data type (int, floating point, etc)\n",
    "- **DataFrame** (2-D labeled heterogeneous array)\n",
    "    - each column has a specific data type\n",
    "- **MultiIndex** (for hierarchical data)\n",
    "- **Xarray** (built on top of Pandas for n-D arrays)\n",
    "\n",
    "The Pandas content of this tutorial will cover:\n",
    "* Creating and understanding Series and DataFrames\n",
    "* Importing/Reading data\n",
    "* Data selection and filtering\n",
    "* Data maniuplation via sorting, grouping, and rearranging\n",
    "* Handling missing data\n",
    "\n",
    "\n",
    "In addition we will also provide information on the following.\n",
    "* Basic data plotting\n",
    "* Descriptive statistics (time permitting)\n",
    "* Inferential statistics (time permitting)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vYNDrGEL0ibY"
   },
   "source": [
    "### Pandas Series"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hQpjwDv_0ibZ"
   },
   "source": [
    "A Pandas *Series* is a 1-dimensional labeled array containing data of the same type (integers, strings, floating point numbers, Python objects, etc. ). It is a generalized numpy array with an explicit axis called the *index*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "iPJvEWlG0ibZ"
   },
   "outputs": [],
   "source": [
    "# Example of creating Pandas series :\n",
    "# Order all S1 together\n",
    "s1 = pd.Series([-3, -1, 1, 3, 5])\n",
    "print(s1)"
   ]
  },
  {
   "attachments": {
    "e141912c-cf3c-4599-a21b-4c3c3f2f6785.png": {
     "image/png": "iVBORw0KGgoAAAANSUhEUgAAAMUAAACUCAYAAAAj6u2iAAAgAElEQVR4Xu1dBVxUWRf/M3QJgigmqIhrt58tiq6ioKyoWKAotZSgYid2gIEtdq2FuoC5roFrx6qIgoktGCA99b77BkeRRZ1RBmZ49/1+X+Dc9+65/3P+95xzU41hxAzoQxGgCHxCQI2SgloDReBLBCgpqEVQBAogQElBTYIiQElBbYAi8G0EqKegFkIRoJ6C2gBFgHoKagMUAbkQUHj4xDD3ML9DbaROYDCvB50SkUs7xVSY6ugHh2QFiTHwrD8fbW/Hwr2W7MZNAS8my/6JauTRUXzMIWhZ94SVHDbwE6KVyKvUU5QI7MpVqaykkLWccrVOfmlkJoWYiYG7YRRcM9agAxMt+f+Nd4mxZ95dnD17Ft09Y3B4jZ1EguxoD9g6rMdNrXZoM8wJVeKDUP5j+CT9Tb1dO/DU6sJl/RqJ52H/vXVUH/xLviGpq8N8tNkgn1eSv/ncfeNbOrod4wkP+wiwOrp+CfDbH4u5dolY6DUc49b9A/2Pel2zJhBZMR7/KavqYfIPk2I4zwHloxOxoKeVxIhHaOeFVq5MGGwa7MfwuDyD5icuhk3t0egQw2COXR6ZWGLZIO8376YJn/7e59keF3tvxP8OumFH3Y3YF2jFXatVYMslOpFDR93cedhwJhCWTOIX+WH+jlKqT2nZ6kS/qvr8MCnyG3eeWx2BcqRnH5zgjnoL6uMEAZEFJr/LnSZ2hx7xIO1IDyR92J5oyUcCsSCzZPuXeB3WY9BHMQiwXuJrOmJ7eVZnhw7dQ/zBgyD/jcgd9bCXdGQdC5CCle5rZVmSqOpT7KTIr4yCoLE9mO2I/ZKwasaZPG9Cn6JH4FukCLHK8/TtIjfA3Zoh3iEanYk3n1UIKaQep7Cyqqy7IidFYeGTJwmfzD+GT9Iwiw2t8oddI6xY15znbVpGt8G4u5M/5ShFbxbc/mJh4ZNUR6w3z99x3Vzsic5TeBJPwRr6Pk8eLjvmDa8XJFfBsqqKcpGTgjV2aaL2kIRJDes6oTNJtKXzFJI8YsRoHEI7ZJDQyYf0SGxewoK9o26iJI+QhmOpE2Lp3IaCLOtrOpIm1CGb1dCkJdBifB+kOt/9lPdJE3QDz8VYvboH9pLk+2tlFSS6wj8rMykULgmtgCKgJAhQUiiJIqgYyoOAUpDi5I0HqGdRAeWNDZQHGQ5Kcu5OEtrUseBgy79scomT4tr9Z/BcsBV6ujpY4t8PTa2qcF4pJQHA4shT2BQVi2EO7RHUx6YkRFCaOkucFMmpGfAN343Eh8+hqamBKW690Lt1PaUBiAuC7D7zL2ZvjJI0dUngIHRqVJMLzf5qG0ucFKxkuXwhgiMO4vTleImgTl1aYurgbpxWTHE1ng1dg1fshkBAdDDEDkNsmxdX1Upbj1KQQopOxJELCN91XPJn68bWmO/eG0b6OkoLnqoLFpf0Ch4kdM3KysEgOzI31N9W1ZtUJPIrFSnYFh25koAZG/+UKKqSuSnWjhqEqmbGRdJY+pHPCDxNScWw+Vvx5m0qOraoi2U+ThSejwgoHSlYue69eAPvsJ0Shenp6SDUrz8dFSlCk03LzIHr/C14/PQ16te2wAbS8WhraRRhDar9KaUkBQspqzgfkoDHJSRJEKbxbtEYGpu/eYfvwrW4hxJP/Mfk4TRELQCt0pJCKuekTTGIPn1N8qd9x6aYPaxn0VgHR78SvO4gjp27CaMyBtg+2Y2GpoXYgdKTgpV524krWEIScHaEhHX3K/37097tB0gtnYtgh743TXJDfQvzH/hK6X9FJUjBqoGdbR29fLckAS9naozVowaiVqVypV9DRdRCOhchO5AqQwq2SeyIiR/JM9gEkU3Ap5GJvu7Na8veWo6WpHMR8ilepUghTcDHkYm+8/8mSlrq79wV7t1bfWo1m6Anp2Vw2ouweUNwP1vJWjI6FyEfIdjSKkcKaRNDth/Fvr/IhgzysOPsC8lEH/sMD9uB2lUrcHZGnO0UOvgtlISYwQN+xcI/jtG5CDl5obKkYNt58PxtzCQTfWwCbl2jMqwJGdiRKjaRPLTAn5OrbllMpq6N/MIM6FyEfKxQaVKwTWVX2QaG70Hah4wvWu7VpxN8HD4fkCAfLKpbWjrkmr8FBUNM1W1d8Uiu8qSQeoyCvSM7Dn98oT/nZmptx4RLwqWCD13bJDuhVJ4U7IhU/+nrJEO1BZ8Qzz6cWobOYmE/Nvw/OLDhZCeSd0mTb9nNg5slVZ4U7Iz3UTJDy+YVBR9LkmMcDPHkjGbzz0WwjWaXcQwmy/AdWtWnk51yWIHKk0LaVja3YP9zk2xWuv3wxacQYtVYV84sJmTziZNkTwrrFQZ2akZ3McpBhPxFSw0pCrafDSWu3X+OMmSSr7CdZIwoFzk5QogYNahr6kBXm/eDEBb9a2Ihn2y8EkAkzvu2upYOdLTUofadqthl963JHmu6B+XndFJqSfEtWMSiHMSuHoqwqGd4m6UD/Rq/YfocN7SupP9zaBbR289jlyJs1Tace2YMLdEHmDYbDL8AP3Sy4n2XGEUkAqc/wzlSMGIBzq5ywYJ/LODq74qGpuk4HDodp0xcsWC0M6zLqZe4QfAz3iDlzVtk5LIkeIPT65bgZm5HuMzwRUsTepSoohXEOVIIci4itHcYeJ6z4OtkDX1yFGTa9eXwCbqMtpNmYWjXapJ/U6bn0b4ghJ4BOnktgVNd5ZJNmXAqKlk4RQoGImScXwrnMckYsWUCHC30kPv8COaOX4Hoi4/QJeRPTBjwC0w0St7wxCKS74hEEIrisX3iHiRVaAuvIHtU0S552YrK+JT1O5wjxYeTC+G8URcB3mY4v3IPLj7SQkf/geAdXwy+TTi8+jeCuYINjxx2D7FARBJpYuBs9swm+xrqUFf/nEq/OBuOJau34fS9XBjWdILvmLFwbKpNc4piYBLHSMEgI3Y+7CdEIUdcBb8FBKBbgyqoXlsbR4IHIbFpGDz7N1E4KcT8m9gaEo6dJxLAN1CHIL0OnCcGwK1XnU+hG//DK7x8lYzUbB4yE3Zjy3UBWvUeh2Gt6CEOiuYFp0jBgpmdsAFewxZD034lZozrgCokVBImR5E94FtRxW8u3B2sYKSm2BCFEWfi9ZPnSH6XBTHrHUT6MLOoDHNTXRSW5gue7sboObHQbeuD+UPqKNomOP99zpFCnPsWB0J+w4Y3gzBjljuamKrj0po+mH6jE2aM90ILS22U9IwFQ8Kqz7RkEL9vAibueo+OXiEY3ZVuIVU0azlHCom3eHceK8dMxdHHfAjUdKBZpwcmBriiXU0TaKgr1kt8T6HsYMCtyAVYuSkad7MMoS5Ih75VZzi5eMG+ZWWY6n1vCu97NdDfv4cAJ0nBgpL+IgGPXn4guQUPOua/oGFV5Zi4Y31EZvJjPHmWggwRmcVmRNAxq4Fa1c2go2RDxd8zLlX9nbOkUFWFUbkVjwAlheIxpjWoGAKUFCqmMCqu4hGgpFA8xrQGFUOAkkLFFEbFVTwClBSKx5jWoGIIUFKomMKouIpHgJJC8RjTGlQMAUoKFVMYFVfxCHCWFAzzHsdCJ2L3tRoYPN8bNlUNS3zNU0F1P/hrCZYtv4OarkEY3qcODOiMtuIZQWrgJCn48TsxKvQijPVfIvpvc4zeNRmD6pkVukK1WLRQoBJR2iWsXPQn7r9+jDu3RKju6I1542xQlpKiWNTBOVKI+C9weNoq3GvQGh1N4zF+1lsMXT0KA5SEFAyTgysbNuBksjbadTHAvkWxyGreH3ODO8KYkoKSQhEIiEVZeBqXDLNG1aF+ORS9Al/BZW0wBioLKcgq2eR7z6FmWh6mOucw2WUH3rVyxXxKCkWYQ6Hf5JynyI9C1tmF6D0uBUOViBT55RO+jMYE/4NI/58L5lFSUFIoGgF2iXbW2UVwLAFSiPnx2BMWgcgzDyDU54GfaQ3HkR4Y2K0W9PKFSMJXMZjgd4CSQtHGUOD71FOwpFg3FoPqliu20SdG9BZ3L91AwpN3EGmpQcQ3Qe2WjVG3ugk085PiNSGFbx4p2PDJiOYUxUIPTpMi99pS9ApIhse2KehrqVMsgMtTiSjzb0xxI3dvtB2BpSObg17/Lg96P16Wc6RgxCJc3zUKq488w5sXz3HjPh+V61ZHZaMO8Jznhg7VypT40OyrSxsRsTUSV5Jy8CThNQRly8PSrDHs3Nzg0qc+DKnH+HGLl+FN7pECYry4cQgXE9KQQw6UMSIxfXZGNoTkyJuWds1haVzyBxekP7uKy9fi8DzDAGUMtaAmzEZamgFqtWiOJvUqQJuSQgbT/vEinCPFj0NF3+QKApQUXNE0bafMCFBSyAwVLcgVBCgpuKJp2k6ZEaCkkBkqWpArCFBScEXTtJ0yI0BJITNUtCBXEKCk4IqmaTtlRoCTpGAYsrlo5lIce5SCbB0DNO07Bm421aBbwocrF9Taqys7sHnLfVR1dIVj5+pfLBaUWcO0oNwIcI4UjEiAWwd/x+ZTlqhQ1QAaWddw5l8reIQEoGuDstBSgtlicU48/lh7DHF3z+Of2GxYDx6NBePpzju5rfsHX+AeKcjap8cXdiG1xgA0MeeBYe5i6UA/POwwHePd2qOSbgkfxc/k4tYf67H32gc0sjHF6a1xEDbvgzljbOjOux80cnlf4xwpCgLEMEnYMLAvrpCrvSb6dUBVvZImhQCPL96BuLIlqptfx5QBW/GG7ryT165/qjznSfH0zFgEzniGntPCMLB9Regq+GovebRFd97Jg1bRleU0KbLjyV1yIxcives8zA7ojGo6ynVLEN15V3SGLs+XOEuKrPg9CJmyFo8sBiN4vAualy/sCkZ5oCz6spQURY+pLF/kJCmEr89i7qQVeFK1H8ZPc0JNJRhxKkxZYv5pTHXZhQ/tPbHMr7Es+qRligABzpFCJErEft/fMXL3czTo0xvNKqiBnyuGmlpD9B3tiObmBsW2V/tr+ntz6wD2Rf2FuKRkXI19gNyKVmjxS0vY/OaIXl2s6HxFERj+tz7BOVKIRY/w17qjuJeVhSx+DnIEIjBigMerC3uv7mhYXr/ESfHu7mHEHItFYqoxTIy0AUEG3r8zRTO77ujank7iKZgT3Dw2U9Gg0u+rNgKc8xSqrS4qfXEgQElRHCjTOlQKAUoKlVIXFbY4EKCkKA6UaR0qhQAlhUqpiwpbHAhQUhQHyipeB8Pcw/wOtZE6gcG8HiW7YLI4oKSkKA6UVbwOeUgRH3MIWtY9YVVLfvL8zLtFCTElRVGiWUq/JSspZC1XGEw/825Rw845UjBiMR7ERuDwlTdI56tDnacF6+5u6N3IuMRnsqXKfZ94Akf/Pov77/SgQTYdmTbsgZ6/NkMlbfl73x81mOxoD9g6rMdNrXZoM8wJVeKDUP5j+HQ7xhMe9hFQb9cO1y8BfvtjMdcuEQu9hmPcun+g//GdNWsCkRXj8Z+yBUMwlhD53201tA6q3Vn3qT62Daw8raP64NpqMdwNo9B4lxh75t3F2bNn0d0zBofX2EmaKpWblY2nVhcu69fAXU6vxUFSiHA7Zg52nH6JtBwNsrzjBZLfNcCwoADYNDVRiv0UKTf2Ytf+o4hL0YeG+A1S3uugZTdfDB7UFObFQAx+4mLYNNiP4XGxEoOS/F17NDrEMJhjFy0xSteMNbAhCynZ37q587DhTCAsmcQvcg8xE/PVstULLMIs6CmkJPj3o7Hv8+ThsmNe/cN5DkhfnIh9gVZk5+TnfKcw2bybJnySVdYOgnOkKAgMw3zAbo/2+Nt6KSb5dkI1/eLrjWVV0u2IoQiLLQuHqUvhWFPx8rEGWW9BfZwghs4ab0GDZf8+dOge4g8eBPlvRO6oh72EJB0LkIJt39fKsoTK/xSsgyXUCO35aHs7Fq5M2CfiWTBfkpL9xs3FbeF6dzLOO0RCj3i3dsRLSB/Wky35SG5Z8aakYJKx1T8AD9oFwsehNSooGSkYRoira6Zg55Oy6PX7OHSsWrKkCLEKk3iRdpEb4G7NEO8Qjc6kN55VCCmkHqewst8jhdTYZ6htxjRmKNj/ZT1DQe9TkBT5ySwrCQqW4yQp2Pvuks5txrFrb/D+5XVcfuCAmaucUaes8uy8S3twmvTUF5D4Ogm3rpZDZ9cAjHAo96N6luu9wsInTxI+mZPwaZrY/QsvcnOxJzpP4Uk8BWvo0jCHzRsKepyCZQsKlf9d9jdWjq7uCajJxKPNhrxQjiUFGz6Vj07Egp7/DZ+k3kVaNv/fsoLAUVKIEXdgOraceIb3gjIwq1kNDVvZo0dzaxjpygqdYsul/LsLO/fG4OoTLZhWq4TqzW3R+X9tUa9i8VzyJU2mH5JQpGFdJ3QmiTY7TyFNqEM2q6FJS6DF+D5Idb77KW6XJroGnouxenUP7CXJ99fKFkQw/7tsks6GbixRZqrFQJpbSD1F/ZkM9kfe+U+izRLJe8RoHEI7ZJDQyYd4NJY88jycJEVBgJ7tcsGgo00xd5oX2lroyYNfsZTlx6+EV+glVOkxHTOdLIulTmWopLBh2sLCp6KWlZKCIPp4uw9cDllh5mxP2FgaFDXGP/29nFtr4DXvKio5BGPugFo//T1V+IDUaxjlG25l5aakUID2GHE6bp+IwuVHH5DL8KAGcn3vtYfQbumKwP7tYG5YsnkFm++8unUK5y7fxmuBNtRFOXj16Bae6zSDs+tQ2NZSvltcFaCmEv0k5zwFI07BiZXzEHktBZlidp5CHQ2cpmJkz6olfisqawkklUTi8XXYvvckHuYaQF2YAbOWg+E5rBesyyh+5KlErVFJKuccKZQEdyqGEiNASaHEylFF0c7dSUKbOhaqKPonmSkpVFp9yiX84shT2BQVi2EO7RHUx0a5hJNDGkoKOcCiRb+OwO4z/2L2xihJgSWBg9CpUU2VhYuSQmVVpzyCn7zxAMErdkMgECJ4iB2G2DZXHuF+QBJKih8Ajb7yGYG4pFfwWLAVWVk5GGTXBuP626o8PJQUKq/CkmvA05RUDJtP7s94m4qOLepimY9TyQlThDVTUhQhmFz6VFpmDlznb8Hjp69Rv7YFNowaBG2t4lmXpWicOU0KhnmHG7E38F69Cho1qwETHeU4jl/84RnibsfjwcsMQI0HsZAPXfO6aNSoASorwQReLl8I7/BduBb3EJXMTfHH5OEw0i89M+2cJQW7nCL51HSMm3wIF7UcsXJzADpVNVR0JyTT97Mur8CY0F24nGIGSxN1iHIzYNK0H0YMH47W1Up+Vjt43UEcO3cTRmUMsH2yG6qaGcvULlUpxFlSCN6exKzABdCqZoUTLxph6rR+ZDGgkVLoLf2fxVh8SQdNHXxgb1XyJMgPinQuQlNTA5smuaG+hblSYFaUQnCSFIzwDY7MDkJEWjeM6vwMi6IN4Tt2CLrUUB5SLLmsh+aO3rCzVB5SlKa5iG+RiJOkeH8xBP4z0/DbzMnoItwG11UM/CYPRVclIUXGuTBM23sf6pY9YEPmwMS6FiTnaYiqRiVHkNI2F0FJ8REBNo8QpyVhY9hEvGs9C2O614DgVjj6hWth3CxPtC1fssvGpYri3z+CVRHbcCQuHdo8IdJyjdHC3hm/D/gV1c2KJ6Fl84bgfrYob2xAblQqfXMRlBRSUpBDAJL2+KLfek0MGtYD9U01kXv3IBae4MHeqTO6d+qC2lUMoaVE1wazomdcX0o82XEY95qFBV5NoKXgO/rY4dYOfgtRztQYwYSIC/84VurmIigppKQQC3F5uz/WHH6OlAx1speCgSgjBY/fMChXwRS9/VdgeHcLlNEouTClMGUJHm6Hb8BhqHfxx4LAVjBUMCkOnr+NqWsjvxCltM1FUFJ8A4Gc8/Nhv0IbE+Z6w7Zq8YQm31IIG+KlPY1Hwv1nSBNpgifm49HFAziZXA1OHiPh1FC/KAdaCv2WdMg1/4/+zl3h3r2VwutWhgo4mWjnBz731k5MjNLAQLeeaF6x5A8tYHfe3T4YirVbDiMh2xA8svOufCtX+PkMQwvz4vFgtmPCJeFSwae0rG36HvE4T4rvAcS139n1TPZjw//TbHZeohNZ3yRNvkszLpQUpVm7P9C2/HMR7OvsMo7BXVrCoVX9UrWUg+YUP2AcXH2FzSdOXo6XeIWBnZqhqVUVzkFBPQXnVP7tBh+5koDWZI91aVrgJ6+KKSnkRYyWL/UIUFKUehXTBsqLACWFvIjR8qUeAUqKUq9i2kB5EaCkIIiJs97iyXs1lCtnDANtnrwYKm353PQ3eJ2SAQ1jc1Q00SHn5tJHFgQ4Twp2WUU2WardP0ITAZOG4teaeXsqst49wwcYw6QMWSCoZGuhZFEsWyYhahpGzTiGqsOWY7lfM8izg5rd7ZealgpGrzLKGRROJ4bJwev7iXidxUBdSx+mlSxR0ejLWsS5qXiS9Aa8MhXInIehXDLI2s6iLsd5UrCA5l5ZAqcVavCbNAzdrfJIcWVlT2wW+iDY3R7V9IpneUVRK/f1xfWYF34GpvbjMWFAHbkOkM65H4PV69eC6b4fQR3/6z35GeS09gsHsXPtFpx7KZTsIXfwXQhfm88btRhxKu7snwH/kGvQ6RiAiGV9UVHBixmLAkPOkkKQ+hiPX2WCL1ID/9Y2jD9UAcEz3NDBNBNPXqbiwmZvHBb3g0vvLqhWRgyUMUd1Mrurw0tH8otMaJU1gZEBWbBHtMCQa33fP38LkX5ZmJTVRe7blxDoloEoNQ3v3r1DjrohzKuyPe7nXlSc/RYvnr9Eao4aeLpmqFqlAgwL3Hya8yEFb9KIwZUtB1NSl6wPk0tkfP0SKR+E0CAymZmZkfc/152b8R7ZQiE5FEGLLD58inQ1bRiZkJ68YhnwBNl4l/wMiVeO4kD0fohbLYRbK22INfVgVK4SqppogxG8xvmd07BkvwaGLlqBnh8vp2QITOScBckjuVLg9j6snxmGF+U74DFaYt1SJ1RWsmX5hWHKSVLkfEhC1Ap/bDmZig8CLajz05Bu5owlK3xQ8/5yjFv6F+48eg6+hhExfD3Sw4qg28obi0b1h7X+aUx1XglNp3HwdWkOM3UGwjfRGOe1H6aOgfBzqY9Ls/rhGK86BAmZSLp3HS+yDdBqxASMdLdBdR0ehFkpuHRwLtbsjMW9tzoQ69VBP38/eHZvBEOtPK/ELgy8EuGNMWufodPYZZjeV/YrqgRPziBieRg2n3iEHPP/wdl/BiZ0r/hJ/4lHF2FHzD/IJMaadng3bvK1ULWJA/xDg9BWcAeRq2ZicdQT8PnkNBHdCjDWzAW/7C+wcZmIGX0skXr3OLasWoz3rRfBuQm5Q4MwQdfQlJDKSOKNWEIIU25j59JFOMZvh0GNkrDqcmOsDu2LygQvZX84RwpGlIFDs/tivWAIZvv2RXVjNWRfXAKXTboIIFtSu1jqk+MfRfhnRS/sEo6Av0tvWJkQNfI0SG6hDjW19zg0zRU7Bf0wKtAFTSuo4cXBUQiIrQ4Pdw90rQ0cmzME83ddR+0R6zHWvRVMbyyH3/JnaOkyEj52FkjY4oeZV03g7DUJjnV18fT4RIzbm4UuQ6ZgWHvTPO9DSHFr9xTM2fESrb2mYSR5T9aHEYshFAqQfScSs3eeBdN8HBb0rfbp9QcnwzBj9Gpktp+EqfOcUeP5Saza8ieSagYi3KUWOVJHiDdxB7Bp+2aIO2/FKFs9Ig8PGhqa0FDPQcKpbdgw5QR4dpa4dOICBDk8mFRqhSGzJsDxlzLk7u/3iItcgDlb9eC9zAOG55dj6vlGWBPWj5JCViUWZzl+5inM7bUV5hNnwMW2Klh1515agr5rePCf+DnRPh9uh21Cb4xyc0RN4y97N378Wgydl0JiaA8MbKmHyAkuuGnxO4YN7QZL3XTETB+Is3qDMdhrIBqQdxnmATa4zcezuoPwe/AvOO41EZf02+A3vy6wIAwQ5ZzDylGnoNPNFWMC28PkY9wtFokgEhNzJBfLqKvLP3YkuHcAMzf/DX6TMZjn9JkUdw7NxrYDD9BkzHr0tVYjG61uYMcKsv2V3xMRUzpBl9Sffucg1m2MgKjbXgTban9SEZtcPzi9HBMD/ob5+IWY/1sNiAUvcGHDSiw8WQOh+3xg8fQCNkWEgm+7BUGdRLi9ZyHGn2+GTYsdUY7mFMVp7t+vSzLSdH87gtzuw35DAOxqmUhGQ3LOL0K/dZrwJ4m2dPQpdsmv+IPxReBQR9QyKXgR+h2sHDAP6T2C4N4zGUvn3EHz/s7o3cqcECBDQorzZd3h6u6I2gYsKR5i47BQPG84AD4jy+EPr3HYei4Z2lUMSK9KBFATIeu9IWy9xsLXs32RJaP8u/sQsvU0hE3/S4od0UloHrwOvauT3Ydp17F99U4S6vTAuo+keH8rEuu3rAfDkqLL5ytjGSYLd49HYPX8JPTYRS59L8e2T4jH5zdh+ZQHsIvwhcaR2Qi5UAezp/ZCeWEy4g5vRfiNXzBrSi/UrlgNZXWVO4TiVPjEkiLn0R6McbuIduETMLAB6bfIHXjn1wVi8klrTJzrgy7V8w5EO7e0OyIy3THetx+sCzlF48amQVj7tBs6GiUgXq8Nevd0QJOKrIHkkeKCqSeGe/dCDZIjCNOiMcntb5Tv6wH/gabY7zMBl0y6wm1iP9RhO2E1NTKHwEpHkm519h6+vIdNiNMyRdA2NIaRnjwDqnnvCx79iVkbT0LYfDxm96rwqddgPcX2qMdoNiYCv5Ek+RMpBIQUk/M8RertA1i7YR1ybXdjSo/Pu/0Yhk88xTKETT2Nxov/hGczIjn5t8STq7AiNA3dQgchbfNQrDirDR0tHvmSCPzMD3iTLoblLw3hMmUzBjf+fgdWkiU4RQoWaDH/MXZN9MZxgxH43bkFjJ7sxbhFfyBBrTdWRgTBplre7ahPD02C/zYxfhsxEB1qlIGangmZADOEtmaeyQpTj2Kmxw6ce6GL4asnoHcDS0koxpLi0J+rL/UAAAODSURBVIxB+IvpgZ797VBTT4wHx2ZhzV1LDHUfA/t6unhwYCymHM+BjYMv7OobgCgBjLYhTI2NoKeVZw5sTnF1ox/Gr3uOjmPCMKWP7Pc98DPf491bQqjEowjffwHCeh4IdqgF47KmklGo+OgQbDnwiNyBvQlO5LA1UdpVbFm+HUcE9tg03VZCCuHrK9i8fBmOZv2KaX5toauuDX1jM1Qoo4mspH+wesEEnNN0x9SRHaCX9gh/7duG8xrOWDS5C4yFuRCKyHdFYoL3K1zftxIhlxthVWg/WOrpK/28D+dIwRpczuM9mBy0EZeTs2BQpx8Ce2Rj2wVTjCCJdweLPE/BiF7jwKwgbDj1CO9FWijT5ncsCuqLuhXyemyJ8U9ywqIXnTF9xih0tMgbMmWYTByfNxyhB+4gVbs89JEFcZ3+mBg0HLbWZHSGHJbAiDLxz+ZpWLXjDB6Ty+21xZmkTF+MDfRCj7p5pGRJcWPneIRseYG2vrMw2t5S5s7z4anVWB62ERdT80bOGHIWLVOlDZy9x8O/c3ncO7EUkcefoaH3Islha6IPcdi/lewDF3RGaGBb6LDkJsn6y5uRWD5jIc4kkzGlcnXQaegkzOxTQ+LTUu5EYeXUufj7tQZEmuVRz9YTM4O7ooLml3MajPgdbv8ZgXlXG2DJzB40p5BZi8VckDW43IxM5AjEUNPURRk9HnIFDBld0SCjK58TWnFuFjJz+BCy4++knKGuDkl4Pw6Zil9ix9jhuF1jFDyGkvOY9D/+Oxs+zRiIM9r94TSExNAkp1Ajs71l9DW/WGbBCHLJnQ7Z4DMkdCKeAhra0CPf19L4XL+IlJHIpakFrQLG9i3IRPwcZGfnQMCo57VHLIJQTQM6OrrQJSGNiJBEKCSz0Jo65Pc8AgjJiJOIlNfR/nzINIsTPyMD2QQnhiT7Wjp60P/4O0sMQWY6svjkNzXym64h+e2/UknO2hIIyHwQj9SvoRJLTTjpKX6Gg8JsMiGXlonnJ5ZhzgldDAjwQZ/GZp+UzXqQqMlOOGfijeG/94G1is6G/wxGqv4uJYWcGkw5E4bxy/7Ev/frwCd8PAa2I7lEvllaSfi0wAOXDAfB2cUBtQyVe6RFzuZzojglhZxqFudm4gM5QU8g1oGxiQE0SY6Q/5HM5mZnkXCFhDxaZLa8wO9yVkeLlwAClBQlADqtUrkR+D95EMGw4MIZngAAAABJRU5ErkJggg=="
    }
   },
   "cell_type": "markdown",
   "metadata": {
    "id": "YPctLCcw0ibb"
   },
   "source": [
    "![image.png](attachment:e141912c-cf3c-4599-a21b-4c3c3f2f6785.png)\n",
    "\n",
    "We did not pass any index, so by default, it assigned the indexes ranging from `0`to `len(data)-1`. Contrast this with a Python list, which always has an implicit index that counts from 0:\n",
    "```\n",
    "x = [10,20,30]\n",
    "y = x[1]  # y --> 20\n",
    "```\n",
    "and also with a Python dictionary, where the keys act as an index:\n",
    "```\n",
    "x = {'a':10, 'b':20, 'c':30}\n",
    "y = x['b']  # y --> 20\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "v4vAbdqv0ibb"
   },
   "outputs": [],
   "source": [
    "# View index values\n",
    "print(s1.index)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "s3I3__b90ibi"
   },
   "outputs": [],
   "source": [
    "s1[:2] # First 2 elements"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "4jsIOlUR0ibi"
   },
   "outputs": [],
   "source": [
    "print(s1[[2,1,0]])  # Elements out of order"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "qGn-aAvx2t9V"
   },
   "outputs": [],
   "source": [
    "type(s1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "WIft0v6r0ibl"
   },
   "outputs": [],
   "source": [
    "# Can place filtering conditions on series\n",
    "s1[s1 > 0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ZmlDR_tj0ibd"
   },
   "outputs": [],
   "source": [
    "# Creating Pandas series with index:\n",
    "# fetch a random number generator object.\n",
    "rng = np.random.default_rng()\n",
    "# select 5 points from a normal (Gaussian) distribution.\n",
    "s2 = pd.Series(rng.normal(size=5), index=['a', 'b', 'c', 'd', 'e'])\n",
    "print(s2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "srVyZ9ik0ibe"
   },
   "outputs": [],
   "source": [
    "# View index values\n",
    "print(s2.index)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "vX-IUZb30ibf"
   },
   "outputs": [],
   "source": [
    "# Create a Series from dictionary\n",
    "data = {'pi': 3.14159, 'e': 2.71828}  # dictionary\n",
    "print(data)\n",
    "s3 = pd.Series(data)\n",
    "print(s3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "sOHbkXC10ibg"
   },
   "outputs": [],
   "source": [
    "# Create a new series from a dictionary and reorder the elements\n",
    "s4 = pd.Series(data, index=['e', 'pi', 'tau'])\n",
    "print(s4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "DeofA8nZ0ibi"
   },
   "source": [
    "NaN (Not a Number) - is used to specify a missing value in Pandas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "skyr2siA0ibj"
   },
   "outputs": [],
   "source": [
    "# Series can be treated as a 1-D array and you can apply functions to them:\n",
    "print(\"Median:\", s4.median())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "42_G7Ks-0ibl"
   },
   "outputs": [],
   "source": [
    "# Methods can be used to filter series:\n",
    "s4[s4 > s4.median()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3f4HD5E90ibn"
   },
   "source": [
    "### Attributes and Methods:\n",
    "An attribute is a variable stored in the object, e.g., index or size with Series.\n",
    "A method is a function stored in the object, e.g., head() or median() with Series.\n",
    "\n",
    "|  Attribute/Method | Description |\n",
    "|-----|-----|\n",
    "| dtype | data type of values in series |\n",
    "| empty | True if series is empty |\n",
    "| size | number of elements |\n",
    "| values | Returns values as ndarray |\n",
    "| head() | First n elements |\n",
    "| tail() | Last n elements |\n",
    "\n",
    "Execute *dir(s1)* to see all attributes and methods. \n",
    "\n",
    "I recommend using online documentation as well. This will be in a much easier format to read and come with examples.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# For more information on a particular method or attribute use the help() function\n",
    "help(s4.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "help(s4.index)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You can also add a question mark to get help information\n",
    "s4.head?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s4.index?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas Series can also [plot](https://pandas.pydata.org/docs/reference/api/pandas.Series.plot.html) themselves:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Going back to s2\n",
    "s2.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One final way to get help is to press shift-tab when you are in the parentheses of a method or after an attribute. Try this in the exercise below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "FXjjnBOj0ibn"
   },
   "source": [
    "### Exercise - Create your own Series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "RUmWlc790ibn"
   },
   "outputs": [],
   "source": [
    "# Create a series with 10 elements containing both positive and negative integers\n",
    "# Examine the series with the head() method\n",
    "# mys = pd.Series(  ...  )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Series from Numpy or a Series. See the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a Series from a numpy array\n",
    "a=np.array(range(6))  # numbers 0 thru 5\n",
    "print(f'numpy: {a}')\n",
    "ser = pd.Series(a)\n",
    "print(f'series:\\n{ser}')\n",
    "# Change an element of the numpy array\n",
    "a[0] = -100\n",
    "# Print the series again\n",
    "print(f'numpy: {a}')\n",
    "print(f'series:\\n{ser}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Tell pandas to make a copy of the numpy array.\n",
    "ser2 = pd.Series(a, copy=True)\n",
    "print(f'series2:\\n{ser2}')\n",
    "# change the array again\n",
    "a[-1]=1000\n",
    "# ser2 was built with a copy of \"a\", so no changes.\n",
    "print(f'series2:\\n{ser2}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PQprwXx40ibo"
   },
   "source": [
    "### Pandas DataFrames"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ZYRivqyM0ibo"
   },
   "source": [
    "A Pandas *DataFrame* is a 2-dimensional, size-mutable, heterogeneous tabular data structure with labeled rows and columns. You can think of it as a dictionary-like container to store Python Series objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "vWXR_GKo0ibo"
   },
   "outputs": [],
   "source": [
    "d = pd.DataFrame({'Name': pd.Series(['Alice', 'Bob', 'Chris']), \n",
    "                  'Age': pd.Series([21, 25, 23])})\n",
    "print(d)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "xt6NqHhS0ibo"
   },
   "outputs": [],
   "source": [
    "d2 = pd.DataFrame(np.array([['Alice','Bob','Chris'], [21, 25, 23]]).T, columns=['Name','Age'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "cZXlWYmn0ibq"
   },
   "outputs": [],
   "source": [
    "# Use the head() method to print the first 5 records in the dataframe (same as with series)\n",
    "d2.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "NHte5SMZ0ibq"
   },
   "outputs": [],
   "source": [
    "# Add a new column to d2:\n",
    "d2['Height'] = pd.Series([5.2, 6.0, 5.6])\n",
    "d2.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add your own index:\n",
    "d3 = d2.copy()\n",
    "d3['my_index'] = ['person1','person2','person3']\n",
    "# assign the values in my_index as the new index\n",
    "# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html\n",
    "d3 = d3.set_index('my_index', drop=True) # remove my_index afterwards\n",
    "d3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# If you don't like an index, you can remove it and reset it to the usual one 0...len(df)-1\n",
    "# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html#pandas.DataFrame.reset_index\n",
    "d4 = d3.reset_index(drop=True)  # What do you get if drop=False?\n",
    "d4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Combine dataframes. There's a bunch of ways. Here let's stack d2 onto d3:\n",
    "#https://pandas.pydata.org/docs/reference/api/pandas.concat.html\n",
    "d5 = pd.concat([d2,d4], axis=0)\n",
    "d5\n",
    "# See here for a discussion of a bunch of ways: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The index has gotten weird...fix it.\n",
    "d5 = d5.reset_index(drop=True)\n",
    "d5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reading data using Pandas\n",
    "You can read CSV (comma separated values) files using Pandas. The command shown below reads a CSV file into the Pandas dataframe df."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Hfwv2wzd0ibq"
   },
   "outputs": [],
   "source": [
    "# Read a csv file into Pandas Dataframe\n",
    "df = pd.read_csv(\"http://rcs.bu.edu/examples/python/DataAnalysis/Salaries.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2JaiSOAXnKup"
   },
   "source": [
    "The above command has many optional arguments that you can find in the Pandas documentation online.\n",
    "\n",
    "You can also read many other formats, for instance:\n",
    "* Excel - pd.read_excel('myfile.xlsx', sheet_name='Sheet1', index_col=None, na_values=['NA'])\n",
    "* Stata - pd.read_stata('myfile.dta')\n",
    "* SAS - pd.read_sas('myfile.sas7bdat')\n",
    "* HDF - pd.read_hdf('myfile.h5', 'df')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2JaiSOAXnKup"
   },
   "source": [
    "Before we can perform any analysis on the data we need to\n",
    "\n",
    "\n",
    "*   Check if the data is correctly imported \n",
    "*   Check the types of each column\n",
    "*   Determine how many missing values each column has\n",
    "\n",
    "We can then carefully prepare the data:\n",
    "\n",
    "*   Remove columns that are not needed in our analysis\n",
    "*   Rename the columns (if necessary)\n",
    "*   Possibly rearrange the columns to make it easier to work with them\n",
    "*   Create new or modify existing columns (e.g., convert into different units) if necessary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "86icwTdZ0ibr"
   },
   "outputs": [],
   "source": [
    "# Display the first 10 records\n",
    "df.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "z-lqfY0MiwGP"
   },
   "outputs": [],
   "source": [
    "# Display structure of the data frame\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### More details on DataFrame data types\n",
    "\n",
    "|Pandas Type | Native Python Type | Description |\n",
    "|------------|--------------------|-------------|\n",
    "| object | string | The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings).|\n",
    "| int64  | int | Numeric characters. 64 refers to the memory allocated to hold this character. |\n",
    "| float64 | float | Numeric characters with decimals. If a column contains numbers and NaNs (see below), pandas will default to float64, in case your missing value has a decimal. |\n",
    "| datetime64, timedelta\\[ns\\]| N/A (but see the datetime module in Python’s standard library) | Values meant to hold time data. Look into these for time series experiments. |\n",
    "\n",
    "\n",
    "### DataFrame attributes\n",
    "|df.attribute | Description |\n",
    "|-------------|-------------|\n",
    "| dtypes | list the types of the columns |\n",
    "| columns | list the column names |\n",
    "| axes | list the row labels and column names |\n",
    "| ndim | number of dimensions |\n",
    "| size | number of elements |\n",
    "| shape | return a tuple representung the dimensionality |\n",
    "| values | numpy representation of the data |\n",
    "\n",
    "### Dataframe methods\n",
    "|df.method() | Description |\n",
    "|-------------|-------------|\n",
    "| head(\\[n\\]), tail(\\[n\\]) | first/last n rows |\n",
    "| describe() | generate descriptive statistics (for numeric columns only) |\n",
    "| max(), min() | return max/min values for all numeric columns |\n",
    "| mean(), median() | return mean/median values for all numeric columns |\n",
    "| std() | standard deviation |\n",
    "| sample(\\[n\\]) | returns a random sample of n elements from the data frame |\n",
    "| dropna() | drop all the records with missing values |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "v_Gm0ubUmu95"
   },
   "source": [
    "Sometimes the column names in the input file are too long or contain special characters. In such cases we rename them to make it easier to work with these columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "5PTMYO9ok8te"
   },
   "outputs": [],
   "source": [
    "# Let's create a copy of this dataframe with a new column names\n",
    "# If we do not want to create a new data frame, we can add inplace=True argument\n",
    "df_new =df.rename(columns={'sex': 'gender', 'phd': 'yearsAfterPhD', 'service': 'yearsOfService'})\n",
    "df_new.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tj85w0PMttNG"
   },
   "source": [
    "### DataFrame Exploration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "pbDGiE0J0ibt"
   },
   "outputs": [],
   "source": [
    "# Identify the type of df_new object\n",
    "type(df_new)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "pybC4toq0ibt"
   },
   "outputs": [],
   "source": [
    "# Check the data type of the column \"salary\"\n",
    "# We access columns using the brackets, e.g., df['column_name']\n",
    "df_new['salary'].dtype"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# If the column name has no spaces, complex symbols, and is not the name of an attribute/method\n",
    "# you can use the syntax df.column_name\n",
    "df_new.salary.dtype"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "1T1wR6qq0ibu"
   },
   "outputs": [],
   "source": [
    "# List the types of all columns\n",
    "df_new.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "YFXUSewc0ibu"
   },
   "outputs": [],
   "source": [
    "# List the column names\n",
    "df_new.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "_Cu0DFOj0ibv"
   },
   "outputs": [],
   "source": [
    "# List the row labels and the column names\n",
    "df_new.axes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "99jPywmg0ibw"
   },
   "outputs": [],
   "source": [
    "# Number of rows and columns\n",
    "df_new.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "utYUKYjN0ibv"
   },
   "outputs": [],
   "source": [
    "# Total number of elements in the Data Frame (78 x 6)\n",
    "df_new.size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "jsjjlva70ibw"
   },
   "outputs": [],
   "source": [
    "# Output some descriptive statistics for the numeric columns\n",
    "# On a large dataframe this can take a long time to calculate\n",
    "df_new.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ZFxc7P8w9mJc"
   },
   "outputs": [],
   "source": [
    "# Remember we can use the ? to get help about the function\n",
    "df_new.describe?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Adding columns\n",
    "\n",
    "Here's two ways to add a column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a new column using the assign method\n",
    "# temporarily make a copy for demonstration purposes.\n",
    "df_copy = df_new.copy()\n",
    "\n",
    "# vectorized computation. This syntax is used to MODIFY a dataframe to\n",
    "# contain a new column.\n",
    "df_copy['salary_k'] = df_copy['salary'] / 1000.0\n",
    "df_copy.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "XHWfkQX8roey"
   },
   "outputs": [],
   "source": [
    "# Create a new column using the assign method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html\n",
    "# Assign returns a copy of df_new with a new column attached. df_copy2 is a brand new dataframe.\n",
    "df_copy2 = df_new.assign(salary_k=df_new['salary']/1000.0)\n",
    "df_copy2.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "vsyWDUkww3m3"
   },
   "outputs": [],
   "source": [
    "# Check how many unique values are in a column\n",
    "# There is a rank attribute in DataFrame object so we access using df['rank']\n",
    "df_new['rank'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "pESwRxLCt5oO"
   },
   "outputs": [],
   "source": [
    "# Get the frequency table for a categorical or binary column\n",
    "df_new['rank'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "pBF1lOcY0FLH"
   },
   "outputs": [],
   "source": [
    "# Get a proportion table\n",
    "df_new['rank'].value_counts()/sum(df['rank'].value_counts())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ITSJWChquiYI"
   },
   "outputs": [],
   "source": [
    "# Alternatively we can use the pandas function crosstab() to calculate a frequency table\n",
    "# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html\n",
    "pd.crosstab(index=df_new['rank'], columns=\"count\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "faCHhLX-vMtw"
   },
   "outputs": [],
   "source": [
    "# Two-way tables\n",
    "pd.crosstab(index=df_new['rank'], columns=df_new['discipline'], margins=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "id": "KzvZFygO0ibz",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "source": [
    "### Data slicing and grouping"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "LE0RknOW0ib0"
   },
   "outputs": [],
   "source": [
    "#Extract a column by name \n",
    "df_new['gender'].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "RdMYTbNlCrzE"
   },
   "outputs": [],
   "source": [
    "# If the column name does not contain spaces or other special characters and does not collide with data frame methods, we can use a dot notation\n",
    "df_new.gender.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "6JrIilpE08Be"
   },
   "outputs": [],
   "source": [
    "# Calculate median number of service years\n",
    "df_new.yearsOfService.median()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3gcURpQh0ib1"
   },
   "source": [
    "### Exercise - Working with a single column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "LXwRAzfQ0ib2"
   },
   "outputs": [],
   "source": [
    "# Calculate the descriptive statistics for only the salary column in df_new\n",
    "# <your code goes here>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "wk7Er3ma0ib2"
   },
   "outputs": [],
   "source": [
    "# Get a count for each of the values in the salary column in df_new\n",
    "# <your code goes here>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "08H11QC80ib2"
   },
   "outputs": [],
   "source": [
    "# Calculate the average (mean) salary in df_new\n",
    "# <your code goes here>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OR761ERr0ib2"
   },
   "source": [
    "### Grouping data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "HxIkM-uT0ib3"
   },
   "outputs": [],
   "source": [
    "# Group data using rank\n",
    "df_rank = df_new.groupby('rank')\n",
    "df_rank.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "dX45khRv0ib3"
   },
   "outputs": [],
   "source": [
    "# Calculate mean of all numeric columns for the grouped object\n",
    "df_rank.mean(numeric_only=True)\n",
    "# What happens with df_rank.mean() ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "txqWe6sM0ib4"
   },
   "outputs": [],
   "source": [
    "# Most of the time, the \"grouping\" object is not stored, but is used as a step in getting a summary:\n",
    "#   df_new.groupby('gender')\n",
    "# Calculate the mean salary for men and women. The following produce Pandas Series (single brackets around salary)\n",
    "df_new.groupby('gender')['salary'].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "TDRTnWXo0ib4"
   },
   "outputs": [],
   "source": [
    "# If we use double brackets Pandas will produce a DataFrame\n",
    "df_new.groupby('gender')[['salary']].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "nUMv2CM50ib5"
   },
   "outputs": [],
   "source": [
    "# Group using 2 variables - gender and rank:\n",
    "df_new.groupby(['rank','gender'], sort=True)[['salary']].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "n_QMiWbN0ib5"
   },
   "source": [
    "### Exercise - Grouping data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "bK_eeYF70ib5"
   },
   "outputs": [],
   "source": [
    "# Group data by the rank and discipline and find the average yearsOfService and salary_k for each group. \n",
    "# <your code goes here>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "5M2fqDoL0ib6"
   },
   "source": [
    "### Filtering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "bNCJvz590ib6"
   },
   "outputs": [],
   "source": [
    "# Select observation with the value in the salary column > 120K\n",
    "df_filter = df_new[df_new.salary > 120000]\n",
    "df_filter.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Oj51sat40ib6"
   },
   "outputs": [],
   "source": [
    "df_filter.axes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "xGZSM5EH0ib7"
   },
   "outputs": [],
   "source": [
    "# Select data for female professors\n",
    "df_w = df_new[df_new.gender == 'Female']\n",
    "df_w.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "qNkmHdkP6hsN"
   },
   "outputs": [],
   "source": [
    "# To subset one column using a condition in another columns use method \"where\"\n",
    "df_new.salary.where(df_new.gender=='Female').dropna().head(6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "iUOMOcM60ib8"
   },
   "source": [
    "### Exercise - Filtering data "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "0fCg9pRv0ib8"
   },
   "outputs": [],
   "source": [
    "# Using filtering, find the mean value of the salary for the discipline A\n",
    "# <your code goes here>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "DUHR3SlV0ib8"
   },
   "outputs": [],
   "source": [
    "# Challenge:\n",
    "# Determine how many female and male professors earned more than 100K\n",
    "# <your code goes here>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6Fir_77Y0ib8"
   },
   "source": [
    "### Slicing a dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "W4v86rQi0ib8",
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Select column salary\n",
    "salary = df_new['salary']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "sugc9L480ib-"
   },
   "outputs": [],
   "source": [
    "# Check data type of the result\n",
    "type(salary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "XuISZ50f0ib-"
   },
   "outputs": [],
   "source": [
    "# Look at the first few elements of the output\n",
    "salary.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "CKQS3knF0ib_"
   },
   "outputs": [],
   "source": [
    "# Select column salary and make the output to be a data frame\n",
    "df_salary = df_new[['salary']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "gMPqOKc20ib_"
   },
   "outputs": [],
   "source": [
    "# Check the type\n",
    "type(df_salary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "4HBqgWTi0ib_"
   },
   "outputs": [],
   "source": [
    "# Select a subset of rows (based on their position):\n",
    "# Note 1: The location of the first row is 0\n",
    "# Note 2: The last value in the range is not included\n",
    "df_new[0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "4BEvyTpr0ib_"
   },
   "outputs": [],
   "source": [
    "# If we want to select both rows and columns we can use method .loc\n",
    "df_new.loc[10:20, ['rank', 'gender','salary']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "fpFEVLMJ0icA"
   },
   "outputs": [],
   "source": [
    "# Recall our filtered dataset with salaries over 120K\n",
    "df_filter.head(25)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "k28_O2L70icA"
   },
   "outputs": [],
   "source": [
    "# Let's see what we get for our df_filter data frame\n",
    "# Method .loc subsets the data frame based on the index values:\n",
    "# loc = location\n",
    "df_filter.loc[10:20,['rank','gender','salary']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Xg9Tuorn0icA"
   },
   "outputs": [],
   "source": [
    "# Unlike method .loc, method iloc selects rows (and columns) by absolute position:\n",
    "# iloc = integer location\n",
    "df_filter.iloc[10:20, [0,3,4,5]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise - Slicing a dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a new dataframe where you filter out salaries below 100K from df_new\n",
    "# Call this data frame df_sub100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract rows 5:10 and columns ['yearsOfService', 'salary_k'] of df_sub100 using the .loc method\n",
    "# How many rows are in the output?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract rows 5:10 and columns ['yearsOfService', 'salary_k'] from df_sub100 using the iloc method\n",
    "# What are the values of the indices in the output?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract rows with index values [6, 12, 20, 22] and columns ['yearsOfService','salary_k'] from df_sub100\n",
    "# Hint: Use the loc method"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "vzxWbXKF0icA"
   },
   "source": [
    "### Sorting the Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "OXybOgop0icB"
   },
   "outputs": [],
   "source": [
    "# Sort the data frame df_new by service and create a new data frame\n",
    "df_sorted = df_new.sort_values(by = 'yearsOfService')\n",
    "df_sorted.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "85UG8-JZ0icC"
   },
   "outputs": [],
   "source": [
    "# Sort the data frame df_new by yearsOfService and overwrite the original dataset\n",
    "df_new.sort_values(by = 'yearsOfService', ascending = False, inplace = True)\n",
    "df_new.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "mcrnORWR0icC"
   },
   "outputs": [],
   "source": [
    "# Restore the original order using the sort_index method\n",
    "df_new.sort_index(axis=0, ascending = True, inplace = True)\n",
    "df_new.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "52sGV0z_0icD"
   },
   "outputs": [],
   "source": [
    "# Sort the data frame using 2 or more columns:\n",
    "df_sorted2 = df_new.sort_values(by = ['yearsOfService', 'salary'], ascending = [True,False])\n",
    "df_sorted2.head(15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "aBgfkBkK0icC"
   },
   "source": [
    "### Exercise - Sorting "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Kh0qVebH0icD"
   },
   "outputs": [],
   "source": [
    "# Sort the data frame df_new by the salary (in descending order)\n",
    "# Store the output of this sorting in a dataframe called df_desc\n",
    "# Display the first 10 records of the output\n",
    "# <your code goes here>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Looping and DataFrames\n",
    "\n",
    "You can iterate over rows in a loop:\n",
    "```\n",
    "# use the iterrows() method\n",
    "sum_sal = 0\n",
    "for index, row in df_sorted2.iterrows():\n",
    "   sum_sal += row['salary']\n",
    "```\n",
    "or using .loc():\n",
    "```\n",
    "sum_sal = 0\n",
    "for i in range(df_sorted2.shape[0]):\n",
    "    sum_sal += df_sorted2.loc[i,'salary']\n",
    "```\n",
    "However, performance is generally VERY POOR, so this is to be *avoided* where a better alternative exists. The [apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply) function performs much better if you must do an operation on every element in a column."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XZjjbRia0icD"
   },
   "source": [
    "### Missing Values\n",
    "To discuss how to handle missing values we will import the flights data set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "bZCCJvtD0icD"
   },
   "outputs": [],
   "source": [
    "# Read a dataset with missing values\n",
    "flights = pd.read_csv(\"http://rcs.bu.edu/examples/python/DataAnalysis/flights.csv\")\n",
    "flights.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "PcLvDYfOjNZv"
   },
   "outputs": [],
   "source": [
    "flights.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "IPfS620V0icE",
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Select the rows that have at least one missing value\n",
    "flights[flights.isnull().any(axis=1)].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "AUI-q-wr0icE"
   },
   "outputs": [],
   "source": [
    "# Filter all the rows where arr_delay value is missing:\n",
    "flights1 = flights[flights['arr_delay'].notnull( )]\n",
    "flights1.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "oem6cDo90icF"
   },
   "outputs": [],
   "source": [
    "# Remove all the observations with missing values\n",
    "flights2 = flights.dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "NoCw8dmx0icH"
   },
   "outputs": [],
   "source": [
    "# Fill missing values with zeros\n",
    "nomiss =flights['dep_delay'].fillna(0)\n",
    "nomiss.isnull().any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "d-Q3P5SO0icJ"
   },
   "source": [
    "### Exercise Count missing data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "uAfRQCPs0icL"
   },
   "outputs": [],
   "source": [
    "# Count how many missing pieces of data there are in the dep_delay and arr_delay columns\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JGOFyTVm0icN"
   },
   "source": [
    "### Common Aggregation Functions:\n",
    "\n",
    "The following functions are commonly used functions to aggregate data.\n",
    "\n",
    "|Function|Description\n",
    "|-------|--------\n",
    "|min   | minimum\n",
    "|max   | maximum\n",
    "|count   | number of non-null observations\n",
    "|sum   | sum of values\n",
    "|mean  | arithmetic mean of values\n",
    "|median | median\n",
    "|mad | mean absolute deviation\n",
    "|mode | mode\n",
    "|prod   | product of values\n",
    "|std  | standard deviation\n",
    "|var | unbiased variance\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "q5SgaHvG0icO"
   },
   "outputs": [],
   "source": [
    "# Find the number of non-missing values in each column\n",
    "flights.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "gmZ5Dekd9N6g"
   },
   "outputs": [],
   "source": [
    "flights.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get mean values\n",
    "flights.mean(numeric_only=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ZaZ1Uaks0icP"
   },
   "outputs": [],
   "source": [
    "# Let's compute summary statistic per group:\n",
    "flights.groupby('carrier')['dep_delay'].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "HrhCXQBi0icP"
   },
   "outputs": [],
   "source": [
    "# We can use agg() methods for aggregation:\n",
    "flights[['dep_delay','arr_delay']].agg(['min','mean','max'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The value returned is a dataframe:\n",
    "agg_vals = flights[['dep_delay','arr_delay']].agg(['min','mean','max'])\n",
    "agg_vals.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "AVsIj0La0icQ"
   },
   "outputs": [],
   "source": [
    "# An example of computing different statistics for different columns\n",
    "flights.agg({'dep_delay':['min','mean','max'], 'carrier':['nunique']})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sPGu6w9j0icR"
   },
   "source": [
    "## Exploring data using graphics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Graphics with the Salaries dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "QJBJNupL0icR"
   },
   "outputs": [],
   "source": [
    "# Use matplotlib to draw a histogram of a salary data\n",
    "plt.hist(df_new['salary'],bins=8, density=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "_x-c3hL20icS"
   },
   "outputs": [],
   "source": [
    "# Use seaborn package to draw a histogram\n",
    "sns.displot(df_new['salary']);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "CAwwtPMR0icS"
   },
   "outputs": [],
   "source": [
    "# Use regular matplotlib function to display a barplot\n",
    "df_new.groupby(['rank'])['salary'].count().plot(kind='bar')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Q4rEH0cx0icS"
   },
   "outputs": [],
   "source": [
    "# Use seaborn package to display a barplot\n",
    "sns.set_style(\"whitegrid\")\n",
    "ax = sns.barplot(x='rank',y ='salary', data=df_new, estimator=len)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "9P4SKlYW0icS"
   },
   "outputs": [],
   "source": [
    "# Split into 2 groups:\n",
    "ax = sns.barplot(x='rank',y ='salary', hue='gender', data=df_new, estimator=len)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "AzBt0Vih0icU"
   },
   "outputs": [],
   "source": [
    "# Violinplot\n",
    "sns.violinplot(x = \"salary\", data=df_new)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "1ieYE6KF0icU"
   },
   "outputs": [],
   "source": [
    "# Scatterplot in seaborn\n",
    "sns.jointplot(x='yearsOfService', y='salary', data=df_new)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "0u2HEEOp0icV"
   },
   "outputs": [],
   "source": [
    "# If we are interested in linear regression plot for 2 numeric variables we can use regplot\n",
    "sns.regplot(x='yearsOfService', y='salary', data=df_new)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "0WD6YnSy0icV"
   },
   "outputs": [],
   "source": [
    "# Box plot\n",
    "sns.boxplot(x='rank',y='salary', data=df_new)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "U6Pmpr5l0icV"
   },
   "outputs": [],
   "source": [
    "# Side-by-side box plot\n",
    "sns.boxplot(x='rank', y='salary', data=df_new, hue='gender')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Dn-ibN7k0icW"
   },
   "outputs": [],
   "source": [
    "# Swarm plot\n",
    "sns.swarmplot(x='rank', y='salary', data=df_new)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "7aR8SY2r0ich"
   },
   "outputs": [],
   "source": [
    "# Factorplot\n",
    "sns.catplot(x='rank', y='salary', data=df_new, kind='bar')\n",
    "print(df_new.groupby('rank').mean())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "UfRCtVU20ici"
   },
   "outputs": [],
   "source": [
    "# Pairplot \n",
    "sns.pairplot(df_new)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8ABnDMpa0ici"
   },
   "source": [
    "### Exercise 7 Graphing data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "HP7rVYGp0ici"
   },
   "outputs": [],
   "source": [
    "# Use the seaborn package to explore the dependency of arr_delay on dep_delay \n",
    "# in the flights dataset. You can use a scatterplot or regplot using flight.\n",
    "# <your code goes here>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "iWKUzQNP0icQ"
   },
   "source": [
    "## Descriptive statistics\n",
    "Statistics that are used to describe data. We have seen methods that calculate descriptive statistics before with the DataFrame describe() method. \n",
    "\n",
    "Descriptive statistics summarize attributes of a sample, such as the min/max values, and the mean (average) of the data. Below is a summary of some additional methods that calculate descriptive statistics."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7qiHcIAJ0icQ"
   },
   "source": [
    "|Function|Description\n",
    "|-------|--------\n",
    "|min   | minimum\n",
    "|max   | maximum\n",
    "|mean  | arithmetic mean of values\n",
    "|median | median\n",
    "|mad | mean absolute deviation\n",
    "|mode | mode\n",
    "|std  | standard deviation\n",
    "|var | unbiased variance\n",
    "|sem | standard error of the mean\n",
    "|skew| sample skewness\n",
    "|kurt|kurtosis\n",
    "|quantile| value at %\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "y2izjRmU0icQ"
   },
   "outputs": [],
   "source": [
    "# Recall the describe() function which computes a subset of the above listed statistics\n",
    "flights.dep_delay.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "yh4S78a80icR"
   },
   "outputs": [],
   "source": [
    "# find the index of the maximum or minimum value\n",
    "# if there are multiple values matching idxmin() and idxmax() will return the first match\n",
    "flights['dep_delay'].idxmin()  #minimum value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "k9D_xUPK0icR"
   },
   "outputs": [],
   "source": [
    "# Count the number of records for each different value in a vector\n",
    "flights['carrier'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "a2XW__P90icj"
   },
   "source": [
    "## Inferential Statistics\n",
    "Use data analysis on a sample of data to infer properties and make predictions that cannot be derived from descriptive statistics. Examples of this could be predicting a new unknown value based on previous data (linear regression, machine learning) or hypothesis testing (such as T-tests)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "A5dv3AVB0icj"
   },
   "source": [
    "### Linear Regression\n",
    "A linear approach to model the relationship between a scalar output and one (or more) input variables. With one input and one output variable you are finding a line of *best fit*. You calculate the slope and y-intercept for a line that minimizes the distance between all the existing data points. You can then use this line to make predictions on unknown data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "qSC1beWi0icj"
   },
   "outputs": [],
   "source": [
    "# Import Statsmodel functions:\n",
    "import statsmodels.api as sm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "DZhLEStD0icj"
   },
   "outputs": [],
   "source": [
    "# Create a fitted model\n",
    "lm = sm.OLS(df_new.yearsOfService, df_new.salary).fit()\n",
    "\n",
    "# Print model summary\n",
    "print(lm.summary())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "otaTnzG20icj"
   },
   "outputs": [],
   "source": [
    "# Print the coefficients\n",
    "lm.params"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "dC0PDTYx0ick"
   },
   "outputs": [],
   "source": [
    "# Using scikit-learn:\n",
    "from sklearn import linear_model\n",
    "est = linear_model.LinearRegression(fit_intercept = True)   # create estimator object\n",
    "# When you use dataframe columns this fits your model with feature names\n",
    "est.fit(df_new[['yearsOfService']], df_new[['salary']])\n",
    "\n",
    "# If you pass the values, you don't have to use feature names\n",
    "# est.fit(df_new[['yearsOfService']].values, df_new[['salary']].values)\n",
    "\n",
    "# Print result\n",
    "print(\"Coef:\", est.coef_, \"\\nIntercept:\", est.intercept_)\n",
    "\n",
    "# Predict \n",
    "# When you predict you have to use the features name otherwise you get a warning\n",
    "pred = est.predict(pd.DataFrame(np.array([21]), columns=['yearsOfService']))\n",
    "\n",
    "# If you created a model based on values, then you predict with only a value, \n",
    "# though you have to pass it with the same shape coef_ \n",
    "# pred = est.predict([[21]])\n",
    "print(\"Predicted salary: \", pred)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "f8LJHj4M0ick"
   },
   "source": [
    "### Exercise 8 Build a linear model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "X4V5kbci0ick"
   },
   "outputs": [],
   "source": [
    "# Build a linear model for arr_delay ~ dep_delay\n",
    "\n",
    "# Print model summary\n",
    "\n",
    "# Predict a value\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "id": "N8ca0v6_0icl",
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "source": [
    "### Student T-test\n",
    "Used to compare the means of two groups. In this case you have a null hypothesis that the two group means are equal. The T-test then tells you whether you have statistically significant evidence to reject the null hypothesis. \n",
    "The T-test has two output results, the statistic, and the p-value. The statistic (or T-value) quantifies the difference between the two mean values. The p-value is the probability of obtaining test results at least as extreme as the result that is observed assuming that the null hypothesis is correct.\n",
    "\n",
    "More succintly, the p-value  tells you how likely it is that your data could have occured under the null hypothesis.  Small p-values indicate there is a small probability of observing such a difference in the mean assuming the null hypothesis is true. Small p-values indicate that there is evidence to reject the null hypothesis (i.e., the group means are different). \n",
    "\n",
    "Conversely, larger p-value scores indicate there is a large probabiliy of observing the calculated statistic under the null hypothesis. Large p-values indicate that you have evidence to accept the null hypothesis (i.e., the group means are equal).\n",
    "\n",
    "One area the T-test is important is in clinical trials. Consider an example where you are looking at whether a drug reduces your cholesterol level. You have 2 populations, one where the drug is administered and another that is not administered. You can use the T-test to determine whether there is evidence to suggest that the drug causes a statistically significant change to the cholesterol level between the two populations.\n",
    "\n",
    "Below we calculate whether there is a meaningful difference between male and female salaries. Generally a p-value below 0.05 is considered statistically significant."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "W2yyiAe30icl"
   },
   "outputs": [],
   "source": [
    "# Using scipy package:\n",
    "from scipy import stats\n",
    "df_w = df[df['sex'] == 'Female']['salary']\n",
    "df_m = df[df['sex'] == 'Male']['salary']\n",
    "stats.ttest_ind(df_w, df_m)   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xM391jsk0icl"
   },
   "source": [
    "## Tutorial Evaluation\n",
    "After the tutorial please submit an evaluation by clicking on this link [Tutorial Evaluation](http://scv.bu.edu/survey/tutorial_evaluation.html)"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [
    "M3FbA7um0ibR",
    "vYNDrGEL0ibY",
    "3f4HD5E90ibn",
    "PQprwXx40ibo",
    "tj85w0PMttNG",
    "KzvZFygO0ibz",
    "5M2fqDoL0ib6",
    "6Fir_77Y0ib8",
    "vzxWbXKF0icA",
    "XZjjbRia0icD",
    "JGOFyTVm0icN",
    "iWKUzQNP0icQ",
    "sPGu6w9j0icR",
    "a2XW__P90icj"
   ],
   "name": "dataScience.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}