{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "vqv_wAtk0ibM" }, "source": [ "# Python for Data Analysis\n", "\n", "### Research Computing Services\n", "\n", "Instructor: Brian Gregor\n", "Website: [rcs.bu.edu](http://www.bu.edu/tech/support/research/) \n", "Tutorial materials: [http://rcs.bu.edu/examples/python/DataAnalysis](http://rcs.bu.edu/examples/python/DataAnalysis) \n", "Contact us: help@scc.bu.edu" ] }, { "cell_type": "markdown", "metadata": { "id": "M3FbA7um0ibR" }, "source": [ "## Course Content\n", "1. Python packages for data scientists\n", "2. Data manipulation with Pandas\n", "3. Basic data plotting\n", "4. Descriptive statistics\n", "5. Inferential statistics\n", "\n", "\n", "## Tutorial Evaluation\n", "After the tutorial please submit an evaluation by clicking on this link [Tutorial Evaluation](http://scv.bu.edu/survey/tutorial_evaluation.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Python packages for data scientists\n", "- [NumPy](https://numpy.org)\n", " - Introduces objects for handling n-dimensional arrays such as vectors (1-D) and matrices (2-D).\n", " - Introduces functions to perform advanced mathematical and statistical operations on these objects.\n", " - Provides vectorization of mathematical operations on arrays and matrices which significantly improves performance.\n", " - Many other Python libraries are built on NumPy\n", "- [SciPy](https://scipy.org)\n", " - An enormous collection of algorithms for statistics, linear algebra, optimization, differential equations, numerical integration, and more.\n", " - Developed and released with Numpy. \n", "- [Pandas](https://pandas.pydata.org)\n", " - Adds data structures and tools designed to work with table-like data (similar to Vectors and Data Frames in R)\n", " - Provides tools for data maniuplation: *reshaping*, *merging*, *sorting*, *slicing*, *aggregation*, etc.\n", " - Easily allows to handle missing data\n", " \n", "- [SciKit-Learn](https://scikit-learn.org/stable/)\n", " - Provides machine learning algorithms: classification, regression, clustering, model validation, etc.\n", " - Built on NumPy, Scipy, and matplotlib.\n", "\n", "- Machine Learning libraries\n", " - [Pytorch](https://pytorch.org/)\n", " - [Tensorflow](https://www.tensorflow.org/)\n", " - [Jax](https://github.com/jax-ml/jax)\n", " - For more info on using these on the SCC see [this page](https://www.bu.edu/tech/support/research/software-and-programming/common-languages/python/python-ml/).\n", "\n", "- Pandas alternatives\n", " - Pandas is very popular and it has some alternatives\n", " - [Dask](https://www.dask.org/) - process large scale data in parallel, built on Pandas.\n", " - [Modin](https://github.com/modin-project/modin) - another library for scaling up Pandas to large datasets.\n", " - [Polars](https://pola.rs/) - Similar functionality (but not built on Pandas), fast, parallel processing, gaining in popularity.\n", " \n", "### Visualization\n", "More in-depth look at visualization in the `Data Visualization in Python` course.\n", "- [matplotlib](https://matplotlib.org/)\n", " - Python 2-D plotting library for pulibcation quality figures in a variety of hardcopy formats\n", " - Functionalities similar to MATLAB\n", " - Line plots, scatter plots, bar charts, histograms, pie charts, etc.\n", " - Effort needed to create advanced visualizations\n", "- [seaborn](https://seaborn.pydata.org/)\n", " - Based on matplotlib\n", " - Provides a high-level interface for drawing attractive statistical graphs\n", " - Similar to the ggplot2 library in R\n", "- [plotly](https://plotly.com/python/)\n", " - over 40 unique chart types covering a wide range of statistical, financial, geographic, scientific, and 3-dimensional use-cases.\n", " - Built on top of the Plotly JavaScript library\n", " - Can create beautiful interactive web-based visualizations\n", "- [Datashader](https://datashader.org/)\n", " - Used to create visualizations and plots from very large datasets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading Python libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Ea4C8rUR0ibU" }, "outputs": [], "source": [ "# Press shift-enter to execute a Jupyter notebook cell\n", "# Import the Python Libraries used in the tutorial\n", "import numpy as np\n", "import scipy as sp\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": { "id": "ca3M2mv_0ibX" }, "source": [ "## Pandas\n", "The main focus of this tutorial is using the Pandas library to manipulate and analyze data.\n", "\n", "Pandas is a python package that deals mostly with :\n", "- **Series** (1-D homogeneous array)\n", " - the array has 1 data type (int, floating point, etc)\n", "- **DataFrame** (2-D labeled heterogeneous array)\n", " - each column has a specific data type\n", "- **MultiIndex** (for hierarchical data)\n", "- **Xarray** (built on top of Pandas for n-D arrays)\n", "\n", "The Pandas content of this tutorial will cover:\n", "* Creating and understanding Series and DataFrames\n", "* Importing/Reading data\n", "* Data selection and filtering\n", "* Data maniuplation via sorting, grouping, and rearranging\n", "* Handling missing data\n", "\n", "\n", "In addition we will also provide information on the following.\n", "* Basic data plotting\n", "* Descriptive statistics (time permitting)\n", "* Inferential statistics (time permitting)" ] }, { "cell_type": "markdown", "metadata": { "id": "vYNDrGEL0ibY" }, "source": [ "### Pandas Series" ] }, { "cell_type": "markdown", "metadata": { "id": "hQpjwDv_0ibZ" }, "source": [ "A Pandas *Series* is a 1-dimensional labeled array containing data of the same type (integers, strings, floating point numbers, Python objects, etc. ). It is a generalized numpy array with an explicit axis called the *index*." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iPJvEWlG0ibZ" }, "outputs": [], "source": [ "# Example of creating Pandas series :\n", "# Order all S1 together\n", "s1 = pd.Series([-3, -1, 1, 3, 5])\n", "print(s1)" ] }, { "attachments": { "e141912c-cf3c-4599-a21b-4c3c3f2f6785.png": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAMUAAACUCAYAAAAj6u2iAAAgAElEQVR4Xu1dBVxUWRf/M3QJgigmqIhrt58tiq6ioKyoWKAotZSgYid2gIEtdq2FuoC5roFrx6qIgoktGCA99b77BkeRRZ1RBmZ49/1+X+Dc9+65/3P+95xzU41hxAzoQxGgCHxCQI2SgloDReBLBCgpqEVQBAogQElBTYIiQElBbYAi8G0EqKegFkIRoJ6C2gBFgHoKagMUAbkQUHj4xDD3ML9DbaROYDCvB50SkUs7xVSY6ugHh2QFiTHwrD8fbW/Hwr2W7MZNAS8my/6JauTRUXzMIWhZ94SVHDbwE6KVyKvUU5QI7MpVqaykkLWccrVOfmlkJoWYiYG7YRRcM9agAxMt+f+Nd4mxZ95dnD17Ft09Y3B4jZ1EguxoD9g6rMdNrXZoM8wJVeKDUP5j+CT9Tb1dO/DU6sJl/RqJ52H/vXVUH/xLviGpq8N8tNkgn1eSv/ncfeNbOrod4wkP+wiwOrp+CfDbH4u5dolY6DUc49b9A/2Pel2zJhBZMR7/KavqYfIPk2I4zwHloxOxoKeVxIhHaOeFVq5MGGwa7MfwuDyD5icuhk3t0egQw2COXR6ZWGLZIO8376YJn/7e59keF3tvxP8OumFH3Y3YF2jFXatVYMslOpFDR93cedhwJhCWTOIX+WH+jlKqT2nZ6kS/qvr8MCnyG3eeWx2BcqRnH5zgjnoL6uMEAZEFJr/LnSZ2hx7xIO1IDyR92J5oyUcCsSCzZPuXeB3WY9BHMQiwXuJrOmJ7eVZnhw7dQ/zBgyD/jcgd9bCXdGQdC5CCle5rZVmSqOpT7KTIr4yCoLE9mO2I/ZKwasaZPG9Cn6JH4FukCLHK8/TtIjfA3Zoh3iEanYk3n1UIKaQep7Cyqqy7IidFYeGTJwmfzD+GT9Iwiw2t8oddI6xY15znbVpGt8G4u5M/5ShFbxbc/mJh4ZNUR6w3z99x3Vzsic5TeBJPwRr6Pk8eLjvmDa8XJFfBsqqKcpGTgjV2aaL2kIRJDes6oTNJtKXzFJI8YsRoHEI7ZJDQyYf0SGxewoK9o26iJI+QhmOpE2Lp3IaCLOtrOpIm1CGb1dCkJdBifB+kOt/9lPdJE3QDz8VYvboH9pLk+2tlFSS6wj8rMykULgmtgCKgJAhQUiiJIqgYyoOAUpDi5I0HqGdRAeWNDZQHGQ5Kcu5OEtrUseBgy79scomT4tr9Z/BcsBV6ujpY4t8PTa2qcF4pJQHA4shT2BQVi2EO7RHUx6YkRFCaOkucFMmpGfAN343Eh8+hqamBKW690Lt1PaUBiAuC7D7zL2ZvjJI0dUngIHRqVJMLzf5qG0ucFKxkuXwhgiMO4vTleImgTl1aYurgbpxWTHE1ng1dg1fshkBAdDDEDkNsmxdX1Upbj1KQQopOxJELCN91XPJn68bWmO/eG0b6OkoLnqoLFpf0Ch4kdM3KysEgOzI31N9W1ZtUJPIrFSnYFh25koAZG/+UKKqSuSnWjhqEqmbGRdJY+pHPCDxNScWw+Vvx5m0qOraoi2U+ThSejwgoHSlYue69eAPvsJ0Shenp6SDUrz8dFSlCk03LzIHr/C14/PQ16te2wAbS8WhraRRhDar9KaUkBQspqzgfkoDHJSRJEKbxbtEYGpu/eYfvwrW4hxJP/Mfk4TRELQCt0pJCKuekTTGIPn1N8qd9x6aYPaxn0VgHR78SvO4gjp27CaMyBtg+2Y2GpoXYgdKTgpV524krWEIScHaEhHX3K/37097tB0gtnYtgh743TXJDfQvzH/hK6X9FJUjBqoGdbR29fLckAS9naozVowaiVqVypV9DRdRCOhchO5AqQwq2SeyIiR/JM9gEkU3Ap5GJvu7Na8veWo6WpHMR8ilepUghTcDHkYm+8/8mSlrq79wV7t1bfWo1m6Anp2Vw2ouweUNwP1vJWjI6FyEfIdjSKkcKaRNDth/Fvr/IhgzysOPsC8lEH/sMD9uB2lUrcHZGnO0UOvgtlISYwQN+xcI/jtG5CDl5obKkYNt58PxtzCQTfWwCbl2jMqwJGdiRKjaRPLTAn5OrbllMpq6N/MIM6FyEfKxQaVKwTWVX2QaG70Hah4wvWu7VpxN8HD4fkCAfLKpbWjrkmr8FBUNM1W1d8Uiu8qSQeoyCvSM7Dn98oT/nZmptx4RLwqWCD13bJDuhVJ4U7IhU/+nrJEO1BZ8Qzz6cWobOYmE/Nvw/OLDhZCeSd0mTb9nNg5slVZ4U7Iz3UTJDy+YVBR9LkmMcDPHkjGbzz0WwjWaXcQwmy/AdWtWnk51yWIHKk0LaVja3YP9zk2xWuv3wxacQYtVYV84sJmTziZNkTwrrFQZ2akZ3McpBhPxFSw0pCrafDSWu3X+OMmSSr7CdZIwoFzk5QogYNahr6kBXm/eDEBb9a2Ihn2y8EkAkzvu2upYOdLTUofadqthl963JHmu6B+XndFJqSfEtWMSiHMSuHoqwqGd4m6UD/Rq/YfocN7SupP9zaBbR289jlyJs1Tace2YMLdEHmDYbDL8AP3Sy4n2XGEUkAqc/wzlSMGIBzq5ywYJ/LODq74qGpuk4HDodp0xcsWC0M6zLqZe4QfAz3iDlzVtk5LIkeIPT65bgZm5HuMzwRUsTepSoohXEOVIIci4itHcYeJ6z4OtkDX1yFGTa9eXwCbqMtpNmYWjXapJ/U6bn0b4ghJ4BOnktgVNd5ZJNmXAqKlk4RQoGImScXwrnMckYsWUCHC30kPv8COaOX4Hoi4/QJeRPTBjwC0w0St7wxCKS74hEEIrisX3iHiRVaAuvIHtU0S552YrK+JT1O5wjxYeTC+G8URcB3mY4v3IPLj7SQkf/geAdXwy+TTi8+jeCuYINjxx2D7FARBJpYuBs9swm+xrqUFf/nEq/OBuOJau34fS9XBjWdILvmLFwbKpNc4piYBLHSMEgI3Y+7CdEIUdcBb8FBKBbgyqoXlsbR4IHIbFpGDz7N1E4KcT8m9gaEo6dJxLAN1CHIL0OnCcGwK1XnU+hG//DK7x8lYzUbB4yE3Zjy3UBWvUeh2Gt6CEOiuYFp0jBgpmdsAFewxZD034lZozrgCokVBImR5E94FtRxW8u3B2sYKSm2BCFEWfi9ZPnSH6XBTHrHUT6MLOoDHNTXRSW5gue7sboObHQbeuD+UPqKNomOP99zpFCnPsWB0J+w4Y3gzBjljuamKrj0po+mH6jE2aM90ILS22U9IwFQ8Kqz7RkEL9vAibueo+OXiEY3ZVuIVU0azlHCom3eHceK8dMxdHHfAjUdKBZpwcmBriiXU0TaKgr1kt8T6HsYMCtyAVYuSkad7MMoS5Ih75VZzi5eMG+ZWWY6n1vCu97NdDfv4cAJ0nBgpL+IgGPXn4guQUPOua/oGFV5Zi4Y31EZvJjPHmWggwRmcVmRNAxq4Fa1c2go2RDxd8zLlX9nbOkUFWFUbkVjwAlheIxpjWoGAKUFCqmMCqu4hGgpFA8xrQGFUOAkkLFFEbFVTwClBSKx5jWoGIIUFKomMKouIpHgJJC8RjTGlQMAUoKFVMYFVfxCHCWFAzzHsdCJ2L3tRoYPN8bNlUNS3zNU0F1P/hrCZYtv4OarkEY3qcODOiMtuIZQWrgJCn48TsxKvQijPVfIvpvc4zeNRmD6pkVukK1WLRQoBJR2iWsXPQn7r9+jDu3RKju6I1542xQlpKiWNTBOVKI+C9weNoq3GvQGh1N4zF+1lsMXT0KA5SEFAyTgysbNuBksjbadTHAvkWxyGreH3ODO8KYkoKSQhEIiEVZeBqXDLNG1aF+ORS9Al/BZW0wBioLKcgq2eR7z6FmWh6mOucw2WUH3rVyxXxKCkWYQ6Hf5JynyI9C1tmF6D0uBUOViBT55RO+jMYE/4NI/58L5lFSUFIoGgF2iXbW2UVwLAFSiPnx2BMWgcgzDyDU54GfaQ3HkR4Y2K0W9PKFSMJXMZjgd4CSQtHGUOD71FOwpFg3FoPqliu20SdG9BZ3L91AwpN3EGmpQcQ3Qe2WjVG3ugk085PiNSGFbx4p2PDJiOYUxUIPTpMi99pS9ApIhse2KehrqVMsgMtTiSjzb0xxI3dvtB2BpSObg17/Lg96P16Wc6RgxCJc3zUKq488w5sXz3HjPh+V61ZHZaMO8Jznhg7VypT40OyrSxsRsTUSV5Jy8CThNQRly8PSrDHs3Nzg0qc+DKnH+HGLl+FN7pECYry4cQgXE9KQQw6UMSIxfXZGNoTkyJuWds1haVzyBxekP7uKy9fi8DzDAGUMtaAmzEZamgFqtWiOJvUqQJuSQgbT/vEinCPFj0NF3+QKApQUXNE0bafMCFBSyAwVLcgVBCgpuKJp2k6ZEaCkkBkqWpArCFBScEXTtJ0yI0BJITNUtCBXEKCk4IqmaTtlRoCTpGAYsrlo5lIce5SCbB0DNO07Bm421aBbwocrF9Taqys7sHnLfVR1dIVj5+pfLBaUWcO0oNwIcI4UjEiAWwd/x+ZTlqhQ1QAaWddw5l8reIQEoGuDstBSgtlicU48/lh7DHF3z+Of2GxYDx6NBePpzju5rfsHX+AeKcjap8cXdiG1xgA0MeeBYe5i6UA/POwwHePd2qOSbgkfxc/k4tYf67H32gc0sjHF6a1xEDbvgzljbOjOux80cnlf4xwpCgLEMEnYMLAvrpCrvSb6dUBVvZImhQCPL96BuLIlqptfx5QBW/GG7ryT165/qjznSfH0zFgEzniGntPCMLB9Regq+GovebRFd97Jg1bRleU0KbLjyV1yIxcives8zA7ojGo6ynVLEN15V3SGLs+XOEuKrPg9CJmyFo8sBiN4vAualy/sCkZ5oCz6spQURY+pLF/kJCmEr89i7qQVeFK1H8ZPc0JNJRhxKkxZYv5pTHXZhQ/tPbHMr7Es+qRligABzpFCJErEft/fMXL3czTo0xvNKqiBnyuGmlpD9B3tiObmBsW2V/tr+ntz6wD2Rf2FuKRkXI19gNyKVmjxS0vY/OaIXl2s6HxFERj+tz7BOVKIRY/w17qjuJeVhSx+DnIEIjBigMerC3uv7mhYXr/ESfHu7mHEHItFYqoxTIy0AUEG3r8zRTO77ujank7iKZgT3Dw2U9Gg0u+rNgKc8xSqrS4qfXEgQElRHCjTOlQKAUoKlVIXFbY4EKCkKA6UaR0qhQAlhUqpiwpbHAhQUhQHyipeB8Pcw/wOtZE6gcG8HiW7YLI4oKSkKA6UVbwOeUgRH3MIWtY9YVVLfvL8zLtFCTElRVGiWUq/JSspZC1XGEw/825Rw845UjBiMR7ERuDwlTdI56tDnacF6+5u6N3IuMRnsqXKfZ94Akf/Pov77/SgQTYdmTbsgZ6/NkMlbfl73x81mOxoD9g6rMdNrXZoM8wJVeKDUP5j+HQ7xhMe9hFQb9cO1y8BfvtjMdcuEQu9hmPcun+g//GdNWsCkRXj8Z+yBUMwlhD53201tA6q3Vn3qT62Daw8raP64NpqMdwNo9B4lxh75t3F2bNn0d0zBofX2EmaKpWblY2nVhcu69fAXU6vxUFSiHA7Zg52nH6JtBwNsrzjBZLfNcCwoADYNDVRiv0UKTf2Ytf+o4hL0YeG+A1S3uugZTdfDB7UFObFQAx+4mLYNNiP4XGxEoOS/F17NDrEMJhjFy0xSteMNbAhCynZ37q587DhTCAsmcQvcg8xE/PVstULLMIs6CmkJPj3o7Hv8+ThsmNe/cN5DkhfnIh9gVZk5+TnfKcw2bybJnySVdYOgnOkKAgMw3zAbo/2+Nt6KSb5dkI1/eLrjWVV0u2IoQiLLQuHqUvhWFPx8rEGWW9BfZwghs4ab0GDZf8+dOge4g8eBPlvRO6oh72EJB0LkIJt39fKsoTK/xSsgyXUCO35aHs7Fq5M2CfiWTBfkpL9xs3FbeF6dzLOO0RCj3i3dsRLSB/Wky35SG5Z8aakYJKx1T8AD9oFwsehNSooGSkYRoira6Zg55Oy6PX7OHSsWrKkCLEKk3iRdpEb4G7NEO8Qjc6kN55VCCmkHqewst8jhdTYZ6htxjRmKNj/ZT1DQe9TkBT5ySwrCQqW4yQp2Pvuks5txrFrb/D+5XVcfuCAmaucUaes8uy8S3twmvTUF5D4Ogm3rpZDZ9cAjHAo96N6luu9wsInTxI+mZPwaZrY/QsvcnOxJzpP4Uk8BWvo0jCHzRsKepyCZQsKlf9d9jdWjq7uCajJxKPNhrxQjiUFGz6Vj07Egp7/DZ+k3kVaNv/fsoLAUVKIEXdgOraceIb3gjIwq1kNDVvZo0dzaxjpygqdYsul/LsLO/fG4OoTLZhWq4TqzW3R+X9tUa9i8VzyJU2mH5JQpGFdJ3QmiTY7TyFNqEM2q6FJS6DF+D5Idb77KW6XJroGnouxenUP7CXJ99fKFkQw/7tsks6GbixRZqrFQJpbSD1F/ZkM9kfe+U+izRLJe8RoHEI7ZJDQyYd4NJY88jycJEVBgJ7tcsGgo00xd5oX2lroyYNfsZTlx6+EV+glVOkxHTOdLIulTmWopLBh2sLCp6KWlZKCIPp4uw9cDllh5mxP2FgaFDXGP/29nFtr4DXvKio5BGPugFo//T1V+IDUaxjlG25l5aakUID2GHE6bp+IwuVHH5DL8KAGcn3vtYfQbumKwP7tYG5YsnkFm++8unUK5y7fxmuBNtRFOXj16Bae6zSDs+tQ2NZSvltcFaCmEv0k5zwFI07BiZXzEHktBZlidp5CHQ2cpmJkz6olfisqawkklUTi8XXYvvckHuYaQF2YAbOWg+E5rBesyyh+5KlErVFJKuccKZQEdyqGEiNASaHEylFF0c7dSUKbOhaqKPonmSkpVFp9yiX84shT2BQVi2EO7RHUx0a5hJNDGkoKOcCiRb+OwO4z/2L2xihJgSWBg9CpUU2VhYuSQmVVpzyCn7zxAMErdkMgECJ4iB2G2DZXHuF+QBJKih8Ajb7yGYG4pFfwWLAVWVk5GGTXBuP626o8PJQUKq/CkmvA05RUDJtP7s94m4qOLepimY9TyQlThDVTUhQhmFz6VFpmDlznb8Hjp69Rv7YFNowaBG2t4lmXpWicOU0KhnmHG7E38F69Cho1qwETHeU4jl/84RnibsfjwcsMQI0HsZAPXfO6aNSoASorwQReLl8I7/BduBb3EJXMTfHH5OEw0i89M+2cJQW7nCL51HSMm3wIF7UcsXJzADpVNVR0JyTT97Mur8CY0F24nGIGSxN1iHIzYNK0H0YMH47W1Up+Vjt43UEcO3cTRmUMsH2yG6qaGcvULlUpxFlSCN6exKzABdCqZoUTLxph6rR+ZDGgkVLoLf2fxVh8SQdNHXxgb1XyJMgPinQuQlNTA5smuaG+hblSYFaUQnCSFIzwDY7MDkJEWjeM6vwMi6IN4Tt2CLrUUB5SLLmsh+aO3rCzVB5SlKa5iG+RiJOkeH8xBP4z0/DbzMnoItwG11UM/CYPRVclIUXGuTBM23sf6pY9YEPmwMS6FiTnaYiqRiVHkNI2F0FJ8REBNo8QpyVhY9hEvGs9C2O614DgVjj6hWth3CxPtC1fssvGpYri3z+CVRHbcCQuHdo8IdJyjdHC3hm/D/gV1c2KJ6Fl84bgfrYob2xAblQqfXMRlBRSUpBDAJL2+KLfek0MGtYD9U01kXv3IBae4MHeqTO6d+qC2lUMoaVE1wazomdcX0o82XEY95qFBV5NoKXgO/rY4dYOfgtRztQYwYSIC/84VurmIigppKQQC3F5uz/WHH6OlAx1speCgSgjBY/fMChXwRS9/VdgeHcLlNEouTClMGUJHm6Hb8BhqHfxx4LAVjBUMCkOnr+NqWsjvxCltM1FUFJ8A4Gc8/Nhv0IbE+Z6w7Zq8YQm31IIG+KlPY1Hwv1nSBNpgifm49HFAziZXA1OHiPh1FC/KAdaCv2WdMg1/4/+zl3h3r2VwutWhgo4mWjnBz731k5MjNLAQLeeaF6x5A8tYHfe3T4YirVbDiMh2xA8svOufCtX+PkMQwvz4vFgtmPCJeFSwae0rG36HvE4T4rvAcS139n1TPZjw//TbHZeohNZ3yRNvkszLpQUpVm7P9C2/HMR7OvsMo7BXVrCoVX9UrWUg+YUP2AcXH2FzSdOXo6XeIWBnZqhqVUVzkFBPQXnVP7tBh+5koDWZI91aVrgJ6+KKSnkRYyWL/UIUFKUehXTBsqLACWFvIjR8qUeAUqKUq9i2kB5EaCkIIiJs97iyXs1lCtnDANtnrwYKm353PQ3eJ2SAQ1jc1Q00SHn5tJHFgQ4Twp2WUU2WardP0ITAZOG4teaeXsqst49wwcYw6QMWSCoZGuhZFEsWyYhahpGzTiGqsOWY7lfM8izg5rd7ZealgpGrzLKGRROJ4bJwev7iXidxUBdSx+mlSxR0ejLWsS5qXiS9Aa8MhXInIehXDLI2s6iLsd5UrCA5l5ZAqcVavCbNAzdrfJIcWVlT2wW+iDY3R7V9IpneUVRK/f1xfWYF34GpvbjMWFAHbkOkM65H4PV69eC6b4fQR3/6z35GeS09gsHsXPtFpx7KZTsIXfwXQhfm88btRhxKu7snwH/kGvQ6RiAiGV9UVHBixmLAkPOkkKQ+hiPX2WCL1ID/9Y2jD9UAcEz3NDBNBNPXqbiwmZvHBb3g0vvLqhWRgyUMUd1Mrurw0tH8otMaJU1gZEBWbBHtMCQa33fP38LkX5ZmJTVRe7blxDoloEoNQ3v3r1DjrohzKuyPe7nXlSc/RYvnr9Eao4aeLpmqFqlAgwL3Hya8yEFb9KIwZUtB1NSl6wPk0tkfP0SKR+E0CAymZmZkfc/152b8R7ZQiE5FEGLLD58inQ1bRiZkJ68YhnwBNl4l/wMiVeO4kD0fohbLYRbK22INfVgVK4SqppogxG8xvmd07BkvwaGLlqBnh8vp2QITOScBckjuVLg9j6snxmGF+U74DFaYt1SJ1RWsmX5hWHKSVLkfEhC1Ap/bDmZig8CLajz05Bu5owlK3xQ8/5yjFv6F+48eg6+hhExfD3Sw4qg28obi0b1h7X+aUx1XglNp3HwdWkOM3UGwjfRGOe1H6aOgfBzqY9Ls/rhGK86BAmZSLp3HS+yDdBqxASMdLdBdR0ehFkpuHRwLtbsjMW9tzoQ69VBP38/eHZvBEOtPK/ELgy8EuGNMWufodPYZZjeV/YrqgRPziBieRg2n3iEHPP/wdl/BiZ0r/hJ/4lHF2FHzD/IJMaadng3bvK1ULWJA/xDg9BWcAeRq2ZicdQT8PnkNBHdCjDWzAW/7C+wcZmIGX0skXr3OLasWoz3rRfBuQm5Q4MwQdfQlJDKSOKNWEIIU25j59JFOMZvh0GNkrDqcmOsDu2LygQvZX84RwpGlIFDs/tivWAIZvv2RXVjNWRfXAKXTboIIFtSu1jqk+MfRfhnRS/sEo6Av0tvWJkQNfI0SG6hDjW19zg0zRU7Bf0wKtAFTSuo4cXBUQiIrQ4Pdw90rQ0cmzME83ddR+0R6zHWvRVMbyyH3/JnaOkyEj52FkjY4oeZV03g7DUJjnV18fT4RIzbm4UuQ6ZgWHvTPO9DSHFr9xTM2fESrb2mYSR5T9aHEYshFAqQfScSs3eeBdN8HBb0rfbp9QcnwzBj9Gpktp+EqfOcUeP5Saza8ieSagYi3KUWOVJHiDdxB7Bp+2aIO2/FKFs9Ig8PGhqa0FDPQcKpbdgw5QR4dpa4dOICBDk8mFRqhSGzJsDxlzLk7u/3iItcgDlb9eC9zAOG55dj6vlGWBPWj5JCViUWZzl+5inM7bUV5hNnwMW2Klh1515agr5rePCf+DnRPh9uh21Cb4xyc0RN4y97N378Wgydl0JiaA8MbKmHyAkuuGnxO4YN7QZL3XTETB+Is3qDMdhrIBqQdxnmATa4zcezuoPwe/AvOO41EZf02+A3vy6wIAwQ5ZzDylGnoNPNFWMC28PkY9wtFokgEhNzJBfLqKvLP3YkuHcAMzf/DX6TMZjn9JkUdw7NxrYDD9BkzHr0tVYjG61uYMcKsv2V3xMRUzpBl9Sffucg1m2MgKjbXgTban9SEZtcPzi9HBMD/ob5+IWY/1sNiAUvcGHDSiw8WQOh+3xg8fQCNkWEgm+7BUGdRLi9ZyHGn2+GTYsdUY7mFMVp7t+vSzLSdH87gtzuw35DAOxqmUhGQ3LOL0K/dZrwJ4m2dPQpdsmv+IPxReBQR9QyKXgR+h2sHDAP6T2C4N4zGUvn3EHz/s7o3cqcECBDQorzZd3h6u6I2gYsKR5i47BQPG84AD4jy+EPr3HYei4Z2lUMSK9KBFATIeu9IWy9xsLXs32RJaP8u/sQsvU0hE3/S4od0UloHrwOvauT3Ydp17F99U4S6vTAuo+keH8rEuu3rAfDkqLL5ytjGSYLd49HYPX8JPTYRS59L8e2T4jH5zdh+ZQHsIvwhcaR2Qi5UAezp/ZCeWEy4g5vRfiNXzBrSi/UrlgNZXWVO4TiVPjEkiLn0R6McbuIduETMLAB6bfIHXjn1wVi8klrTJzrgy7V8w5EO7e0OyIy3THetx+sCzlF48amQVj7tBs6GiUgXq8Nevd0QJOKrIHkkeKCqSeGe/dCDZIjCNOiMcntb5Tv6wH/gabY7zMBl0y6wm1iP9RhO2E1NTKHwEpHkm519h6+vIdNiNMyRdA2NIaRnjwDqnnvCx79iVkbT0LYfDxm96rwqddgPcX2qMdoNiYCv5Ek+RMpBIQUk/M8RertA1i7YR1ybXdjSo/Pu/0Yhk88xTKETT2Nxov/hGczIjn5t8STq7AiNA3dQgchbfNQrDirDR0tHvmSCPzMD3iTLoblLw3hMmUzBjf+fgdWkiU4RQoWaDH/MXZN9MZxgxH43bkFjJ7sxbhFfyBBrTdWRgTBplre7ahPD02C/zYxfhsxEB1qlIGangmZADOEtmaeyQpTj2Kmxw6ce6GL4asnoHcDS0koxpLi0J+rL/UAAAODSURBVIxB+IvpgZ797VBTT4wHx2ZhzV1LDHUfA/t6unhwYCymHM+BjYMv7OobgCgBjLYhTI2NoKeVZw5sTnF1ox/Gr3uOjmPCMKWP7Pc98DPf491bQqjEowjffwHCeh4IdqgF47KmklGo+OgQbDnwiNyBvQlO5LA1UdpVbFm+HUcE9tg03VZCCuHrK9i8fBmOZv2KaX5toauuDX1jM1Qoo4mspH+wesEEnNN0x9SRHaCX9gh/7duG8xrOWDS5C4yFuRCKyHdFYoL3K1zftxIhlxthVWg/WOrpK/28D+dIwRpczuM9mBy0EZeTs2BQpx8Ce2Rj2wVTjCCJdweLPE/BiF7jwKwgbDj1CO9FWijT5ncsCuqLuhXyemyJ8U9ywqIXnTF9xih0tMgbMmWYTByfNxyhB+4gVbs89JEFcZ3+mBg0HLbWZHSGHJbAiDLxz+ZpWLXjDB6Ty+21xZmkTF+MDfRCj7p5pGRJcWPneIRseYG2vrMw2t5S5s7z4anVWB62ERdT80bOGHIWLVOlDZy9x8O/c3ncO7EUkcefoaH3Islha6IPcdi/lewDF3RGaGBb6LDkJsn6y5uRWD5jIc4kkzGlcnXQaegkzOxTQ+LTUu5EYeXUufj7tQZEmuVRz9YTM4O7ooLml3MajPgdbv8ZgXlXG2DJzB40p5BZi8VckDW43IxM5AjEUNPURRk9HnIFDBld0SCjK58TWnFuFjJz+BCy4++knKGuDkl4Pw6Zil9ix9jhuF1jFDyGkvOY9D/+Oxs+zRiIM9r94TSExNAkp1Ajs71l9DW/WGbBCHLJnQ7Z4DMkdCKeAhra0CPf19L4XL+IlJHIpakFrQLG9i3IRPwcZGfnQMCo57VHLIJQTQM6OrrQJSGNiJBEKCSz0Jo65Pc8AgjJiJOIlNfR/nzINIsTPyMD2QQnhiT7Wjp60P/4O0sMQWY6svjkNzXym64h+e2/UknO2hIIyHwQj9SvoRJLTTjpKX6Gg8JsMiGXlonnJ5ZhzgldDAjwQZ/GZp+UzXqQqMlOOGfijeG/94G1is6G/wxGqv4uJYWcGkw5E4bxy/7Ev/frwCd8PAa2I7lEvllaSfi0wAOXDAfB2cUBtQyVe6RFzuZzojglhZxqFudm4gM5QU8g1oGxiQE0SY6Q/5HM5mZnkXCFhDxaZLa8wO9yVkeLlwAClBQlADqtUrkR+D95EMGw4MIZngAAAABJRU5ErkJggg==" } }, "cell_type": "markdown", "metadata": { "id": "YPctLCcw0ibb" }, "source": [ "![image.png](attachment:e141912c-cf3c-4599-a21b-4c3c3f2f6785.png)\n", "\n", "We did not pass any index, so by default, it assigned the indexes ranging from `0`to `len(data)-1`. Contrast this with a Python list, which always has an implicit index that counts from 0:\n", "```\n", "x = [10,20,30]\n", "y = x[1] # y --> 20\n", "```\n", "and also with a Python dictionary, where the keys act as an index:\n", "```\n", "x = {'a':10, 'b':20, 'c':30}\n", "y = x['b'] # y --> 20\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "v4vAbdqv0ibb" }, "outputs": [], "source": [ "# View index values\n", "print(s1.index)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "s3I3__b90ibi" }, "outputs": [], "source": [ "s1[:2] # First 2 elements" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4jsIOlUR0ibi" }, "outputs": [], "source": [ "print(s1[[2,1,0]]) # Elements out of order" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qGn-aAvx2t9V" }, "outputs": [], "source": [ "type(s1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "WIft0v6r0ibl" }, "outputs": [], "source": [ "# Can place filtering conditions on series\n", "s1[s1 > 0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ZmlDR_tj0ibd" }, "outputs": [], "source": [ "# Creating Pandas series with index:\n", "# fetch a random number generator object.\n", "rng = np.random.default_rng()\n", "# select 5 points from a normal (Gaussian) distribution.\n", "s2 = pd.Series(rng.normal(size=5), index=['a', 'b', 'c', 'd', 'e'])\n", "print(s2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "srVyZ9ik0ibe" }, "outputs": [], "source": [ "# View index values\n", "print(s2.index)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vX-IUZb30ibf" }, "outputs": [], "source": [ "# Create a Series from dictionary\n", "data = {'pi': 3.14159, 'e': 2.71828} # dictionary\n", "print(data)\n", "s3 = pd.Series(data)\n", "print(s3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sOHbkXC10ibg" }, "outputs": [], "source": [ "# Create a new series from a dictionary and reorder the elements\n", "s4 = pd.Series(data, index=['e', 'pi', 'tau'])\n", "print(s4)" ] }, { "cell_type": "markdown", "metadata": { "id": "DeofA8nZ0ibi" }, "source": [ "NaN (Not a Number) - is used to specify a missing value in Pandas." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "skyr2siA0ibj" }, "outputs": [], "source": [ "# Series can be treated as a 1-D array and you can apply functions to them:\n", "print(\"Median:\", s4.median())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "42_G7Ks-0ibl" }, "outputs": [], "source": [ "# Methods can be used to filter series:\n", "s4[s4 > s4.median()]" ] }, { "cell_type": "markdown", "metadata": { "id": "3f4HD5E90ibn" }, "source": [ "### Attributes and Methods:\n", "An attribute is a variable stored in the object, e.g., index or size with Series.\n", "A method is a function stored in the object, e.g., head() or median() with Series.\n", "\n", "| Attribute/Method | Description |\n", "|-----|-----|\n", "| dtype | data type of values in series |\n", "| empty | True if series is empty |\n", "| size | number of elements |\n", "| values | Returns values as ndarray |\n", "| head() | First n elements |\n", "| tail() | Last n elements |\n", "\n", "Execute *dir(s1)* to see all attributes and methods. \n", "\n", "I recommend using online documentation as well. This will be in a much easier format to read and come with examples.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# For more information on a particular method or attribute use the help() function\n", "help(s4.head())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "help(s4.index)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You can also add a question mark to get help information\n", "s4.head?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s4.index?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas Series can also [plot](https://pandas.pydata.org/docs/reference/api/pandas.Series.plot.html) themselves:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Going back to s2\n", "s2.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One final way to get help is to press shift-tab when you are in the parentheses of a method or after an attribute. Try this in the exercise below." ] }, { "cell_type": "markdown", "metadata": { "id": "FXjjnBOj0ibn" }, "source": [ "### Exercise - Create your own Series" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RUmWlc790ibn" }, "outputs": [], "source": [ "# Create a series with 10 elements containing both positive and negative integers\n", "# Examine the series with the head() method\n", "# mys = pd.Series( ... )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Series from Numpy or a Series. See the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create a Series from a numpy array\n", "a=np.array(range(6)) # numbers 0 thru 5\n", "print(f'numpy: {a}')\n", "ser = pd.Series(a)\n", "print(f'series:\\n{ser}')\n", "# Change an element of the numpy array\n", "a[0] = -100\n", "# Print the series again\n", "print(f'numpy: {a}')\n", "print(f'series:\\n{ser}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Tell pandas to make a copy of the numpy array.\n", "ser2 = pd.Series(a, copy=True)\n", "print(f'series2:\\n{ser2}')\n", "# change the array again\n", "a[-1]=1000\n", "# ser2 was built with a copy of \"a\", so no changes.\n", "print(f'series2:\\n{ser2}')" ] }, { "cell_type": "markdown", "metadata": { "id": "PQprwXx40ibo" }, "source": [ "### Pandas DataFrames" ] }, { "cell_type": "markdown", "metadata": { "id": "ZYRivqyM0ibo" }, "source": [ "A Pandas *DataFrame* is a 2-dimensional, size-mutable, heterogeneous tabular data structure with labeled rows and columns. You can think of it as a dictionary-like container to store Python Series objects." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vWXR_GKo0ibo" }, "outputs": [], "source": [ "d = pd.DataFrame({'Name': pd.Series(['Alice', 'Bob', 'Chris']), \n", " 'Age': pd.Series([21, 25, 23])})\n", "print(d)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xt6NqHhS0ibo" }, "outputs": [], "source": [ "d2 = pd.DataFrame(np.array([['Alice','Bob','Chris'], [21, 25, 23]]).T, columns=['Name','Age'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cZXlWYmn0ibq" }, "outputs": [], "source": [ "# Use the head() method to print the first 5 records in the dataframe (same as with series)\n", "d2.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NHte5SMZ0ibq" }, "outputs": [], "source": [ "# Add a new column to d2:\n", "d2['Height'] = pd.Series([5.2, 6.0, 5.6])\n", "d2.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Add your own index:\n", "d3 = d2.copy()\n", "d3['my_index'] = ['person1','person2','person3']\n", "# assign the values in my_index as the new index\n", "# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html\n", "d3 = d3.set_index('my_index', drop=True) # remove my_index afterwards\n", "d3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# If you don't like an index, you can remove it and reset it to the usual one 0...len(df)-1\n", "# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html#pandas.DataFrame.reset_index\n", "d4 = d3.reset_index(drop=True) # What do you get if drop=False?\n", "d4" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Combine dataframes. There's a bunch of ways. Here let's stack d2 onto d3:\n", "#https://pandas.pydata.org/docs/reference/api/pandas.concat.html\n", "d5 = pd.concat([d2,d4], axis=0)\n", "d5\n", "# See here for a discussion of a bunch of ways: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The index has gotten weird...fix it.\n", "d5 = d5.reset_index(drop=True)\n", "d5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading data using Pandas\n", "You can read CSV (comma separated values) files using Pandas. The command shown below reads a CSV file into the Pandas dataframe df." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Hfwv2wzd0ibq" }, "outputs": [], "source": [ "# Read a csv file into Pandas Dataframe\n", "df = pd.read_csv(\"http://rcs.bu.edu/examples/python/DataAnalysis/Salaries.csv\")" ] }, { "cell_type": "markdown", "metadata": { "id": "2JaiSOAXnKup" }, "source": [ "The above command has many optional arguments that you can find in the Pandas documentation online.\n", "\n", "You can also read many other formats, for instance:\n", "* Excel - pd.read_excel('myfile.xlsx', sheet_name='Sheet1', index_col=None, na_values=['NA'])\n", "* Stata - pd.read_stata('myfile.dta')\n", "* SAS - pd.read_sas('myfile.sas7bdat')\n", "* HDF - pd.read_hdf('myfile.h5', 'df')" ] }, { "cell_type": "markdown", "metadata": { "id": "2JaiSOAXnKup" }, "source": [ "Before we can perform any analysis on the data we need to\n", "\n", "\n", "* Check if the data is correctly imported \n", "* Check the types of each column\n", "* Determine how many missing values each column has\n", "\n", "We can then carefully prepare the data:\n", "\n", "* Remove columns that are not needed in our analysis\n", "* Rename the columns (if necessary)\n", "* Possibly rearrange the columns to make it easier to work with them\n", "* Create new or modify existing columns (e.g., convert into different units) if necessary" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "86icwTdZ0ibr" }, "outputs": [], "source": [ "# Display the first 10 records\n", "df.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "z-lqfY0MiwGP" }, "outputs": [], "source": [ "# Display structure of the data frame\n", "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### More details on DataFrame data types\n", "\n", "|Pandas Type | Native Python Type | Description |\n", "|------------|--------------------|-------------|\n", "| object | string | The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings).|\n", "| int64 | int | Numeric characters. 64 refers to the memory allocated to hold this character. |\n", "| float64 | float | Numeric characters with decimals. If a column contains numbers and NaNs (see below), pandas will default to float64, in case your missing value has a decimal. |\n", "| datetime64, timedelta\\[ns\\]| N/A (but see the datetime module in Python’s standard library) | Values meant to hold time data. Look into these for time series experiments. |\n", "\n", "\n", "### DataFrame attributes\n", "|df.attribute | Description |\n", "|-------------|-------------|\n", "| dtypes | list the types of the columns |\n", "| columns | list the column names |\n", "| axes | list the row labels and column names |\n", "| ndim | number of dimensions |\n", "| size | number of elements |\n", "| shape | return a tuple representung the dimensionality |\n", "| values | numpy representation of the data |\n", "\n", "### Dataframe methods\n", "|df.method() | Description |\n", "|-------------|-------------|\n", "| head(\\[n\\]), tail(\\[n\\]) | first/last n rows |\n", "| describe() | generate descriptive statistics (for numeric columns only) |\n", "| max(), min() | return max/min values for all numeric columns |\n", "| mean(), median() | return mean/median values for all numeric columns |\n", "| std() | standard deviation |\n", "| sample(\\[n\\]) | returns a random sample of n elements from the data frame |\n", "| dropna() | drop all the records with missing values |" ] }, { "cell_type": "markdown", "metadata": { "id": "v_Gm0ubUmu95" }, "source": [ "Sometimes the column names in the input file are too long or contain special characters. In such cases we rename them to make it easier to work with these columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5PTMYO9ok8te" }, "outputs": [], "source": [ "# Let's create a copy of this dataframe with a new column names\n", "# If we do not want to create a new data frame, we can add inplace=True argument\n", "df_new =df.rename(columns={'sex': 'gender', 'phd': 'yearsAfterPhD', 'service': 'yearsOfService'})\n", "df_new.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "tj85w0PMttNG" }, "source": [ "### DataFrame Exploration" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pbDGiE0J0ibt" }, "outputs": [], "source": [ "# Identify the type of df_new object\n", "type(df_new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pybC4toq0ibt" }, "outputs": [], "source": [ "# Check the data type of the column \"salary\"\n", "# We access columns using the brackets, e.g., df['column_name']\n", "df_new['salary'].dtype" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# If the column name has no spaces, complex symbols, and is not the name of an attribute/method\n", "# you can use the syntax df.column_name\n", "df_new.salary.dtype" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1T1wR6qq0ibu" }, "outputs": [], "source": [ "# List the types of all columns\n", "df_new.dtypes" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YFXUSewc0ibu" }, "outputs": [], "source": [ "# List the column names\n", "df_new.columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_Cu0DFOj0ibv" }, "outputs": [], "source": [ "# List the row labels and the column names\n", "df_new.axes" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "99jPywmg0ibw" }, "outputs": [], "source": [ "# Number of rows and columns\n", "df_new.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "utYUKYjN0ibv" }, "outputs": [], "source": [ "# Total number of elements in the Data Frame (78 x 6)\n", "df_new.size" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jsjjlva70ibw" }, "outputs": [], "source": [ "# Output some descriptive statistics for the numeric columns\n", "# On a large dataframe this can take a long time to calculate\n", "df_new.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ZFxc7P8w9mJc" }, "outputs": [], "source": [ "# Remember we can use the ? to get help about the function\n", "df_new.describe?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Adding columns\n", "\n", "Here's two ways to add a column:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a new column using the assign method\n", "# temporarily make a copy for demonstration purposes.\n", "df_copy = df_new.copy()\n", "\n", "# vectorized computation. This syntax is used to MODIFY a dataframe to\n", "# contain a new column.\n", "df_copy['salary_k'] = df_copy['salary'] / 1000.0\n", "df_copy.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XHWfkQX8roey" }, "outputs": [], "source": [ "# Create a new column using the assign method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html\n", "# Assign returns a copy of df_new with a new column attached. df_copy2 is a brand new dataframe.\n", "df_copy2 = df_new.assign(salary_k=df_new['salary']/1000.0)\n", "df_copy2.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vsyWDUkww3m3" }, "outputs": [], "source": [ "# Check how many unique values are in a column\n", "# There is a rank attribute in DataFrame object so we access using df['rank']\n", "df_new['rank'].unique()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pESwRxLCt5oO" }, "outputs": [], "source": [ "# Get the frequency table for a categorical or binary column\n", "df_new['rank'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pBF1lOcY0FLH" }, "outputs": [], "source": [ "# Get a proportion table\n", "df_new['rank'].value_counts()/sum(df['rank'].value_counts())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ITSJWChquiYI" }, "outputs": [], "source": [ "# Alternatively we can use the pandas function crosstab() to calculate a frequency table\n", "# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html\n", "pd.crosstab(index=df_new['rank'], columns=\"count\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "faCHhLX-vMtw" }, "outputs": [], "source": [ "# Two-way tables\n", "pd.crosstab(index=df_new['rank'], columns=df_new['discipline'], margins=True)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "id": "KzvZFygO0ibz", "jupyter": { "outputs_hidden": true } }, "source": [ "### Data slicing and grouping" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LE0RknOW0ib0" }, "outputs": [], "source": [ "#Extract a column by name \n", "df_new['gender'].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RdMYTbNlCrzE" }, "outputs": [], "source": [ "# If the column name does not contain spaces or other special characters and does not collide with data frame methods, we can use a dot notation\n", "df_new.gender.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6JrIilpE08Be" }, "outputs": [], "source": [ "# Calculate median number of service years\n", "df_new.yearsOfService.median()" ] }, { "cell_type": "markdown", "metadata": { "id": "3gcURpQh0ib1" }, "source": [ "### Exercise - Working with a single column" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LXwRAzfQ0ib2" }, "outputs": [], "source": [ "# Calculate the descriptive statistics for only the salary column in df_new\n", "# " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wk7Er3ma0ib2" }, "outputs": [], "source": [ "# Get a count for each of the values in the salary column in df_new\n", "# " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "08H11QC80ib2" }, "outputs": [], "source": [ "# Calculate the average (mean) salary in df_new\n", "# " ] }, { "cell_type": "markdown", "metadata": { "id": "OR761ERr0ib2" }, "source": [ "### Grouping data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HxIkM-uT0ib3" }, "outputs": [], "source": [ "# Group data using rank\n", "df_rank = df_new.groupby('rank')\n", "df_rank.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dX45khRv0ib3" }, "outputs": [], "source": [ "# Calculate mean of all numeric columns for the grouped object\n", "df_rank.mean(numeric_only=True)\n", "# What happens with df_rank.mean() ?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "txqWe6sM0ib4" }, "outputs": [], "source": [ "# Most of the time, the \"grouping\" object is not stored, but is used as a step in getting a summary:\n", "# df_new.groupby('gender')\n", "# Calculate the mean salary for men and women. The following produce Pandas Series (single brackets around salary)\n", "df_new.groupby('gender')['salary'].mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TDRTnWXo0ib4" }, "outputs": [], "source": [ "# If we use double brackets Pandas will produce a DataFrame\n", "df_new.groupby('gender')[['salary']].mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nUMv2CM50ib5" }, "outputs": [], "source": [ "# Group using 2 variables - gender and rank:\n", "df_new.groupby(['rank','gender'], sort=True)[['salary']].mean()" ] }, { "cell_type": "markdown", "metadata": { "id": "n_QMiWbN0ib5" }, "source": [ "### Exercise - Grouping data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bK_eeYF70ib5" }, "outputs": [], "source": [ "# Group data by the rank and discipline and find the average yearsOfService and salary_k for each group. \n", "# " ] }, { "cell_type": "markdown", "metadata": { "id": "5M2fqDoL0ib6" }, "source": [ "### Filtering" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bNCJvz590ib6" }, "outputs": [], "source": [ "# Select observation with the value in the salary column > 120K\n", "df_filter = df_new[df_new.salary > 120000]\n", "df_filter.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Oj51sat40ib6" }, "outputs": [], "source": [ "df_filter.axes" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xGZSM5EH0ib7" }, "outputs": [], "source": [ "# Select data for female professors\n", "df_w = df_new[df_new.gender == 'Female']\n", "df_w.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qNkmHdkP6hsN" }, "outputs": [], "source": [ "# To subset one column using a condition in another columns use method \"where\"\n", "df_new.salary.where(df_new.gender=='Female').dropna().head(6)" ] }, { "cell_type": "markdown", "metadata": { "id": "iUOMOcM60ib8" }, "source": [ "### Exercise - Filtering data " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0fCg9pRv0ib8" }, "outputs": [], "source": [ "# Using filtering, find the mean value of the salary for the discipline A\n", "# " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DUHR3SlV0ib8" }, "outputs": [], "source": [ "# Challenge:\n", "# Determine how many female and male professors earned more than 100K\n", "# " ] }, { "cell_type": "markdown", "metadata": { "id": "6Fir_77Y0ib8" }, "source": [ "### Slicing a dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "W4v86rQi0ib8", "scrolled": true }, "outputs": [], "source": [ "# Select column salary\n", "salary = df_new['salary']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sugc9L480ib-" }, "outputs": [], "source": [ "# Check data type of the result\n", "type(salary)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XuISZ50f0ib-" }, "outputs": [], "source": [ "# Look at the first few elements of the output\n", "salary.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "CKQS3knF0ib_" }, "outputs": [], "source": [ "# Select column salary and make the output to be a data frame\n", "df_salary = df_new[['salary']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gMPqOKc20ib_" }, "outputs": [], "source": [ "# Check the type\n", "type(df_salary)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4HBqgWTi0ib_" }, "outputs": [], "source": [ "# Select a subset of rows (based on their position):\n", "# Note 1: The location of the first row is 0\n", "# Note 2: The last value in the range is not included\n", "df_new[0:10]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4BEvyTpr0ib_" }, "outputs": [], "source": [ "# If we want to select both rows and columns we can use method .loc\n", "df_new.loc[10:20, ['rank', 'gender','salary']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fpFEVLMJ0icA" }, "outputs": [], "source": [ "# Recall our filtered dataset with salaries over 120K\n", "df_filter.head(25)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "k28_O2L70icA" }, "outputs": [], "source": [ "# Let's see what we get for our df_filter data frame\n", "# Method .loc subsets the data frame based on the index values:\n", "# loc = location\n", "df_filter.loc[10:20,['rank','gender','salary']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Xg9Tuorn0icA" }, "outputs": [], "source": [ "# Unlike method .loc, method iloc selects rows (and columns) by absolute position:\n", "# iloc = integer location\n", "df_filter.iloc[10:20, [0,3,4,5]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise - Slicing a dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a new dataframe where you filter out salaries below 100K from df_new\n", "# Call this data frame df_sub100" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract rows 5:10 and columns ['yearsOfService', 'salary_k'] of df_sub100 using the .loc method\n", "# How many rows are in the output?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract rows 5:10 and columns ['yearsOfService', 'salary_k'] from df_sub100 using the iloc method\n", "# What are the values of the indices in the output?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Extract rows with index values [6, 12, 20, 22] and columns ['yearsOfService','salary_k'] from df_sub100\n", "# Hint: Use the loc method" ] }, { "cell_type": "markdown", "metadata": { "id": "vzxWbXKF0icA" }, "source": [ "### Sorting the Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "OXybOgop0icB" }, "outputs": [], "source": [ "# Sort the data frame df_new by service and create a new data frame\n", "df_sorted = df_new.sort_values(by = 'yearsOfService')\n", "df_sorted.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "85UG8-JZ0icC" }, "outputs": [], "source": [ "# Sort the data frame df_new by yearsOfService and overwrite the original dataset\n", "df_new.sort_values(by = 'yearsOfService', ascending = False, inplace = True)\n", "df_new.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mcrnORWR0icC" }, "outputs": [], "source": [ "# Restore the original order using the sort_index method\n", "df_new.sort_index(axis=0, ascending = True, inplace = True)\n", "df_new.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "52sGV0z_0icD" }, "outputs": [], "source": [ "# Sort the data frame using 2 or more columns:\n", "df_sorted2 = df_new.sort_values(by = ['yearsOfService', 'salary'], ascending = [True,False])\n", "df_sorted2.head(15)" ] }, { "cell_type": "markdown", "metadata": { "id": "aBgfkBkK0icC" }, "source": [ "### Exercise - Sorting " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Kh0qVebH0icD" }, "outputs": [], "source": [ "# Sort the data frame df_new by the salary (in descending order)\n", "# Store the output of this sorting in a dataframe called df_desc\n", "# Display the first 10 records of the output\n", "# " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Looping and DataFrames\n", "\n", "You can iterate over rows in a loop:\n", "```\n", "# use the iterrows() method\n", "sum_sal = 0\n", "for index, row in df_sorted2.iterrows():\n", " sum_sal += row['salary']\n", "```\n", "or using .loc():\n", "```\n", "sum_sal = 0\n", "for i in range(df_sorted2.shape[0]):\n", " sum_sal += df_sorted2.loc[i,'salary']\n", "```\n", "However, performance is generally VERY POOR, so this is to be *avoided* where a better alternative exists. The [apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply) function performs much better if you must do an operation on every element in a column." ] }, { "cell_type": "markdown", "metadata": { "id": "XZjjbRia0icD" }, "source": [ "### Missing Values\n", "To discuss how to handle missing values we will import the flights data set." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bZCCJvtD0icD" }, "outputs": [], "source": [ "# Read a dataset with missing values\n", "flights = pd.read_csv(\"http://rcs.bu.edu/examples/python/DataAnalysis/flights.csv\")\n", "flights.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PcLvDYfOjNZv" }, "outputs": [], "source": [ "flights.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "IPfS620V0icE", "scrolled": true }, "outputs": [], "source": [ "# Select the rows that have at least one missing value\n", "flights[flights.isnull().any(axis=1)].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AUI-q-wr0icE" }, "outputs": [], "source": [ "# Filter all the rows where arr_delay value is missing:\n", "flights1 = flights[flights['arr_delay'].notnull( )]\n", "flights1.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "oem6cDo90icF" }, "outputs": [], "source": [ "# Remove all the observations with missing values\n", "flights2 = flights.dropna()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NoCw8dmx0icH" }, "outputs": [], "source": [ "# Fill missing values with zeros\n", "nomiss =flights['dep_delay'].fillna(0)\n", "nomiss.isnull().any()" ] }, { "cell_type": "markdown", "metadata": { "id": "d-Q3P5SO0icJ" }, "source": [ "### Exercise Count missing data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "uAfRQCPs0icL" }, "outputs": [], "source": [ "# Count how many missing pieces of data there are in the dep_delay and arr_delay columns\n" ] }, { "cell_type": "markdown", "metadata": { "id": "JGOFyTVm0icN" }, "source": [ "### Common Aggregation Functions:\n", "\n", "The following functions are commonly used functions to aggregate data.\n", "\n", "|Function|Description\n", "|-------|--------\n", "|min | minimum\n", "|max | maximum\n", "|count | number of non-null observations\n", "|sum | sum of values\n", "|mean | arithmetic mean of values\n", "|median | median\n", "|mad | mean absolute deviation\n", "|mode | mode\n", "|prod | product of values\n", "|std | standard deviation\n", "|var | unbiased variance\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "q5SgaHvG0icO" }, "outputs": [], "source": [ "# Find the number of non-missing values in each column\n", "flights.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gmZ5Dekd9N6g" }, "outputs": [], "source": [ "flights.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get mean values\n", "flights.mean(numeric_only=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ZaZ1Uaks0icP" }, "outputs": [], "source": [ "# Let's compute summary statistic per group:\n", "flights.groupby('carrier')['dep_delay'].mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HrhCXQBi0icP" }, "outputs": [], "source": [ "# We can use agg() methods for aggregation:\n", "flights[['dep_delay','arr_delay']].agg(['min','mean','max'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The value returned is a dataframe:\n", "agg_vals = flights[['dep_delay','arr_delay']].agg(['min','mean','max'])\n", "agg_vals.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AVsIj0La0icQ" }, "outputs": [], "source": [ "# An example of computing different statistics for different columns\n", "flights.agg({'dep_delay':['min','mean','max'], 'carrier':['nunique']})" ] }, { "cell_type": "markdown", "metadata": { "id": "sPGu6w9j0icR" }, "source": [ "## Exploring data using graphics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Graphics with the Salaries dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QJBJNupL0icR" }, "outputs": [], "source": [ "# Use matplotlib to draw a histogram of a salary data\n", "plt.hist(df_new['salary'],bins=8, density=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_x-c3hL20icS" }, "outputs": [], "source": [ "# Use seaborn package to draw a histogram\n", "sns.displot(df_new['salary']);" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "CAwwtPMR0icS" }, "outputs": [], "source": [ "# Use regular matplotlib function to display a barplot\n", "df_new.groupby(['rank'])['salary'].count().plot(kind='bar')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Q4rEH0cx0icS" }, "outputs": [], "source": [ "# Use seaborn package to display a barplot\n", "sns.set_style(\"whitegrid\")\n", "ax = sns.barplot(x='rank',y ='salary', data=df_new, estimator=len)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9P4SKlYW0icS" }, "outputs": [], "source": [ "# Split into 2 groups:\n", "ax = sns.barplot(x='rank',y ='salary', hue='gender', data=df_new, estimator=len)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AzBt0Vih0icU" }, "outputs": [], "source": [ "# Violinplot\n", "sns.violinplot(x = \"salary\", data=df_new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1ieYE6KF0icU" }, "outputs": [], "source": [ "# Scatterplot in seaborn\n", "sns.jointplot(x='yearsOfService', y='salary', data=df_new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0u2HEEOp0icV" }, "outputs": [], "source": [ "# If we are interested in linear regression plot for 2 numeric variables we can use regplot\n", "sns.regplot(x='yearsOfService', y='salary', data=df_new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0WD6YnSy0icV" }, "outputs": [], "source": [ "# Box plot\n", "sns.boxplot(x='rank',y='salary', data=df_new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "U6Pmpr5l0icV" }, "outputs": [], "source": [ "# Side-by-side box plot\n", "sns.boxplot(x='rank', y='salary', data=df_new, hue='gender')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Dn-ibN7k0icW" }, "outputs": [], "source": [ "# Swarm plot\n", "sns.swarmplot(x='rank', y='salary', data=df_new)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7aR8SY2r0ich" }, "outputs": [], "source": [ "# Factorplot\n", "sns.catplot(x='rank', y='salary', data=df_new, kind='bar')\n", "print(df_new.groupby('rank').mean())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UfRCtVU20ici" }, "outputs": [], "source": [ "# Pairplot \n", "sns.pairplot(df_new)" ] }, { "cell_type": "markdown", "metadata": { "id": "8ABnDMpa0ici" }, "source": [ "### Exercise 7 Graphing data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HP7rVYGp0ici" }, "outputs": [], "source": [ "# Use the seaborn package to explore the dependency of arr_delay on dep_delay \n", "# in the flights dataset. You can use a scatterplot or regplot using flight.\n", "# " ] }, { "cell_type": "markdown", "metadata": { "id": "iWKUzQNP0icQ" }, "source": [ "## Descriptive statistics\n", "Statistics that are used to describe data. We have seen methods that calculate descriptive statistics before with the DataFrame describe() method. \n", "\n", "Descriptive statistics summarize attributes of a sample, such as the min/max values, and the mean (average) of the data. Below is a summary of some additional methods that calculate descriptive statistics." ] }, { "cell_type": "markdown", "metadata": { "id": "7qiHcIAJ0icQ" }, "source": [ "|Function|Description\n", "|-------|--------\n", "|min | minimum\n", "|max | maximum\n", "|mean | arithmetic mean of values\n", "|median | median\n", "|mad | mean absolute deviation\n", "|mode | mode\n", "|std | standard deviation\n", "|var | unbiased variance\n", "|sem | standard error of the mean\n", "|skew| sample skewness\n", "|kurt|kurtosis\n", "|quantile| value at %\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "y2izjRmU0icQ" }, "outputs": [], "source": [ "# Recall the describe() function which computes a subset of the above listed statistics\n", "flights.dep_delay.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yh4S78a80icR" }, "outputs": [], "source": [ "# find the index of the maximum or minimum value\n", "# if there are multiple values matching idxmin() and idxmax() will return the first match\n", "flights['dep_delay'].idxmin() #minimum value" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "k9D_xUPK0icR" }, "outputs": [], "source": [ "# Count the number of records for each different value in a vector\n", "flights['carrier'].value_counts()" ] }, { "cell_type": "markdown", "metadata": { "id": "a2XW__P90icj" }, "source": [ "## Inferential Statistics\n", "Use data analysis on a sample of data to infer properties and make predictions that cannot be derived from descriptive statistics. Examples of this could be predicting a new unknown value based on previous data (linear regression, machine learning) or hypothesis testing (such as T-tests)." ] }, { "cell_type": "markdown", "metadata": { "id": "A5dv3AVB0icj" }, "source": [ "### Linear Regression\n", "A linear approach to model the relationship between a scalar output and one (or more) input variables. With one input and one output variable you are finding a line of *best fit*. You calculate the slope and y-intercept for a line that minimizes the distance between all the existing data points. You can then use this line to make predictions on unknown data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qSC1beWi0icj" }, "outputs": [], "source": [ "# Import Statsmodel functions:\n", "import statsmodels.api as sm" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DZhLEStD0icj" }, "outputs": [], "source": [ "# Create a fitted model\n", "lm = sm.OLS(df_new.yearsOfService, df_new.salary).fit()\n", "\n", "# Print model summary\n", "print(lm.summary())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "otaTnzG20icj" }, "outputs": [], "source": [ "# Print the coefficients\n", "lm.params" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dC0PDTYx0ick" }, "outputs": [], "source": [ "# Using scikit-learn:\n", "from sklearn import linear_model\n", "est = linear_model.LinearRegression(fit_intercept = True) # create estimator object\n", "# When you use dataframe columns this fits your model with feature names\n", "est.fit(df_new[['yearsOfService']], df_new[['salary']])\n", "\n", "# If you pass the values, you don't have to use feature names\n", "# est.fit(df_new[['yearsOfService']].values, df_new[['salary']].values)\n", "\n", "# Print result\n", "print(\"Coef:\", est.coef_, \"\\nIntercept:\", est.intercept_)\n", "\n", "# Predict \n", "# When you predict you have to use the features name otherwise you get a warning\n", "pred = est.predict(pd.DataFrame(np.array([21]), columns=['yearsOfService']))\n", "\n", "# If you created a model based on values, then you predict with only a value, \n", "# though you have to pass it with the same shape coef_ \n", "# pred = est.predict([[21]])\n", "print(\"Predicted salary: \", pred)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "f8LJHj4M0ick" }, "source": [ "### Exercise 8 Build a linear model" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "X4V5kbci0ick" }, "outputs": [], "source": [ "# Build a linear model for arr_delay ~ dep_delay\n", "\n", "# Print model summary\n", "\n", "# Predict a value\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "id": "N8ca0v6_0icl", "jupyter": { "outputs_hidden": true } }, "source": [ "### Student T-test\n", "Used to compare the means of two groups. In this case you have a null hypothesis that the two group means are equal. The T-test then tells you whether you have statistically significant evidence to reject the null hypothesis. \n", "The T-test has two output results, the statistic, and the p-value. The statistic (or T-value) quantifies the difference between the two mean values. The p-value is the probability of obtaining test results at least as extreme as the result that is observed assuming that the null hypothesis is correct.\n", "\n", "More succintly, the p-value tells you how likely it is that your data could have occured under the null hypothesis. Small p-values indicate there is a small probability of observing such a difference in the mean assuming the null hypothesis is true. Small p-values indicate that there is evidence to reject the null hypothesis (i.e., the group means are different). \n", "\n", "Conversely, larger p-value scores indicate there is a large probabiliy of observing the calculated statistic under the null hypothesis. Large p-values indicate that you have evidence to accept the null hypothesis (i.e., the group means are equal).\n", "\n", "One area the T-test is important is in clinical trials. Consider an example where you are looking at whether a drug reduces your cholesterol level. You have 2 populations, one where the drug is administered and another that is not administered. You can use the T-test to determine whether there is evidence to suggest that the drug causes a statistically significant change to the cholesterol level between the two populations.\n", "\n", "Below we calculate whether there is a meaningful difference between male and female salaries. Generally a p-value below 0.05 is considered statistically significant." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "W2yyiAe30icl" }, "outputs": [], "source": [ "# Using scipy package:\n", "from scipy import stats\n", "df_w = df[df['sex'] == 'Female']['salary']\n", "df_m = df[df['sex'] == 'Male']['salary']\n", "stats.ttest_ind(df_w, df_m) " ] }, { "cell_type": "markdown", "metadata": { "id": "xM391jsk0icl" }, "source": [ "## Tutorial Evaluation\n", "After the tutorial please submit an evaluation by clicking on this link [Tutorial Evaluation](http://scv.bu.edu/survey/tutorial_evaluation.html)" ] } ], "metadata": { "colab": { "collapsed_sections": [ "M3FbA7um0ibR", "vYNDrGEL0ibY", "3f4HD5E90ibn", "PQprwXx40ibo", "tj85w0PMttNG", "KzvZFygO0ibz", "5M2fqDoL0ib6", "6Fir_77Y0ib8", "vzxWbXKF0icA", "XZjjbRia0icD", "JGOFyTVm0icN", "iWKUzQNP0icQ", "sPGu6w9j0icR", "a2XW__P90icj" ], "name": "dataScience.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.4" } }, "nbformat": 4, "nbformat_minor": 4 }