{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Visualization in Python\n",
"\n",
"## Research Computing Services\n",
"\n",
"Instructor: Scott Ladenheim, PhD
\n",
"Website: [rcs.bu.edu](http://www.bu.edu/tech/support/research/)
\n",
"Tutorial materials: [http://rcs.bu.edu/examples/python/DataVisualization](http://rcs.bu.edu/examples/python/DataVisualization)
\n",
"Contact us: help@scc.bu.edu\n",
"\n",
"## Data visualization software\n",
"\n",
"Data visualization software is a tool you use to you relay information about data. There are many data visualization software packages available. Today I will present how to use a few of these pacakges. The focus is on how to use them on the SCC. However, what I demonstrate in the tutorial should be applicable on other systems. By the end of the tutorial you will understand:\n",
"\n",
"- how to create plots in Matplotlib, Seaborn, Pandas plot, and Plotly,\n",
"- how to adjust plot properties, e.g., color schemes, titles, axis labels, marker sizes, ect.,\n",
"- where to find documentation when you don't know what to do or where to start.\n",
"\n",
"### Matplotlib\n",
"\n",
"Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.\n",
"\n",
"https://matplotlib.org/\n",
"\n",
"### Seaborn\n",
"\n",
"Seaborn is a Python data visualization library built on top of matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Closely integrated to work with dataframes.\n",
"\n",
"https://seaborn.pydata.org/\n",
"\n",
"### Pandas plot\n",
"\n",
"Data visualization routines within Pandas that allow you to easily plot the data within a dataframe.\n",
"\n",
"https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html\n",
"\n",
"\n",
"### Plotly\n",
"\n",
"Plotly's Python graphing library makes interactive, publication-quality graphs. Plotly is open-source and free to use. Plotly works with R, Julia, Javascript and Matlab. \n",
"\n",
"https://plotly.com/python/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Matplotlib\n",
"\n",
"API reference page for the latest stable version of matploblib: \n",
"\n",
"https://matplotlib.org/stable/api/index"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Matplotlib backends"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import matplotlib with name mpl\n",
"import matplotlib as mpl"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Interactive plots in Jupyter notebook"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# The %matplotlib magic function is a feature of the IPython interpreter that Jupyter Notebooks use.\n",
"# This is NOT A PYTHON COMMAND. Don't use this in a plain Python file.\n",
"# This tells matplotlib to use an interactive plot window inside the Notebook.\n",
"\n",
"# If you are running JupyterLab DON'T run this line. Run the cell with the $matplotlib widget call.\n",
"%matplotlib notebook"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# You could alternatively set the interactive plot window using regular Python:\n",
"# mpl.use('nbAgg')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"mpl.get_backend()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Here's a list of all of the available backends\n",
"mpl.rcsetup.all_backends"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Interactive plots in JupyterLab\n",
"If you are using JupyterLab you need to install the ipympl library https://github.com/matplotlib/ipympl and can then use the command\n",
"\n",
"`%matploblib widget`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Uncomment the line below and run this cell if you use Jupyterlab \n",
"# %matplotlib widget"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using Backends\n",
"\n",
"In general, you don't have to worry about setting of the backend. You can always let matplotlib choose the default for your computer. In a Jupyter notebook (or JupyterLab) the command `%matplotlib notebook` (`%matplotlib widget`) can be used, as we do here.\n",
"\n",
"If you are running a Python program that will create plots using matplotlib it will try to open a graphics window by default. If you don't want that to happen, for instance you're running a batch job on the SCC where graphics aren't available, use the 'agg' backend. This allows for plots to be created in memory and then saved to disk. The backend must be set before calling any matplotlib functions.\n",
"\n",
"For example, in a Python program running as a batch job on the SCC:\n",
"\n",
"```\n",
"import matplotlib as mpl\n",
"mpl.use('agg')\n",
"import matplotlib.pyplot as plt\n",
"#...do stuff and make plots...\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Turning off interactive\n",
"\n",
"You can alternatively turn off the interactive mode using `%matplotlib inline`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Uncomment the below line to set interactive mode off\n",
"#%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Scripting and object oriented interface layers\n",
"\n",
"There are two main ways to interact with matplotlib to create your graphs. In this tutorial I try work with them seperately as much as possible. However they can, and often are (especially on sites like stackoverflow), used in tandem. It is important to understand the difference between the two layers and how they are used. The two interface layers are:\n",
"\n",
"- scripting,\n",
"- object oriented.\n",
"\n",
"Let's see how they work with some example plots below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Import the matplotlib submodule pyplot\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Scripting interface\n",
"\n",
"We plot the function f(x) = sin(x). The scripting interface makes function calls to create and modify the plot. When the scripting layer is used, pyplot has a notion of a current figure and a current axes which all the functions you call act on."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"x = np.linspace(0, 2 * np.pi, 200)\n",
"y = np.sin(x)\n",
"\n",
"### Using the scripting API via plt\n",
"plt.figure()\n",
"plt.plot(x, y)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Object oriented layer\n",
"\n",
"Plotting the same function using the object oriented layer. The function call to subplots() returns two objects:\n",
"\n",
"- a figure object, `fig`,\n",
"- an axis object, `ax`.\n",
"\n",
"We then use the properties and functions of these objects to create the plot. The object oriented layer is useful for fine grain control of the plot."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"### Using the object oriented API\n",
"fig, ax = plt.subplots()\n",
"ax.plot(x, y)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### The Figure and Axes objects\n",
"\n",
"Figure is the top-level object that holds all plot elements (see https://matplotlib.org/stable/api/figure_api.html#matplotlib.figure.Figure for more details). With this object you can set properties such as \n",
"\n",
"- figsize\n",
"- dpi (dots per inch)\n",
"- layout\n",
"\n",
"Axes is an object that encapsulates all the elements of an individual (sub-) plot in a figure (see https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.html#matplotlib.axes.Axes for more details). With this object you can set properties such as:\n",
"\n",
"- xlim, ylim\n",
"- xlabel, ylabel\n",
"- legend\n",
"- title\n",
"- linewidth\n",
"- fontsizes\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Changing the plotting style\n",
"\n",
"Online visual guide here:\n",
"\n",
"https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html\n",
"\n",
"More informantion on color palettes here:\n",
"\n",
"https://seaborn.pydata.org/tutorial/color_palettes.html"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.style.available"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.style.use('seaborn-v0_8-colorblind')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### Using the scripting API via plt\n",
"y2 = np.cos(x)\n",
"\n",
"plt.figure()\n",
"plt.plot(x, y, x, y2)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Ask for help on the plot function. This is a notebook-only syntax:\n",
"plt.plot?\n",
"# in a plain Python console we'd use (it will also work here):\n",
"# help(plt.plot)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Saving a figure"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Scripting layer you would call. Observe that this saves the last open interactive window.\n",
"plt.savefig('sine_cosine-sl.pdf')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Object oriented layer call. Observe that this saves the object stored in the fig object.\n",
"fig.savefig('sine-ool.pdf');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can save a figure in the following formats:\n",
"\n",
"- '.png'\n",
"- '.jpg'\n",
"- '.svg'\n",
"\n",
"There are many more parameters you can pass to savefig, for example dpi, format, etc. Documenationa for savefig is here: \n",
"\n",
"https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Modify a plot"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Scripting level api\n",
"\n",
"plt.figure(figsize=(8, 5))\n",
"plt.plot(x, y, 'r+', x, y2, 'g--')\n",
"plt.xlabel('x-axis', fontsize=16)\n",
"plt.ylabel('y-axis', fontsize=16)\n",
"plt.title(\"Sine and Cosine\", fontsize=18)\n",
"plt.legend(['Sine', 'Cosine'])\n",
"plt.tick_params(axis='x', labelsize=12)\n",
"plt.tick_params(axis='y', labelsize=12)\n",
"axes = plt.gca()\n",
"axes.spines[['top', 'bottom', 'left', 'right']].set_linewidth(2)\n",
"plt.grid()\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"# Plot the same line with OO layer\n",
"\n",
"fig1, ax1 = plt.subplots()\n",
"fig1.set_size_inches(8, 5)\n",
"ax1.plot(x, y, 'r+', x, y2, 'g--')\n",
"ax1.set_xlabel('x-axis', fontsize=16)\n",
"ax1.set_ylabel('y-axis', fontsize=16)\n",
"ax1.set_title(\"Sine and Cosine\", fontsize=18)\n",
"ax1.legend(['Sine', 'Cosine'])\n",
"ax1.spines[['top', 'bottom', 'left', 'right']].set_linewidth(2)\n",
"ax1.tick_params(axis='x', labelsize=12)\n",
"ax1.tick_params(axis='y', labelsize=12)\n",
"ax1.grid()\n",
"plt.show()\n",
"\n",
"## Other plot options can be found here https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"# Make a subplot with 1 row and 2 columns\n",
"fig2, (ax11, ax12) = plt.subplots(1, 2)\n",
"fig2.set_size_inches(8, 5)\n",
"fig2.subplots_adjust(wspace=0.4)\n",
"ax11.plot(x, y, 'r+')\n",
"ax11.set_title('Sine')\n",
"ax11.set_xlabel('x')\n",
"ax11.set_ylabel('sin(x)')\n",
"\n",
"ax12.plot(x, y2, 'g--')\n",
"ax12.set_title('Cosine')\n",
"ax12.set_xlabel('x')\n",
"ax12.set_ylabel('cos(x)')\n",
"ax12.grid()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Other plots\n",
"\n",
"There are many other kinds of plots that you can make in matplotlib. Here is a list of them https://matplotlib.org/stable/plot_types/index.html. The syntax to create these plots is similar using either the scripting or object oriented API.\n",
"\n",
"Are there any plots you have an interest in knowing more about?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise\n",
"In the below cell we have generated 2 randomly generated vectors, v1 and v2. Create the following plot:\n",
"- A subplot with 2 rows and 1 column\n",
"- A vertical bar chart in row 1 with vector v1. Use the [bar](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html) function\n",
"- A horizontal bar chart in row 2 with vector v2. Use the [barh](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.barh.html#matplotlib.axes.Axes.barh) function\n",
"- Add the title 'Vertical Bar Chart' and 'Horizontal Bar Chart' to their respective subplots\n",
"\n",
"Use an object-oriented approach."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"rng = np.random.default_rng(12345)\n",
"x = 0.5 + np.arange(8)\n",
"v1 = rng.uniform(2, 7, size=len(x))\n",
"v2 = rng.uniform(1, 3, size=len(x))\n",
"\n",
"### Your code goes here\n",
"\n",
"fig, (ax11, ax21) = plt.subplots(2, 1)\n",
"fig.subplots_adjust(hspace=0.4)\n",
"ax11.bar(x, v1)\n",
"ax11.set_title('Vertical Bar Chart')\n",
"ax21.barh(x, v2)\n",
"ax21.set_title('Horizontal Bar Chart')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise\n",
"In the below cell we have generated a randomly generated vector v3. Create the following plot:\n",
"- A histogram. Use the hist() function\n",
"- Add the title \"Normal Distribution\"\n",
"- Add the labels \"X-axis\" to the X-axis and \"Y-axis\" to the Y-axis\n",
"- Set the color of the histogram to \"skyblue\"\n",
"\n",
"Use the scripting API"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"v3 = rng.normal(0.0, 2.0, 1000)\n",
"\n",
"fig, ax = plt.subplots()\n",
"ax.hist(v3, color = \"skyblue\")\n",
"### Your code goes here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Seaborn\n",
"\n",
"Seaborn is a visualization package built on top of matplotlib. This gives the Seaborn API some similarities to matplotlib. The advantages of using seaborn are:\n",
"\n",
"- easy integration with Pandas dataframes,\n",
"- build more complex graphs.\n",
"\n",
"We will demonstrate exampels of this below. The documentation for the lastest stable verion of Seaborn is here:\n",
"\n",
"https://seaborn.pydata.org/api.html\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import seaborn as sns\n",
"pgn = sns.load_dataset(\"penguins\")\n",
"pgn.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"sns.pairplot(pgn, hue=\"species\", diag_kind=\"hist\", height=1.5);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.scatterplot(data=pgn, x=\"bill_length_mm\", y=\"bill_depth_mm\", hue=\"species\");"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"sns.jointplot(x=pgn[\"bill_length_mm\"], y=pgn[\"bill_depth_mm\"], alpha=0.4);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"v1 = pgn[\"bill_length_mm\"]\n",
"v2 = pgn[\"bill_depth_mm\"]\n",
"sns.jointplot(x=v1, y=v2, kind='hex');"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=(7,6))\n",
"plt.subplots_adjust(bottom=0.2, left=0.2)\n",
"ax = sns.heatmap(pgn.corr(numeric_only=True), annot=True);\n",
"ax.set_xticklabels(ax.get_xticklabels(), fontsize = 8);\n",
"ax.set_yticklabels(ax.get_yticklabels(), rotation=0, fontsize = 8);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise\n",
"Using the pgn dataframe create a Seaborn [violin plot](https://seaborn.pydata.org/generated/seaborn.violinplot.html) of the flipper length in mm. Give the plot the title \"Flipper Length for 3 Penquin Species.\" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### Your code goes here\n",
"ax = sns.violinplot(data=pgn, x=\"species\", y=\"flipper_length_mm\")\n",
"ax.set_title(\"Flipper Length for 3 Penguin Species\");"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pandas plot\n",
"\n",
"Pandas is a library for manipulating data using Series and Dataframes, see https://pandas.pydata.org/docs/index.html for more details. This is the main topic of the *Python for Data Analysis* tutorial. We previously saw a Dataframe object when we loaded the penquin dataset. \n",
"\n",
"Pandas has useful plotting tools for exploratory data analysis when you are working with a Dataframe object. To explore the pandas plotting functionality, we use the iris flower data set (https://en.wikipedia.org/wiki/Iris_flower_data_set). This is a dataset of observations on 3 species of iris flowers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"columns=['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'SpeciesName']\n",
"iris = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/iris-data.csv')\n",
"iris.columns=columns\n",
"# Print out a preview of the dataframe.\n",
"iris.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### DataFrame.plot()\n",
"\n",
"kind : str\n",
" - 'line' : line plot (default)\n",
" - 'bar' : vertical bar plot\n",
" - 'barh' : horizontal bar plot\n",
" - 'hist' : histogram\n",
" - 'box' : boxplot\n",
" - 'kde' : Kernel Density Estimation plot\n",
" - 'density' : same as 'kde'\n",
" - 'area' : area plot\n",
" - 'pie' : pie plot\n",
" - 'scatter' : scatter plot\n",
" - 'hexbin' : hexbin plot"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Pandas dataframes have the matplotlib plot function built-in. \n",
"iris.plot?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Auto-plot the numeric data. The Names column is ignored as it's all strings\n",
"# and the default is a numeric line plot.\n",
"iris.plot();"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Select a column with the y argument. \n",
"#The x-values are automatically numbered by the number of rows.\n",
"iris.plot(y='SepalLength');"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Let's try some other plots...\n",
"fig=iris.plot(kind='box')\n",
"fig.set_xticklabels(['sl','sw','pl','pw']);\n",
"fig.set_xlabel('features')\n",
"fig.set_ylabel('inches')\n",
"plt.tight_layout()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"iris.plot(kind='hist');"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"iris.plot(y=['SepalWidth','SepalLength'], kind='kde');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Plot results from data manipulation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"iris.groupby('SpeciesName').mean()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots()\n",
"iris.groupby('SpeciesName').mean().plot(kind='bar', ax=ax, rot=0);\n",
"ax.set_ylabel('cm')\n",
"ax.set_xlabel('Species')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"pd.plotting.scatter_matrix(iris);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plotly\n",
"\n",
"Plotly is a great data visualization software package, see https://plotly.com/python/ for more details. A major advantage of using plotly is you can easily create interactive exportable figures easily. There are two main ways to use plotly\n",
"\n",
"1. plotly.express \n",
" - Higher level interface that allows you to quickly create plots with only a few lines of code. We will demonstrate this by creating scatter plots and a chloropleth map plot. \n",
" \n",
"2. plotly.graph_objs \n",
" - This interface requires more programming but allows for fine-tuned control of plotly plots. Functions in plotly.express wrap around graph_objs. We use graph_objs to create a 3-D visualization and to make an interactive bubble chart of colors used by painters over time."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Plotly scatter plots"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
" \n",
" "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.plotly.v1+json": {
"config": {
"plotlyServerURL": "https://plot.ly"
},
"data": [
{
"hovertemplate": "x=%{x}
y=%{y}