# Introduction to Scikit-Learn

Scikit-Learn (sklearn) is a powerful Python package for machine learning. The goals of this tutorial are:

1. To learn how to use Scikit-Learn to implement machine learning models.
2. To understand the general structure of using the Scikit-Learn API.

The main framework for implementing machine learning models in sklearn are:

1. Import the sklearn objects you need for the code
2. Prepare a set of preprocessed (namely cleaned and scaled) data to give your model.
3. Create the model object in your code.
4. Use the model object to train your model using the appropriate training method (usually `fit()`)
5. Apply model to data that the model has not seen (test data) using the appropriate prediction/transformation method (usually `predict()`)

Understanding this structure and the methods within the sklearn objects that accomplish this are all you need in order to work with sklearn.

In this tutorial we will cover features of sklearn that allow you to:

- Load and preprocess data
- Implement supervised learning models
- Implement unsupervised learning models

This notebook introduces these concepts with example code cells. Attendees are expected to follow along and execute the code cells themselves. I will explain what each of the commands do in the code blocks. 

I have included a few practice examples at the end of the tutorial.

In [None]:
# Import sklearn and print the version
import sklearn
print(sklearn.__version__)

## Data preprocessing

Data preprocessing is an essential step before applying any machine learning algorithm. In general, you are not handed a ready-to-use dataset. Datasets often contain incorrect data, missing data, and data with different scales adn types. 

Before you can extract useful information from the data through a machine learnign algorithm, you will need to preprocess the data. In this section we will demonstrate the following topics:

- Loading datasets

 - Toy datasets
 - External datasets
 - Generated datasets
 - Real world dataset
 
- Exploratory data analysis

 - Pandas tools
 - Basic visualization
 
- Test train splits

- Scaling datasets

 - Scaler object
 - Min-max scaling
 - Standardization

### Toy datasets

Scikit learn provides some built in toy datasets. There is an easy API call to load these datasets. Scikit-learn's toy datasets make it easy to test out many kinds of machine learning algorithms. The list of datasets is at this [link](https://scikit-learn.org/stable/datasets/toy_dataset.html#toy-datasets). The following code cell shows how to import a built-in toy dataset.

In [None]:
from sklearn import datasets

X = datasets.load_iris()

print(X)

What kind of object is X? You can find this out by using the `type()` method.

In [None]:
print(type(X))

This means we are working with an object of type `Bunch`. The `Bunch` object X has the following attributes:

- `data`: the data matrix
- `target`: the classification target
- `feature_names`: the names of the dataset columns
- `target_names`: the names o the target classes

To access the 2 numpy 2 arrays that contain the data matrix and the target values you use these commands

1. `X_data = X["data"]` or `X.data`
2. `X_target = X["target"]` or `X.target`

The same syntax works for `feature_names` or `target_names`.

You can read more about this data type at this [link](https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html). Let's see this in action

In [None]:
print(X)

In [None]:
X_data = X["data"]
print(X_data)

In [None]:
X_target = X["target"]
print(X_target)

In [None]:
print(type(X_data))
print(type(X_target))

It is also possible to load the data as a Pandas dataframe. Pandas is a Python package for storing and manipulating data. In Pandas, data is stored in a Dataframe object. The dataframe object stores data in a table (rows and columns). Additionally, the dataframe object has methods to manipulate and analyze the data it contains. 

In [None]:
Z = datasets.load_iris(as_frame=True)
print(Z)

In [None]:
Z_data = Z["data"]
print(type(Z_data))
print(Z_data.head())

In [None]:
Z_target = Z["target"]
print(type(Z_target))
print(Z_target.head())

In [None]:
Z_names = Z["target_names"]
print(Z_names)

## Loading other datasets

In general you do not develop machine learning applications with a toy dataset. Instead your dataset comes from a database or a file (e.g., csv, Excel). Scikit-learn offers limited tools to import files. This means you have to use other tools, like Pandas to import your file into dataframes or arrays. To load a `.csv` file you can use the Pandas method `read_csv()`, see this [link](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) for the full documentation.

In [None]:
import pandas as pd
print(pd.__version__)

In [None]:
df = pd.read_csv("datasets/Salaries.csv")
print(df.head())

## Generating a dataset

Scikit-learn has built in functions that allow you to create a random dataset These randomly generated datasets can then be used to explore various machine learning algorithms.

In [None]:
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=0)

In [None]:
print(X.shape)

In [None]:
print(y)

We can plot the data that we just created using the [matplotlib](https://matplotlib.org/). 

In [None]:
import matplotlib.pyplot as plt
colors = {0:'red', 1:'blue'};
c_arr=[colors[k] for k in y]
plt.scatter(X[:, 0], X[:, 1], marker="o", c=c_arr, s=50, edgecolor="k");

## Real world data sets

Using scikit-learn you can also import real world datasets. The real world datasets are larger than the toy datasets.

In [None]:
from sklearn.datasets import fetch_california_housing
RW = fetch_california_housing(as_frame=True)

In [None]:
RW_data = RW["data"]
RW_target = RW["target"]

In [None]:
RW_data.head()

In [None]:
RW_target.head()

## Exploratory data analysis

It is important to explore and understand your dataset prior to applying machine learning algorithms to it. There are a few functions in Pandas that are helpful for this. 

We will explore these functions using the iris dataset first. We will perform a few transformations on this dataset prior to the analysis.

In [None]:
## Data manipulation
Z_df = Z.frame

Z_df["target_names"] =Z_df["target"].replace(to_replace=
 {0: Z.target_names[0], 
 1: Z.target_names[1], 
 2: Z.target_names[2]})
print(Z_df.head())

In [None]:
## Pandas info
Z_df.info()

In [None]:
## Pandas describe
Z_df.describe()

It is also very helpful to visualize variables of interest. We will use a visualization package called [Seaborn](https://seaborn.pydata.org/). Seaborn is intended for visualizing statistical data. In particular, most functions require as input a dataframe.

In [None]:
## Seaborn to visualize data
import seaborn as sns
print(sns.__version__)

In [None]:
## Boxplot
sns.boxplot(data=Z_df, x="target_names", y="sepal length (cm)");

In [None]:
## Pairplot
sns.pairplot(Z_df.drop("target", axis=1), hue="target_names");

## Exercises
Let's do some exploratory data analysis on the RW_data set.

In [None]:
### Determine if there are any null values in the RW_data dataframe ###
RW_data.info()

In [None]:
### Compute the descriptive statistics of all the input features ###
RW_data.describe()

In [None]:
### Create a horizontal box plot RW_DATA
### What is an important observation from this plot?
### How is it a useful visualzation? How is it not a useful visualization?
sns.boxplot(data=RW_data, orient="h");

## Dataset summary

We have 4 datasets stored in our notebook which we summarize below

- Iris dataset (toy dataset)
- Generated dataset 
- California housing dataset (real world dataset)
- Salaries dataset (toy dataset)


We will used these datasets in subsequent cells of the notebook when we explore more techniques in Scikit-learn.

## Test train split

In order to train and test your model you need to split your dataset into two sets:

- training set
- test set

This is easy to accomplish with `sklearn.model_selection.train_test_split`.

We will see how this function works using our generated dataset. This is because it is small and easy to confirm the expected behavior.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [None]:
print("Shape of X_train: ", X_train.shape, ", Shape of X_test: ", X_test.shape)
print("Shape of y_train: ", y_train.shape, ", Shape of y_test: ", y_test.shape)

## Scaling data

We have now observed and explored a few different datasets. One important observation we made was that our datasets can have very different scales.

Why is it not a good idea to run a machine learning algorithm on a dataset where the input features have scales that differ by orders of magnitude?

What can we do to fix this? As the title of this section suggests, we will scale our data. This is a way to ensure that the numerical features of our datasets have scales that are of the same order of magnitude.

Here are a few common ways to scale data:

 - Standard scaling
 - Min-Max scaling
 
These (and others) are all implemented in scikit-learn. The methodology for using a scaling technique in scikit-learn is similar. You will always start by creating a scaler object and then use this object to scale your data.

### Standard Scaling
Standard scaling scales data so that all the numerical features have zero mean and unit variance.

Since the scales of the RW dataset were very different, we will apply the scaler to this dataset and then replot the box and whisker plot.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
RW_standard = scaler.fit_transform(RW_data)
RW_standard = pd.DataFrame(RW_standard, columns=RW_data.columns)
RW_standard.head()

In [None]:
sns.boxplot(data=RW_standard, orient="h");

### Min-max scaling

We can also do min-max scaling which puts everything in a range between 0 and 1.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
RW_minmax = scaler.fit_transform(RW_data)
RW_minmax = pd.DataFrame(RW_minmax, columns=RW_data.columns)
RW_minmax.head()

In [None]:
sns.boxplot(data=RW_minmax, orient="h");

## Supervised learning

In this section we will cover supervised learning algorithms in Scikit-Learn. Supervised learning is a machine learning technique where we train a model using *labeled* data. This trained model can then be used to predict values on new data.

There are two broad categories of supervised learning:

- Regression, when the model predicts continuous variables
- Classification, when the model segments data into classes

Scikit-learn has a standardized API which makes it easy to train different models with very similar pieces of code. Generally you will create an object for the model that you want, e.g., `LinearRegression`or `LogisticRegression`. These objects have all the methods you need to train your model and then predict values on your model.

We will cover the following supervised learning methods:

- Linear regression (regression)
- Logistic regression (classification)

### Notation
We introduce notation and the general ideas behind 

- A pair $(x^{(i)}, y^{(i)})$ is called a training example
- A set $\{(x^{(i)}, y^{(i)})\}_{i=1}^{m}$ is called a training set
- The goal is to find a function $h(x)$ that is good at predicting targets $y$
- Assume $\hat{y} = h_{w}(x)$ depends on a parameter $w$ (or parameters $w_{i}$ if $x$ is a vector)
- Use the labeled training set to *learn* the parameter(s) $w$ for the function $h_{w}(x)$
- The fully trained $h_{w}(x)$ is referred to as a *model*

## Linear regression

To create a linear regression model in scikit-learn you will instantiate the [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) object.

We use the scaled Calfornia housing dataset to demonstrate how to create this object, train the model, and then predict on the test set.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

# Fetch data
RW = fetch_california_housing()
X = RW["data"]
y = RW["target"]

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Standardize
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

# Create regression model
reg = LinearRegression()
# Train
reg.fit(X_train, y_train)

# Predict on test set
y_pred = reg.predict(X_test)

# R^2 value
r2 = reg.score(X_test, y_test)
print("The R^2 score is : ", r2)

# Report Mean Square Error (mse)
mse = mean_squared_error(y_test, y_pred)
print("Mean squared error: ", mse)

In [None]:
## Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

# Fetch data
Iris = load_iris()
X = Iris["data"]
y = Iris["target"]

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Standardize
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

# Create regression model
reg = LogisticRegression()
# Train
reg.fit(X_train, y_train)

# Predict on test set
y_pred = reg.predict(X_test)

# Classification report
print(classification_report(y_test, y_pred, target_names=Iris.target_names))

## Unsupervised learning

In this section we will cover unsupervised learning algorithms in Scikit-Learn. Unsupervised learning is a machine learning technique where we train a model using *un-labeled* data. With unsupervised learning algorithms you are extracting information from the data itself without any labels.

Somes examples of unsupervised learning techniques that we cover are:

- Clustering
- Principal Component Analysis (PCA)

## Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

Here we will consider K-means clustering, where we will cluster objects into k-clusters. The clusters will be formed by determimning centroids of each cluster, then membership to the cluster is determined by an observations shortest distance to the centroid.

For this problem we will work with a generated dataset.

In [None]:
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans

# Generate data with 2 clusters
X, y = make_blobs(n_samples=500, centers=2, n_features=2, random_state=10)

# Create bcluster object
cluster = KMeans(n_clusters=2, n_init="auto");

# Train cluster model
cluster.fit(X);

print("Cluster centers: ", cluster.cluster_centers_)

In [None]:
import matplotlib.pyplot as plt
colors = {0:'red', 1:'blue'};
c_arr=[colors[k] for k in y]
plt.scatter(X[:, 0], X[:, 1], marker="o", c=c_arr, s=25, edgecolor="k");
plt.scatter(cluster.cluster_centers_[:, 0], cluster.cluster_centers_[:, 1], marker='*', s=100, c='y');

## Principal component analysis (PCA)

PCA is an unsupervised machine learning algorithm that helps to reduse the dimension of your data. The dimension of your data is the number of input features. This algorithm finds a reduced set of input features in the data that account for the majority of the variance in the data. This means that you can work with a smaller set of input features (smaller data), but you are not losing the important information from the full set of input features.

In [None]:
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

iris = load_iris()
print("Feature names: ", iris.feature_names)

In [None]:
X = iris.data
y = iris.target
target_names = iris.target_names

pca = PCA(n_components=2)
#
X_r = pca.fit_transform(X)

In [None]:
plt.figure()
colors = ["navy", "turquoise", "darkorange"]
lw = 2

for color, i, target_name in zip(colors, [0, 1, 2], target_names):
 plt.scatter(
 X_r[y == i, 0], X_r[y == i, 1], color=color, alpha=0.8, lw=lw, label=target_name
 )
plt.legend(loc="best", shadow=False, scatterpoints=1)
plt.xlabel("PC 1")
plt.ylabel("PC 2")
plt.title("PCA of IRIS dataset");

## Summary

We have covered many topics in this tutorial. We have seen how to preprocess data and train supervised and unsupervised machine learning. We can also compute simple metrics tos evaluate the performance of these models. Hopefully this has given you a better idea of how to use sklearn.


Remember that the main framework for working with sklearn has the following structure:

1. Import the sklearn objects you need for the code.
2. Prepare a set of preprocessed (namely cleaned and scaled) data to give your model (usually in the form of numpy arrays).
3. Create the model object in your code.
4. Use the model object to train your model using the appropriate training method (usually `fit()`).
5. Apply the model to data that the model has not seen (test data) using the appropriate prediction/transformation method (usually `predict()`).

This structure and knowing the methods within the various data and model classes that accomplish this are all you need in order to work with sklearn.

## Exercises

In [None]:
## Exercise 1 ##
# Using the Calfornia housing dataset:
# Train a regression model only on these features
# Evaluate the performance of this model using MSE
# Does this reduced set of features give better performance than the full set of input features?

In [None]:
## Exercise 2 ##
# Load the breast cancer dataset
# Create and train a random forest model on this dataset, call this object model1
# Crete and train a logistic regression model on this dataset, call this object model
# Evaluate the performance of both models
# Which model is more accurate?
# Which model is a better choice for this application and why?

In [None]:
## Exercise 3 ##
# Generate a dataset with 4 blobs using these paramenters
# Perform K-means clustering to cluster the 4 blobs using these parameters
# Evaluate the model using this function
# Try to get more than 90% accuracy on the model