# Logistic Regression for Breast Cancer Classification

In this notebook we will see how to solve a classification problem using logistic regression.

We will use
- The Python library `sckit-learn`
- Using the `datasets` submodule we will import the breast cancer data set
- Using the `model_selection` submodule we will use the method `test_train_split` to split the dataset into training and testing subsets
- Using the `linear_model` submodule create a `LogistricRegression` object to train logistic regression classifier
- We will train this model using the training set
- Predict and evaluate the results of our LogisticRegression model on the test set using `metrics`
- Use `seaborn` to plot relevant model metrics

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import metrics
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

The breast cancer dataset is available in `scikit-learn`. Many machine learning libraries come with built-in datasets or expose an API with which you can download datasets to train and test your model. 

We set the input parameter `as_frame=True` in the `load_breast_cancer()` fuction to return the data as a Pandas dataframe. All of the `sklearn.datasets` behave in a similar fashion.

In [None]:
# Load the breast cancer dataset as a dataframe
bc_dataset = load_breast_cancer(as_frame=True)

The `bc_dataset` is an object. 

To obtain the input features we need to call `bc_dataset["data"]`.

To obtain the output target we need to call `bc_dataset["target"]`.

In [None]:
# X is a Pandas dataframe
# The columns are the features 
X = bc_dataset["data"]

# y is a Pandas series with the target class labels (0 - negative, 1 - positive)
y = bc_dataset["target"]

# Explore these objects with the .head() method

In [None]:
X.head()

In [None]:
X.describe()

In [None]:
y.head(20)

In [None]:
# Using the train_test_split method we split 80% of the data into the X_train, y_train numpy arrays
# The remaining 20% is our X_test and y_test 
X_train, X_test, y_train, y_test = train_test_split(X.to_numpy(), y.to_numpy(), test_size=0.20, random_state=10)

In [None]:
# Create a StandardScaler object
sc = StandardScaler()

# The StandardScaler standardizes features by removing the mean and scaling to unit variance
# Prevents features with larger variances to dominate
# We only need to apply this to our training/testing input data
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

In [None]:
# Create our logistic regression object
logistic_regr = LogisticRegression()

In [None]:
logistic_regr.fit(X_train, y_train)

In [None]:
predictions = logistic_regr.predict(X_test)
print(predictions)

In [None]:
score = logistic_regr.score(X_test, y_test)

In [None]:
print("The accuracy of the model is: ", score)

In [None]:
# Using the metrics submodule we can compute the 
cm = metrics.confusion_matrix(y_test, predictions)

In [None]:
# Using matplotlib and seaborn we can display a heatmap of the
plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(score)
plt.title(all_sample_title, size = 15);

In [None]:
# True positives / (True positives + False positives)
# Quality of a positive prediction
# Answers what proportion of positive identifications were actually correct.
# High precision means not a lot of false positives
precision = metrics.precision_score(y_test, predictions)
print("Precision score: ", precision)

In [None]:
# True positives / (True positives + False Negatives)
# Answers what proportion of the actual positives was correct.
# Higher recall means not a lot of False Negatives
recall = metrics.recall_score(y_test, predictions)
print("Recall score: ", recall)

In [None]:
# ROC curve
# fpr = FP / (FP + TN)
# tpr = TP / (TP + FN)
lr_probs = logistic_regr.predict_proba(X_test)
lr_preds = lr_probs[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_test, lr_preds)
print(threshold)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.01, 1])
plt.ylim([0, 1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()