Index of /examples/machine_learning/classic_ML

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory   -  

README

Classic Machine Learning Example

Here we try to demonstrate some classic ML code and how they're implemented on the SCC.

Load Python3 module

The python3 module comes pre-loaded with classical machine learning libraries like NumPy, SciPy, Pandas and many others. It is advisable to check first if the package you need is already pre-installed with the python3 module.

1 - Input Data

There is a great need to understand how to package and parse your data. Are we packaging the data as NumPy arrays/Torch Tensors or as a Dataset or as a Generator or are we even "pre-loading" the data into Memory or not?

In addition, it's important to decide how to present that data to the model. Do we give the model a normalized/scaled version of the data or just the raw data? Do we batch our data or feed the entire dataset as a single batch? And so on.

We won't be able to answer all those questions today. But they are useful to be aware of in any case.

1.A - Datasets

In many Machine Learning tasks, we train our model on a standard dataset. Standard datasets are helpful because researchers can compare their results (i.e. benchmarking). One of the most popular datasets is the Hand-written Digits Classification Dataset MNIST. Thankfully, we don't need to manually compile some of those standard datasets. They are readily available through Python Libraries like Sci-Kit Learn. Let's explore the available dataset in that library.

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets


Let's look at the Boston House Pricing Dataset
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html#sklearn.datasets.load_boston

Word of Caution - ML models carry over biases in the data!!



The shape of the data is 20,640 examples; Each example has 8 features.
The shape of the targets is 20,640 targets; Each target is a single float number.

The goal is to use those features to predict the single float number, i.e. target price.


1.C - Splitting & Batching

A common techinque in ML is to split the data into training and validation sets. The training set is the portion of the data that the model will "see", i.e. be trained on. The validation set will not be used in training, only to validate the performance of the model, i.e. assess how it behaves on data it hasn't seen before.

The Train/Val split is very important. ML models suffer from 2 classic problems, over-fitting and under-fitting. Without going into much detail, we ultimately want the model to achieve as high an accuracy (or whatever other performance metric) on the validation set. This will provide assurance that the model will work on future data not present in the current dataset.

It is also common to batch the training data. Usually, the training set is too large to fit into CPU/GPU memory. Thus, batching becomes important. However, some training algorithms are able to handle this "mini-batching" of the training data well like Stochastic Gradient Descent (SGD). Other training algorithms may not be able to handle training on batches, like Local Search.

For our purposes in this session, we will put batching aside since none of the models we use here will require it.


2 - Machine Learning Model

In this section we present a couple of popular models. We will train those models and see how they compare against each. We could go into the theory of each model, how it works, etc... However that would require a semester-long course and perhaps more even. I recommend you satiate your curiosity by Googling the different models and learning about them.

Since we are working with a Regression problem, the default scoring metric in SKLearn is R2 score. It can be negative and its best possible value is 1.0, thus the closer to 1.0 the better the model is performing. For more information on that metric, please do a simple Google Search or consult here: https://en.wikipedia.org/wiki/Coefficient_of_determination.

Contact Information

Help: help@scv.bu.edu

Note: RCS example programs are provided "as is" without any warranty of any kind. The user assumes the entire risk of quality, performance, and repair of any defect. You are welcome to copy and modify any of the given examples for your own use.