Application of Random Forest and K-nearest Neighbours on student alcohol consumption and their test performance (Matlab)

Introduction

Description of the problem

  • Two classification models are used to estimate student’s second year grades (referred to as G2) based on the previous year’s grades (G1) to decide whether including additional variables improves our predictions.
  • We compare the predictions to the actual results obtained by each student, which are known to us.

Description of the data set (UCI data source)

  • Two data sets imported from the UCL Machine Learning Repository and merged on Excel.
  • 34 variables (numeric, nominal, binary), 992 observations.
  • Normal distributions for scores G1 and G2. (figures below)
  • Students with zero values for either grades have been removed, it is assumed that they have not yet sat the tests and do not provide additional information that is usable for our hypothesis

Summary of G1, G2 and age

picture10

Histograms of G1 and G2 score

                                              G1 in 20 bins                                                                                        G2 in 15 bins

1picture1

 

Methods

Random Forests

  • An ensemble classifier that consists of many decision trees, outputs the mode of the individual trees (1)
  • Is known for its accuracy, even on large data sets (1)
  • Can handle many input variables and can show which variables are important
  • Hyper parameters are: 1) number of trees and 2) minimum leaf size
  • Known to overfit data when using too many parameters relative to the number of observations (1)

K-nearest Neighbours

  • A method for classifying objects based on closest training examples in the feature space (2)
  • Simplest classification technique when there is little or no prior knowledge about the distribution of the data (2)
  • Performance of a KNN classifier is primarily determined by the choice of K as well as the distance metric applied (2)
  • It is Robust to noisy training data
  • There is a non existent or minimal training phase but a costly testing phase. The cost is in terms of both time and memory. (3)

 

Hypothesis

  • We expect for the variables for Alcohol Consumption (Dalc x27, Walc x28) , to have a noticeable impact on student’s success
  • Evaluation criteria: Prediction is accurate when the estimate is correct or out by one
  • Because of the above mentioned features of each model, we expect Random Forest to be more accurate when it comes to our problem

Selection of Predictor Variables

  • Following the results of the importance analysis (figure opposite) the input variables Age(x3), Mother education (x7), Father’s education (x8), Mother’s job (x9), Reason to choose school (x11), Travel time (x13), Study time (x14), Failures (x15), Family Relations (x24), Free time (x25), Go out (x26), Weekday alcohol consumption (x27), Weekend alcohol consumption (x28), Health (x29), Absences (x30), and G1 (x31) were selected and used to predict G2.

 

picture2

 

 

Training and test sets

  • Due to Random Forests trend of overfitting data, we first analysed the best method for separating our data into training and test sets
  • The three methods are: 1) independent test set, 2) k-fold cross validation and 3) out of bag method
  • k- fold cross validation is the best, increasing the number of trees improves stability and iterative analysis shows that 5-fold performs the best on our data

picture3

 

Application

Random Forest applied to our data

  • Tree bagger algorithm used to train model
  • When increasing the number of trees the time taken increases linearly and error decreases exponentially, observations are so few so time taken is less of a factor
  • When increasing the minimum leaf size, the time taken decreases exponentially and results are optimal between 20 and 30
  • Iterative analysis used to establish our hyper parameters:

1)Number of trees = 1000

2)Minimum leaf size = 25

  • The number of trees also correlate with out previous analysis on the training and test sets, having a large number of trees improves stability

picture4picture5

 

K-nearest Neighbors applied to our data

  • For each new data point (student), K-nearest Neighbours groups and evaluates them based only on other students with the most similar background then dynamically generates prediction rules. (3)
  • An algorithm to optimize hyperparameter k (Bayesian Optimization)
  • Optimal results from the model when applied to estimates made using just G1 and then additional features when Number of Neighbors is 57 and 139 respectively and the distance between data points is Euclidian.
  • The model works well with a larger number of classes (15 in our case)
  • K-nearest Neighbours is a very simple model to fit, when the number of features and observations is relatively small such as in our dataset.

picture6picture7

Predictions compared to actual results for both cases (Count/%)

    picture11

 

Conclusion

  • Prediction of student performance using Machine Learning algorithms has a big interest amongst researchers. According to Weston, Clarence Y. models such as linear or logistic regression were mostly employed for this purpose. We conclude that both of our methods can be fitted to make a prediction.
  • The results we’ve obtained are not ideal, the figure opposite show the number of correct classifications for both models at each score. None of these are greater than 50%, however, following our evaluation criteria, we consider the prediction to be accurate even when it has missed by one mark (equivalent to 5% deviation).
  • Results show that Random Forests perform better with more features and is 70% accurate. K-Nearest Neighbour performs best with only the one feature, accurate to 75%, however, accuracy is reduced when more features are introduced to only 40%.
  • Due to the Bayesian Optimisation in the K-Nearest Neighbour model, time taken is ten times as long as our Random Forest model
  • It is noticed that additional variables slightly improve the results, but based on the importance analysis, the variables for Alcohol Consumption doesn’t have as big of an impact as we expected to.
  • The results are effected by how few observations we have and could be improved with more variables (especially additional test results) and/or more observations.
  • The Random Forest method could be improved by using adaptive reweighting of the training set instead of bagging (4).
  • Both models struggle to predict accurate results for extreme scores when more features are used. It makes sense to combine methods, making a prediction for scores close to the mean (8-15) using Random Forests with more features and then either predict the extremes (1-8 and 16-19) using fewer features or another classification model, such as Naïve Bayes.

picture9

 

Relative percentage of correct classifications for each score

picture12

References

  1. Breiman l, Cutler A. 2004. Random Forests: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
  2. Application of K-Nearest Neighbor (KNN) Approach for Predicting Economic Events: Theoretical Background, Sadegh Bafandeh Imandoust And Mohammad Bolandraftar, S B Imandoust et al. Int. Journal of Engineering Research and Applications, Vol. 3, Issue 5, Sep-Oct 2013, pp.605-61
  3. A Detailed Introduction to K-Nearest Neighbor (KNN) Algorithm – Saravanan Thirumuruganathan — https://saravananthirumuruganathan.wordpress.com/2010/05/17/a-detailed-introduction-to-k-nearest-neighbor-knn-algorithm/
  4. On the K-Nearest Neighbor approach to the generation of fuzzy rules for college student performance prediction Weston, Clarence Y.. Morgan State University, ProQuest Dissertations Publishing, 2015. 10076254
  5. Leo BreimanStatistics Department, University of California, Berkeley, CA 94720, 2001 – Random Forests, Kluwer Academic Publishers, Manufactured in The Netherlands

Matlab Raw Code:

studentsmatlabcode

Leave a Reply