Merge branch 'main' of https://gitea.jany.se/Jany/MLPproject

2025-10-30 13:06:37 +01:00
parent 84ad5d508c cfb020fb1a
commit 484a3d9b3e
3 changed files with 7 additions and 5 deletions
--- a/Decision_tree.ipynb
+++ b/Decision_tree.ipynb
--- a/Report/MLPproject.tex
+++ b/Report/MLPproject.tex
@@ -92,7 +92,7 @@ With our numerical version of the dataset we found with the info function in pan
 \subsection{Training, validation and test sets}
 Before doing any sort of training or analysis on the data, se split it into training, test and validation data. We did this by first splitting a random 20\% of the data into test data. This data is reserved for the final testing of the model and will not be touched until the model is finished. Then we did a further split of the rest of the data were 25\% was designated as validation data. This data will be used for calibration of the model and hyperparameter tuning. The rest of the data which is 60\% of the total data or around 18000 data points will be used to train the model.
 \section{Model selection}
-When selecting the model to use for this project we have to limit us to using models that are appropriate to the type of problem that we are trying to solve. The problem is a classification task so all models that are used for regression are immediately invalid. There are plenty of different types of classification models left to choose from. Many of them however, are good for data that has non-discrete features. This includes models such as logistic regression, KNN and other similar types of classification models. Also since we have so many features that are non-numerical and converted into arbitrary numbers these types of models would not be optimal. What is left is the Gaussian Naïve Baye's and the different tree based models. Naïve Baye's can be a bit troublesome for this dataset since we have found that some parameters are slightly correlated. However, this does not necessarliy make in an inappropriate method as it has been found to perform well despite this strict assumption. Therefore we are left with the tree based models such as the decision tree and random forests. We decided to implement two different types of models. We first do a decision tree and see how good we can get that model to work. We then do a random forest which may not be the absolute best model but since it is a continuation on the decision tree it might be interesting to see if it performs better. We then do analysis on both methods and see if these models are good enough and if there is any meaningful difference between the two.
+When selecting the model to use for this project we have to limit us to using models that are appropriate to the type of problem that we are trying to solve. The problem is a classification task so all models that are used for regression are immediately invalid. There are plenty of different types of classification models left to choose from. Many of them however, are good for data that has non-discrete features. This includes models such as logistic regression, KNN and other similar types of classification models. Also since we have so many features that are non-numerical and converted into arbitrary numbers these types of models would not be optimal. At first glance, due to the many discrete features Naïve Baye's could be a possible contender. However, the dataset also includes some continious features which complicates things. The different versions of Naïve Baye's aren't really suitable to a mix of discrete and continuous features. Therefore we are left with the tree based models such as the decision tree and random forests. We decided to implement two different types of models. We first do a decision tree and see how good we can get that model to work. We then do a random forest which may not be the absolute best model but since it is a continuation on the decision tree it might be interesting to see if it performs better. We then do analysis on both methods and see if these models are good enough and if there is any meaningful difference between the two.

 \section{Model Training and Hyperparameter Tuning}
 \subsection{Models and methods used}
@@ -103,7 +103,7 @@ The hyperparameters included in the grid for the decision tree were the maximum
 When performing the hyperparameter tuning, we started out with a rough grid to get a decent estimate of the optimal configuration. From the resluts we then performed a finer grid around the optimal configuration. This way we where able to inspect both a wide range and a more precise range without severly increasing the computational load. 

 \subsection{Caveats and restrictions}
-Although the validation results produced from the script are quite promising there are a couple of important notes to make, not only to better understand the final models but also to avoid pitfalls in potential future projects. Firslty, in our script we decided to not use any standardization as this is a sort of unique case where the models used do not require it. % Elaborate... Secondly, there are more hyperparameters that one might want to consider... Continuing, the scoring metric used is not always the best choice. In fact, the scoring metric one should use is highly dependent on what one's goal is... 
+Although the validation results produced from the script are quite promising there are a couple of important notes to make, not only to better understand the final models but also to avoid pitfalls in potential future projects. Firstly, in our script we decided to not use any standardization as this is a sort of unique case where the models used do not require it. However, it's extremely important to understand that if we were to introduce another model, we would need to standardize the data to ensure that the features contribute equally. Secondly, there are more hyperparameters that one might want to consider as we only used a few of them. The problem with expanding the number of hyperparameters in the grid is that it will exponentially increase the computational load. Therefore we picked a few that we thought were most important. Continuing, the scoring metric used is not always the best choice. We used accuracy, meaning the model tries to correctly label as many datapoints as possible and does not care about keeping a similiar precision for both labels. Our goal of this project is somewhat arbitrary, we mainly want to train and compare models. However if such a model were to be used in a real world application, one might want to change the scoring to better adapt the model to the problem at hand. % Elaborate... Secondly, there are more hyperparameters that one might want to consider... Continuing, the scoring metric used is not always the best choice. In fact, the scoring metric one should use is highly dependent on what one's goal is... 


 \section{Model Evaluations}
@@ -161,6 +161,8 @@ As we can see in the confusion matricies there is not that big of a difference b
 \end{table}
 Looking at the values we see that the difference between our models is not that large. The Random forest model is on average about 1 percentage point better than the Decision Tree. We can also see that all metrics are at about 0.85. This means that our models are not very accurate and that the differences between them is not that large at all. Which model that is better depends a lot on what is the priority. While it is clear that the Random Forest has the better performance, even by just a little bit, it is also significanty slower on the validation data. So for this dataset was it really worth 30x the computational time to get a slightly better result? We are not really sure. The extra computational time is a definite negative but at the size of this dataset we are only talking about a couple of minutes which is not too bad. For another dataset the results may be different and it might be clearer which is really the prefered model.

+Another thing to consider is the interpretability of the models. Here, there is quite a big difference that could possibly outweigh one model over the other. Starting with the Decision Tree, because the model's prediction process is quite simple, it is also highly interpretable. We can even plot the decision tree to see how the model handles every feature for a datapoint. This can be beneficial if we want to better understand the model. In contrast, Random Forest uses a more complicated method for prediction as it takes the averages over numerous decision trees with random subsets of features. This means that the model is more or less a black box. The importance of model interpretability is difficult to define as it will vary between different applications and there is even a subjective element to its importance. % Elaborate.
+
 \subsection{Analyzing the Performance}
 At a first glance at both the confusion matricies and the performance metrics the models do not look to be that good. But what has to be considered is the data that we are analyzing. We are looking at what possible indicators there are for a person to earn more than a certain amount of money. This is real world data and in the real world there is a lot of unique ways of earning money. While there certainly are some indicators that will clearly tell that somebody is earning a lot of money, there are other factors that are not as telling. This means that some features are less important than others. This can be seen in our models int he feature importance graphs in figure(\ref{fig:featureImportanceDT}) and (\ref{fig:featureImportanceRF}). This also means that there will be plenty of outliers in the data. No matter how good the model is, it cannot possibly catch all of these outliers. If it did it would be overfitted. We simply cannot expect a model to have very good accuracy on this type of data set.

--- a/decision_tree.pdf
+++ b/decision_tree.pdf