Cleaned the analysis part, started on abstract, introduction and sumamry parts

2025-10-28 17:08:45 +01:00
parent 55eca37d53
commit 56742bab1e
9 changed files with 133 additions and 110 deletions
--- a/Report/MLPproject.tex
+++ b/Report/MLPproject.tex
@@ -48,7 +48,7 @@
 %----------------------------
 %	ABSTRACT
 %----------------------------
-\Abstract{}
+\Abstract{We found a dataset that could be used for classification tasks. In order to be able to use this dataset we had to do some feature engineering, handle missing values and do some other data cleaning such as label encoding. We chose two applicable models, the Decision Tree and the Random Forst models. The dataset was divided into training, validation and testing. We tuned hyperparameters to get the best possible validation results and to avoid overfitting. When we were satisfied with our models we found that both models performed about tha same with the Random Forest having about on percentage point better results but with much higher training times. We argue that the weighted accuracies of about 85\% which at a glance might seem bad, actually are reasonable given the nature of our data sets and the choices we made.}

 %----------------------------
 \begin{document}
@@ -67,6 +67,8 @@

 %----------------------------
 \section{Introduction}
+Machine learning techniques have plenty of practical use cases. In this report we find a real world, dataset and train two machine learning models on it to try and get the best results possible.
+

 \section{Data analysis}

@@ -89,20 +91,21 @@ Another very important part of the model training is finding the optimal hyperpa

 \section{Model Evaluations}
 There are two interesting parts to look at after our analysis. One part is to analyze how well the actual models performed and compare the difference between the two models we have chosen to study. We fine tuned our models using the validation part of the data. After running it on the test data we can see how well it actually performs. A great way to get a quick overview of how well a model classifies is to look at the confusion matrix.
+\subsection{Analyzing the Confusion Matricies}


 \begin{figure}[!hptb]
 \centering
- \begin{subfigure}[b]{0.9\columnwidth}
+ \begin{subfigure}[b]{\columnwidth}
     \centering
-     \includegraphics[width=\textwidth]{confusionMatrix.png}
+     \includegraphics[width=\textwidth]{CM_dt.png}
     \caption{}
     \label{fig:featureImportanceDT}
 \end{subfigure}
 \hfill
- \begin{subfigure}[b]{0.9\columnwidth}
+ \begin{subfigure}[b]{\columnwidth}
     \centering
-     \includegraphics[width=\textwidth]{confusionMatrix.png}
+     \includegraphics[width=\textwidth]{CM_rf.png}
     \caption{}
     \label{fig:featureImportanceRF}
 \end{subfigure}
@@ -110,8 +113,9 @@ There are two interesting parts to look at after our analysis. One part is to an
    \label{fig:}
 \end{figure}

-As we can see in the confusion matricies there is not that big of a difference between the models. Both did an overall good job at identifying the two classes. There is a difference in how well the models did in identifying the two different classes. Overall they performed a lot better at classifying the poor people than the rich. % Add more about the exact numbers!!!
-This is a very interesting result and maybe not so weird as it first seems. There were a lot more poor people in our training data set than rich people. This would of course train our model to be better at classifying the poor. As well as looking at the classification matricies it is interesting to look at the actual performance metrics that can be calculated from the matricies. These metrics can be seen in table(\ref{perfmetric}). Of note is that all of these metrics are calculated as weighted metrics which means that they account for the class imbalances seen in the confusion matrcies.
+As we can see in the confusion matricies there is not that big of a difference between the models. Both did an overall good job at identifying the two classes. There is a difference in how well the models did in identifying the two different classes. Overall they performed a lot better at classifying the poor people than the rich. We can see that for the both models are pretty good at classifying the poor class and worse at the rich class. The Random forest model is slightly better than the Decision Tree. This is a very interesting result and maybe not so weird as it first seems. There were a lot more poor people in our training data set than rich people. This would of course train our model to be better at classifying the poor. As well as looking at the classification matricies it is interesting to look at the actual performance metrics that can be calculated from the matricies. 
+\subsection{Analyzing Weighted Performance Metrics}
+ We want to analyze to sets of metrics. First we have the validaton Metrics. These metrics can be seen in table(\ref{perfmetric}). Then we have the actual test metrics which is the result from our model. These can be seen in table(\ref{perfmetrictest}). Of note is that all of these metrics are calculated as weighted metrics which means that they account for the class imbalances seen in the confusion matrcies.
 \begin{table}[!htbp]
    \centering
    \caption{The performance metrics of the models on the validation data.}
@@ -128,24 +132,29 @@ This is a very interesting result and maybe not so weird as it first seems. Ther
 \begin{table}[!htbp]
    \centering
    \caption{The performance metrics of the models on the test data.}
-    \label{perfmetric}
-    \resizebox{\columnwidth}{!}{
-    \begin{tabular}{c|c|c|c|c|c}
-        Model&Accuracy&Precision&Recall&F1 Score&Total Time\\
+    \label{perfmetrictest}
+    \resizebox{0.6\columnwidth}{!}{
+    \begin{tabular}{c|c|c|c}
+        Model&Precision&Recall&F1 Score\\
        \hline
-        RF &0.8589&0.8535&0.8589&0.8534&150.8154\\
+        RF &0.86&0.86&0.86\\
        \hline
-        DT&0.8483&0.8449&0.8483&0.8462&6.7357
+        DT&0.84&0.85&0.84
    \end{tabular}}
 \end{table}
-Looking at the values we see that the difference between our models is not that large. The Random forest model is on average about 1 percentage point better than the Decision Tree. We can also see that all metrics are at about 0.85. This means that our models are not very accurate and that the differences between them is not that large at all. Which model that is better depends a lot on what is the priority. While it is clear that the Random Forest has the better performance, even by just a little bit, it is also significanty slower. So for this dataset was it really worth 30x the computational time to get a slightly better result? We are not really sure. The extra computational time is a definite negative but at the size of this dataset we are only talking about a couple of minutes which is not too bad. For another dataset the results may be different and it might be clearer which is really the prefered model.
+Looking at the values we see that the difference between our models is not that large. The Random forest model is on average about 1 percentage point better than the Decision Tree. We can also see that all metrics are at about 0.85. This means that our models are not very accurate and that the differences between them is not that large at all. Which model that is better depends a lot on what is the priority. While it is clear that the Random Forest has the better performance, even by just a little bit, it is also significanty slower on the validation data. So for this dataset was it really worth 30x the computational time to get a slightly better result? We are not really sure. The extra computational time is a definite negative but at the size of this dataset we are only talking about a couple of minutes which is not too bad. For another dataset the results may be different and it might be clearer which is really the prefered model.

+\subsection{Analyzing the Performance}
 At a first glance at both the confusion matricies and the performance metrics the models do not look to be that good. But what has to be considered is the data that we are analyzing. We are looking at what possible indicators there are for a person to earn more than a certain amount of money. This is real world data and in the real world there is a lot of unique ways of earning money. While there certainly are some indicators that will clearly tell that somebody is earning a lot of money, there are other factors that are not as telling. This means that some features are less important than others. This can be seen in our models int he feature importance graphs in figure(\ref{fig:featureImportanceDT}) and (\ref{fig:featureImportanceRF}). This also means that there will be plenty of outliers in the data. No matter how good the model is, it cannot possibly catch all of these outliers. If it did it would be overfitted. We simply cannot expect a model to have very good accuracy on this type of data set.

+An important thing to touch on is the poor fit on rich people by our model. We see that only 60-70\% where correctly identified which is quite bad. As we talked about above there may be many data reasons for this poor fit. Of note is that we have optimized this model to find the best accuracy on all data point. We therefore stride to classify as many total data points correctly as possible and not on getting the best average for the classes separetly. Since there are more poor people in our dataset it is very reasonable for the model to have optimised for that as well since it gives the best weighted accuracy.
+
+\subsection{Overfitting and Underfitting}
+We spent some time tuning the hyperparameters to ensure that we did not overfit. If we compare the validation results with the test results we see that the performance metrics do not change much at all. This is what we want to see as this means that we have avoidede overfitting the model. This means that our model could be used on other similar datasets and hopefully give similar perfomances. We also do not want our model to be underfit. This is a bit harder to validate as we want the errors to be as small as possible for both training and testing and as we stated before I believe that this is a difficult dataaset to get a great fit to. Therefore we believe that we have found a model that has a decent enough balance between bias and variance.
+
+\subsection{Feature Importance}
 Taking a closer look at the feature importance graphs of the two models we notice an interesting difference. The Decision tree which is only one tree focuses has only a few main features where one is the most important. The rest are not used that much or almost not at all. The Random Forest uses a far wider range of features. They also rank the features a bit differently and the best feature for one model is not the best for the other. We considered removing the worst performing features to see if it would make a difference in the performanes. But since they have diffrent worst performing features we reasoned that to keep the comparison as fair as possible it would be more interesting to leave the features as is.

-% Jämföra test och validation för att verifiera att vi inte overfittar
-We spent some time tuning the hyperparameters to ensure that we did not overfit. We can also see if we 

 \begin{figure}[!hptb]
 \centering
@@ -166,8 +175,9 @@ We spent some time tuning the hyperparameters to ensure that we did not overfit.
    \label{fig:}
 \end{figure}

-
-%----------------------------
+\section{Summary}
+We have succesfully trained two different but similar machine learning models on classifying the monetary status of people based on a bunch of different features. While some trade offs where made in regards to which features where kept and to what we optimized the model for. We still managed to get a respectable result especially regarding the difficult type of data that we had to work with.
+%---------
 %	REFERENCE LIST
 %----------------------------
 \bibliographystyle{model1-num-names}