This commit is contained in:
2025-10-31 16:03:54 +01:00
parent 52a78af447
commit abbd45298a
9 changed files with 62 additions and 62 deletions

File diff suppressed because one or more lines are too long

View File

@@ -57,7 +57,6 @@
\newlabel{rf_metrics@cref}{{[table][4][]4}{[1][4][]4}}
\@writefile{toc}{\contentsline {subsection}{\numberline {5.4}Overfitting and Underfitting}{4}{subsection.5.4}\protected@file@percent }
\@writefile{toc}{\contentsline {subsection}{\numberline {5.5}Feature Importance}{4}{subsection.5.5}\protected@file@percent }
\@writefile{toc}{\contentsline {section}{\numberline {6}Summary}{4}{section.6}\protected@file@percent }
\newlabel{fig:featureImportanceDT}{{2(a)}{5}{\relax }{figure.caption.6}{}}
\newlabel{fig:featureImportanceDT@cref}{{[subfigure][1][2]2(a)}{[1][4][]5}}
\newlabel{sub@fig:featureImportanceDT}{{(a)}{5}{\relax }{figure.caption.6}{}}
@@ -69,6 +68,7 @@
\@writefile{lof}{\contentsline {figure}{\numberline {2}{\ignorespaces The feature importance graphs for the Decision Tree model and the Random Forest model based on the validation data.\relax }}{5}{figure.caption.6}\protected@file@percent }
\newlabel{fig:}{{2}{5}{The feature importance graphs for the Decision Tree model and the Random Forest model based on the validation data.\relax }{figure.caption.6}{}}
\newlabel{fig:@cref}{{[figure][2][]2}{[1][4][]5}}
\@writefile{toc}{\contentsline {section}{\numberline {6}Summary}{5}{section.6}\protected@file@percent }
\ttl@finishall
\newlabel{LastPage}{{}{5}{}{page.5}{}}
\xdef\lastpage@lastpage{5}

View File

@@ -1,6 +1,6 @@
# Fdb version 4
["pdflatex"] 1761920867.71702 "/home/petrus/Documents/MLP/Projects/MLPproject/Report/MLPproject.tex" "MLPproject.pdf" "MLPproject" 1761920868.83079 0
"/home/petrus/Documents/MLP/Projects/MLPproject/Report/MLPproject.tex" 1761920867.50658 24533 58032bad0234d994ba6556d7acc5212e ""
["pdflatex"] 1761921952.93435 "/home/petrus/Documents/MLP/Projects/MLPproject/Report/MLPproject.tex" "MLPproject.pdf" "MLPproject" 1761921953.99247 0
"/home/petrus/Documents/MLP/Projects/MLPproject/Report/MLPproject.tex" 1761921952.70534 24812 36992a9467feb6ff9f9f97a89afe5aee ""
"/usr/share/texlive/texmf-dist/fonts/enc/dvips/base/8r.enc" 1737590400 4850 80dc9bab7f31fb78a000ccfed0e27cab ""
"/usr/share/texlive/texmf-dist/fonts/map/fontname/texfonts.map" 1577235249 3524 cb3e574dea2d1052e39280babc910dc8 ""
"/usr/share/texlive/texmf-dist/fonts/tfm/adobe/helvetic/phvb7t.tfm" 1136768653 2240 eb56c13537f4d8a0bd3fafc25572b1bd ""
@@ -131,10 +131,10 @@
"/var/lib/texmf/web2c/pdftex/pdflatex.fmt" 1761127067 7753793 c9f4d2c19ab997188c605d7179b0cdc0 ""
"CM_dt.png" 1761920482.34887 97023 ce9f07bdb4551ffd7f80782b99a54328 ""
"CM_rf.png" 1761920484.96582 98726 a24b8d53317f0e7e65e41ed83ef8fae5 ""
"MLPproject.aux" 1761920868.72356 6698 e699ab45a2056e84f281588212bdf2ec "pdflatex"
"MLPproject.out" 1761920868.72456 3113 d57c5f2b0e6699323b0a2645b9706cce "pdflatex"
"MLPproject.tex" 1761920867.50658 24533 58032bad0234d994ba6556d7acc5212e ""
"MLPproject.toc" 1761920868.72456 1587 d275c5e85ba45c005c3baf7931c510a7 "pdflatex"
"MLPproject.aux" 1761921953.8967 6698 d2e044226fe88697053e22eec695a818 "pdflatex"
"MLPproject.out" 1761921953.8997 3113 d57c5f2b0e6699323b0a2645b9706cce "pdflatex"
"MLPproject.tex" 1761921952.70534 24812 36992a9467feb6ff9f9f97a89afe5aee ""
"MLPproject.toc" 1761921953.8997 1587 6a7d8c5cbfca28921bcd78f124e2ec7a "pdflatex"
"SelfArx.cls" 1761125830.98333 7316 506603b27aab6da8087bc0f1ee693041 ""
"featureImportanceDT.png" 1761403205.10917 60078 4a2e56e2a45ae2ae5e41b9830c1bbcea ""
"featureImportanceRF.png" 1761403205.11075 61794 6b3eefc625dd3da8a3dbf302174c614c ""

View File

@@ -1,4 +1,4 @@
This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) (preloaded format=pdflatex 2025.10.22) 31 OCT 2025 15:27
This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) (preloaded format=pdflatex 2025.10.22) 31 OCT 2025 15:45
entering extended mode
restricted \write18 enabled.
file:line:error style messages enabled.
@@ -726,9 +726,9 @@ Here is how much of TeX's memory you used:
38909 multiletter control sequences out of 15000+600000
569401 words of font info for 297 fonts, out of 8000000 for 9000
1137 hyphenation exceptions out of 8191
75i,12n,77p,1656b,605s stack positions out of 10000i,1000n,20000p,200000b,200000s
75i,12n,77p,1812b,605s stack positions out of 10000i,1000n,20000p,200000b,200000s
</usr/share/texlive/texmf-dist/fonts/type1/urw/helvetic/uhvb8a.pfb></usr/share/texlive/texmf-dist/fonts/type1/urw/helvetic/uhvr8a.pfb></usr/share/texlive/texmf-dist/fonts/type1/urw/helvetic/uhvro8a.pfb></usr/share/texlive/texmf-dist/fonts/type1/urw/times/utmr8a.pfb>
Output written on MLPproject.pdf (5 pages, 305630 bytes).
Output written on MLPproject.pdf (5 pages, 305913 bytes).
PDF statistics:
191 PDF objects out of 1000 (max. 8388607)
148 compressed objects within 2 object streams

Binary file not shown.

Binary file not shown.

View File

@@ -41,7 +41,7 @@
\Authors{Petrus Einarsson\textsuperscript{1}*, Jakob Nyström\textsuperscript{1}*} % Authors
\affiliation{\textsuperscript{1}\textit{Department of Physics, Umeå University, Umeå, Sweden}} % Author affiliation
\affiliation{*\textbf{Corresponding authors}: peei0011@student.umu.se, jany0047@student.umu.se} % Corresponding author
\affiliation{*\textbf{Corresponding authors}: peei0011@student.umu.se, jany0047@student.umu.se } % Corresponding author
\affiliation{*\textbf{Supervisor}: shahab.fatemi@umu.se}
\Keywords{} % Keywords - if you don't want any simply remove all the text between the curly brackets
\newcommand{\keywordname}{Keywords} % Defines the keywords heading name
@@ -197,7 +197,7 @@ Table (\ref{dt_metrics}) and (\ref{rf_metrics}) shows the class-wise metrics of
At a first glance at both the confusion matricies and the performance metrics the models do not look to be that good. But what has to be considered is the data that we are analyzing. We are looking at what possible indicators there are for a person to earn more than a certain amount of money. This is real world data and in the real world there is a lot of unique ways of earning money. While there certainly are some indicators that will clearly tell that somebody is earning a lot of money, there are other factors that are not as telling. This means that some features are less important than others. This can be seen in our models in the feature importance graphs in figure(\ref{fig:featureImportanceDT}) and (\ref{fig:featureImportanceRF}). This also means that there will be plenty of outliers in the data. No matter how good the model is, it cannot possibly catch all of these outliers. If it did it would be overfitted. We simply cannot expect a model to have very good accuracy on this type of data set.
An important thing to touch on is the poor fit on higher-earning people by our model. We see that both models produce a precision of 77\% on the lower-earning individuals, which is quite bad compared to the precision of 87\% and 89\% on the higher-earning individuals. This means that out of all individuals predicted as higher-earning, only 77\% are correctly predicted. Even more notably, there is a very big discrepancy on the recall between the two classes. Recalls of 56\% and 63\% for the higher-earning class compared to 95\% and 94\% shows that out of all the higher-earning individuals, the models are not good at correctly detecting them as higher-earning. As we talked about above there may be many reasons for this poor fit. Of note is that we have optimized this model to find the best accuracy on all data point. We therefore stride to classify as many total data points correctly as possible and not on getting the best average for the classes separetly. Since there are more lower-earning people in our dataset it is very reasonable for the model to have optimised for that as well since it gives the best weighted accuracy. As previosly stated, the scoring metrics used for training the models should be adapted based on the problem at hand. If the problem requires similiar metrics across the classes, one should instead consider using scoring metrics such as balanced accuracy score, which are adapted to produce such results.
An important thing to touch on is the poor fit on higher-earning people by our model. We see that both models produce a precision of 77\% on the lower-earning individuals, which is quite bad compared to the precision of 87\% and 89\% on the higher-earning individuals. This means that out of all individuals predicted as higher-earning, only 77\% are correctly predicted. Even more notably, there is a very big discrepancy on the recall between the two classes. Recalls of 56\% and 63\% for the higher-earning class compared to 95\% and 94\% shows that out of all the higher-earning individuals, the models are not good at correctly detecting them as higher-earning. Additionally, the F1-score of both classes demonstrates the discrepancy of the overall performances across the classes. It shows that the harmonic average of precision and recall is significantly lower for the lower-earning individuals than for the higher earning individuals. As we talked about above there may be many reasons for this poor fit. Of note is that we have optimized this model to find the best accuracy on all data point. We therefore stride to classify as many total data points correctly as possible and not on getting the best average for the classes separetly. Since there are more lower-earning people in our dataset it is very reasonable for the model to have optimised for that as well since it gives the best weighted accuracy. As previosly stated, the scoring metrics used for training the models should be adapted based on the problem at hand. If the problem requires similiar metrics across the classes, one should instead consider using scoring metrics such as balanced accuracy score, which are adapted to produce such results.
\subsection{Overfitting and Underfitting}
We spent some time tuning the hyperparameters to ensure that we did not overfit. If we compare the validation results with the test results we see that the performance metrics do not change much at all. This is what we want to see as this means that we have avoidede overfitting the model. This means that our model could be used on other similar datasets and hopefully give similar perfomances. We also do not want our model to be underfit. This is a bit harder to validate as we want the errors to be as small as possible for both training and testing and as we stated before I believe that this is a difficult dataaset to get a great fit to. Therefore we believe that we have found a model that has a decent enough balance between bias and variance.

View File

@@ -16,5 +16,5 @@
\contentsline {subsection}{\numberline {5.3}Analyzing the Performance}{4}{subsection.5.3}%
\contentsline {subsection}{\numberline {5.4}Overfitting and Underfitting}{4}{subsection.5.4}%
\contentsline {subsection}{\numberline {5.5}Feature Importance}{4}{subsection.5.5}%
\contentsline {section}{\numberline {6}Summary}{4}{section.6}%
\contentsline {section}{\numberline {6}Summary}{5}{section.6}%
\contentsfinish

Binary file not shown.