Compare commits

...

2 Commits

Author SHA1 Message Date
29b5d87b57 Stop tracking .venv 2025-10-23 22:20:59 +02:00
37fa321655 Added fine tuning 2025-10-23 15:44:32 +02:00
3 changed files with 292 additions and 117 deletions

File diff suppressed because one or more lines are too long

View File

@@ -81,7 +81,7 @@ With our numerical version of the dataset we found with the info function in pan
\subsection{Training, validation and test sets}
Before doing any sort of training or analysis on the data, se split it into training, test and validation data. We did this by first splitting a random 20\% of the data into test data. This data is reserved for the final testing of the model and will not be touched until the model is finished. Then we did a further split of the rest of the data were 25\% was designated as validation data. This data will be used for calibration of the model and hyperparameter tuning. The rest of the data which is 60\% of the total data or around 18000 data points will be used to train the model.
\section{Model selection}
When selecting the model to use for this project we have to limit us to using models that are appropriate to the type of problem that we are trying to solve. The problem is a classification task so all models that are used for regression are immediately invalid. There are plenty of different types of classification models left to choose from. Many of them however, are good for data that has non-discrete features. This includes models such as logistic regression, KNN and other similar types of classification models. Also since we have so many features that are non-numerical and converted into arbitrary numbers these types of models would not be optimal. What is left is the Gaussian Naïve Baye's and the different tree based models. Since our data again is made up of arbitrary numbers it is not possible to assume thata we have normally distributed data. Therefore we are left with the tree based models such as the decision tree and random forests. We decided to implement two different types of models. We first do a decision tree and see how good we can get that model to work. We then do a random forest which may not be the absolute best model but since it is a continuation on the decision tree it might be interesting to see if it performs better. We then do analysis on both methods and see if these models are good enough and if there is any meaningful difference between the two.
When selecting the model to use for this project we have to limit us to using models that are appropriate to the type of problem that we are trying to solve. The problem is a classification task so all models that are used for regression are immediately invalid. There are plenty of different types of classification models left to choose from. Many of them however, are good for data that has non-discrete features. This includes models such as logistic regression, KNN and other similar types of classification models. Also since we have so many features that are non-numerical and converted into arbitrary numbers these types of models would not be optimal. What is left is the Gaussian Naïve Baye's and the different tree based models. Naïve Baye's can be a bit troublesome for this dataset since we have found that some parameters are slightly correlated. However, this does not necessarliy make in an inappropriate method as it has been found to perform well despite this strict assumption. Therefore we are left with the tree based models such as the decision tree and random forests. We decided to implement two different types of models. We first do a decision tree and see how good we can get that model to work. We then do a random forest which may not be the absolute best model but since it is a continuation on the decision tree it might be interesting to see if it performs better. We then do analysis on both methods and see if these models are good enough and if there is any meaningful difference between the two.
\section{Model Training and Hyperparameter Tuning}
During the model training there are some important changes we can make to improve the accuracy of our model. One thing we implement is cross validation. Since there is a great spread in our data we choose to use randomized search. %Add more here and change type of x-val if needed. How many folds?

Binary file not shown.