Linear methods in classification
- Load hurricane data from the article “Hurricane-induced selection on the morphology of an island lizard”. (You may use ELTE wifi to access materials placed behind Nature’s paywall.) https://www.nature.com/articles/s41586-018-0352-3
- A, Drop the lizard with the most missing values
- B, Drop the ID column, and the ‘SumFingers’,’SumToes’,’MaxFingerForce’ columns,
which were only measured after the hurricane
- C, Encode, the Sex, Origin ans Hurricane values into binary columns,
and drop the original text columns.
- D, Make sure all your columns are encoded as floating point values,
not unsigned integers!
—-
- Use logistic regression from the statsmodels package to predict whether
the lizard was measured after of before the hurricane,
using the whole dataset
- A, Investigate the Toe and Finger area coefficients, whats going on?
Fix this problem by only keeping the mean measurements.
- B, Which measured quality had the most significant positive effect on survival?
- C, Which measured quality had the most significant negative effect on survival?
- D, Try explain the results in your words. Check the abstract of the paper.
- E, Repeat the fit after scaling each input column to 0 mean and 1 variance.
Have the coefficients changed? Have the predictions changed?
—-
- Repeat the fit with scikit-learn on the unnormalized dataset.
- A, Compare the coefficients with the ones you got from statsmodels.
Are they the same? If not try to answer why?
- B, Try to tweak the parameters of the scikit-learn method to reproduce the
the coefficients produced by statsmodels.
- C, Plot the ROC curve for the full dataset, and calculate the AUC.
- D, Repeat the fit after scaling each input column to 0 mean and 1 variance.
Have the coefficients changed? Have the predictions changed?
—-
- Split the dataset into 5 folds and predict each fold by training on the other 4.
- A, Make sure to fix the seed of the splitting to 0 to make it reproducible.
- B, Plot the ROC for the 5 folds separately as curves on the same plot.
- C, Calculate the AUC values for the 5 folds separately.