Lab exercise 10
This week the homework is much less guided, you may need to improvise and use the knowledge that you gained during the semester.
1, Loading data (1 point)
- A, Download the data from the following URL: https://archive.ics.uci.edu/ml/machine-learning-databases/00477/Real%20estate%20valuation%20data%20set.xlsx
- B, For more information, read https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set#
- C, Check the data in Excel and then load it to Python via Pandas
- D, Check for missing data and handle them reasonably, if you find any
- E, Split the data to train and test part (2012 is the train, 2013 is the test).
2, Visualize the data (2.5 points)
- Create meaningful visualization for the data to understand what it contains.
- Do not use any machine learning models here, only creative, meaningful charts.
- Create publication quality plots, that are fully understandable without looking at the code that generated them.
3, Unsupervised modeling (2.5 points)
- A, Try at least two different unsupervised machine learning techniques to reveal the inner structure of the data.
- B, Are there any clusters of the houses? What are those clusters? (smaller in-the-city flats vs larger outskirt houses?)
- Do not forget, sometimes scaling the data helps!
- Use both 2012 and 2013 data for this task.
4, Supervised modeling (4 points)
- A, Try different models to predict the ‘Y house price of unit area’.
- B, Select a proper metric to measure the goodness of your predictions.
- C, Tune the hiperparameters for at least 2 selected models.
- Use data from 2012 to train the models and also for hiperparameter tuning!
- Use data from 2013 only for evaluating your models!
- D, What is the best MAE (mean absolute error) for the price of unit area that you could achieve?
- E, Check the errors of the predictions. Is there a specific which kind of houses for that your model works much better or worse?