Assignment 4: Machine Learning

This assignment is designed to help you get comfortable using the skills covered in our machine learning tutorial in class. For this assignment you will use the Auto MPG data set from CMU, or a data set approved by the instructor. Your goal is to predict the miles per gallon of an automobile based on one or more of the following features. This is an INDIVIDUAL assignment. 

1. cylinders: multi-valued discrete 
2. displacement: continuous 
3. horsepower: continuous 
4. weight: continuous 
5. acceleration: continuous 
6. model year: multi-valued discrete 
7. origin: multi-valued discrete 
8. car name: string (unique for each instance)

In order to complete this assignment, you will need to:

  1. Download Weka and get it installed on your machine 
  2. Download the  Auto MPG data set 
  3. Prepare the data (in the Auto MPG case, this means converting the data into a form weka will read. 
    • I used Excel to open the data (I had to specify "any file type" and then saved it out as CSV format). 
    • For this assignment, you will want to delete the descriptions and/or replace them with a class (such as the maker of the car) 
    • I renamed the file to auto-mpg.arff and added a header specifying the type and name of each variable (following the guidelines in this documentation
  4. Load the data into Weka's "Explorer" 
  5. Use two types of classification:
    1. Linear Regression:
      • You cannot have any string variables (but you can have a class). This is why you needed to change the description as mentioned above.
      • I recommend using "more options" and telling Weka to output the model, predictions (using CSV) and all evaluation metrics
      • Your summary report should include the weights of the attributes, and accuracy
      • Your summary report should explore which attributes were most predictive (based on their weights in the linear regression model). 
    2. Decision Trees: 
      • Bin the MPG into "low" and "high" (or whatever classes you decide on). One option is to use the unsupervised discretize filter for the mpg attribute, with 2 or three as the maximum number of categories. If you look at the values for mpg, you can see they range from about 5 to  about 78. You may also decide on bins that make sense to you, and discretize by hand. 
      • Your summary report should include a picture of the decision tree generated, and your accuracy
      • Your summary report should explore the meaning of the tree structure
  6. Report on the results (see comments on the summary report for each classifier)
This assignment is an individual assignment only. You should hand in a zip file containing the ARFF file you created for the linear regression, the ARFF file you created for decision trees, and a word document reporting on your results. The report should briefly explain what you are trying to predict, what attributes you are using to help with the prediction, how large the data set is, and where it came from (this is required whether you use the assigned data set or one of your choosing). It should include a screen capture of bar charts illustrating the data (taken from Weka's explorer Preprocess screen); and a summary of the machine learning results (as described above), including your interpretation of what is going on with your results.