Personal Project: Cars Inc predicting Prices

JMP, Data Analysis, Prediction

A used car dealership would like to have a reasonable profit in sales and be able to predict prices in used trade in cars. By increasing profits they would have a more efficient and successful business. They could use these extra profit to invest more into the business and grow its business such as using the extra money for employees’ bonuses. To find the best way to predict, three different models will be used and compared. The response variable in these models will be price. There is 10 possible predictors which include Age, Mileage, Fuel type, Horse Power, Metallic, Automatic, CC, Doors, QuartTax, and Weight. The Numerical variables are Price, age, Mileage, horse power, CC, Doors, QuartTax, and weight. The Categorical are Fuel Type, Automatic, and Metallic.

Process

Data

Looking at the distributions, age is incremental. This is understandable since people would be more likely to trade in an old car rather than an old car. It is more likely that newer car model are more profitable, however that is not to say that older used cars cannot earn a profit. Mileage is skewed to the left with some outliers at its tail but these are not out of the ordinary. Since in the Bivariate diagram of Price By Mileage there is a negative strong correlation with an r-square of .35, this is very good for the dealership since the most profitable cars are low mileage. This shows that some of the variability can be explained by mileage. Most of these car use the fuel type of Petrol at 86.7% and they are also no automatic at 95.2%. It also seems that there is a slight correlation between price and weight that the heavier a car is the more expensive it is which can be due to a variety of factors.

In the column viewer we can see that there is not much data missing from any of the columns. The few that are missing some are Fuel Type, Metallic, and automatic. It could be that it was unknown at the time or it fits into none of the categories. None of these statistics are out of the ordinary. It would be advisable to check the car with the lowest mileage and lowest age as their might be recalls for these types of cars.

Multivariate analysis

Looking at the multivariate analysis we can see that Price with age, mileage, and weight have a correlate. Price and Age have a strong significant negative correlation at -.86. While Price and Mileage also have a mildly significant correlation at -.59. Price with weight has a mildly significant positive correlation at .61. Age and weight have a strong correlation this could be because as car manufacturers got better at making cars the weights got heavier. Mileage and age have a slight positive correlation at .55 this is expected because with age comes mileage since a car will generally be used with age. QuartTax and weight have a mildly strong correlation at .6339.

Evaluate the Model

Leaf Report & Multicollinearity

There is no missing data to be concerned with. One problem that could possibility occur is multi-collinearity with age and mileage since essentially they are almost the same. Generally speaking older cars will usually have more mileage and newer cars will have less. As well as age and weight since there seems to be a trend that newer cars have a greater weight. If this would occur then the independent variable would not be as a significant a predictor.

Model I: Partition Tree

Analyze: Modeling

Three Model Comparison

In our partition tree, the box missing information will be checked although we do not have much information missing. The rest of the settings are left at default. In this partition model a total of 14 splits were made. The training has an r-square of .875 and the validation has an r-square of .847. The r-square is an indicator of the variability explained by the model since 1 is the strongest positive correlation a .847 is a strong correlation. This is always to be expected. The RMSE is at 1280.91 for training and 1528 for validation. The validation RMSE is very high. In our column contributions the greatest contributor is Age. With some predictors not being used such as Fuel Type, Metallic, Automatic, and Doors. This is good for our model since it would make it simpler. Our leaf report shows the highest priced used car would be one with an age of less than 32 and a weight greater than or equal to 1265. There are not many used cars who fit into this group only 5 would out of 1000 cars. This would be priced at a mean of 2610. While the cheapest used car would be one that is older, has a lower horse power, as well as having a high mileage.


Model II: Multilinear Regression Model

In the Multilinear Regression Model we see that the r-square is .840 this is really good for the model however the decision tree had a slightly higher r-square. The Root mean square error shows the unexplained variance our model has it is at 1454.085 this is somewhat high and is not good for the model.This model uses only 4 predictors of which in order of weight in the model is age, weight, horse power, and mileage. If we look at the VIF we can see if any of the predictors that are used in the model are showing multi-collinearity. None seem to be too high in VIF so this means that we do not have multi-collinearity in this model. All of our predictors are important to the model as should be the prob>|t|.

Model III: Neural Network

Lastly, in the neural network we check off information missing once again. The r-square for this model is the highest out of all the models we used this is expected since neural networks are usually the best type of models. The validation r-square is at .906 this is a very good correlation and shows that our model can predict the variability very well. The RMSE is lower that both the decision tree and the MLR which is good the lower the RMSE the better. The model minimizes unexplained variance. Overall the neural network is a very good model.

Model Comparison

Looking at the models they are all very close together in terms of r-square and RASE. However the neural network is the best model to use to predict what price the dealership will get for the used car. It has the lowest error, lowest unexplained variability, and highest explained variability. Comparing RASE we see that the Neural Network has the lowest showing less error in predictions. In both validation and training the Neural Network is better. The training is always slightly better than the validation this is to be expected. Since the training is how the model learns and the validation is measuring how well it actually did at predicting. So usually the test set would be more valuable. The second best model would be the Fit Least Squares because it has a lower error and higher r-square in its validation. It also uses less predictors. However it is very close between these last two model.

Conclusion

Overall the best model for the dealership to use to predict used car prices would be the neural network because of its higher ability to predict compared to the other models.

Alondra Salazar

© 2022 Alondra Salazar