Welcome back, in my previous post I described how we can perform linear regression using normal equation in Dynamo and I left with a question "what if the input and output are not linearly dependent?” Let’s say we have a hypothesis that the housing price doesn’t depend on floor area and number of rooms linearly but it has a following relationship as y = a_{0} * x_{0} + a_{1} * x_{1} + a_{2} * x_{2} + a_{3} * x_{1} * x_{2} + a_{4} * x_{1}^2 + a_{5} * x_{2}^2 \: where x_{0} = 1, \: x_{1} is floor area and x_{2} is number of rooms. Then we can introduce few new parameters x_{3} = x_{1} * x_{2}, \: x_{4} = x_{1}^2, \: x_{5} = x_{2}^2 and then perform the linear regression to find the coefficient matrix.
The Dynamo graph to setup the feature vector looks as follows. Once the feature vector is setup rest all is same as previous linear regression example.
Note that the price prediction for a given floor area and rooms we need to again construct the same feature vector. The price prediction based on the complete dataset and this new hypothesis for a 3 room house of area 1650 sq. ft. is 307,176 whereas the price predicted based on previous linear regression model is 293,081. Now the sellers can argue that the polynomial model is good because it is fetching them the higher price and similarly buyers can argue that linear regression model is better because it gives a lower price for his desired house. To identify the best model we need to build a test set and use that test set to figure out which model gives less error. We can use 80% of the dataset to evaluate the regression model and rest 20% we can use to test both the models. While splitting the dataset into a training set and test set we must shuffle the data before the split just to avoid any bias present in the dataset.
The Dynamo graph to split the dataset into training and testing is shown below. We shuffle the data row-wise and then extract sub matrix based on the percentage of the dataset to be used for training.
Now we can solve the linear regression model using Matrix.SolveLinearEquation node, which uses the same technique to evaluate the regression model as discussed in the previous blog post on Linear Regression. Once we have computed the two models (linear and polynomial) we can now run this model with the test set to get the error value. The total error for a model is the sum of the absolute error for each of the predictions as compared to the actual price data. The comparison of the two errors shows that the linear regression model performs better on the test set hence we should choose linear regression as compared to the polynomial regression hypothesis we discussed here. The error computation graph in Dynamo is shown below.
The final graph to run both the models and error comparison is shown below and it is shared at Github.
Conclusion: In this article, I have demonstrated how we can model polynomial equation based hypothesis and solve for the prediction model using linear regression and compare the two hypothesis based on the error value on a test set which is different than the training set. As always, please do share your feedback and suggestions on this article as well as the DynamoAI package on this page or on Github.
The Dynamo graph to setup the feature vector looks as follows. Once the feature vector is setup rest all is same as previous linear regression example.
Note that the price prediction for a given floor area and rooms we need to again construct the same feature vector. The price prediction based on the complete dataset and this new hypothesis for a 3 room house of area 1650 sq. ft. is 307,176 whereas the price predicted based on previous linear regression model is 293,081. Now the sellers can argue that the polynomial model is good because it is fetching them the higher price and similarly buyers can argue that linear regression model is better because it gives a lower price for his desired house. To identify the best model we need to build a test set and use that test set to figure out which model gives less error. We can use 80% of the dataset to evaluate the regression model and rest 20% we can use to test both the models. While splitting the dataset into a training set and test set we must shuffle the data before the split just to avoid any bias present in the dataset.
The Dynamo graph to split the dataset into training and testing is shown below. We shuffle the data row-wise and then extract sub matrix based on the percentage of the dataset to be used for training.
Now we can solve the linear regression model using Matrix.SolveLinearEquation node, which uses the same technique to evaluate the regression model as discussed in the previous blog post on Linear Regression. Once we have computed the two models (linear and polynomial) we can now run this model with the test set to get the error value. The total error for a model is the sum of the absolute error for each of the predictions as compared to the actual price data. The comparison of the two errors shows that the linear regression model performs better on the test set hence we should choose linear regression as compared to the polynomial regression hypothesis we discussed here. The error computation graph in Dynamo is shown below.
The final graph to run both the models and error comparison is shown below and it is shared at Github.
Conclusion: In this article, I have demonstrated how we can model polynomial equation based hypothesis and solve for the prediction model using linear regression and compare the two hypothesis based on the error value on a test set which is different than the training set. As always, please do share your feedback and suggestions on this article as well as the DynamoAI package on this page or on Github.
Comments
Post a Comment