FIT2086 Studio 9 Supervised Machine Learning Methods Assessment Answer
Number of leaves in best tree = 7
Variables used in best tree are: THAL, AGE, CP, EXANG, CA, CHOL
Plot of best tree:
According to this decision tree, most important features which decide about heart disease are THAL, AGE, CP, EXANG, CA, CHOL.
Following are the rules which decide whether a patient has disease or not
- (THAL = Normal) and (Age < 54.5) à No
- (THAL = Normal) and (Age >= 54.5) and (EXANG = Y) à Yes
- (THAL = Normal) and (Age >= 54.5) and (EXAMG = N) and (CHOL < 304)à No
- (THAL = Normal) and (Age >= 54.5) and (EXAMG = N) and (CHOL >= 304)à Yes
- (THAL != Normal) and (CP = Asymptomatic) à Yes
- (THAL != Normal) and (CP != Asymptomatic) and (CA < 0.5) à No
- (THAL != Normal) and (CP != Asymptomatic) and (CA >= 0.5) à Yes
Tree plot with probability of each class at leaf node:
According to this tree,
(THAL != Normal) and (CP = Asymptomatic) à Yes has 0.8787 probability to have heart disease.
Variables included by logistic regression model
THAL, CP, CA, EXANG, OLDPEAK.
Whereas tree model include following features THAL, CP, EXANG, CA, CHOL, AGE. CP has biggest coefficient in this logistic regression equation, so CP is one of the most important predictor in logistic regression
Logistic regression equation
Z = (-0.8075472) +
Y = 1/ 1+e-z
Prediction statistics of tree
Prediction statistics logistic regression
Accuracy of d-tree is 84.84%, and accuracy of logistic regression is 86.36.% Whereas sensitivity(recall) of decision tree is better than logistic regression, and specificity(precision) of logistic regression is better than decision tree. So, in case of diagnostic test, once would prefer d-tree over logistic regression as it has higher recall i.e it has higher accuracy to catch positive cases.
According to logistic regression model:
Probability of 45th example to be positive for heart disease is 0.71
According to d-tree model:
Probability of 45th example to be positive for heart disease is 0.3
95% confidence interval for probability of having heart disease for patient in the 45th row in the test data is (0.39, 0.8983).
Probability of having heart disease for patient in the 45th row, according to the logistic regression model in part 1.8 is 0.71 and lies inside 95% confidence interval (0.39, 0.8983). Whereas probability of having heart disease for patient in the 45th row, according to the d-tree model in part 1.8 is 0.3 and it does not lie inside 95% confidence interval (0.39, 0.8983) computed for logistic regression. This means prediction for patient in 45th row is YES according to logistic regression but is NO according to d-tree.
95% Confidence interval for the classification accuracy of the logistic regression model using the predictors selected by BIC is (0.8182,0.8788).
Actual classification accuracy obtained on the testing data using the model you learned in Q1.6 is 86.36% and lies inside the 95% confidence interval (0.8182,0.8788).
Plot for k = 2
Plot for k = 5
Plot for k = 10
Plot for k = 25
Visually we can see that estimated curve is closest to actual test curve for k = 5. For k = 2 and k = 10, the estimated curve is distinguishable from actual test curve, and the gap increase for k = 25.
So, rms for k = 5 will be minimal out the four cases here.
Optimal value of k chosen by cross-validation method is 3. In question 2.1a, we have seen that mean-squared-error on test data is minimized at k = 4. This is possible as data used by cross-validation is training data, we can notice that optimal value of k using both the methods don’t differ much.
Sensor measurement noise for test data will be equal to the difference between given intensity values and the predicted values (using k = 5). Variance of this difference is = 0.6859631
Yes, for k = 5, knn method is able to achieve our aim of providing a smooth, low-noise estimate of background level as well as accurate estimation of the peaks.
According to me, knn is able to estimate the peaks well because this method learns the behaviour of neighbour points, if neighbour points in training data have peak values, algorithms assigns higher values to test points as well.
Plot using d-tree
We can see that d-tree is estimating the test data much poorly as compared to knn method. Mse of knn for k = 5 was 0.717139, and Mse of d-tree is 4.723485.
Similarity between D-tree and Knn: Broadly both of these algorithms are trying to find the nearest neighbourhood of a point and then use mean to give the prediction.
Difference between D-tree and Knn: But the major difference between the two is the way of finding the neighbours. D-tree fid the neighbours by making splits of the form MZ > R (where R is a real number) Or MZ < R. But Knn method finds the neighbours by directly calculating the Euclidian distance.
The choice of getting nearest neighbours suits best for knn in this particular example, but for some other problem with higher dimension, d-tree generally works well.