9/28/2019

Optimal classification trees (最佳分類樹)

D. Bertsimas and J. Dunn, Optimal classification trees, Machine Learning, July 2017, Volume 106, Issue 7, pp 1039–1082.
State-of-the-art decision tree methods apply heuristics recursively to create each split in isolation, which may not capture well the underlying characteristics of the dataset. The optimal decision tree problem attempts to resolve this by creating the entire decision tree at once to achieve global optimality. In the last 25 years, algorithmic advances in integer optimization coupled with hardware improvements have resulted in an astonishing 800 billion factor speedup in mixed-integer optimization (MIO). Motivated by this speedup, we present optimal classification trees (1), a novel formulation of the decision tree problem using modern MIO techniques that yields the optimal decision tree for axes-aligned splits. We also show the richness of this MIO formulation by adapting it to give optimal classification trees with hyperplanes (2) that generates optimal decision trees with multivariate splits. Synthetic tests demonstrate that these methods recover the true decision tree more closely than heuristics, refuting the notion that optimal methods overfit the training data. We comprehensively benchmark these methods on a sample of 53 datasets from the UCI machine learning repository. We establish that these MIO methods are practically solvable on real-world datasets with sizes in the 1000s, and give average absolute improvements in out-of-sample accuracy over CART of 1–2 and 3–5% for the univariate and multivariate cases, respectively. Furthermore, we identify that optimal classification trees are likely to outperform CART by 1.2–1.3% in situations where the CART accuracy is high and we have sufficient training data, while the multivariate version outperforms CART by 4–7% when the CART accuracy or dimension of the dataset is low.
About Random Forests
The average accuracy of Random Forests across all 53 datasets was 85.8%, which is about 6, 5 and 3% higher than CART, OCT and OCT-H at depth 4, respectively.... 
We can therefore achieve state-of-the-art accuracies in this regime with just a single tree learner, allowing us to maintain some interpretability (3) in the model, albeit less than an axis-aligned tree, but still much more than a forest.
(1) optimal classification tree (OCT) equation (24)
(2) optimal classification trees with hyperplanes (OCT-H) equation (28)
(3) At the expense of computational cost (for off-line training). The authors do not compare the time complexity, but they provide the following information in the paper.
A time limit was applied to each optimal tree problem. On most datasets for OCT this was 30 min, but for some of the larger datasets we increased the time limit up to 2 h to ensure that the solver could make progress. The OCT-H problem is easier to solve and so only required time limits of 5–15 min. When generating heuristic warm starts for the multivariate problem by recursively solving the OCT-H problem with a single split, we used a time limit of 60 s. We found this was more than enough to find a good greedy solution. 

沒有留言:

張貼留言