Adapted from [275], where a similar taxonomy was suggested in the general framework of searching for structure in data.

Lubinsky [222] considered trees that can have internal nodes with just one child. At these nodes, the data are not split, but residuals are taken from a single variable regression.

While converting decision tables to trees, it is common to have leaf nodes that have a ``no decision'' label.

More precisely, a decision tree is said to perform classification if the class labels are discrete values, and regression if the class labels are continuous. We restrict almost entirely to classification trees in this chapter.

Oblivious decision trees [186] from the machine learning literature are nearly identical in structure to OBDDs.

Chou [56] suggested a pruning method, based on [29], for optimally pruning a balanced TSVQ. The TSVQ growing procedure suggested by Riskin and Gray [305] can be viewed as an inverse to Chou's pruning procedure.

The desirable properties of a measure of entropy include symmetry, expandability, decisivity, additivity and recursivity. Shannon's entropy [326] possesses all of these properties [1]. For an insightful treatment of entropy reduction as a common theme underlying several pattern recognition problems, see [370].

Goodman and Smyth [126] report that the idea of using the mutual information between features and classes to select the best feature was originally put forward by Lewis [209].

Quinlan's C4.5 [297] uses a naive version of the confidence intervals for doing pessimistic pruning.

Schaffer [318] stated and proved a conservation theorem that states, essentially, that positive performance in some learning situations must be offset by an equal degree of negative performance in others. To clarify the, sometimes non-intuitive, consequences of the conservation theorem, Schaffer [319] gave an example of a concept for which information loss gives better generalization accuracy than information gain.

For a general discussion about the relationship between complexity and predictive accuracy of classifiers, see [284].

Techniques that start with a sufficient partitioning and then optimize the structure (e.g., [237]) can be thought of as being a converse to this approach.

In bootstrapping, B independent learning samples, each of size N are created by random sampling with replacement from the original learning sample L. In cross validation, L is divided randomly into B mutually exclusive, equal sized partitions. Efron [83] showed that, although cross validation closely approximates the true result, bootstrap has much less variance, especially for small samples. However, there exist arguments that cross validation is clearly preferable to bootstrap in practice [185].

Van Campenhout [44] argues that increasing the amount of information in a measurement subset through enlarging its size or complexity never worsens the error probability of a truly Bayesian classifier. Even after this guarantee, the cost and complexity due to additional measurements may not be worth the slight (if any) improvement in accuracy. Moreover, most real world classifiers are not truly Bayesian.

An exception is the optimal feature subset selection method using zero-one integer programming, suggested by Ichino and Sklansky [157].

Smoothing is the process of adjusting probabilities at a node in the tree based on the probabilities at other nodes on the same path. Averaging improves probability estimates by considering multiple trees.

A lot of work exists in the neural networks literature on using committees or ensembles of networks to improve classification performance. See [139] for example. An alternative to multiple trees is a hybrid classifier that uses several small classifiers as parts of a larger classifier. Brodley [31] describes a system that automatically selects the most suitable among a univariate decision tree, a linear discriminant and an instance based classifier at each node of a hierarchical, recursive classifier.

Hierarchical unsupervised clustering can construct, using bottom-up or top-down methods, tree-structured classifiers. As mentioned in Section 2.1, these methods are beyond the scope of the current chapter.

Thanks to Kevin Van Horn for pointing this out.

It is argued empirically [76] that the variance in decision tree methods is more a reason than bias for their poor performance on some domains.

The Statlog project is initiated by the European Commission, and its full title is ``The Comparative Testing of Statistical and Logical Learning Algorithms on Large-Scale Applications to Classification, Prediction and Control''.

For a general description of modern classification problems in astronomy, which prompt the use of pattern recognition and machine learning techniques, see [198].

Sreerama Murthy
Thu Oct 19 17:40:24 EDT 1995