Q1. What is CART?
A1. CART is an acronym for Classification and Regression Trees, a decision-tree procedure introduced in 1984 by world-renowned UC Berkeley and Stanford statisticians, Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone. Their landmark work created the modern field of sophisticated, mathematically- and theoretically-founded decision trees. The CART methodology solves a number of performance, accuracy, and operational problems that still plague many current decision-tree methods. CART's innovations include:
- solving the "how big to grow the tree" problem;
- using strictly two-way (binary) splitting;
- incorporating automatic testing and tree validation; and
- providing a completely new method for handling missing values.
Q2. What makes Salford Systems' CART the only "true" CART?
A2. Salford Systems' CART is the only decision tree based on the original code of Breiman, Friedman, Olshen, and Stone. Because the code is proprietary, CART is the only true implementation of this classification-and-regression-tree methodology. In addition, the procedure has been substantially enhanced with new features and capabilities in exclusive collaboration with CART's creators. While some other decision-tree products claim to implement selected features of this technology, they are unable to reproduce genuine CART trees and lack key performance and accuracy components. Further, CART's creators continue to collaborate with Salford Systems to refine CART and to develop the next generation of data-mining tools.
Q3. What is a decision tree?
A3. A decision tree is a flow chart or diagram representing a classification system or predictive model. The tree is structured as a sequence of simple questions, and the answers to these questions trace a path down the tree. The end point reached determines the classification or prediction made by the model, which can be a qualitative judgment (e.g., these are responders) or a numerical forecast (e.g., sales will increase 15 percent).
Q4. What makes CART so easy to interpret?
A4. As illustrated above, the results of a decision-tree data-mining project are displayed as a tree-shaped visual diagram. Discovered relationships and patterns in the data - even in massively complex datasets with hundreds of variables - are presented as a flow chart. Compare this to complex parameter coefficients in a logistic regression output or a stream of numbers in a neural-net output, and the appeal of decision trees is readily apparent.
The visual display enables users to see the hierarchical interaction of the variables, in addition,it often confirms previous knowledge about important data relationships, which adds confidence in the reliability and utility of the CART model. Further, because simple if-then rules can be read right off the tree, models are easy to grasp and easy to apply to new data.
Q5. How are decision trees grown?
A5. There are a number of different ways to grow decision trees, but CART uses strictly binary, or two-way, splits that divide each parent node into exactly two child nodes by posing questions with yes/no answers at each decision node. CART searches for questions that split nodes into relatively homogenous child nodes, such as a group consisting largely of responders, or high credit risks, or people who bought sport-utility vehicles. As the tree evolves, the nodes become increasingly more homogenous, identifying important segments. Other methods, such as CHAID, favor multi-way splits that can paint visually appealing trees but that can bog models down with less accurate splits.
Q6. Why is CART unique among decision-tree tools?
A6. CART is based on a decade of research, assuring stable performance and reliable results. CART's proven methodology is characterized by:
- Reliable pruning strategy - CART's developers determined definitively that no stopping rule could be relied on to discover the optimal tree, so they introduced the notion of over-growing trees and then pruning back; this idea, fundamental to CART, ensures that important structure is not overlooked by stopping too soon. Other decision-tree techniques use problematic stopping rules.
- Powerful binary-split search approach - CART's binary decision trees are more sparing with data and detect more structure before too little data are left for learning. Other decision-tree approaches use multi-way splits that fragment the data rapidly, making it difficult to detect rules that require broad ranges of data to discover.
- Automatic self-validation procedures - in the search for patterns in databases it is essential to avoid the trap of "overfitting," or finding patterns that apply only to the training data. CART's embedded test disciplines ensure that the patterns found will hold up when applied to new data. Further, the testing and selection of the optimal tree are an integral part of the CART algorithm. Testing in other decision-tree techniques is conducted after the fact and tree selection is left up to the user.
- In addition, CART accommodates many different types of real-world modeling problems by providing a unique combination of automated solutions:
- surrogate splitters intelligently handle missing values;
- adjustable misclassification penalties help avoid the most costly errors;
- multiple-tree, committee-of-expert methods increase the precision of results; and
- alternative splitting criteria make progress when other criteria fail.
Q7. What tree-growing, or "splitting," criteria can CART provide?
A7. CART includes seven single-variable splitting criteria - Gini, Symgini, twoing, ordered twoing and class probability for classification trees, and least squares and least absolute deviation for regression trees - and one multi-variable splitting criteria, the linear combinations method. The default Gini method typically performs best, but, given specific circumstances, other methods can generate more accurate models. CART's unique "twoing" procedure, for example, is tuned for classification problems with many classes, such as modeling which of 170 products would be chosen by a given consumer.
Other splitting criteria are available for inherently difficult problems in which even the best models are expected to have a relatively low accuracy. Demographics, for example, are often weak predictors of attitude- and preference-based segments. Special CART tree-growing options can dramatically increase the predictive accuracy of such demographic-based models. Additional unique tree-growing criteria are available for problems involving unequal misclassification costs, ordered target variables, and continuous dependent variables.To deal more effectively with select data patterns, CART also offers splits on linear combination of continuous predictor variables. For this option, CART looks for weighted averages of predictor variables to use as splitters; these weighted averages can reveal important database structure and can uncover new critical measures.
Q8. What are "adjustable misclassification penalties"?
A8. Unlike many data-mining tools, CART can accommodate situations in which some misclassifications, or cases that have been incorrectly classified, are more serious than others. CART users can specify a higher penalty for misclassifying certain data, and the software will steer the tree away from that type of error. Further, when CART cannot guarantee a correct classification, it will try to ensure that the error it does make is less costly. If credit risk is classified as low, moderate, or high, for example, it would be much more costly to classify a high-risk person as low-risk than as moderate-risk. Traditional data mining tools cannot distinguish between these errors.
Q9. What are "intelligent surrogates for missing values"?
A9. CART handles missing values in the database by substituting "surrogate splitters," which are back-up rules that closely mimic the action of primary splitting rules. Suppose that, in a given model, CART splits data according to household income. If a value for income is not available, CART might substitute education level as a good surrogate.
The surrogate splitter contains information that is typically similar to what would be found in the primary splitter. Other products' approaches treat all records with missing values as if the records all had the same unknown value; with that approach all such 'missings" are assigned to the same bin. In CART, each record is processed using data specific to that record. This allows records with different data patterns to be handled differently, which results in a better characterization of the data.
By using surrogates to stand in for missing values, CART generates robust and reliable predictive models, even when applied to very large databases with hundreds of variables and many missing values. CART's identification of surrogate predictor variables also provides an effective way to discover low-cost predictive mechanisms. If the best splitting criterion in a tree involves an expensive or difficult-to-obtain measure, a less-expensive surrogate can be considered instead.
Q10. What are CART's "automatic self-validation procedures"?
A10. CART uses two test procedures to select the "optimal" tree, which is the tree with the lowest overall misclassification cost, thus the highest accuracy. Both test disciplines, one for small datasets and one for large, are entirely automated, ensuring that the optimal tree model will accurately classify existing data and predict results.
For smaller datasets and cases when an analyst does not wish to set aside a portion of the data for test purposes, CART automatically employs cross validation. This frequently occurs in medical research, but a shortage of training data can occur in the study of any rare event, such as specific types of fraud. In cross validation, ten different trees are typically grown, each built from a different ten percent of the total sample. When the results of the ten trees are put together, a highly reliable determination of the optimal tree size is obtained. For large datasets, CART automatically selects test data or uses pre-defined test records or test files to self-validate results.
Q11. What is a "multiple-tree, committee-of-expert method," or "bootstrap aggregation"?
A11. The use of multiple trees in a committee of experts is a relatively new technique, and one of CART's creators has developed a dramatically effective way of combining trees in CART. Prediction errors can be reduced as much as 50 percent by directing CART to draw 50 or more different random samples from the training data, grow a different tree on each sample, and then allow the different trees to "vote" on the best classification. When appropriate, combining trees can yield a substantial performance edge over any other data mining procedure. For more information, see Committee of Experts.
Q12. When can CART be used to advantage as a standalone package?
A12. Most data-mining projects involve classification for gaining insight into existing data and turning that knowledge into a predictive model. Typical classification projects include sifting profitable from unprofitable; detecting fraudulent claims; identifying repeat buyers; profiling high-value customers who are likely to churn; and flagging high credit-risk applications. CART is a state-of-the-art classification tool that, as a standalone package, can investigate any classification task and provide a robust, accurate predictive model. The software tackles the core data-mining challenges by accommodating classification - for categorical variables, such as responder and non-responder - and regression - for continuous variables, such as sales revenue.
In addition to delivering accuracy, CART offers three distinct advantages over other data-mining tools. First, CART is easily accessible to beginning users, and it does not require a high level of technical expertise to operate. CART?s new, user-friendly GUI and reference manual guide users through a quick process, and the default settings perform so well that many highly experienced experts do not change them. Second, CART results are extremely easy to interpret; the tree-shaped flow chart easily identifies the most important predictors. Lastly, CART costs thousands of dollars less than a data-mining suite, and it comparably handles classification projects.
Q13. How can CART complement other data mining packages and/or suites?
A13. CART is an excellent pre-processing complement to data mining packages, such as SAS®. In the first stage of a data mining project, CART can extract the most important variables from a very large list of potential predictors. Focusing on the top variables from the CART model can significantly speed up neural networks and other data mining techniques. For neural nets in particular, CART bypasses "noise" and irrelevant variables, quickly and effectively selecting the best variables for input. The result is significant reductions in neural-net training speeds and more accurate and robust neural networks. In addition, the CART outputs, or "predicted values," can be used as inputs to the neural net.
CART can also be used to:
- establish performance benchmarks;
- detect important interactions that should be included in statistical models; and
- impute values for variables with missing values.
Q14. How quickly can CART generate results?
A14. CART's efficient algorithm generates results much faster than other methods, such as neural nets. On industry-standard servers, CART models based on 300,000 records and 1,000 variables can be generated in less than an hour. More typical problems involving 100,000 records and 450 variables run in approximately 10 minutes, while 100 variables and one million records can be run in less than 30 minutes. Exploratory analyses based on extracts from a large database can be conducted even faster; for example, a sample of 30,000 records with 100 selected input variables can be explored in less than five minutes.
Q15. COMBINE (Bagging, ARCing) Command
A15. The COMBINE command allows one to choose from several ways of combining separate CART trees into a single predictive engine. The trees are combined by either averaging their outputs for regression or by using an unweighted plurality voting scheme for classification. The current version of CART offers two combination methods: Bootstrap aggregation and ARCing. Each generates a set of trees by resampling (with replacement) from the original training data.
When training data is resampled with replacement the effect is to create a new version of the data that is a slightly "perturbed" version of the original. Some original training cases are excluded from the new training sample, and other cases are included multiple times. Typically, 37% of the original cases are not included at all in the resample; the sample is brought up to full size by including other cases more than once. A handful of cases will be replicated 4, 5, 6, or even 7 times, although the most common replication counts are 1 and 2. The effect of this resampling is to randomly alter the weights that cases will have in any analysis, thus shifting slightly the results obtained from tree growing or any other type of statistical analysis.
Bootstrap resampling was originally developed to help analysts determine how much their results might have changed if another random sample had been used instead, or how different results might be when a model is applied to new data. The theory of the bootstrap was developed by Stanford's Brad Efron in 1979 and has been studied extensively since then. For data mining applications, Leo Breiman applied the bootstrap in a novel way: the bootstrap is used to generate many versions of the data set or replications, a separate analysis is conducted for each replication, and then the results are averaged. If the separate analyses differ considerably from each other (suggesting tree instability), the averaging will stabilize the results and yield much more accurate predictions. If the separate analyses are very similar to each other, the trees exhibit stability and the averaging will not harm or improve the predictions. Thus, the more unstable the trees, the greater the benefits of averaging.
Note that COMBINE generates a committee of experts" rather than a single optimal tree. Because a single tree is thus not displayed, there is no simple way to explain the underlying rationale driving the COMBINE predictions. In this sense, combined trees are somewhat akin to the black box of a neural net, although the trees are built much faster.
Two different resampling technologies are currently available within CART:
- ootstrap Aggregation (or bagging) in which each new resample is drawn in the identical way, and
- ARCing (Adaptive Resampling and Combining) or ADAPTIVE resampling in which the way a sample is drawn for the next tree depends on the performance of prior trees.
Freund and Schapire (1996) first introduced ARCing (a.k.a. boosting); Breiman (1996) introduced bagging and demonstrated that it performs as well or better than boosting. In general, we recommend bagging rather than ARCing as bagging is more robust with dependent variable errors and is also much faster. Nevertheless, ARCing is capable of yielding some remarkable reduction in predictive error.
One final caution on combining via bagging or ARCing: the increase in accuracy is often accomplished for the class one have least interest in. For example, in a binary response model in which response is relatively rare, bagging and ARCing may improve the non-response classification accuracy while slightly reducing the response classification accuracy relative to a standard CART tree. You will probably need to adjust priors to induce the most useful improvements.
Using COMBINE
To invoke bagging, one simply issue the combine command as in
COMBINE
and to invoke ARCing enter
COMBINE ARC=YES, POWER=4
POWER sets the weight the resampling puts on selecting cases that have been previously misclassified. The higher the power, the greater the bias against selecting cases that were previously classified correctly. Breiman has found that POWER=4 works quite well; POWER=1 or 2 gives results virtually identical to bagging.
Setting POWER to numbers higher than 4 could make it difficult to locate a sample large enough to fill the training sample if only a small fraction of the data are misclassified. Also, as Dietterich (1998) has reported, if the dependent variable is recorded in error then using ARCing will progressively focus new trees on the bad data and yield poor predictive models.
After selecting bagging or arching, the next step is to select the number of trees to be grown. Bagging typically shows good results once 25 trees have been grown, but ARCing may require 100 to 250 trees. The number of trees is set with the CYCLES option, as in
COMBINE ARC=no, CYCLES=25
Experiment with a modest number to see how the procedure is working, and if it looks promising launch a CART run with a full complement of 50 or more trees.
When growing a single tree pruning is not just optional, it is vital to obtaining a reliable tree. By definition, a CART tree is first overgrown (i.e., overfit) and then the overfit portions are pruned away with the help of a test set. When combining trees, Breiman has shown that the trees need not be pruned! Whatever overfitting there might be is averaged away when the combining takes place. You might wish to verify this for yourself.
The options controlling pruning are
[EXPLORE | TEST=SAMPLE | LEARN ]
where EXPLORE means do not prune, TEST=SAMPLE means draw a new sample to use for testing and pruning, and LEARN means use the entire training data sample (i.e., the sample from which the resamples are drawn) for testing.
Note that when a bootstrap or ARCed resample is drawn via sampling with replacement, only about 63% of the data will be selected into the sample and 37% will be excluded. The sampling with replacement will result in some cases being drawn into the sample multiple times and it this perturbation of the data that allows the combining method to work. The portion of the data not drawn into the sample in a replication is known as the "out-of-bag" data and can legitimately be used for a variety of testing purposes. When the entire LEARN data is used to prune a tree, a mixture of "in-bag" and "out-of-bag" data is used, and the mixture is in fact optimal for pruning.
Although you may prune the trees, the recommended settings are
COMBINE EXPLORE
coupled with a judicious setting of the complexity parameter so that trees are not grown out too large on huge databases.
Finally, it is highly advisable to specify a SETASIDE data set. This data is a genuine holdout sample and is used to compare the performance of a standard CART tree, the combined set of trees, as well as any other classification or regression scheme you care to consider. The SETASIDE data can be selected randomly from the input file, be identified via an indicator variable, or reside in a separate file. The options are
- SETASIDE=PROP=p; where p is a number between 0 and 1,
- SETASIDE=SEPVAR=variable; where variable is equal to 1 for setaside status and 0 otherwise, or
- SETASIDE=FILE=filename; where filename is a separate dataset accessible by CART.
A sample command for bagging might then be:
COMBINE EXPLORE, CYCLES=50, SETASIDE=FILE="HOLDOUT.SYS"
while for ARCing a command might be:
COMBINE EXPLORE CYCLES=50, ARC=YES, POWER=4, SETASIDE=FILE="HOLDOUT.SYS"
There are other options for the COMBINE command which can be used to save individual tree files, save individual samples, produce detailed output on every tree grown, and to control details of the ARC resampling process (see Appendix below).
ARCing Fine Points Combining methods require the growth of multiple trees all grown under slightly different conditions. Differences can occur in the control settings for the tree, such as when priors are systematically varied, or in the data being used for tree training. In bagging, the data is perturbed by resampling with replacement. The replacement induces a different set of weights on the training data. Bootstrap resampling always leaves some of the data out, reducing its weight to zero, and incorporating other data multiple times, increasing the weight of data included in the resample. Bootstrap resampling can always be conducted without fail. The same cannot be said for an ARC resample, however.
In ARCing, the probability with which a case is selected for the next training set is not constant and is not equal for all cases in the original learn data set; instead, the probability of selection increases with the frequency with which a case has been misclassified in previous trees. Cases that are very difficult to classify receive an increasing probability of selection while other cases that are classified correctly receive declining weights from trial to trial. As the probability of selection becomes more skewed in favor of the difficult-to-classify cases, the probability for selection for the typical case quickly declines to zero and the process of sample building takes an increasingly long time.
One of the ARCing options is a setting controlling how hard the ARCer should try to build a sample. In many runs the ARC process of resampling will simply bog down and the ARCer will automatically reset the probabilities to their equal starting values and continue generating additional trees.
Q16. No Tree
A16. The most frustrating outcome is to learn that CART has apparently not grown a tree. In this circumstance you will obtain a Tree Sequence Summary, but no node detail and no analytical results beyond a variable importance ranking.
SOLUTION 1: PICK a Tree
While you are still in BUILD, issue the PICK command to generate a tree from the tree sequence. Try something like:
PICK NODES = 4
which produces an exploratory tree including all the standard output on node detail, learning sample cost statistics, and so on. Be aware, however, that CART has determined that no tree should be generated, and therefore that the predictive performance of your tree is likely to be poor.
If you have already left BUILD and are back at the CART phase of analysis, grow an exploratory tree with:
ERROR EXPLORE
setting the complexity to a value that limits the tree to a size you wish to examine. Find the complexity value you need from the previous tree sequence output.
SOLUTION 2: Change Priors
CART can experience problems when some levels of a target variable have very few cases and you have specified PRIORS DATA (or PRIORS LEARN or PRIORS TEST). When the costs of misclassification are strictly proportional to the relative frequency of a class in the learning sample, CART may find that it is more accurate not to grow a tree. For example, in a binary problem with 95% class 1 and 5% class 2 cases, classifying all cases as class 1 from the start may yield the lowest costs on test data. To deal with this situation, try PRIORS EQUAL or PRIORS MIX, even if the classes have not been over- or under-sampled. The new priors will alter the penalty for misclassifying the rarer cases and may yield a tree.
SOLUTION 3: Change Costs
This solution is similar to changing priors. By changing costs you penalize CART for failing to split the root node, which might be enough to generate a test-validated tree.
SOLUTION 4: Change the Splitting Rule
For classification trees, Gini is the default node-splitting method; for regression trees, it is LS. You can sometimes induce a reluctant CART to generate test-validated trees by trying another method, for example, twoing or ordered twoing. One method that often proves useful is to adjust CART towards equal sized child nodes with the POWER option. Thus,
METHOD TWOING, POWER = 1
sets the center-cutting exponent to one, decreasing the chance of extremely unbalanced child nodes (end cuts).
Steinberg, Dan and Phillip Colla. CART--Classification and Regression Trees. San Diego, CA: Salford Systems, 1997.
Q17. Insufficient Memory
A17. CART is an extremely memory-intensive program, storing all data used for tree growing in RAM. This means certain large problems may not fit into available memory.
SOLUTION 1: Use A Larger Version of CART
Switch to a larger-memory version of Salford Systems CART. See the platform and operating specific documentation for the limits of the version you are using now, or call Salford Systems for further information.
SOLUTION 2: Adjust Parameters
CART memory requirements depend on several factors, including the number of levels of your dependent variable, the number of independent variables in the search list, the number of categorical predictors, how deep the tree is allowed to grow, and the size of the learning sample. You can experiment with altering these problem dimensions by using the LIMIT, MEMORY and ADJUST commands. MEMORY and ADJUST display your current memory requirements and show you parameter values (if any) that allow your problem to be accommodated.
One extremely useful parameter to adjust is BOPTIONS SUBSAMPLE, which directs CART to search for optimal splits near the top of the tree on a subsample of data, but uses the full data set to populate child nodes. This device saves work space in the nodes with the largest number of cases without reducing the size of the learning sample, while keeping the number of cases in the smaller nodes at their original number.
Other parameters that can be adjusted to save memory include the linear combinations option and the number of categorical predictors. Small amounts of memory can be saved by reducing the number of surrogates tracked and the number of competitors printed.
SOLUTION 3: Limit The Search List
CART will search over all variables in the learning sample looking for the best splits. The search list can be limited by listing explicit split candidates on the MODEL or KEEP statements or by eliminating variables on the EXCLUDE command. If you have an idea as to which variables are unlikely to be useful predictors, you can eliminate them from your first runs.
Alternatively, you can run several models, each with a different set of model variables, eliminating the most unimportant variables from each. The elimination process can continue in this way until you can pool all the remaining variables in a single model. Note, however, that this runs the risk of missing some important interactions.
SOLUTION 4: Subdivide Data
Break the problem into smaller pieces. Start by generating an exploratory tree of DEPTH = 1 to find the optimal split at the root node. Such a small tree should allow you to accommodate a rather large data set. Then, using the optimal or near optimal split, divide the data set into two subsets and run separate CART runs on each subsample. If one split still leaves a problem that is too large, repeat the process a second time to yield four subsets.
Once you have determined the truly important variables in this way, go back to the original data, limit the search list to just those variables, and run the model again for all cases.
Steinberg, Dan and Phillip Colla. CART--Classification and Regression Trees. San Diego, CA: Salford Systems, 1997.
Q18. High Costs or All Relative Costs Exceed One
A18. This is a variant of the NO TREE problem. When using a test sample or running a cross validation, all trees in the tree sequence have a relative error that exceeds one. Thus, no split exists that can improve the predictive performance of the tree.
SOLUTION: Vary Priors and Methods
For this problem the solutions provided for the NO TREE situation still hold. Experimentation with PRIORS and alternative splitting rules is your only recourse.
Q19. Cross Validation Breakdown
A19. This is a variant of the NO TREE problem. When using a test sample or running a cross validation, all trees in the tree sequence have a relative error that exceeds one. Thus, no split exists that can improve the predictive performance of the tree.
V-fold cross validation works by partitioning your data into equal-sized segments and holding out one segment at a time for test purposes. If certain classes of the target variable have very small sample sizes it may not be possible to subdivide each class into v subsets. Since CART requires at least one case in each class to run, the cross-validation trees cannot be constructed.
SOLUTION 1: Fewer Cross Validations
By reducing the number of cross validations you may be able to generate the test runs. However, Breiman, Friedman, Olshen and Stone warn that using fewer than 10 cross validations can seriously overestimate the error rate of any tree.
SOLUTION 2: Dedicated Test Samples
Define your own test subsamples with a random number generator in a data step. The best way is to define a new variable, say TEST1, set to 1 for test cases, and 0 otherwise. By adjusting the random number seed you can repeat the process until the desired balance of cases within each class is obtained. The process can be repeated with separate random test samples identified with variables TEST2, TEST3, etc. Running half a dozen CART trees, each with a different test set, will let you know if your results are stable.
Steinberg, Dan and Phillip Colla. CART--Classification and Regression Trees. San Diego, CA: Salford Systems, 1997.
Q20. Costs are Too High
A20. Even the best tree incurs unacceptably high error rates.
SOLUTION 1: Specify Costs
If some misclassifications are worse than others, you might be able to improve CART performance by specifying variable misclassification costs. While this method cannot improve the overall unweighted error rate of the tree, it can improve performance for the most important classes.
SOLUTION 2: Obtain More Data
The tools within CART are designed to give you the most reliable assessment of the predictive accuracy of your tree. If the error rate of the optimal tree is too high, CART is telling you that you probably cannot honestly get what you want. An objective assessment of your data reveals poor predictive power for virtually any model. The solution is to obtain more data on new variables.
SOLUTION 3: Create More Variables
Creating new variables such as sums, differences and ratios of your raw variables could help. You can use Salford Systems BASIC to create these new variables from within CART.
SOLUTION 4: Try Linear Combinations
Search for linear combinations with a high deletion threshold, as in:
LINEAR N=number, DELETE=.4
Steinberg, Dan and Phillip Colla. CART--Classification and Regression Trees. San Diego, CA: Salford Systems, 1997.
Q21. Cannot Apply Tree to New Data
A21.You wish to apply your results to new data, but CASE will not accept the data.
SOLUTION 1: Check For TR1 Tree File
Make sure the .TR1 tree file, in which CART stores the tree structure, variable list and category specifications, exists.
SOLUTION 2: Check Variable List on USE File
Every variable on the .TR1 tree file, except the target variable, must also exist on the USE file. The variables need not all have non-missing values but they must exist. If the tree was originally grown with many extraneous variables of low or zero importance, you may wish to re-build the tree with a smaller MODEL list.
Steinberg, Dan and Phillip Colla. CART--Classification and Regression Trees. San Diego, CA: Salford Systems, 1997.
Q22. Too Many Levels in a Categorical Predictor
A22. CART will only search over all possible subsets of a categorical predictor for a limited number of levels. Beyond a threshold set by computational feasibility, CART will simply reject the problem. You can control this limit with the BOPTION NCLASSES = m command, but be aware that for m larger than 15, computation times increase dramatically.
SOLUTION: Convert The Variable Into Dummies
The ideal solution is to work with a supercomputer implementation of Salford Systems CART, since this will provide the optimal tree. Other alternatives are compromises that might not yield satisfactory results. One such compromise is to break the categorical variable into a vector of dummies. For example, a 50-level occupation variable could be coded into 50 separate indicators.
Steinberg, Dan and Phillip Colla. CART--Classification and Regression Trees. San Diego, CA: Salford Systems, 1997.
Q23. Tree is Too Large or Too Complex
A23. CART determines that the optimal tree has a very large number of nodes and therefore is too complex for practical use or easy intelligibility.
SOLUTION 1: Pick a Smaller Tree
CART is designed to identify a set of candidate predictive trees complete with honest estimates of costs and standard errors. There is no reason not to decide to accept a higher error rate in exchange for a simpler tree, so long as you remain aware of the costs of the simplification.
SOLUTION 2: Parametric Model
Consider fitting a parametric model using CART to select variables and possible interactions. We have often used CART regression trees to partition a sample into subsets on which separate linear models are fit. CART can assist in specifying a switching regression.
Steinberg, Dan and Phillip Colla. CART--Classification and Regression Trees. San Diego, CA: Salford Systems, 1997.
Q24. Limitations on the Learn Sample Size When Using Cross Validation
A24. By default CART will not allow Cross Validation (CV) for any dataset that has more than 3000 observations. The n-fold cross-validation technique is designed to get the most out of datasets that are too small to accommodate a hold-out or test sample. Once you have 3,000 records or more, we recommend that a separate test set be used.
For large datasets, it is recommended that a separate error set be used, either by manually splitting the dataset into learn and test samples (ERROR TEST or ERROR SEPVAR) or by using a randomly-selected test set (ERROR PROPORTION).
However, you can persist in using CV with the command:
BOPTIONS CVLEARN = n
The default value for n is 3000 but it can be reset to a larger value. For example, Iif you have 50,000 observations and want to use the entire dataset in a cross-validation run, issue the command:
BOPTIONS CVLEARN = 50000
Steinberg, Dan and Phillip Colla. CART--Classification and Regression Trees. San Diego, CA: Salford Systems, 1997.
Q25. Interactive Splitting
A25. CART's tree-building process can be interactively controlled by using the INTERACTIVE option on the ERROR command:
ERROR EXPLORATORY / INTERACTIVE
When building is started CART will produce and display information on the first node, including lists of Competitor and Surrogate splits. You will be prompted with:
>
The commands that can be used during interactive splitting are:
- REPLACE COMPETITOR=n: Choose to split on competitor n from the list.
- REPLACE SURROGATE=n: Choose to split on surrogate n from the list.
- REPLACE LINEAR=n: Choose to split on linear combination n from the list.
- REPLACE SPLIT var=n: Choose to split on a specific variable var at split value n.
- REPLACE SPLIT var=n1,n2,n3,... : Choose to split on a specific categorical variable var with split levels n1, n2, n3,...
- NEXT: move on to the next node.
- CONTINUE: stop interactive splitting and let the rest of the tree grow automatically.
- QUIT: Totally stop altogether.
- FRESH: Redo the current node unconstrained, as it was when you first saw it.
- ABOVE DEPTH = n: Allows interactive splitting only above a given depth.
- RESAMPLE: If the node is subsampled, this will choose a new subsample and generate new splits.
Several important caveats about interactive splitting:
- You must be in command mode to use this feature. Future versions of CART will enable this feature via the GUI.
- Because the interactively split tree is an exploratory tree, it will not be pruned back. To avoid growing a tree too large, be sure to limit the size of the tree by setting the complexity, depth, or number of nodes prior to building the tree.
- If you only want to interactively split the top three nodes, use ABOVE DEPTH=2 to avoid having to interactively split other nodes on the left side of the tree before returning to the second node on the right side of the tree.
Steinberg, Dan and Phillip Colla. CART--Classification and Regression Trees. San Diego, CA: Salford Systems, 1997.
Q26. Internal Error Messages
A26. Rarely, computational instability or a precision problem will result in an message stating that an "internal error" has been encountered. To help Salford Systems resolve the problem, please attempt the following:
- Rebuild as an exploratory tree to see if CART completes the building process.
- Rebuild the tree with a randomly selected subset for testing (ERROR p=.2, for example).
- Rebuild a cross-validation tree with an explicit SEED to perturb the selection of cross-validation subsets.
Please send Salford Systems via email (or diskette):
- command files or commands used prior to the build;
- output generated during the run;
- frequency table of your target variable;
- frequency table of your categorical predictor variables; and
- if possible, your data set, or any subset that can reproduce the problem.
Q27. CART and Large Datasets
A27. CART is capable of determining the number of records in your data sets, and uses this information to predict the memory and workspace requirements for trees that you build. Also, CART will read your entire data set each time a tree is built. At times these actions may be problematic, especially if you have enormous data sets.
If you only wish to use the first N records, perhaps due to memory limitations or because you wish for faster turnaround during early exploratory analysis, you can direct CART to treat your data sets as if they have fewer records than actually exist in the data. (Another option is to contact Salford Systems regarding a memory compile upgrade so that CART can accomodate all of your data; CART can be compiled to utilize up to 32 gigabytes of RAM. For further info on problems sizes and scalability, see CART Technical Overview - Scalability).
There are two options on the LIMIT command to consider:
LIMIT DATASET = N, ERRORSET = N
These options tell CART to act as if your main data set (and error/test data set if you have one) have fewer observations than they actually do. For instance, if your data set has 500,000 observations but you wish to only use the first 25,000, issue the command:
LIMIT DATASET = 25000
Similarly, if you have an enormous separate test set and wish to only use 75,000 records of it, issue the command:
LIMIT ERRORSET = 75000
CART will now treat these data sets as if they were only 25000 and 75000 records in length. Any other records will be totally ignored.
Steinberg, Dan and Phillip Colla. CART--Classification and Regression Trees. San Diego, CA: Salford Systems, 1997.
Q28. Why does the tree change when non-splitting variables are dropped?
A28. If a variable does not enter the tree as a primary node splitter, it may still play a important role in the tree as a surrogate splitter. If you have turned the displaying of surrogate splitters off, you will not see how these variables affect the tree but they will still be used internally by CART when applying the tree to data. The Variable Importance Table produced by CART ranks the variables in the tree by their importance, a statistic measuring how strongly a variable acts as a primary or surrogate splitter.
Suppose a variable enters the tree as the top surrogate splitter in many nodes, but never as the primary splitter. If this variable is removed from the list of potential predictor variables and the tree is rebuilt, it will probably be a very different tree - certainly if there are missing values in the data for the primary node-splitting variables.
Steinberg, Dan and Colla, Phillip. CART--Classification and Regression Trees. San Diego, CA: Salford Systems, 1997.
Another possibility is due to the way CART grows trees. Normally, CART first grows a maximal tree and then tests it either through cross validation or a separate test sample. If a split does not hold up to testing, it is removed from the model. Thus, if a model splits one or more times on a particular variable, but none these splits hold up to testing, the variable will not appear as a primary splitter in the final model. However, if the variable is dropped, the splits involving that variable in the maximal tree might be replaced by others, which might appear in the final tree.
Q29. Improvement Penalties
A29.
Q: I wish to define penalties which CART should use to make it harder for a predictor to become the primary splitter in a node. How do I do this? A: CART supports three "improvement penalties". The "natural" improvement for a splitter is alway computed according to the CART methodology. A penalty may be imposed, however, that causes the improvement to be lessened depending, affecting the penalized splitter´s relative ranking among competitor splits. If the penalty is enough to cause the top competitor to be replaced by a competitor, the tree is changed. Variable-Specific Penalty This penalizes a given predictor (perhaps because it is expensive to collect and we do not want it serving as a splitter unless it is a really powerful predictor). If the user-defined variable-specific penalty is in the range [0,1] inclusive, then the natural improvement is adjusted as: improv-adj = improve * (1 - variable_specific_penalty) If the user-specified penalty falls outside of [0,1] then no penalty is imposed. Missing-value Penalty This penalizes the improvement of a competitor based on the proportion of missing values for the competitor in the node in question. This makes it difficult, but not impossible, for a competitor with many missing values in a node to rise to the top of the competitor list and assume the role of primary splitter. If there are missing values, the improvement is adjusted as: improve-adj = improve * SW1 * [ (Ngood/N} ^ SW2 ] in which SW1 and SW2 are controlled in the PENALTY command. N is the size of the node, Ngood is the number of records with nonmissing values for the variable in question. If there are no missing values (NGOOD=N), no adjustment is made. High level Categorical Penalty This penalizes a categorical variable that has many levels relative to the size (unweighted N) of the node in question. For a categorical variable: ratio = log_base_2(N) / (Nlevels - 1) in which NLevels is the number of levels for the categorical predictor and N is the number of learn sample records in the node. improve-adj = improve * [ 1 - SW3 + SW3 * (ratio ^ SW4) ] in which SW3 and SW4 are controlled on the PENALTY command. Note that all three penalties can be in effect, in which case they all serve to decrease the "freely computed" improvement, resulting in a "adjusted" improvement which is what appears in the competitor table and which is used to rank the competitors. These penalties are first used in adjusting the improvements evaluated for the competitors in a node. When generating surrogates, the penalties will affect the improvements computed for the surrogates in the same way -- unless PENALTY SURROGATES=NO is specified in which case improvements are not adjusted for surrogates even if missing values or high level categoricals are involved. Note that the associations for surrogates are not penalized, so these penalties will not change the ordering of surrogates for a given primary splitter. They will only affect the improvement listed for a surrogate.
Q30. Variable importance
A30. CART automatically produces a predictor ranking (variable importance) based on the contribution predictors make to the construction of the tree. Predictor rankings are strictly relative to a specific tree; change the tree and you might get very difffernt rankings. Importance is determined by playing a role in the tree, either as a main splitter or as a surrogate. CART users have the option of fine tuning the variable importance algorithm.
Variable importance, for a particular predictor, is the sum across all nodes in the tree of the improvement scores that the predictor has when it acts as a primary or surrogate (but not competitor) splitter. Specifically, for node i, if the predictor appears as the primary splitter, then it has a contribution toward the importance as:
importance_contribution_node_i = improvement
If instead, the predictor appears as the n'th surrogate instead of the primary predictor, the expression is:
importance_contribution_node_i = (p ^ n) * improvement
in which p is the "surrogate improvement weight": a user controlled parameter which is equal to 1.0 by default and can be set anywhere between 0 and 1. Thus, you are able to specify that surrogate splits contribute less towards a predictor's improvement than do primary splits. This parameter is controlled with the BOPTIONS IMPORTANCE option.
Linear combination splits do not contribute in any way to variable improvement.
If, in the absence of linear combinations, the improvement weight is greater than 0, and the variable has importance = 0.0, it does not appear in the tree as a primary or surrogate splitter, although it may appear as a competitor.
Q31. Nominal (Ordered) Predictors and Rank
A31. One of the strengths of CART is that, for ordered predictors, the only information CART uses are the rank orders of the data - not the actual value of the data. In other words, if you replace a predictor with its rank order, the CART tree will be unchanged.
This means that CART splits cannot be affected by any transform of the data which preserves order (monotone transform). For instance, AGE, log(AGE), AGE^2, etc., all would yield the same split. If you have a nominal variable with values 1,2,3,4,5: so long as the value "5" properly represents the highest value in the data and "1" represents the lowest value, and so forth, then monotone transforms of the data -- transforms that preserve rank ordering -- will not alter how the predictor acts in the tree.
Of course if a variable is categorical (discrete, unordered, nominal) then the values are just arbitrary labels. Simply indicating to CART that a predictor is variable is sufficient for proper handling.
Q32. Cross Validation
A32. Cross-validation is a method for estimating what the error rate of a sub-tree (of the maximal tree) would be if you had test data. Regardless of what V you set for V-fold cross validation, CART grows the same maximal tree. The monograph provides evidence that using a V of 10-20 gives better results than using a smaller number, but each number could result in a slightly different error estimate. The optimal tree -- which is derived from the maximal tree by pruning -- could differ from one V to another because each cross-validation run will come up with slightly different estimates of the error rates of sub-trees and thus might differ in which tree was actually best. Normally a test sample is used to prune the maximal tree down to an "optimal" tree. This is especially recommended for large data sets, from which a test sample can be withdrawn. However, there are times when the size of the data set makes withdrawing a test sample difficult. In the absence of a test sample and without using cross validation, no pruning is done -- this is called EXPLORATORY -- and the maximal tree is the result. Note that the maximal tree in an exploratory run is identical to the maximal tree when using a test sample, provided that the learn sample is the same for each run. When you are unwilling to use a test sample but still desire estimates of the error rates of each tree in the sequence, cross validation may be used. In a nutshell, cross validation establishes how much to prune the maximal tree by building a series of "ancillary cross validation trees" from which error rates of the maximal tree and its subtrees can be estimated. Cross validation does not affect the growth of the maximal tree at all -- it is conducted after the maximal tree is grown. The V ancillary cross validation trees may are similar to the maximal tree, but not necessarily. Here is how it works:
- The maximal tree is grown and saved. Note that we do not have any "independent" estimate of the error rates for each node in the maximal tree, since we do not have a test sample. Based on node complexities of the maximal tree a pruning sequence is defined, although the error rate for each tree in the sequence is not yet known. In other words, we know which nodes to prune off the tree and in what order, and we have a series of subtrees defined by the pruning sequence, but we do not know how far to prune.
- V ancillary cross validation trees are grown, each on a partition of the learn sample. For instance, if 10 cross validation trees are grown, each uses 90% of the learn sample for tree growth and the remaining 10% as a pseudo test sample with which to estimate error rates for the nodes in the cross validation tree.
- Error rates from each of the V cross validation trees are combined and mapped to the nodes in the original maximal tree. The V cross validation trees are then discarded.
Now that estimates of the error/cost for each node in the maximal tree are known, we are in a position to prune the maximal tree and declare an optimal tree. Q: We typically use the default of 10-fold cross validation in CART. However, when we change to, say, 20-fold cross validation, CART indicates a different optimal tree. Why? A: In both cases the maximal tree is the same. 20-fold cross validation will partition the learning sample into 20 subsets and will generate 20 ancillary cross validation trees. These trees, each with their own error rates, will be combined to yield estimated error rates for the maximal tree. Since we are combining 20 trees rather than 10, it is almost certain that the 20-fold combined error rates estimated for the maximal tree will differ from those estimated by combining 10-fold cross validation trees. Although the pruning sequence is the same in both runs, a different tree may be chosen as optimal between the two runs due to the differing error rate estimates. In other words, the maximal tree and pruning sequence is the same, but the 10- and 20-fold cross validation procedures will result in a different amount of pruning. Q: In the tree sequence and on the "select tree" dialog we see "cross-validated relative cost" (with confidence intervals) and "resubstitution relative cost", for each tree in the tree sequence, e.g.:
|
Terminal Tree Nodes |
Cross Validation Relative Cost |
Resubstitution Relative Cost |
Complexity Parameter |
| 1 |
15 |
0.7457930 +/- 0.0142744 |
0.6738151 |
0.0019035 |
| 2 |
10 |
0.7506419 +/- 0.0135887 |
0.6981514 |
0.0024436 |
| 3 |
9 |
0.7533725 +/- 0.0136544 |
0.7033467 |
0.0026077 |
| 4 |
7 |
0.7476655 +/- 0.0137743 |
0.7145392 |
0.0028081 |
| 5** |
6 |
0.7439012 +/- 0.0135847 |
0.7221265 |
0.0038037 |
| 6 |
3 |
0.7605784 +/- 0.0142045 |
0.7499018 |
0.0046392 |
| 7 |
1 |
1.0000000 +/- 0.0000896 |
1.0000000 |
0.0625345 |
A: Cross validated relative cost is the error rate of the tree, relative to the root node, using the cross validation method. If you had used a test sample instead of cross validation, you would have been presented with "test sample relative cost". The resubstitution relative cost depicts the error rate that would be estimated had you used a copy of the learn sample as your test sample. Note that this rate always decreases as the tree gets larger -- this is a property of using the same data to estimate errors that was used to build the tree in the first place. The +/- number is a measure of the uncertainty around the actual (cross validation or test sample) error rate of the tree in question when confronted with new data. The cross validation error rate is derived from one cross-validation procedure, whereas a test sample error rate is derived from a one test sample. Either way, if you ran another cross validation procedure or used a different test sample you would likely see another (slightly) different error rate. the +/- gives an idea of how uncertain the error rate estimate is.
Q33. The Systat Dataset Format
A33. CART and MARS continue to read data stored in the legacy SYSTAT format, a binary (i.e., not human-readable) format widely used by statisticians and researchers using the SYSTAT statistical programs. Relative to comma-separated-text and some other binary formats, the legacy SYSTAT format is quite restrictive (limited variable name lengths, limited lengths of character data). We do not recommend that you use it. However... For our clients that do need to work with this format, we provide the following C and Fortran programs that illustrate how legacy SYSTAT datasets are structured. Originally, legacy SYSTAT format was written and read with Fortran code. Thus, the format must accomodate the record segmentation and padding typical of Fortran I/O -- so the C version handles these issues explicitly. Click here to obtain LFR.F -- the Fortran 77 legacy SYSTAT file reader source. LFR is written entirely in Fortran 77. Record padding is handled implicitly by the Fortran I/O system. The user is prompted for an input legacy SYSTAT set. The data are echoed to the console in comma-delimited as confirmation. Click here to obtain CLFR.C -- the C legacy SYSTAT file reader source. CLFR is written entirely in C. Record padding is handled explicitly to match the Fortran conventions. You must give the input file name as the first command line argument. If you give an output file name as the second command line argument, data will be written to that file in comma-delimited for as confirmation, otherwise data are written to stdout. Both are portable in the sense they accomodate both Unix and Windows compilation platforms.
|