Salford Company Blog
R-Squared for CART Regression Trees

CART users often ask where they can find the value of the R-Squared for their regression trees. The answer is simple. In conventional statistics, 

R-Squared  = 1 - SSE/SST,                        (1)

where SSE is the sum of squared errors of the actual data around the model predictions, and SST, the total sum of squares, is the sum of squared deviations of the dependent variable around its mean. In traditional statistics R-Squared is always calculated using the training data (LEARN SET).  CART users can read the R-Squared directly from the output:                           

R-Squared  =  1 - CART_Relative_Error    (2)

because                           

Regression Tree Ensembles

Many have asked if RandomForests (RF) supports regression analysis.

The short answer is:  not with the current implementation.  Salford Systems plans to support RF regression in our next release.

 

That said, if you have been thinking about RF regression we urge you to consider using TreeNet regression instead. Some reasons follow:

 

Using Your Own Cross-Validation Bins

Users of cross validation (CV) in CART, MARS, and TreeNet have become accustomed to simply requesting this testing method when setting up a predictive model and allowing the software to take care of the details. Of course, the Salford software prepares the data automatically and uses stratified sampling to randomly assign each record to a CV bin. The user has no influence and no control over how the bins are managed.

There will be times, however, when it is advantageous to construct these CV bins yourself. This can occur, for example, if you want to compare results across different software tools so as to be sure that any differences in results between methods are not due to the cross-validation process itself.  By using the same CV bins in every modeling run, you can be sure that any differences in performance are due only to differences in modeling methods. Analysts with repeated observations on subjects will want to assign subjects rather than individual data records to CV bins, keeping all data belonging to a given subject together at all times. In data with a temporal dimension, it may be desirable to break the data into bins along the time dimension (for example, assigning  records from every calendar month to a distinct bin). Although assigning data records to CV bins needs to be conducted with care to ensure that the right kind of balancing of the data is maintained, the process is quite simple and mechanical. In preparing the data for analysis you need to create a new column of data on which the bin assignments for every record in the training data will be recorded. If you prefer to work with a numeric bin variable, we recommend that you use the integers from 1 through K, where K is the number of bins you want.  If you prefer to work with a text or character variable to record bin assignments, then you are free to come up with any unique labels you like for the bins.

Once the bin variables are created you can use the GUI to let CART, MARS, or TreeNet know that you intend to use your own CV bin variable on the MODEL setup dialog’s TEST tab.  In the screen capture below  you can see that we have selected “Variable determines cross-validation bins” as our test method.

 yourcvbins

Alternatively, you can issue a command of the form

<< Start < Prev 1 2 3 4 5 6 7 Next > End >>

Dan Steinberg, CEO of Salford Systems

Dan Steinberg

President & CEO
Salford Systems

Profile | Contact

Tags