Algorithm Overview - CART/CHAID

Algorithm Description 

CART and CHAID are both Decision Tree machine learning algorithms.  Their objective is to find quantitative splits (segments) of the dataset that do a good job of differentiating the dataset with respect to the target variable.  These segments are created by iteratively splitting the dataset based on key values of the most most important predictor variables.  Most decision tree algorithms differ with respect to how they determine the most important predictors, and the key values on which to split the dataset.

CART – Classification and Regression Trees 

CART is a binary decision tree algorithm that can be used for classification or regression modeling problems. Creating this type of tree involves cycling through the input variables (‘root’ or ‘parent’ node) and choosing a split point on each variable. Each parent node is split into two child nodes, and this continues until a tree is constructed. The ‘leaf nodes’ of the tree contain the value of the dependent variable (y), which is used to define the prediction. Each data point traverses through the tree until a prediction is made. The specific variable split points are chosen using a greedy algorithm to minimize a cost function, and tree construction ends using a predefined stopping criterion.  


CHAID – Chi-square Automated Interaction Detection 

CHAID is a decision tree algorithm that determines splitting based on statistical tests. Since this is decision tree, the algorithm again cycles through the predictors to determine the appropriate category splits. This is done by using either a Chi Square Test (categorical response) or an F-Test (continuous response) to find splits that most “explain” the response variable. Using a pre-specified significance level (P-Value), if the test shows that the split variable and response are independent, the algorithm stops the tree growth. Otherwise, the split is created, and the next best split is searched for.


Additional Links 


Lityx IQ Parameters 



Minimum Observations Needed to Split  - The minimum number of observations allowed at a node for the node to be further split into sub-nodes.

Minimum Observations in a Child Node - The minimum number of observations allowed in a resulting child node of the potential split.

Maximum Tree Depth - Maximum number of levels for the tree 

Splitting Criterion - The cost function used to determine variable split points. Gini is intended for  

Surrogate Splits - How missing values are handled by the tree. Surrogates will save information about secondary splits that are used in the case of missing data at a node.  

Maximum No. of Model terms - The maximum number of terms used during the variable selection process. Larger values may have a longer processing time, but smaller values may miss important variables. 




Minimum Observations Needed to Split - Same as for CART 

Minimum Observations in a Child Node - Same as for CART 

Maximum Tree Depth - Same as for CART 

Maximum P-Value Allowed to Make Split - The largest p-value allowed to make a split at a node. If no predictors have a p-value smaller than this setting, no split is made at the node. The larger you set the value, the larger the tree may get.  

Maximum No. of Model terms - Same as for CART