Algorithm Overview - Random Forest

Algorithm Description 

The Random Forest algorithm is a supervised machine learning technique that uses many individual decision trees to form a "forest” or ensemble. Random Forests are trained using a method called ‘bagging’, which uses randomly sampled subsets of the data to train each decision tree. This method helps reduce variance in the model. The ‘bagging’ method is also applied to the feature space, where only a random subset of features is considered at each split in each decision tree. Once the trees are trained, each data point is run through all the trees, and the predictions are aggregated. The aggregation method is dictated by the type of problem, classification or regression. With classification, each tree spits out a class prediction, and the class with the most votes becomes the prediction for that data point, whereas for regression tasks, the average prediction of each tree is used.  

Additional Links  

https://builtin.com/data-science/random-forest-algorithm 

https://dataaspirant.com/2017/05/22/random-forest-algorithm-machine-learing/ 

https://blog.citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics\ 

 

Lityx IQ Parameters  

  

Maximum Number of Trees to Build - The number of trees to be built in the Random Forest. The greater number of trees in the forest generally leads to higher accuracy and prevents overfitting.  

Minimum Sample Size in Node - The smallest sample size allowed at a single node in any forest tree. This parameter implicitly sets the depth of the tree. Setting this number larger causes smaller trees to be grown, whereas a smaller number may lead to larger trees and more noise being captured. 

Maximum Number of Nodes - The largest number of nodes any tree in the forest will be allowed to have. This prevents trees from getting too large and overfitting the dataset.

Maximum Number of Model Terms - The maximum number of terms used during the variable selection process. Larger values may take longer processing time, but too small a value may miss important variables.