Algorithm Overview - Linear and Logistic Regression

Algorithm Description 

Linear and Logistic Regression are supervised machine learning techniques which investigate the relationship between a dependent variable (target) and independent variable(s) (predictors). A Linear Regression model focuses on predicting a continuous target, while a Logistic Regression model aims to predict a binary target (e.g. 1/0 , True/False, Yes/No). Both techniques can have continuous or discrete predictors.  

 

Linear Regression 

Linear Regression establishes the relationship between the target and predictors using a best fit straight line (also known as a ‘regression line’), represented by the equation Y = mx +b + ℇ, where b is the intercept, x is the slope and ℇ is the error term. This equation can be used to predict the value of the target variable Y, based on given predictors, and the goal in the model fitting process is to minimize the error term. 

 

Logistic Regression 

Logistic Regression is used to find the probability of an event. The events are represented by the binary target variable, 1/True/Yes = ‘Success’ and 0/False/No = ‘Failure,’ and the predicted value of Y is a probability ranging from 0 to 1. This technique is widely used for classification problems, and does not require a linear relationship between target and predictors. Since the predictions are in the form of a probability, LityxIQ determines the ‘cutoff’ of what would be considered a ‘Success’ based on the percent of the training dataset that is a ‘Success’.  

 

Additional Links 

https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-regression/ 

https://www.statisticssolutions.com/what-is-linear-regression/ 

https://www.statisticssolutions.com/what-is-logistic-regression/ 

 

Lityx IQ Parameters 

 

Linear Regression 

 

 

Maximum Number of Model Terms - The maximum number of terms/variables used during the variable selection process. Larger values may make processing time longer, but too small a value may miss important variables.  

Model Complexity Penalty - Set this value higher to keep models from being overly complex or having too many fields. Set it lower to have a larger model with more variables. Typical values range from 2 to 2.5. Larger values will also decrease run times. Smaller values may significantly increase run times and produce large models. The parameter does not necessarily improve performance on the training dataset, but more so improves the performance on unseen data, and generally helps prevent the model from modeling on ‘noise’ from the training dataset.  For more, see https://www.kdnuggets.com/2016/06/regularization-logistic-regression.html.

 

Logistic Regression 

 

 

Maximum Number of Convergence Iterations - Logistic regression uses an iterative maximum likelihood algorithm to fit the data, and this parameter sets the maximum number of iterations it will run. As the model runs the iterative fitting process, there becomes a point where each iteration does not provide any significant gain in the performance metric, and therefore the process will stop.  

Convergence Tolerance - If the performance metrics change by less than this value from one iteration to the next, the model training is stopped. If the maximum number of iterations is hit first, the model training will stop before this convergence tolerance is achieved.  

Maximum Number of Model Terms - Same as for Linear Regression.  

Model Complexity Penalty - Same as for Linear Regression.