Model Settings: Validation

All predictive models should be validated using specialized statistical techniques.  In LityxIQ, by default, this validation step will always be performed.  Alternative validation methods can also be setup. These model validation options, available on the Validation tab, are described below.


 

Summary

Validation Method

Definition

Common Options

Unique Options

Holdout

Modeling is performed on the Training Set Pct portion and the remaining portion is used for validation.

Number of Reps

 

Training Set Pct - percentage of the modeling dataset that will be used for training (the rest is used for validation and is not a part of the training process).

Cross Validation

Sometimes called k-Folds, typically for small datasets, uses equally sized subsets of the data for validation and the remainder for training.

 Number of Reps  

 

Number of Partitions - the number of equal sized partitions used for the cross-validation. The larger the number, the longer the run time.

Backtest

Defining the training and validation datasets by use of a date range.  This is sometimes referred to as “Out-of-Time Validation” and can be a stronger form of validation than Holdout which is an “In-Time Validation” because the validation dataset is randomly selected from all of the data. 

 

 

Date Variable - the variable that will control selection of records for training or validation based on the date value set.

Date Cutoff - the cutoff date for the backtest data. All records with the date field after this date will be part of the validation set and not used for training.

 

Filter

Validation dataset defined by user criteria

 

The Validation Filter tab become available.

Resubstitution

This option re-uses the training dataset for validation.  It often provides a biased high view of performance metrics, but can be a way to determine how overfitted a model is (when also built using other validation techniques).

 

 

None

Validation is not performed and no model performance is output.

 

 

 

Validation Method = Holdout

The Holdout validation method is the default validation method in LityxIQ and the most commonly used.  It works by randomly splitting the modeling dataset into two groups:  Training Set and the Validation Set (sometimes called the Test Set).  Model building proceeds by optimizing the variables and coefficients of the model, ONLY using the training set.  The model is then run against the validation set to see how well it performed.  Because the model is not built using any data from the validation set, the validation set serves as an independent gauge as to how well the model would fare if used against brand new data.  This provides an unbiased estimate of its performance.

  • Training Set Pct - This setting determines how much of the modeling dataset will be run in the Validation Set.  70 is the default and is a commonly chosen percentage.  If too much data is used for training, the validation set may be too small to provide a precise measurement of performance.  But if the training set is too small, the model that results for measuring performance may not be a good reflection of the final model that would be built upon the entirety of the data.
  • Number of Holdout Reps - This determines how many times the holdout process will be repeated.  The default and most common setting is 1, meaning that the modeling dataset will be split once, and that the model will be built against the training set and validated against the validation set.  However, LityxIQ allows for random splitting, training, and for the validation process be repeated multiple times.  In this case, the performance results from each repetition would be averaged together.  This is a useful technique for smaller datasets, where a single split will have a stronger likelihood of being influenced by random number generation.

 

Validation Method = Cross-Validation

The Cross-Validation method is another well-known and common technique for measuring model performance.  In general, it will provide results similar to the holdout method, on larger datasets.  Because it is much more computationally intensive than the holdout method, it is typically only recommended for smaller datasets, where it can provide more benefit than the holdout method.

The Cross-Validation method works by first randomly splitting the modeling dataset into a given number of equal-sized partitions.  Then, one at a time, each partition serves as a validation set while all the other partitions merge together as the training set (similar to a single repetition of the holdout method).  The final performance metrics are computed by averaging all of the results.

  • Number of Partitions - This determines how many partitions the modeling dataset is split into.  More partitions require longer run times, while few partitions will result in this method becoming more similar to the holdout method.  5 (the default) or 10 are common choices.
  • Number of Cross-Validation Reps - This determines how many times the partitioning process will occur.  The default and most common setting is 1.  But, if the dataset is very small and effects of a single random procedure will not create statistically valid partitions, this option replicates the entire process multiple times. Final performance results are averaged over all repetitions and partitions.

 

Validation Method = Back-test

A Back-test allows for a date-based determination of the training and validation sets.  This is often useful if the modeling dataset contains data that has been collected over time; the earlier data may be used to build the model, while the later data may serve as validation of that model.  Note that there is no randomization to determine the training and validation sets in a back-test.  At least one date variable is needed for the back-test method.

  • Date Variable - Select the variable to be used as the basis for determining which observations are part of the training set and which are part of the validation set.
  • Date Cutoff - Select a date to serve as the cutoff point between records going into the training versus the validation sets.  Any record in the modeling dataset, with a date value greater than the date selected here, will be put into the validation set.

Validation Method = Filter

The Filter validation method allows users to create their own validation set, using criteria from the dataset itself.  A user can control which records should be used for training and which for validation.

To set a filter for validation, select ‘Filter’ from the drop-down menu, which will then activate the validation filter tab.  For help with using the filter dialog, please see the following article: https://support.lityxiq.com/806706-Using-the-Filter-Dialog

 

Validation Method = Re-substitution

The Re-substitution method is generally not recommended.  It works by reusing the same data that was used to build the model, to also measure model performance.  This can lead to a model that has been overfit (in other words, a model that appears to perform well at the time it is built, but performs poorly when used against new data).  However, for very large datasets, re-substitution is more efficient than the holdout method and has a smaller chance of creating performance measurement issues.


Validation Method = None

This selects no validation method and LityxIQ will skip the validation step.  Model building will be faster, but there will be no measurements of how the model is expected to perform.

--------------------------------------------------------

Worth Noting:

Different validation methods will always give different results.  In any model building situation, the model builder needs to pick the validation method that makes the most sense for that situation.  Cross-validation is often most appropriate for smaller datasets, holdout is generally optimal for larger, etc.  There is little reason to compare results from different validation methods, as they simply go about computing metrics in different ways.  Re-substitution will generally give higher performance metrics than the other methods, but it is typically not believable unless used with very large datasets.  Use the validation method that makes the most sense for your dataset.  The performance metrics that come out of the chosen validation method are a good indication of the results you’d get from implementing the model built on the full file…  in fact, the purpose of the validation is to estimate just that.