The Sampling tab allows you to specify how much data will be used to build a model, as well as how that data is to be sampled. Dataset sampling is a good strategy when building a model; it can make the building process faster and more efficient and often has only a small effect on the final model performance.
- Maximum Rows for Modeling - Specify how many rows will be used to build and validate the model. If the dataset has fewer rows than this value, all of them will be used.
- Sampling Procedure - If the number of rows chosen is fewer than the number of rows in the dataset, this selection determines how rows are selected to be used in the modeling process.
- Random Sampling - The rows are selected on a completely random basis. This is the most common selection and the default.
- Stratified - The rows are selected randomly, but in such a way that a certain value is represented in the dataset at a specified percentage level. If you select Stratified sampling, other choices will become available. See below for options that become available.
- None - Select this option to not perform sampling on the modeling dataset. Even in this situation, LityxIQ may perform some backend optimizations that include sampling procedures to ensure both accuracy of the model and efficiency of the model building process.
Stratification Options
- Stratification Variable - Select the variable to be used as the basis of the stratified sample.
- Value to Stratify - Select the specific value of the stratification variable that will be the basis of stratification. For example, if you wish to over-sample responders when building a response model, select the value representing response (likely a "1" or "Y").
- Stratification Pct - This determines the percentage of data in the sample that will be dedicated to records that have the specified value of the stratification variable. For example, if you would like to build the model using a file with 50% responders, set this to 50. Such a stratification level may not be possible in some cases. For example, if you only have 500 responders in the dataset and request that 10000 rows be used for modeling, LityxIQ will not be able to create a modeling dataset that contains 50% responders.