Defining QC (Quality Control) Rules for a Derived Dataset

QC, or Quality Control, rules provide a way to do error checking on a dataset after it finishes executing.  When no QC rules are set, the dataset execution is considered a success when it completes (of course, other errors may have happened along the way, such as an invalid field definition).  If QC rules are defined, each rule is checked when the dataset is otherwise finished executing.  If any of the QC rules are evaluated to be be valid, the result of the execution is considered an "error" instead of a "success".  The dataset itself will reflect the results of that execution, but its run status will be an error.  In particular, this means that other downstream datasets that are set to execute Upon Data Refresh based on this dataset will not be executed.

An example of the QC tab is below.

Add QC Rule - click this button to add a new QC rule to the list.  It will open the define rule dialog which is explained below.

Quality Control Rule List - this list shows all of the currently defined QC rules for the dataset.  The rules can be re-ordered using drag and drop.  Re-ordering affects the order in which the rules are evaluated, which is top to bottom according to how they appear in this list.

Edit and Delete Buttons Click the appropriate button to Edit the rule, or to delete it.  When editing a rule, the Define Rule dialog is shown, and is explained further below.

 

Define QC Rule Dialog

There are two text settings within the Define QC Rule dialog.

Rule Definition - Enter code that will be checked for a violation of quality control rules. It will be evaluated as a logical value, with a logical "true" result triggering an error message.  The code you can enter generally will include special values that allow you to dynamically refer to summary metadata (summary statistics) about the dataset or variables in the dataset.  The options available to you are described in more detail below.  The code can also include numeric constants and arithmetic operators and comparison symbols, such as equals (=) signs.

Error Message - Enter an error message to provide as output if this QC rule is triggered.  This will show in the user console.

 

Coding a QC Rule

Dynamic code you can use in a QC rule comes in one of two flavors:

  • Referencing metadata about a dataset - in this case the code you use will look like {@dataset name:dataset metadata component}
  • Referencing metadata about a variable in a dataset - in this case, the code you use will look like {@dataset name:variable name:variable metadata component}

The {@ ... } aspect of the rule code is essential to help recognize the code as representing a QC rule.  The other components are defined below.  Note that the colon (:) symbol is used to separate the components of the code.

  • dataset name - replace with either the name of the dataset from which you want to extract metadata, or with the special reference "this" to refer to the just-finished dataset execution.  See below for more regarding the "this" terminology.
  • dataset metadata component - replace with one of the following literal values to refer to that property of the named dataset.
    • numrows - refers to the number of rows in the dataset
    • numcols - refers to the number of columns, or variables, in the dataset
  • variable name - replace with the name of a variable in the named dataset
  • variable metadata component - replace with one of the following literal values to refer to that property of the named variable in the named dataset.
    • min - refers to the minimum value of the variable in the dataset.
    • max - refers to the maximum value of the variable in the dataset.
    • q1 - refers to the 1st quartile of the variable in the dataset.
    • q3 - refers to the 3rd quartile of the variable in the dataset.
    • mean - refers to the mean or average of the variable in the dataset.
    • median - refers to the media of the variable in the dataset.
    • stdev - refers to the standard deviation of the variable in the dataset.
    • sum - refers to the sum of the variable in the dataset.
    • notnullcount - refers to the count of non null values of the variable in the dataset.
    • nullcount - refers to the count of null values of the variable in the dataset.
    • zeroct - refers to the count of zero values of the variable in the dataset.
    • numvalues - refers to the count of distinct unique values of the variable in the dataset.

 

Special note on use of this terminology as a dataset name

When using the special dataset name reference this, it reflects that you want to evaluate the value referenced with respect to the just-completed dataset execution.  The important aspect to distinguish, in particular, is when you also want to refer to metadata of this executed dataset as it existed prior to this current run.  The rule in the screenshot above provides a good example of this.  The dataset being executed is called "Perform Suppressions".  The rule entered is:

{@this:CurrentPatientSuppress:sum} < 0.9* {@Perform Suppressions:CurrentPatientSuppress:sum}

Here is an explanation of the parts of this rule:

  • {@this:CurrentPatientSuppress:sum} - refers to the sum of the variable CurrentPatientSuppress in the dataset, based on run that just completed.
  • {@Perform Suppressions:CurrentPatientSuppress:sum} - refers to the sum of the variable CurrentPatientSuppress in the dataset as it was computed prior to this run.
  • 0.9 * - this is a constant and mathematical calculation.

Note that the variable CurrentPatientSuppress is defined as a 0 or 1 in the dataset.

Since, as mentioned just above, the dataset being executed is named Perform Suppressions, this rule in effect is a way to compare something about the prior run of this Perform Suppressions dataset to the current (just-finished) run of the dataset.  In the end, the interpretation of this rule is that an error will be triggered if the total number of patients marked for suppression after this run is fewer than 90% of the total number that were marked for suppression in the prior run.  The reason for this rule would be to catch a situation, for example, that might occur if the new suppression file provided in the process had errors or strange values that might have affected it working as expected.