Defining an Aggregation for a Derived Dataset

After opening the settings dialog for a derived dataset (see, follow these steps to define an aggregation step.  It is not required that a derived dataset have an aggregation defined.  If an aggregation is defined, the resulting dataset will typically have fewer (usually many fewer) records than the original datasets.  The number of records will be the total number of different combinations of the selected aggregation variables (see below).  There is an option to instead (or also) aggregate individual variable levels.

1) If the Aggregate panel is not already opened, click on that panel header to open it.  Then click the Edit button.  Note that if this view had previously been defined, you may see information on the currently defined aggregation displayed in this panel, and you can click the Edit button to proceed with changing the aggregation.  The Delete button removes the defined aggregation.

2) The Define Aggregation dialog will appear. The two areas of this dialog are explained below.


Aggregation Variables Tab

Choose Aggregation Partitioning Variables - In this box, select the variables that will be used for aggregation of the dataset.  Specifically, all combinations of the values of the selected variables that are in the dataset will lead to a row in the result set. 

Aggregation Type - This dropdown box has two choices:

  • Summarize Over all Level Combinations - this is the standard aggregation option.   For example, if you select State and Gender (50 values in a dataset; 2 values in the dataset), the result of the aggregation will be 100 rows (or fewer if all state/gender combinations do not exist in the data).  This count would change if Individual Level variables are selected.
  • Summarize Separately for Individual Variable Levels - this option allows you to additionally have the results aggregated separately for each unique level of the variables selected in the Choose Individual Variables box.  A full explanation of aggregating over Individual Variable Levels is provided here:

Choose Individual Variables - this box is available if you select Summarize Separately for Individual Variable Levels as the Aggregation Type.


Summaries Tab

In the Summaries tab, you will define which data summaries will be computed for each combination of the values of the aggregation variables and/or individual level variables.  Below is an overview of using this tab to define summaries.

Aggregation Summaries - this area lists the summary calculations you have already defined.  You can drag and drop them to re-order them, or use the Edit and Delete buttons to edit or delete the existing definitions.

Save These Changes button - click this button when you are finished defining an aggregation summary.  This will add the definition to the Aggregation Summaries area.

Reset button - this will clear out any selections you have made to start with a fresh aggregation summary definition.

Use Pivot Variables - checking this box will allow you to select variables to pivot (data in rows become columns).  More information on using pivot functionality with the aggregation step can be found here:

Select Variables on Which to Compute Summaries - this dropdown list will show all variables currently in the dataset entering this aggregation step.  Select one or more variables on which you want to compute summaries.  If you select multiple variables, the selected summary statistics will be applied to all selected variables.  Note that if you select variables that do not match up well with the selected summary statistics, they will simply be ignored.  For example, if you select a variable Address (a string variable) and also the Average summary statistic, that operation will be ignored and not throw an error.  This makes it easier to defined a large set of operations in one definition without having to worry about matching up variable types with operations.

Select Summary Statistics to Compute - this dropdown list will show all of the available summary statistics or operations that are available for computation.  The options available are:

  • Average - the mean or average of the variable.
  • Weighted Average - the weighted average of the variable.  This requires also selecting a weighting variable.
  • Correlation Coefficient - the correlation of the variable with any variables also selected in the weighting/correlation variables box.
  • Sum - the sum of the variable.
  • Count - the count of all rows (regardless of null or non-null values). 
  • Count Non-NULL - the count of all non-null values of the variable.
  • Count NULLs - the count of null values of the variable.
  • Minimum - the minimum value of the variable.
  • Maximum - the maximum value of the variable.
  • Zero Count - the number of values that are exactly zero.
  • Variance - the sample variance of the variable.
  • Standard Deviation - the sample deviation of the variable.
  • Concatenate Values - this is a special aggregation function that concatenates the unique values of the value together into a single string, using comma as the separator.  This option can create very long values.  Note that the maximum string length allowed is 65,000 characters.  If you define a concatenation that results in a string longer than this limit, an error will occur.
  • Percent of Records - this selection will compute the percentage of records in the dataset that fall into each combination of aggregation variables. 
    • Note that the calculation it makes does not depend on any particular variable (it is just a row count), but it does require at least one variable to be selected.
    • Selecting this summary function will result in only a single variable in the resulting aggregated dataset, even if many variables were selected.
    • If you are aggregating over Individual Variable Levels, the percentages are computed as adding to 100% over all the levels of each variable.
  • Percent of Sum Total - this will compute the percentage of the total that falls into each combination of aggregation variables, for the sum of the selected variables.
    • Only selected variables that are numeric will be used for this summary function.
    • If you are aggregating over Individual Variable Levels, the percentages are computed as adding to 100% over all the levels of each variable.
  • Count Unique Values - the number of unique values of the variable.

Note: you cannot use both the Concatenate Values and Count Unique Values functions in the same aggregation step.

Weight/Correlation Variables - if either the Weighted Average or Correlation Coefficient summary statistics were selected, this box will become available.  Only numeric variables (integer or decimal) are shown.  You can select one or multiple variables in this box.  All combinations of these variables with the selected variables in the Compute Summaries list will be used to compute weighted averages and/or correlation coefficients.

New Variable Name Template - Enter the name for the new variable(s) that will be created from this aggregation.  Since many new variables may result from a single aggregation definition (for example, if you have multiple variables and/or statistics selected), you would often enter a template for how those variable names would be created. If you have selected multiple variables to aggregate, the template [@var] must be used somewhere in this name in order to differentiate between them. If you have selected multiple summary statistics, the template [@stat] must be used somewhere in this name. If you have selected multiple weight/correlation variables, the template [@wtvar] must be used somewhere in this name.  In the case of using a pivot variable, you also can use the templates [@pivotvar] and [@pivotvalue].

Once you have modified your settings on the right side, be sure to click the Save These Changes button at the top of the panel.  The settings for this set of variables and statistics will be saved and appear in the Aggregation Summaries panel, along with any others you have created.

3) When you are completely finished defining aggregations, click Save.  This will return you to the main Derived Dataset window. 

After defining the aggregation as above, the Aggregate area will look like the following, providing a summary of the aggregation being performed.