The concept of creating "New Field Aggregations" in LityxIQ is similar to partition and windowing functions that you may be familiar with in some SQL languages. The functionality provides a way to add new variables to a dataset that have a value for every row, and are computed using summary functions over partitions and ordered windows of the dataset. It is a very powerful set of functionality that typically requires complex SQL coding in a SQL language, but LityxIQ makes such functions easily available through a point and click interface.
There are six essential components to defining a New Field Aggregation. Three components determine how the rows of the dataset are used to compute new values.
- Partition - the computations will be performed separately for each unique combination of the levels of one or more Partition Variables. Each of these unique combinations can be considered a "partition" of the dataset. It is possible to not select any Partition Variables, in which case the computations are performed considering the entire dataset as a single partition. All partitions are processed completely independently of each other.
- Ordering - for many computations (but not all), the ordering of the rows that are within a partition is important. For example, when computing a lag, it is necessary to specify a way that rows are ordered. The Ordering Variables can be set to determine how that ordering is performed. Most computations require that one or more ordering variables be selected.
- Window - each requested computation is carried out on a certain dynamic subset (window) of records within the partition. The Window can be "dynamic" because it can be defined differently for each row in the partition as that row relates to other rows in the partition. A simple and common window, however, is often the window containing "all records in the partition". This choice of window is "static" because the rows contained in the window is always the same for every row in the partition. An example of a dynamic window would be: "for each row in the partition, define the window under consideration to be all rows starting two before this row, and up to two following this row". This can be thought of as a dynamic moving window that always contains five rows, and those five rows in the window change as each new row in the partition is considered.
The other three components to define a New Field Aggregation relate to the computation that is made for each row, based on the Partition, Ordering, and Window selected above:
- Variable to Summarize - this is the variable on which a calculation will be made. For example, to compute an average Age, Age would be selected.
- Summary Function - this is the method for making a calculation on the selected variable. There are many options, including standard aggregation functions like average, min, max, and counts. Other options include lag, percentiles, ranks, z-values, and many more.
- Other Settings - some functions chosen require other settings. For example, when requesting that a lag be computed, you will need to specify which lagged value (such as 1 row before, 3 rows before, or even something like 2 rows after). Another example is the "Nth Value" function for which you will specify the value of N (e.g., specifying N=4 requests the 4th ordered value in the window as the result). "Ntile" and "Percentile" are two other examples of functions requiring more information to work properly.
There are similarities and differences with creating standard new fields and with doing a standard dataset aggregation (in the Aggregation area when defining a view). Here are some comparison and contrasts:
Standard New Field
- Creates a new variable that has a value for every row in the dataset. The number of rows in the dataset does not change.
- The value for a row depends on other data in the same row, using a formula (such as a mathematical operation, a string or date-based calculation, etc).
- These operations are performed in the New Fields area when creating a View.
Aggregation New Field
- Creates a new variable that has a value for every row in the dataset. The number of rows in the dataset does not change.
- The value for a row can depend on summaries or other operations that involve data from other rows. Examples include lags, averages, and percentiles among many others.
- These operations are performed in the New Fields area when creating a View.
Standard Aggregation
- Summarizes a dataset to a different level based on the defined aggregation variables, with the result typically having fewer rows than the starting point.
- The values for a row in the summarized dataset are based on simple summary operations such as averages, minimums, and counts. More complex operations such as lags and percentiles are not supported.
- These operations are performed in the Aggregation area when creating a View.