The Associations Analysis module

Content

Purpose and short description
Input data formats
Definitions and notations
Basic parameters for an associations analysis
Pattern content constraints ('item filters')
Pattern statistics constraints ('numeric filters')
Result display options
Pattern verification and significance assurance
Applying association models to new data ('Scoring')


Purpose and short description

An associations analysis finds items in your data that are associated with each other in a meaningful way. Within Synop Analyzer, an associations analysis is started by pressing the button associations detection in the left screen column.

An item is an 'atomic' piece of information contained in the input data, that means a combination of a data field name and a field value, for example PROFESSION=farmer. A prerequisite for finding associations between these atomic items is that a grouping of several of the items to one comprising group of data fields or data records exists. Often, this group of fields or records is called a transaction (TA).

An association is a combination of several items, a so-called item set, for example the combination PROFESSION=farmer & GENDER=male. An association rule is a combination of two item sets in the form of a rule itemset1 → itemset2. The left hand side of the rule is called the rule body or antecedent, the right hand side the rule head or consequent.

The table below lists typical use cases for associations analysis [Ballard, Rollins, Dorneich et al., Dynamic Warehousing: Data Mining made easy]:

industryuse casegrouping criteriontypical body itemtypical head item
retailmarket basket analysisbill ID or purchase IDa purchased articleanother purchased article
manufacturingquality assuranceproduct (e.g. vehicle ID)component, production conditionproblem, error ID
medicinemedical study evaluationpatient or test personsingle treatment infomedical impact

Input data formats

Synop Analyzer's association detection module is prepared for working with three different data formats:

A general rule, which is valid on all data formats, is: the items which form the detected associations can only come from active data fields which have not been marked as 'group', 'entity', 'oder' or 'weight'. 'entity' fields are ignored in associations mining (they are only important for sequential patterns analysis), 'group' field values serve to define data groups covering more than one data row, information from 'order' fields is used to calculate trend coefficients for the detected associations, and information from 'weight' fields is used to calculate pattern weight coefficients.


Definitions and notations

An association pattern or rule can be characterized by the following properties: [Ballard, Rollins, Dorneich et al., Dynamic Warehousing: Data Mining made easy]


Basic parameters for an Associations analysis

In Synop Analyzer, an associations analysis is started by loading a data source - the so-called training data - into memory and by clicking on the button associations detection in the input data panel on the left side of the Synop Analyzer GUI. The button opens a panel named Associations Detection. In the lower part of this panel, you can specify the settings for an associations analysis and start the search. The detection process itself can be a long-running task, therefore it is executed asynchronically in several parallelized background threads. In the upper part of the panel, the detected association rules - the so-called association model - are displayed.

The following paragraphs and screenshots demonstrate the handling of the various sub-panels and buttons at hand of the sample data doc/sample_data/customers.txt.

The first visible tab in the toolbar at the lower end of the screen contains the most important parameters for associations analysis.

image file assoc_toolbar_train_875.png not found

In the screenshot, the following settings were specified:


Pattern content constraints ('item filters')

Filter criteria defining the desired contant of the patterns to be detected can be specified using the second tab named Item filters of the bottom part of the associations analysis screen. The tab itself displays how many content filter criteria of the various types have been set, the specification of new content filter criteria is performed within pop-up dialogs which open up when one presses one of the buttons in the tab.

image file assoc_toolbar_itemfilter_774.png not found


Advanced pattern statistics constraints

The third tab at the lower end of the screen, Advanced Parameters, provides 12 parameters which serve for fine-tuning the detected pattern set based on certain statistical measures.

image file assoc_toolbar_advanced_875.png not found


Result display options

The fourth tab within the tool bar at the lower border of the associations analysis window offers some capabilites to modify the display mode of the detected associations and to introspect and export them. Some of the buttons only become enabled if you have selected one or more patterns by mouse clicks in the result table above the tool bar.

The screenshot shown below results if one performs the parameter settings described in the previous sections, presses the button Start training in the first tab and finally selects one of the resulting patterns by left mouse click.

The tabular view of detected patterns contains the statistical measures of each pattern and its content, the items which form the pattern. The most important statistical measures are, from left to right: the number of items in the pattern, the pattern's absolute and relative support, the absolute supports of the involved items, the lift, purity and core item purity, and finally the list of the items which form the pattern.

image file assoc_resultview_1021.png not found

The items describing numeric data field values contain, in addition to the value range limits, an extra information within curly braces: the position of the value range within the overall value distribution of the numeric data field. For example, the item Age=[20..30[ {=3(10)} means that the age range from 20 (incl.) to 30 (excl.) is the third smallest out of 10 value ranges, hence the age value is below average but not strongly below average.

The numbers in the table column item frequencies contain the absolute supports of the different items of the pattern, in the same order in which the item names appear in the columns at the right end of the result table. If the number is marked by a star (*), the corresponding item belongs to the core of the pattern. That means that each partial pattern in which this item has been removed has a larger support than the original pattern.

The tabular result view also contains some more advanced information on the detected patterns. In the figure shown below these columns have been enlarged and thus highlighted:

image file assoc_resultview2_1014.png not found

Clicking on a table row with the right mouse button opens a detail view of the association in a separate pop-up window. The detail view displays the n different possibilities to interpret the association as an association rule with exactly one item as the rule head. For each rule, the detail view contains the absolute support of the rule body and the rule head, the rule's confidence, the lift of the rule body pattern and the rule lift.

image file assoc_resultview_detail_1024.png not found

In the tool bar tab Result introspection the following options are available:


Pattern verification and significance assurance

At the end of the chapter on associations analysis we want to discuss how one can make sure that a detected pattern is a statistically significant pattern and not just a random statistical fluctuation, so called white noise, in the data. This issue is often completely left aside in traditional books on data mining and in many existing software packages.

Synop Analyzer provides two means for targeting this issue: one can calculate a so-called χ2 confidence level for each pattern, and one can perform one or more verification runs on artificially permuted versions of the original data which serve to define the so-called noise level and the associated 'Monte Carlo confidence' that the given pattern's statistical key measures exceed tht noise level, making it a significant pattern. In this section, the two confidence measures and their interpretation shall be discussed in detail.

As an example, let us look at one concrete association pattern which we have taken as an example several times in this chapter:

image file assoc_resultview2_1014.png not found

The highlighted sample pattern has length 4, absolute support (frequency) 163, relative support of 1.6%, a lift value of 6.64, the χ2 confidence of 1.000 and the Monte Carlo confidence of 0.58. What does that mean for the significance of the pattern, and why is the χ2 confidence of this pattern (and of most other patterns) much larger than the Monte Carlo confidence?

For answering these questions we start with remembering the definition of χ2 confidence. A pattern of n items with absolute support S has a χ2 confidence of x% if for each of the n items, the following holds: the appearance probability of the item in the presence of the n-1 other items of the pattern differs so strongly from the a-priori appearance probability of the item on the overall data that this difference is in x out of 100 cases greater than the difference in appearance probilities which results from comparing a randomly selected subset of S data groups to the entire data. More familiarly spoken, that means roughly the following: x out of 100 association patterns which do not represent a statistically significant relation on the data and which have the same pattern length and support as the given pattern, would have a lift value closer to 1 than the given pattern. Inversely, this also means: even if a pattern has a χ2 confidence value of 0.9999, 1 out of 10000 randomly chosen noise patterns of the same length and support would have a lift value as strong as the given pattern. A typical associations analysis - if not almost all items appearing in the detected patterns have been specified as 'required items' by the user - examines billions or even trillions of candidate patterns. Therefore, it is highly probable that e few random noise patterns make it into the displayed result which have a χ2 confidence of 0.999 or even 1.000.

In summary we can conclude: that a pattern has a χ2 confidence of 0.95 or higher is a necessary but not a sufficient condition for the pattern's statistical significance. The condition is only sufficient if the search space of candidate patterns during the analysis was very small, that means if only a few patterns were evaluated. In all other cases, one needs other significance measures for finally classifying a pattern as significant or not.

In these latter cases, the Monte Carlo confidence level, which is based on verification runs and permutation tests, gives a more reliable significance estimation. The method first calculates a 'maximum noise level' for each pair of (pattern length, support) based on all available verification runs. The maximum noise level takes into account all recorded lift, purity and core item purity values of the detected patterns on the verification data. From each triple (lift, purity, core item purity), a number NL(length,support) is calculated, and the maximum noise level MNL is the maximum of all recorded NL(length,support). For pairs (length,support) for which not enough patterns have been found within the verification runs, the maximum noise level is interpolated and estimated from neigbored MNL values. Once the MNLs have been established, we calculate the corresponding quality number Q as a function of lift, purity and core item purity for each detected pattern on the real data and compare it to the MNL for the same length, support, lift, purity and core item purity. The Monte Carlo confidence is a function of Q minus MNL which is calibrated such that the result is 0.45 if Q equals MNL and 0.95 if Q equals 1.5 MNL.

Familiarly spoken, we can interpret the Monte Carlo confidence as follows: a value of about 0.5 means that on all verification runs not a single fluctuation pattern has been found with the same combined significance of the values pattern length, support, lift, purity and core item purity as the current pattern. This is a good evidence for the fact that the current pattern is statistically significant. The evidence becomes even stronger if the MC confidence goes towards 1.0. That means, our sample pattern, which has MC conf=0.58, is with high probability statistically significant, whereas the pattern below our example pattern in the result table could be random noise, even though its χ2 confidence is 1.000.


Applying association models to new data ('Scoring')

Association models can be applied to new data in order to create predictions on these data. For example, an associations model could use the click history of a web shop user to decide which product offers are to be shown to this user. Another associations model could serve as an early warning system in a production process, predicting upcoming problems and faulty products. A third associations model could classify credit demands into a high risk and a low risk group. This application of associations models to new data for predictive purposes is called 'scoring'.

In the current version of Synop Analyzer, associations models must satisfy a certain precondition for being usable for scoring: all association rules in the model must have rule heads ('then' sides) containing values of one single data field. This data field is called the target field of the model. In the three sample applications cited above (web shop, production monitoring, credit risk), the target fields could be ARTICLE, ERROR and RISK_CLASS.

If all rules of the model only contain information ('items') from one single data field, the precondition for scoring is trivially satisfied. If not, you can enforce the precondition by defining one or more required items of type Rule head when training the model. In this case, you must make sure all required head items are values or value ranges of one single data field.

You load and apply an associations model by first opening and reading the new data, by then pressing the button assoc in order to start the associations analysis module and by then clicking the button Load model in the tab Scoring Settings of the tool bar at the lower end of the panel's GUI window.

image file assoc_toolbar_scoring_818.png not found

In the following sections we will demonstrate the process of associations scoring with the help of a concrete example use case: using an associations model we want to predict the propensity of newly acquired bank customers to sign a life insurance contract.

For this purpose, we load the sample data doc/sample_data/customers.txt. We keep the default data import settings with one exception: the number of bins for numeric fields (Bins:) is reduced from 10 to 5. Then we start the associations analysis module and train a model called assoc_li.mdl, using the following parameter settings:

The model trained with these settings contains 17 rules. The strongest rule predicts a probability of 45% that a customer with the properties given on the left side of the rule will sign a life insurance contract.

image file assoc_scoring_train2_806.png not found

Now we want to use the generated model for predicting the propensity of 159 new customers for signing life insurance contracts. The new customers' data reside in the file newcustomers_159.txt. We load these data as a new Synop Analyzer data source. The value range discretizations of the numeric fields of the new data must be identical to the range discretizations that were in place when the model was created. In our case, we use the pop-up window Settings → Field discretizations to make sure the field Age has the range boundaries 20, 40, 60 and 80. For the field ClientID we specify the usage type group in the dialog Active fields.

On this in-memory data source we start the associations analysis module and move to the tab Scoring Parameters in the tool bar at the lower end of the screen. Here, we enter the name of the file in which the scoring results are to be stored (scored_newcust_LI.txt), we define the scoring result data fields to be contained in that file and we specify that the new file should be a copy of the existing file newcustomers_159.txt plus the new computed data fields. (Create new data, original plus computed fields). Since all association rules in our model predict the same value (LifeInsurance=yes), we do not need a new data field Predicted field. Instead, we are interested in the predicted probability of that value, therefore we define a Confidence field and call it LI_CONF. For being able to identify the single customers in the new data, we make sure the key field ClientID is contained in the new data and serves as Record ID field.

image file assoc_scoring_params_834.png not found

By means of the button Start scoring we create the scoring results, write the desired result file to disk and open the resulting data as a new in-memory data source in Synop Analyzer, that means as a new tab in the left column of the Synop Analyzer workbench.

We introspect the scoring result data with the module 'multivariate exploration'. We see that the model has created a non-empty propensity probability for 39 of the 159 new customers. But some of these 39 customers should be filtered out because they already have a life insurance, they have an age of 60 or more years or because they are children or retired persons. There remain 19 new customers which are interesting for selling life insurance contracts:

image file assoc_scoring_multivar_854.png not found

Via the button show data we submit the selected 19 data records to a last visual examination. Then we can use the button Export to save the resulting list to a flat file or Excel spreadsheet, or we can use the main menu button Report to create a HTML or PDF report.

image file assoc_scoring_result_825.png not found