The Sequential Patterns Analysis module

Content

Purpose and short description
Input data formats
Definitions and notations
Basic parameters
Pattern content constraints ('item filters')
Pattern statistics constraints ('numeric filters')
Pattern count constraints
Pattern verification and significance assurance
Storing and reusing Sequences training tasks
The Sequences result panel
Applying sequences models to new data ('Scoring')


Introduction to Sequential Patterns Analysis

Sequential patterns analysis is a variant of associations analysiswhich is suitable for data containing a time stamp or a more general data field with ordering information.

Within Synop Analyzer, the sequential patterns analysis module is started using the button sequential patterns in the left screen column. The button is only active on input data on which an 'entity' field, an 'order' field and a 'group' field have been defined. The Group field and the Order field can be identical. In this case, duplicate the data field in the active fields dialog and specify the original data field as the group field and the duplicated field as the order field.

The result of a sequential patterns analysis is a sequences model, that means a collection of sequential patterns which have been detected during the sequences training run on the training data set. The model can be applied to a new data source in a so-called sequences scoring step. In Synop Analyzer's sequential patterns analysis panel, you can visualize and introspect the sequences model in tabular form, sort, filter and export the filtered results to flat files or into the inter-vendor standard XML format PMML. Furthermore, you can explore and export the support of selected sequential patterns, that means the data sets on which the selected patterns occur.

In the following sections, we will refer to many notations and concepts which have been introduced and explained in the documentation chapter on associations analysis, in particular in the section Definitions and notations of that chapter. Therefore, we recommend to read that chapter and to become familiar with the concepts of associations analysis before starting to use the sequential patterns analysis module.

Unlike an association pattern, a sequential pattern or sequence is a time-ordered combination of several sets of items, a so-called sequence of item sets, in which the items within each item set occur at the same time and consequtive item sets are separated by time steps larger than zero.

An example for a sequence is the following one, based on supermarket purchase data:
(diapers size 1 (new born) & baby cleansing tissues) →[4±1 months]→ baby food 4th-6th month
The sequence consists of two item sets and contains the fact the a certain group of supermarket customers starts buying diapers size 1 and soft baby cleansing tissues at a certain point of time, and the same customers often start buying baby food for 4 to 6 months old babies 4 months plus/minus one month after buying their first diapers and baby tissues.

A sequence rule is a sequence in which the last time step is interpreted as the separation between the rule body (left hand side) and the rule head (right hand side).

The table below lists typical use cases for sequential patterns analysis [Ballard, Rollins, Dorneich et al., Dynamic Warehousing: Data Mining made easy]:

industryuse caseentity fieldgroup fieldtypical body itemtypical head item
retailupselling analysiscustomer IDbill ID or purchase IDa purchased articleanother purchased article
manufacturingquality assuranceproduct (e.g. vehicle ID)process step or timestampcomponent, production conditionproblem, error ID
medicinemedical study evaluationpatient or test persontreatment step or datesingle treatment infomedical impact


Input data formats

As mentioned in the first section of this chapter, each data source on which a sequential patterns analysis is to be performed must contain a so-called entity field and an order or timestamp field. These fields must have been declared in the active fields dialog of the input data panel. The entity field contains the subjects (entities) on which time-ordered patterns habe been observed, e.g. customers, vehicles, or patients.

Another required property of the data is that they are sorted by entity field values and, if available, by group field values. If the data are read from a database, Synop Analyzer automatically assures that property by issuing a SELECT statement with an appropriate ORDER BY clause. If the data are read from flat file or from a spreadsheet, the user is responsible for bringing the data into the correct order. Synop Analyzer will issue a warning message if the data are not correctly ordered.

If these prerequisites are fulfilled, Synop Analyzer's sequential patterns analysis module is prepared for working with three different data formats:

A general rule, which is valid on all data formats, is: the items which form the detected sequences can only come from active data fields which have not been marked as 'group', 'entity', 'oder' or 'weight'. 'entity' and 'group' field values serve to define data groups covering more than one data row, information from 'order' fields is used to attach a time stamp to each item, and information from 'weight' fields is used to calculate pattern weight coefficients.


Definitions and notations

A sequence or sequence rule can be characterized by the following properties: [Ballard, Rollins, Dorneich et al., Dynamic Warehousing: Data Mining made easy]


Basic parameters for an Sequential patterns analysis

In Synop Analyzer, an sequence analysis is started by loading a data source - the so-called training data - into memory and by clicking on the button sequence analysis in the input data panel on the left side of the Synop Analyzer GUI. The button opens a panel named Sequences Detection. In the lower part of this panel, you can specify the settings for an sequential patterns analysis and start the search. The detection process itself can be a long-running task, therefore it is executed asynchronically in several parallelized background threads. In the upper part of the panel, the detected sequences - the so-called sequence model - are displayed.

The following paragraphs and screenshots demonstrate the handling of the various sub-panels and buttons at hand of the sample data doc/sample_data/RETAIL_PURCHASES.txt. We assume that these data have been imported into Synop Analyzer as described in Name mappings and Taxonomies, that means with PURCHASE_ID as group field, CUSTOMER_ID as entity field, DATE as order field, PRICE as weight field and with doc/sample_data/RETAIL_NAMES_DE_EN.txt as article names and doc/sample_data/RETAIL_ARTICLEGROUPS.txt as article hierarchies.

The first visible tab in the toolbar at the lower end of the screen contains the most important parameters for sequential patterns analysis.

image file assoc_toolbar_train_866.png not found

In the screenshot, the following settings were specified:


Pattern content constraints ('item filters')

Filter criteria defining the desired contant of the patterns to be detected can be specified using the second tab named Item filters of the bottom part of the sequential patterns analysis screen. The tab itself displays how many content filter criteria of the various types have been set, the specification of new content filter criteria is performed within pop-up dialogs which open up when one presses one of the buttons in the tab.

image file assoc_toolbar_itemfilter_746.png not found


Advanced pattern statistics constraints

The third tab at the lower end of the screen, Advanced Parameters, provides 9 parameters which serve for fine-tuning the detected pattern set based on certain statistical measures.

image file assoc_toolbar_advanced_866.png not found


Result display options

The fourth tab within the tool bar at the lower border of the sequences analysis window offers some capabilites to introspect and export the generated patterns and the entities on which they appear. Some of the buttons only become enabled if you have selected one or more patterns by mouse clicks in the result table above the tool bar.

The screenshot shown below results if one performs the parameter settings described in the previous sections, presses the button Start training in the first tab and finally selects the first resulting pattern by left mouse click.

The tabular view of detected patterns contains the statistical measures of each pattern and its content, the itemsets which form the pattern. The most important statistical measures are, from left to right: the number of items in the pattern, the sequence length, that means the number of itemsets in the pattern, the pattern's absolute and relative support, the absolute supports of the involved itemsets, the lift, purity and weight, and finally the list of itemsets which form the pattern.

If the user has specified a time step limit in the third tab of the bottom tool bar (in our example, that has been the case), then the result table also contain time step information. Each time step information contains the mean and the standard deviation of the time measured on the training data.

image file seqpat_resultview_978.png not found

The itemsets describing numeric data field values contain, in addition to the value range limits, an extra information within curly braces: the position of the value range within the overall value distribution of the numeric data field. For example, the text Age=[20..30[ {=3(10)} means that the age range from 20 (incl.) to 30 (excl.) is the third smallest out of 10 value ranges, hence the age value is below average but not strongly below average.

The numbers in the table column set frequencies contain the absolute supports of the different itemsets of the pattern, in the same order in which the itemset names appear in the columns at the right end of the result table.

In the tool bar tab Result introspection the following options are available:


Applying sequence models to new data ('Scoring')

Sequenz models can be applied to new data in order to create predictions on these data. For example, a sequence model could use the click history of a web shop user to decide which product offers or banners are to be shown to this user next. Another sequence model could serve as an early warning system in a production process, predicting upcoming problems and faulty products. This application of sequence models to new data for predictive purposes is called 'scoring'.

In the current version of Synop Analyzer, sequence models must satisfy a certain precondition for being usable for scoring: all sequences in the model must have rule heads (final parts of the sequence) containing values of one single data field. This data field is called the target field of the model. In the sample applications cited above (web shop, production monitoring), the target fields could be ARTICLE or ERROR.

If all rules of the model only contain information ('items') from one single data field, the precondition for scoring is trivially satisfied. If not, you can enforce the precondition by defining one or more required items of type Sequence end when training the model. In this case, you must make sure all required head items are values or value ranges of one single data field.

You load and apply a sequence model by first opening and reading the new data, by then pressing the button seqpat in order to start the sequential patterns analysis module and by then clicking the button Load model in the tab Scoring Settings of the tool bar at the lower end of the panel's GUI window.

In the following sections we will demonstrate the process of sequence rule scoring with the help of a concrete example use case: using a sequence model we want to identify suitable customers for a marketing campaign for a certain premium product: champagne.

For this purpose, we load the sample data doc/sample_data/RETAIL_PURCHASES.txt. We assume that these data have been imported into Synop Analyzer as described in Name mappings and Taxonomies, that means with PURCHASE_ID as group field, CUSTOMER_ID as entity field, DATE as order field, PRICE as weight field and with doc/sample_data/RETAIL_NAMES_DE_EN.txt as article names and doc/sample_data/RETAIL_ARTICLEGROUPS.txt as article hierarchies.

Then we start the sequential patterns analysis module. We first want to train a sequence model and then apply it. We specify the following settings for the model to be created:

The sequence model trained with these settings contains one single sequence. The sequence states that customers who have purchased a specific beer ('beer 3') have a probability of 80% for purchasing champagne in the 1 to 14 days after buying the beer.

image file seqpat_scoring_train_826.png not found

Now we want to use the generated model for identifying the most susceptible customers for an advertizing campaign for champagne within our small sample database RETAIL_PURCHASES.txt of 24 customers.

We move to the tab Scoring Parameters in the tool bar of the sequential patterns analysis module. Here, we enter the name of the file in which the scoring results are to be stored (scored_PURCHASES.txt), we define the scoring result data fields to be contained in that file and we specify that the new file should be a copy of the existing file in-memory data source plus the new computed data fields. (Create new data, original plus computed fields). Since all sequence rules in our model predict the same value (champagne), we do not need a new data field Predicted field. Instead, we are interested in the predicted probability of that value, therefore we define a Confidence field and call it CHAMPAGNE_CONF. For being able to identify the single customers in the new data, we make sure the group field PURCHASE_ID (and automatically also the attached entity field CUSTOMER_ID) is contained in the new data.

image file seqpat_scoring_params_836.png not found

By means of the button Start scoring we create the scoring results, write the desired result file to disk and open the resulting data as a new in-memory data source in Synop Analyzer, that means as a new tab in the left column of the Synop Analyzer workbench.

We introspect the scoring result data with the module 'multivariate exploration'. We see that the model has identified 10 of the 24 customers as sucsceptible for champagne:

image file seqpat_scoring_multivar_821.png not found

Via the button show data we submit the selected 10 customer IDs to a last visual examination. Then we can use the button Export to save the resulting list to a flat file or Excel spreadsheet, or we can use the main menu button Report to create a HTML or PDF report.

image file seqpat_scoring_result_388.png not found