The Sequential Patterns Analysis module
Content
Purpose and short description
Input data formats
Definitions and notations
Basic parameters
Pattern content constraints ('item filters')
Pattern statistics constraints ('numeric filters')
Pattern count constraints
Pattern verification and significance assurance
Storing and reusing Sequences training tasks
The Sequences result panel
Applying sequences models to new data ('Scoring')
Introduction to Sequential Patterns Analysis
Sequential patterns analysis is a variant of associations analysiswhich is suitable for data containing a time stamp or a more general data field with ordering information.
Within Synop Analyzer, the sequential patterns analysis module is started using the button in the left screen column. The button is only active on input data on which an 'entity' field, an 'order' field and a 'group' field have been defined. The Group field and the Order field can be identical. In this case, duplicate the data field in the active fields dialog and specify the original data field as the group field and the duplicated field as the order field.
The result of a sequential patterns analysis is a sequences model, that means a collection of sequential patterns which have been detected during the sequences training run on the training data set. The model can be applied to a new data source in a so-called sequences scoring step. In Synop Analyzer's sequential patterns analysis panel, you can visualize and introspect the sequences model in tabular form, sort, filter and export the filtered results to flat files or into the inter-vendor standard XML format PMML. Furthermore, you can explore and export the support of selected sequential patterns, that means the data sets on which the selected patterns occur.
In the following sections, we will refer to many notations and concepts which have been introduced and explained in the documentation chapter on associations analysis, in particular in the section Definitions and notations of that chapter. Therefore, we recommend to read that chapter and to become familiar with the concepts of associations analysis before starting to use the sequential patterns analysis module.
Unlike an association pattern, a sequential pattern or sequence is a time-ordered combination of several sets of items, a so-called sequence of item sets, in which the items within each item set occur at the same time and consequtive item sets are separated by time steps larger than zero.
An example for a sequence is the following one, based on supermarket purchase data:
(diapers size 1 (new born) & baby cleansing tissues) →[4±1 months]→ baby food 4th-6th month
The sequence consists of two item sets and contains the fact the a certain group of supermarket customers starts buying diapers size 1 and soft baby cleansing tissues at a certain point of time, and the same customers often start buying baby food for 4 to 6 months old babies 4 months plus/minus one month after buying their first diapers and baby tissues.
A sequence rule is a sequence in which the last time step is interpreted as the separation between the rule body (left hand side) and the rule head (right hand side).
The table below lists typical use cases for sequential patterns analysis [Ballard, Rollins, Dorneich et al., Dynamic Warehousing: Data Mining made easy]:
industry | use case | entity field | group field | typical body item | typical head item |
retail | upselling analysis | customer ID | bill ID or purchase ID | a purchased article | another purchased article |
manufacturing | quality assurance | product (e.g. vehicle ID) | process step or timestamp | component, production condition | problem, error ID |
medicine | medical study evaluation | patient or test person | treatment step or date | single treatment info | medical impact |
Input data formats
As mentioned in the first section of this chapter, each data source on which a sequential patterns analysis is to be performed must contain a so-called entity field and an order or timestamp field. These fields must have been declared in the active fields dialog of the input data panel. The entity field contains the subjects (entities) on which time-ordered patterns habe been observed, e.g. customers, vehicles, or patients.
Another required property of the data is that they are sorted by entity field values and, if available, by group field values. If the data are read from a database, Synop Analyzer automatically assures that property by issuing a SELECT
statement with an appropriate ORDER BY
clause. If the data are read from flat file or from a spreadsheet, the user is responsible for bringing the data into the correct order. Synop Analyzer will issue a warning message if the data are not correctly ordered.
If these prerequisites are fulfilled, Synop Analyzer's sequential patterns analysis module is prepared for working with three different data formats:
-
The transactional or pivoted data format:
Often, the input data for sequences analysis are available in a format in which one column is the so-called group field and contains transaction IDs, one or more additional fields are the so-called item fields and contain items, i.e. the information on which associations are to be detected.
The file doc/sample_data/RETAIL_PURCHASES.txt is an example for such a data format: the field PURCHASE_ID is the group field, the field ARTICLE contains the real information, namely the IDs of the purchased articles; the field CUSTOMER_ID is the entity field and DATE the order field.
In the transactional data format, the items appearing in the detected sequential patterns are a combination of field name and field value if there is more than one item field; the name of the item field is omitted if all items come from one single field (such as the field ARTICLE).
-
The data format with Boolean fields:
You can also detect sequential patterns on input data which do not have a group field (that means each data row represents a separate transaction) and in which each single 'item', i.e. each single event or fact, has its own two-valued (Boolean) data field which indicates whether or not the item occurs in the transaction.
If the field PURCHASE_ID was missing in the sample data doc/sample_data/RETAIL_PURCHASES.txt and if there was a separate data field for each existing article ID which contained either 0 or 1, depending on whether or not the corresponding article was purchased in transaction represented by the current data row, then the data would have the data format with Boolean fields.
If Synop Analyzer detects a data format with Boolean fields, it interprets all Boolean field values starting with '0', '-', 'F' or 'f' (such as 'false'), 'N' or 'n' (such as 'no' or 'n/a') as indicators for 'item does not occur in the transaction', all other values are interpreted as 'item occurs in the transaction'.
In the data format with Boolean fields the items appearing in the detected patterns contain only the names of the Boolean fields, but not the field values such as 'YES' or '1'.
-
The 'normal' or broad data format:
Of course, Synop Analyzer can also detect sequential patterns on 'normal' data in which each single data row is considered one data group and in which there are different data fields of various types which contain the items.
On these data, the items appearing in the detected patterns always have the form 'field_name=field_ value'.
A general rule, which is valid on all data formats, is: the items which form the detected sequences can only come from active data fields which have not been marked as 'group', 'entity', 'oder' or 'weight'. 'entity' and 'group' field values serve to define data groups covering more than one data row, information from 'order' fields is used to attach a time stamp to each item, and information from 'weight' fields is used to calculate pattern weight coefficients.
Definitions and notations
A sequence or sequence rule can be characterized by the following properties: [Ballard, Rollins, Dorneich et al., Dynamic Warehousing: Data Mining made easy]
-
The items
which are contained in the rule body, in the rule head, or in the entire rule.
-
Categories of the contained items.
Often, an additional hierarchy or taxonomy for the items is known. For example, the items 'milk' and 'baby food' might belong to the category 'food', 'diapers' might belong to the category 'non-food'. 'axles=17-B' and 'engine=AX-Turbo 2.3' might be members of the category 'component', 'production_chain=3' of category 'production condition', and 'delay > 2 hours' of category 'error state'. Hence, the second sample rule can be characterized by the fact that its body contains components or production conditions, and its head an error state.
-
The support of the pattern:
Absolute support S
is defined as the total number of entities for which the rule holds. Support or relative support s
is the fraction of all entities for which the rule holds. Note that this is different from the definition of support of an association pattern, which is defined in terms of data groups (transactions), not in terms of entities.
-
The confidence of the pattern when interpreted as a rule:
Confidence C
is defined as
C := s(body →[dt]→ head) / s(body)
.
-
The lift of the pattern:
The Lift L
of a pattern Itemset1 →[dt1]→ ... →[dtn-1] Itemsetn
is defined as
L := s(Itemset1 →[dt1]→ ... →[dtn-1]→ Itemsetn) / (s(Itemset1) * ... * s(Itemsetn))
.
When interpreting the lift value of a sequence, we can not simply formulate in analogy to what we have done for association patterns: lift > 1 (< 1) means that the pattern appears more (less) frequently than expected assuming that all involved items are statistically independent. The problem is that in the enumerator part of the lift formula given above, we do not count all common occurrences of all involved items but only the occurrences in the correct time order.
Therefore, an interpretation of lift values is difficult. One can, however, say that a lift value greater than 0.5 always stands for a positive correlation of the involved items in the given time ordering. Apart from that, lift values should only be used for comparisons (this sequence is more positively correlated than that sequence'), and these comparisons should only be drawn between sequences of the same number of items and the same number of time steps.
-
The purity of the sequence pattern:
The purity P
of a sequence Itemset1 →[dt1]→ ... →[dtn-1]→ Itemsetn
is defined as
P := s(Itemset1 →[dt1]→ ... →[dtn-1]→ Itemsetn) / maxi=1...n( s(Itemseti) )
.
P
= 1 means that the pattern describes a 'perfect sequence': none of the parts Itemseti
ever occurs on any entity without all the other parts in the time ordering defined by the sequence.
-
The weight (cost,price) of the pattern:
If a weight field has been defined on the input data, we can calculate the weight of a sequence as the average of summed weights of the entities which support the sequence.
Basic parameters for an Sequential patterns analysis
In Synop Analyzer, an sequence analysis is started by loading a data source - the so-called training data - into memory and by clicking on the button in the input data panel on the left side of the Synop Analyzer GUI. The button opens a panel named Sequences Detection. In the lower part of this panel, you can specify the settings for an sequential patterns analysis and start the search. The detection process itself can be a long-running task, therefore it is executed asynchronically in several parallelized background threads. In the upper part of the panel, the detected sequences - the so-called sequence model - are displayed.
The following paragraphs and screenshots demonstrate the handling of the various sub-panels and buttons at hand of the sample data doc/sample_data/RETAIL_PURCHASES.txt. We assume that these data have been imported into Synop Analyzer as described in Name mappings and Taxonomies, that means with PURCHASE_ID as group field, CUSTOMER_ID as entity field, DATE as order field, PRICE as weight field and with doc/sample_data/RETAIL_NAMES_DE_EN.txt as article names and doc/sample_data/RETAIL_ARTICLEGROUPS.txt as article hierarchies.
The first visible tab in the toolbar at the lower end of the screen contains the most important parameters for sequential patterns analysis.
In the screenshot, the following settings were specified:
-
The detected sequential patterns will be saved under the name assoc_PURCHASES.mdl in the current working directory. Per default, the created file will be a file in a proprietary binary format. But you could also save the file as a <TAB> separated flat text file, which can be opened in any text editor or spreadsheet processor such as MS Excel. Using the main menu item Preferences→Sequences Preferences you can switch the output format, for example to the intervendor XML standard for data mining models, PMML.
-
The currently specified settings will automatically be saved to an XML parameter file named assoc_params_PURCHASES.xml every time the button Start training will be pressed. The resulting XML file can be reloaded in a later Synop Analyzer session via the main menu item Analysis→Run sequences analysis. This reproduces exactly the currently active parameter settings and data import settings.
-
The patterns to be detected should consist of up to 3 parts (itemsets) involving up to 2 time steps. When specifying the parameters for a sequential patterns analysis, you should always specify an upper boundary for the desired sequence lengths, otherwise the analysis can take extremely long time.
-
The patterns to be detected should contain up to 3 items. When specifying the parameters for a sequential patterns analysis, you should always specify an upper boundary for the number of items, otherwise the analysis can take extremely long time.
-
The single parts (itemsets) within the patterns to be detected should consist of up to 2 items. This setting is redundant here, since we have already specified that the sequences to be detected should contains 1 or 2 time steps and the total number of items should not exceed 3. Therefore, itemsets of more than 2 items are not possible anyhow.
-
The patterns to be detected should occur in at least 5 entities. When specifying the parameters for a sequences analysis, you should always specify an lower boundary for the absolute or relative support, otherwise the training can take extremely long time.
-
The upper limit for the number of patterns to be detected and displayed is set to 1000. If more patterns are found, the 1000 patterns with the highest values of the measure currently specified in the selector box Sorting criterion will be selected. In our example, the 1000 patterns with highest lift will be selected.
Pattern content constraints ('item filters')
Filter criteria defining the desired contant of the patterns to be detected can be specified using the second tab named Item filters of the bottom part of the sequential patterns analysis screen. The tab itself displays how many content filter criteria of the various types have been set, the specification of new content filter criteria is performed within pop-up dialogs which open up when one presses one of the buttons in the tab.
-
The three buttons named Required items (group n) define items which must occur in each detected pattern. If several item patterns are specified within one 'required group', at least one of them must appear in each detected sequence. In the sequential patterns analysis module, up to 3 different groups of required items can be specified. The detected patterns must contain at least one item out of every specified group.
Each item specification can contain wildcards (*) at the beginning, in the middle and/or at the end. A wildcard stands for an arbitrary number of arbitrary characters or nothing. The spelling of the items with upper case and lower case letters and empty spaces must exactly match the spelling of the field names and value names as it is displayed in the module . You can either type in the desired values into the input field, or you can select one or more values from a drop-down list of all available items in the data by pressing the arrow symbol at the right edge of the input field.
As the first required item group in our example we specify '*car tire*' and '*windscreen wiper*'. That means we look for patterns which involve customers who have bought car equipment such as tires or windscreen wipers. We enter each text into the editor field of the pop-up dialog and then press Add. After closing the pop-up dialog we set the desired position of the required items to 'at the end of the sequence'. Hence, we want to find sequences of product purchases which lead to the purchase of car equipment at the end.
-
We could specify two more groups of required items, but in our example we do not make use of this possibility.
-
Suppressed Items are items which are to be ignored during the pattern search. In our example, we do not use this feature.
-
If a pair of items or item groups has been specified as incompatible (by pairs), then none of the detected sequences will contain more than one item out of this set. In the text field of the pop-up dialog, you can enter several patterns, separated by comma (,) without adjacent spaces. If a pattern contains a comma as part of the pattern name, escape it by a backslash (\). Each pattern can contain one or more wildcards (*) at the beginning, in the middle and/or at the end.
In general it is reasonable to specify items from highly correlated data fields as 'incompatible'. Otherwise one would obtain many patterns with very high lift values in which one item from each of the two highly correlated fields appears. These trivial patterns might shadow the truely interesting, non-trivial patterns. In our example, we do not use this feature.
-
The item pair purity of two items i1 and i2, is the number of entities on which both items occur divided by the maximum of the absolute supports of the two items. Item pairs with a purity of 1 are 'perfect pairs': whenever i1 occurs on an entity, also i2 occurs in it, and vice versa.
Defining an upper limit for the permitted item pair purity is therefore an alternative means for specifying many single incompatible item pairs. It serves to suppress all trivially highly correlated item pairs from the sequential patterns analysis.
Advanced pattern statistics constraints
The third tab at the lower end of the screen, Advanced Parameters, provides 9 parameters which serve for fine-tuning the detected pattern set based on certain statistical measures.
-
The relative support of the patterns to be detected in our example must be at least 0.1, or 10% of all entities. When specifying the parameters for a sequential patterns training, you should always specify an lower boundary for the absolute or relative support, otherwise the training can take extremely long time. In our example, however, setting the minimum relative support to 0.1 has no real effect and is redundant since we have already specified a minimum absolute support of 5, which is more than 10% of all 24 entities (customers) contained in the data.
-
The relative support of an item is the item's absolute support divided by the total number of entities. In other words, the relative support is the a-priori probability that the item occurs with a randomly selected entity value. Items which appear with (almost) every entity often represent trivial information which one does not want to find in the detected patterns. In our example, we have specified an upper boundary of 0.8 in order to suppress items which occur on at least 80% of all entities.
-
The confidence of a sequence rule is the ratio between the rule's support and the rule body's support. An sequence rule is an sequence of n itemsets separated ny n-1 time steps in which the first n-1 of the n itemsets are considered the 'rule body' and the last itemset is considered the 'rule head'. A rule's confidence is the probability that the rule head is true if one knows for sure that the entire rule body is true. In our example we have specified that we want to search only sequences with confidence value of at least 0.2.
-
Next, in our example we want to find only patterns whose lift is at least 0.9. Hence, we are interested only in 'frequent' patterns with a positive correlation of the involved itemsets in the time order defined by the sequence.
-
The patterns consisting of more than two parts (itemsets) must have lift increase factors of at least 0.9. That means, a longer sequence should only be formed if the prolongation increases the positive correlation of all involved item sets.
The specification of an upper or lower limit for the lift increase factor often is a very effective means for preventing the set of detected patterns from growing too big and for suppressing the appearance of 'redundant', trivial extensions of relevant patterns by just appending arbitrary itemsets to them.
-
The weight of an sequence is the mean weight of all entities on which the sequence occurs. A minimum or maximum threshold for the sequences' weights can only be specified if a weight field has been defined on the input data. We specify a minimum weight of 100, that means we only want to find sequences which apply to customer groups which have a purchase history of at least 100 EUR in our supermarket.
-
The parameter minimum child support ratio defines boundary for the acceptable 'support shrinking rate' when creating expanded sequences out of existing sequences by adding an additional item. An expanded sequence of n items will be rejected if at least one of the possible parent sequences has a support which is so large that when multiplied with the minimum shrinking rate, the result is larger than the actual support of the expanded sequence. In our example we have specified the value of 0.25. That means we suppress the formation of patterns whose support is less than 25% of the support of the least frequent possible parent pattern.
-
The two parameters named Time step limits permit to specify a lower and upper boundary for the duration of the single time steps which form the sequences to be detected. You should enter e pure number without a time unit. The suitable time unit is chosen automatically by the software: days if the order field contains dates, seconds if it contains time stamps, and years if it contains year numbers. In our example, we specify that the time differences between the purchases in our patterns should be between 1 and 10 days.
-
The parameter maximum number of threads specifies an upper limit for the number of parallel threads used for reading and compressing the data. If no number or a number smaller than 1 is given here, the maximum available number of CPU cores will be used in parallel.
Result display options
The fourth tab within the tool bar at the lower border of the sequences analysis window offers some capabilites to introspect and export the generated patterns and the entities on which they appear. Some of the buttons only become enabled if you have selected one or more patterns by mouse clicks in the result table above the tool bar.
The screenshot shown below results if one performs the parameter settings described in the previous sections, presses the button Start training in the first tab and finally selects the first resulting pattern by left mouse click.
The tabular view of detected patterns contains the statistical measures of each pattern and its content, the itemsets which form the pattern. The most important statistical measures are, from left to right: the number of items in the pattern, the sequence length, that means the number of itemsets in the pattern, the pattern's absolute and relative support, the absolute supports of the involved itemsets, the lift, purity and weight, and finally the list of itemsets which form the pattern.
If the user has specified a time step limit in the third tab of the bottom tool bar (in our example, that has been the case), then the result table also contain time step information. Each time step information contains the mean and the standard deviation of the time measured on the training data.
The itemsets describing numeric data field values contain, in addition to the value range limits, an extra information within curly braces: the position of the value range within the overall value distribution of the numeric data field. For example, the text Age=[20..30[ {=3(10)}
means that the age range from 20 (incl.) to 30 (excl.) is the third smallest out of 10 value ranges, hence the age value is below average but not strongly below average.
The numbers in the table column set frequencies contain the absolute supports of the different itemsets of the pattern, in the same order in which the itemset names appear in the columns at the right end of the result table.
In the tool bar tab Result introspection the following options are available:
-
The information displayed at the left end of the tab contains the name of the data source and the number of patterns which are currently selected.
-
The next vertical pair of radio buttons determines what happens if several sequences have been selectend and then the button is pressed. The button's purpose is to display those entities which support the selected sequences. The question is, does this mean the intersection or the superset of the supports of the single selected patterns? This question is answered by the choice made in these radio buttons.
-
The second vertical pair of radio buttons has a similar function to the pair next to it: it specifies whether pressing the button dispays entire data sets or only the entity IDs of the entities which support the selected patterns.
-
The button opens an additional window which shows the data groups on which the currently selected association patterns occur. Whether the new window contains full-width data records or only entity IDs, and whether it contains the intersection or the superset of the data sets supporting the single selected patterns, is defined by the radio buttons described above.
-
The button opens an additional window in which the data groups on which the currently selected patterns occur can be visually explored. Whether the new window contains the intersection or the superset of the data groups supporting the single selected patterns, is defined by the radio buttons described above. The new window provides the entire functionality of the module multivariate analysis.
-
Using the button you can export the currently selected patterns, or all patterns if none has been selected, into a <TAB> separated flat text file, into a PMML
SequenceModel
or into a series of SQL SELECT statements.
-
Using the button you can export the data groups supporting the currently selected patterns into a <TAB> separated flat text file or into a spreadsheet in
.xlsx
format.
Applying sequence models to new data ('Scoring')
Sequenz models can be applied to new data in order to create predictions on these data. For example, a sequence model could use the click history of a web shop user to decide which product offers or banners are to be shown to this user next. Another sequence model could serve as an early warning system in a production process, predicting upcoming problems and faulty products. This application of sequence models to new data for predictive purposes is called 'scoring'.
In the current version of Synop Analyzer, sequence models must satisfy a certain precondition for being usable for scoring: all sequences in the model must have rule heads (final parts of the sequence) containing values of one single data field. This data field is called the target field of the model. In the sample applications cited above (web shop, production monitoring), the target fields could be ARTICLE or ERROR.
If all rules of the model only contain information ('items') from one single data field, the precondition for scoring is trivially satisfied. If not, you can enforce the precondition by defining one or more required items of type Sequence end when training the model. In this case, you must make sure all required head items are values or value ranges of one single data field.
You load and apply a sequence model by first opening and reading the new data, by then pressing the button in order to start the sequential patterns analysis module and by then clicking the button Load model in the tab Scoring Settings of the tool bar at the lower end of the panel's GUI window.
In the following sections we will demonstrate the process of sequence rule scoring with the help of a concrete example use case: using a sequence model we want to identify suitable customers for a marketing campaign for a certain premium product: champagne.
For this purpose, we load the sample data doc/sample_data/RETAIL_PURCHASES.txt. We assume that these data have been imported into Synop Analyzer as described in Name mappings and Taxonomies, that means with PURCHASE_ID as group field, CUSTOMER_ID as entity field, DATE as order field, PRICE as weight field and with doc/sample_data/RETAIL_NAMES_DE_EN.txt as article names and doc/sample_data/RETAIL_ARTICLEGROUPS.txt as article hierarchies.
Then we start the sequential patterns analysis module. We first want to train a sequence model and then apply it. We specify the following settings for the model to be created:
-
Required item champagne of type Sequence end,
-
In the toolbar tabs Analysis settings and Advanced Parameters, we specify a minimum absolute support of 7, a minimum lift of 1.2, minimum lift increase factor of 1.0 and a permitted time step size between 1 and 14 days.
The sequence model trained with these settings contains one single sequence. The sequence states that customers who have purchased a specific beer ('beer 3') have a probability of 80% for purchasing champagne in the 1 to 14 days after buying the beer.
Now we want to use the generated model for identifying the most susceptible customers for an advertizing campaign for champagne within our small sample database RETAIL_PURCHASES.txt of 24 customers.
We move to the tab Scoring Parameters in the tool bar of the sequential patterns analysis module. Here, we enter the name of the file in which the scoring results are to be stored (scored_PURCHASES.txt), we define the scoring result data fields to be contained in that file and we specify that the new file should be a copy of the existing file in-memory data source plus the new computed data fields. (Create new data, original plus computed fields).
Since all sequence rules in our model predict the same value (champagne), we do not need a new data field Predicted field. Instead, we are interested in the predicted probability of that value, therefore we define a Confidence field and call it CHAMPAGNE_CONF. For being able to identify the single customers in the new data, we make sure the group field PURCHASE_ID (and automatically also the attached entity field CUSTOMER_ID) is contained in the new data.
By means of the button Start scoring we create the scoring results, write the desired result file to disk and open the resulting data as a new in-memory data source in Synop Analyzer, that means as a new tab in the left column of the Synop Analyzer workbench.
We introspect the scoring result data with the module 'multivariate exploration'. We see that the model has identified 10 of the 24 customers as sucsceptible for champagne:
Via the button we submit the selected 10 customer IDs to a last visual examination. Then we can use the button Export to save the resulting list to a flat file or Excel spreadsheet, or we can use the main menu button Report to create a HTML or PDF report.