Detecting Deviations and Inconsistencies

Content

Purpose and short description
The result view
Obtaining correction hints
The bottom tool bar
Specification of the desired content of the deviation patterns
Modification of the statistical limits and settings
Detailed introspection of selected deviation patterns
Saving and exporting results
Interpretation of deviations: untypical data set or data error?


Purpose and short description

In the Deviation Detection panel, outliers, deviations and presumable data inconsistencies can be detected. The specific approach of this module is that it does not examine the values and value distribution characteristics of each data field separately for outliers as traditional data quality checker tools do. Rather, it finds cross-field inconsistencies.
For example, in a customer master data table neither the value Age=35 nor the value FamilyStatus=child is an outlier or deviation, but the combination of both is one. This type of data errors are often overlooked by other data quality tools.


The result view

The deviations and inconsistency detection module was designed for usability for non-statisticians. It aims at delivering interesting results and findings without obliging the user to define hypotheses, busines rules or filter criteria and without too many 'expert' parameters and options. In the following, we are going to demonstrate a typical usage scenario of the module by means of an example analysis of the sample data doc/sample_data/customers.txt. We assume that these data have been read in as described in another chapter of this documentation, that means with ClientID as group field. If we start the module Deviations and Inconsistencies on these data and just press the Start button in the module's tool bar, we obtain the folloging result:

image file img/deviations_resulttable_753.png

(Note: in the picture shown above, we have mouse-clicked the table column header of the column item 1 in order to sort the detected deviations thematically, that means lexically). The different columns of the result table have the following meaning:

Hence, the deviation pattern which has been highlighted in blue in the above picture can be interpreted as follows: the combination of the two items Age=[70 to 79 years] (which is the 8th out of 10 value ranges of the date field Age) and Profession=Worker appears in one single data record. As the range Age=[70 to 79 years] appears in 958 out of 10000 data records and the value Profession=Worker in 1320 data records, we expected a much higher occurrence frequency, namely with about 958/10000 * 1320 = 126.5. That means, the lift value of the pattern is 1/126.5. The difference between the observed frequency of 1 and the expected frequency of 126.5 is highly significant (χ² confidence=1.000). The combination of the two, the confidence divided by the lift, is the deviation strength of the pattern (126.5).


Obtaining correction hints

Some detected deviation patterns might be evident such as the pattern Age=[30 to 49] andFamilyStatus=Child. In many other detected patterns, however, it might be unclear which part of the pattern contains the part which does not fit with the rest of the pattern, and what replacement would most probably 'heal' the inconsistency or remove the data error. Synop Analyzer helps answering these questions by providing a 'correction hint' feature: richt-clicking on one of the patterns listed in the patterns table opens a pop-up window in which the software indicates which parts of the pattern are most probably the deviating parts and what are the statistically most probably corrections. The following picture shows the correction hint for the pattern which has been highlighted in the preceding screenshot:

image file img/deviations_hint_???.png not found

The single correction hints displayed in the pop-up window are ordered by descending statistical plausibility. In the example shown above, the correction hint says that it would be normal if a person at an age of more than 70 years was not a worker but a pensioner. The second most plausible correction would be that the age of a person who has the profession 'worker' was between 30 and 50 years.

Often, it is advisable to check the displayed correction hints by looking at the involved data records. Then it often becomes obvious which one of the suggested corrections is the best matching one - or whether no corrections should be applied because the inflicted data sets are somehow untypical but not erroneous.Our example of the worker above 70 years occurs in one single data record:

image file img/deviations_dataset_992.png not found

The closer inspection of this data set shows that most probably the value of the field Profession is outdated. The duration of the client relationship, the lack of adoption of 'modern' bank services (online banking, credit card, bank card) combined with an above-average account balance are more typical for a 71-year-old pensioner than for a younger worker.


The bottom tool bar

The tool bar at the lower edge of the panel provides features for

image file img/deviations_toolbar_???.png not found

In the following we will describe these three groups of features in more detail.

Specification of the desired content of the deviation patterns

The three buttons at the left end of the tool bar help to focus the deviation detection to patterns with a user-specified content. Three kinds of specifications can be performed.

Using the button Suppressed items one can define groups of items which are to be completely ignored during the following deviation detection. Clicking on the button opens a pop-up window in which one can enter item names or parts of item names plus wildcard symbols (*) and activate each input by pressing the button Add. Each wildcard stands for zero or more arbitrary characters. You can either type in the desired values into the input field, or you can select from a drop-down list of all available items in the data by pressing the arrow symbol at the right edge of the input field.

In the screenshot below we have alredy specified that nothing involving the term Saving (as part of a field name or field value) should appear in the detected patterns. Then we have specified that we also want to suppress all patterns in which the term OnlineBanking occurs. This second limitation has not yet been activated by pressing the Add button.

image file img/deviations_suppressed_558.png not found

Using the button Required items one can define groups of terms or items and enforce that each detected pattern contains at least one item from that group. Clicking on the button opens a pop-up window in which one can enter item names or parts of item names plus wildcard symbols (*) and activate each input by pressing the button Add. Each wildcard stands for zero or more arbitrary characters.

image file img/deviations_required_558.png not found

Using the button Incompatible items one can define pairs or tupels of items which must not occur together in the detected deviations. Clicking on the button opens a pop-up window in which one can enter combinations of item names or parts of item names plus wildcard symbols (*), separated by commas. Each entered combination must be activated by pressing the button Add. Each wildcard stands for zero or more arbitrary characters.

In the screenshot below we have specified that we do not want to find deviations which simulataneously contain values from the data fields NumberCredits and NumberDebits

image file img/deviations_incompatible_558.png not found

The blue number fields at the right of the three aforementioned tool bar buttons indicate how many restrictions of the respective type have been defined and activated.


Modification of the statistical limits and settings

The five numeric input fields in the middle of the tool bar serve to specify desired value ranges for five statistical measures of the patterns to be detected:

image file img/deviations_toolbar_???.png not found

The preceding picture shows an eample in which the predefined value of all five input fields has been modified. Additionally, the item content restrictions described in the preceding section have been maintained. Using these settings, Synop Analyzer finds the following deviation patterns:

image file img/deviations_resulttable_ECcard_782.png not found

Compared to the patterns created using the default settings, we notice that we now find some patterns of length 3. For these longer patterns, the 'correction hint' feature is often particularly helpful: the longer a pattern is, the more difficult is it to understand which part of the pattern does not match with the rest. In the screenshot below we show the correction hint for the pattern which has been highlighted in the picture above.

image file img/deviations_hint_ECcard_???.png not found

From the correction hints we understand what makes this pattern a deviation: long-term inactive (nominal) clients should not have an active bank card. The item BankCard=yes is the one which does not fit into the rest of the pattern.

Detailed introspection of selected deviation patterns

Another possibility of deviation pattern introspection is to compare the properties of the data records which are affected by the selected pattern(s) to the entire data. This can be done using the multivariate exploration technique known from the module Multivariate Exploration, but with the affected data records as fixed preselection. By deactivating some of the checkboxes below the histogram charts you can further reduce the data selection.

This function is provided via the button Explore selected on the right side of the toolbar. If you press that button after selecting the pattern of length 3 which has been discussed in the previous section, you obtain the following result:

image file img/deviations_multivar_ECcard_766.png not found

We notice that the affected customers are mainly married male employees between 40 and 60 years which are long-term customers and have a joint account and a bank card. This is a very normal, unremarkable combination. More noticeable is the fact that their accounting activity (NumberDebits and NumberCredits is close to zero, which is quite untypical for this customer group.

Our preliminary result is that the examined pattern does probably not indicate data errors: apart from the accounting activity, all demographic data properties of the affected records are consistent. The question now is which of the involved customers are purely nominal clients which should be removed from the customer master data because they generate negative margins, and which customers could and should be 'reactivated'.

In order to answer this question, we use another tool bar function: the button Show. This button opens a new window in which those data records are shown which are affected by at least one of the currently selected deviation patterns.

Note: A table of the affected data records can also be opened from the multivariate exploration pop-up window by clicking on the button Show in the tool bar of that window. This second option has the advantage that one can hide and suppress a part of the data fields by means of the button Visible fields before opening the tabular data view. In contrast, pressing the button Show on the main toolbar of the deviation detection panel always shows all data fields of the displayed data records.

If only the one pattern of length 3 has been selected which has been discussed above, the tabular data records view looks as follows:

image file img/deviations_multivar_ECcard_766.png not found

From this introspection we understand that the second customer, P0034770 probably belongs to the category 'nominal client': during one year, the customer had no credit transaction and only one debit transaction, probably an account-keeping fee so that the account balance has slipped into the slightly negative range. This customer generates most probably more cost than profit, and a 'reactivation' is highly improbable.

The first customer, P0031522, on the contrary, shows some financial activity on his accounts. Here, trying to reactivate the customer might be more promising.

Saving and exporting results

At the end of a data analysis one often wants to permanently save the analysis settings, or to export the analysis results so that they can be used outside of Synop Analyzer. The tool bar of the module Deviations and Inconsistencies offers four functions for achieving this:

  1. save settings :
    By means of this button one can save the currently active settings for this module and for importing the data to an XML parameter file. The structure of this file conforms to the XML schema http://http://www.synop-systems.com?/xml/InteractiveAnalyzerTask.xsd. This file can later be reloaded via the main menu button Analysisrun Deviations and Inconsistencies. This starts a process which reads the most current version of the data specified in the XML file and then starts the module Deviations and Inconsistencies with the settings stored in the XML file.
  2. save result :
    This button saves the deviation patterns which are currently displayed in the main part of the panel into a TAB separated flat text file which can be opened in any text editoror with MS Excel. The results shown in the section Modifying the statistical settings look like this when exported and re-opened in MS Excel:
    image file img/deviations_export_1028.png not found
    The exported version of the patterns differs in three points from the version shown on screen. First, non-localized english column names are used. Second, instead of the column 'deviation strength' the two values are exported from which the deviation strength is calculated: the patterns' lift and chiSqrConfidence. Third, an additional column is added which contains a slightly shortened version of the correction hints which appear in a separate pop-up window in the on-screen version of the patterns.
  3. save data :
    The pop-up window 'show data records' has its own export button with which all data records on which at least one of the selected deviation patterns appears are exported into a TAB separated flat text file.
  4. explorationExport :
    Similarly, the pop-up window 'multivariate exploration' contains its own export button. This button exports all data records on which at least one of the selected deviation patterns appears, plus all multivariate histogram chart plots into a spreadsheet file in the .xlsx format.


Interpretation of deviations: untypical data set or data error?

The module 'Deviations and Inconsistencies' shows data records and patterns which are significantly untypical. If the module is used for data quality monitoring purposes, two questions have to be answered for each detected pattern and each affected data record:

  1. Does the affected data record contain an data fault which should be removed, or are the data correct and they just describe something untypical?
  2. If the data record contains a data fault, which data field contains the faulty value, and what is the correct value?

The module contains a couple of tools for answering these questions - the correction hints, the multivariate exploration and the data record introspection which have been described in earlier sections of this chapter. However, it should be noted that the correction hint can be misleading in some situations, and an automatic data correction process based on these correction hints and without further human controll is not advisable.

After this initial remark we want to revisit two of the examples discussed above. In these examples, a human introspector quickly understands that they must be data faults:

image file img/deviations_resulttable_children_782.png not found

Since children of more than 21 years do not exist in any country on earth, we are faced with the question which values are faulty in the affected data sets. In order to answer this question we first look at the multivariate exploration of the four data records in which children of more than 21 years appear. From this graphical data exploration one often gets hints on where the data fault is located, for example could all affected data records carry one identical data import time stamp or they could stem from one identical source system or one filiation etc.

image file img/deviations_multivar_ECcard_782.png not found

In our example, we get the impression that the four customers show rather a typical adult behaviour than a typical child behaviour (see the fields DurationClient, AccountBalance or NumberCredits. Now we introspect the data records themselves:

image file img/deviations_datasets_children_785.png not found

The first impression is consolidated: in each single data record we find three to four indications for the person being an adult (see Profession, AccountBalance, BankCard, NumberCredits). As a human processor we could now delete the four values FamilyStatus=child and either replace them by unknown or send the data records to a colleague who gathers the correct family status data.