The Module 'Statistics and Distributions'

Content

Purpose and short description
The tabular views
The histogram charts view
The bottom tool bar
Detecting and removing perfect tupels


Purpose and short description

The data exploration module 'Statistics and Distributions' is the easiest, most fundamental data visualization module of Synop Analyzer. The screen screen is vertically divided into two areas. The upper part contains some basic statistical measures and figures of each data field in tabluar form. The lower part shows the value distribution of each data field in the form of histogram charts. In summary, the purpose of the module is to give a quick overview over a data source which has been read into Synop Analyzer:


The tabular views

In the upper part of the module 'Statistics and Distributions' two tabular views display important statistical measures of the numeric and the non-numeric data fields. The screenshot below shows these tabular views for the data doc/sample_data/RETAIL_PURCHASES.txt, which have been imported into Synop Analyzer as described in Importing data with name mappings:

univariate statistics tabular part

The screenshot shows that the textual data field ARTICLE has no missing values, that it has 79 different values and that the value lemonade, which occurs in 50 purchase IDs, is the most frequently purchased article, followed by the article cream contained in 47 purchases.

For the numeric field PRICE we see that it has no missing/invalid values either, that the cheapest purchase was 1.18 $, the most expensive one 744.75 $, the average purchase value was 41.70 $ but 50% of all purchases were below 7.50 $. That means, there are many small purchases and a few very large ones. Accordingly, the distribution of purchase prices has a positive skewness (long tail towards high prices). A precise definition of the three measures 'Standard deviation', 'Skewness' and 'Excess' can be found on the following Wikipedia pages: Sample standard deviation, Skewness und Excess Kurtosis.

For group fields (in our example the field PURCHASE_ID, Synop Analyzer does not show a statistics on the field values - the field values of group fields are normally of little interest since they are only used to define groups of data records. Instead, a statistics and distribution of group lengths is shown, in other words: a statistics on how many data records are in the various data groups defined by identical group field values.


The histogram charts view

The lower part of the screen shows value distribution histograms for all data fields. Histograms with more than 40 bars cover the entire screen width, histograms with not more than 20 bars are grouped into tupels of N charts per screen row, where N is the number entered into the tool bar input field named Charts/row. If this input field contains the value 0, the software decides autonomously how many charts to put into one screen row. Charts with 21 to 40 bars occupy twice as much horizontal space as the charts with not more than 20 bars. In the figure below, we show value distribution histograms which have been generated on the sample data doc/sample_data/RETAIL_PURCHASES.txt after importig them as described in Importing data with name mappings.

univariate statistics histogram part

In the histogram charts for non-numeric data fields, the values are arranged by descending occurrence frequency from left to right. Each value has another bar color. If a data field has more then N values, where N is the number in the input field #values (text fields) in the Input Data panel, then only the N most frequent values have been separately recorded when the data were imported. All other values have been summarized into the 'rest' value 'others'. This rest value will be represented in the chart by one single bar with label 'others'. If there is no such 'rest' value in the data, it can still be the case that there are so many different values that it is impossible to draw a histogram bar for each of them. In this case, the histogram chart will be truncated after 80 bars (you can change that value of 80 in the pop-up dialog PreferencesUnivariate Preferences). The fact that some bars could not be displayed is indicated by an additional label saying "... ?? others", where ?? is the number of suppressed bars. The chart for the field ARTICLE in the figure above has such a label saying "... 39 others".

In the histogram charts for numeric data fields, all bars have the same color, and the values or value ranges are ordered by increasing value from left to right. A histogram for a numeric data field has - unless a manual field-specific discretization has been defined - never more than n histogram bars, where n is the number entered into the input field #bins (numeric fields) in the Input Data panel.

By left mouse click on a histogram chart you open a tabular detail view containing all different values of the field and their absolute and relative occurrence frequencies. This detail view also contains those values for which no separate bar could be drawn in the histogram charts due to lack of space. In the figure below, such a pop-up table view for the data field ARTICLE is shown.

chart detail popup

By drawing with the mouse (keep the left mouse button pressed while moving) on a histogram chart you mark a rectangular region in which you want to zoom in.

By right-clicking on a histogram chart you open the pop-up dialog shown below. In this dialog, you can modify the appearance of the histogram chart (text fonts and sizes, axis styles, labels, etc.) via the menu item Properties. You can also save the chart as PNG graphics, print it or copy it as png graphics object to the system clipboard.

chart modification popup

Using the button Visible fields in the bottom toolbar, you can hide and remove certain fields from the charts panel in order to get a clearly arranged picture on data with many data fields.


The bottom tool bar

The tool bar at the lower screen border provides the following buttons and functions:

univariate statistics toolbar


Detecting and removing perfect tupels

The detection of 'perfect tupels' is started by clicking on the tool bar button Perfect tupels. The button is only usable if a group field has been specified on the input data and if at least one of the textual input data fields is set-valued with respect to the group field, that means it contains more than one different value on at least some of the data groups. The button opens a pop-up dialog in which you can choose one of the set-valued fields and then search this field for 'perfect tupels'. A 'perfect tupel' is a set of two or more field values which occur always or almost always together in the same data groups.

In the following we want to demonstrate this using the sample data doc/sample_data/CAR_REPAIR.txt. We assume that these data have been imported into Synop Analyzer as described in Transaktional data and stream data, that means with REPAIR_ID as group field. If we start the module 'Statistics and Distributions' on these data and click on the Perfect tupels button, the following dialog window opens up:

perfect tupel detection dialog

We choose the field ERROR_LOG and accept all other default settings in the window: search for value tupels whose single values appear in at least 10 data groups and for which at least 95% of the data groups which contain one single value out of the tupel also contain the entire tupel. Then we press the Start button in order to start the tupel detection.

The screenshot printed above already shows the appearance of the window after the start command has been executed. 10 perfect tupels were found. The values forming these tupels were eliminated from the data, and whenever all values forming a tupel wer found in a data group, the tupel was inserted as a new single value into that data group. After replacing the single values by the tupels, there are 950 remaining different values in the field ERROR_LOG.

One can examine which values have been replaced by closing the 'perfect tupels' dialog and left-clicking on the histogram chart for the field ERROR_LOG: when scrolling through the value list we find new, longer error log codes which contain the concatenation character '|'. , for example the tupel of length 4 composed of the values KWX34759496, KWX34759494, KWX34759493, and KWX34759495. This tupel occurs in 90 repair cases.

perfect tupel detection result