The Module 'Multivariate Exploration'

Content

Purpose and short description
Understanding the main panel
Working with the range selector buttons
Working with detail pop-up dialogs für single fields
The bottom toolbar
Rearranging and suppressing fields
Working with the detail field selector box
Working with set-valued data fields
Creating forecasts and what-if scenarios


Purpose and short description

The data exploration module 'Multivariate Exploration' serves to study the dependencies and interrelations between the different values of several data fields in detail. To this purpose, the module displays histogram charts of all (or a user-defined subset of all) data fields on a single screen panel. By mouse-clicking the user can interactively select an deselect values and value ranges in an arbitrary combination of the histograms, thereby defining a multivariate data selection. The modifications in the field value distributions of all data fields which result from the selection are displayed in real time on screen even on very large data.


Understanding the main panel

The main part of the Multivariate Exploration panel consists of one histogram chart per active data field. Each histogram chart compares a field's value distribution on the currently selected subset (blue bars) to the field's value distribution on the entire data (light green bars).

Histograms with more than 36 bars cover the entire screen width, histograms with not more than 18 bars are grouped into tupels of N charts per screen row, where N is the number entered into the tool bar input field named Charts/row. If this input field contains the value 0, the software decides autonomously how many charts to put into one screen row. Charts with 19 to 36 bars occupy twice as much horizontal space as the charts with not more than 20 bars. In order to avoid ugly gaps in the arrangements of the charts on screen, the 'large' charts (those with more than 18 bars) are placed before the 'small' charts, that means those with less than 19 bars.

In the histogram charts for non-numeric data fields, the values are arranged by descending occurrence frequency from left to right.If a data field has more then N different values, where N is the number in the input field #values (text fields) in the Input Data panel, then only the N most frequent values have been separately recorded when the data were imported. All other values have been summarized into the 'rest' value 'others'. This rest value will be represented in the chart by one single bar with label 'others'. If there is no such 'rest' value in the data, it can still be the case that there are so many different values that it is impossible to draw a histogram bar for each of them. In this case, the histogram chart will be truncated after 80 bars (you can change that value of 80 in the pop-up dialog PreferencesMultivariate Preferences). The fact that some bars could not be displayed is indicated by an additional label saying "... ?? others", where ?? is the number of suppressed bars.

Numeric data fields - such as the field Age in the picture below - often have so many different values that a binning into a small number of value ranges or intervals is reasonable. The number of bins and the bin boundaries have been defined and can be modified in the Input Data Panel.

By clicking on one of the checkboxes which are situated below each chart, a value selection (restriction) can be defined for the corresponding data field. In the following screenshot, the sample data doc/sample_data/customers.txt have been imported into Synop Analyzer. Then, the Multivariate Exploration module has been started and the left checkbox below the chart for the field Gender has been deselected. That means, we have removed the male customers from the blue selected data. Hence, the latter represent the female customers, and the differences between the light green and the blue bars represent the differences between the female customers and all customers.

image file img/multivar_customers_all_910.png not found

We derive from the picture that the professions of the female customers strongly differ from those of the male customers - more women are employees or inactive whereas much more men are workers - while there is almost no difference between both groups as to the possession rate of savings books, life insurances or credit cards.

The user can now interactively select an deselect values and value ranges in one or more arbitrary other data fields, thereby defining a multivariate data selection. The calculation of the overall selection is performed on an in-memory representation of the data which is optimized for those multivariate 'slicing' operations over several fields. Therefore, the results can be calculated and displayed within fractions of a second even on multi-gigabyte data.

By drawing with the mouse (keep the left mouse button pressed while moving) on a histogram chart you mark a rectangular region in which you want to zoom in.

By right-clicking on a histogram chart you open the pop-up dialog shown below. In this dialog, you can modify the appearance of the histogram chart (text fonts and sizes, axis styles, labels, etc.) via the menu item Properties. You can also save the chart as PNG graphics, print it or copy it as png graphics object to the system clipboard.

chart modification popup

Using the button Visible fields in the bottom toolbar, you can hide and remove certain fields from the charts panel in order to get a clearly arranged picture on data with many data fields.


Working with the range selector buttons

Now we want to study the possibilites of selecting and deselecting value ranges by means of the button bars below the histogram charts in more detail. To that purpose we focus on a part of the screenshot shown above, namely the histograms and button bars for the three data fields Age, Gender and FamilyStatus.

In addition to the existing range limitation on the field Gender we want to restrict the values of the field Age, namely we want to focus on the customers below 40 years. To that purpose we could deselect the six rightmost checkboxes under the histogram for field Age. A bit faster is the alternative approach of deselecting the four leftmost checkboxes and then clicking on the invert button. The invert button inverts the existing range selection on a data field. The button allremoves all ranges restrictions from the field.

image file img/multivar_customers_3_865.png not found

The new selection defines 4143 customers in the selected Age region. As the intersection with the existing preselection of 4981 female customers we get 1972 or about 20% young female customers (these numbers are displayed in and next to the progress bar in the bottom tool bar).

The range restriction in the field Age instantaneously changes the heights of the blue bars in all other data fields. As expected, the percentage of children and singles in the field FamilyStatus have grown significantly. The difference between the the selected subset and the light green background distribution on the entire data has grown strongly on most data fields. The displayed 'diff value is calculated as the total length of all parts of the blue bares which exceed the light green bars divided by the total length of all blue bars (the latter is always 100% if the respective field is not set-valued).

The chart titles of the fields in which we have specified a range restriction (selection) are displayed in blue; the titles of the 'response' fields in which the observed differences between blue and light green bars are a reaction of range selections in other fields are displayed in black.


Working with detail pop-up dialogs für single fields

A left mouse-click on one of the histogram charts opens a tabular detail statistics which shows the field's values or value ranges and their actual and expected occurrence frequencies on the selected data. #expected is the expected number of selected data records under the assumption that the value's relative frequency on the selected data is identical to the value's relative frequency on the entire data. The columns difference and rel. difference contain the absolute and relative difference between the actual and the exected occurrence frequency. Finally, the column significance displays the result of a χ2 significance test which indicates whether the observed difference between actual and expected occurrence frequencies on the selected data are statistically significant (significance values close to 1) or not (significance values below 0.95...0.9).

image file img/multivar_detail_popup_556.png not found

If a non-numeric data field has many different values, for example far more than 100, then the available space in the histogram is not sufficient for displaying a separate bar and checkbox for each of them. In this case, the pop-up detail view is the only possibility for seeing all different values and for selecting or deselecting single values which do not figure among the 80 most frequent values. This selection or deselection can be performed by mouse-clicks on certain table rows in the detail view. If you keep the <CTRL> key pressed while clicking, you can select more than one row; by keeping the <SHIFT> key pressed you can select an entire value range. After selecting the desired table rows you activate your selection and close the pop-up view by pressing the button Apply selection.

In the details pop-up view you can also reorder the values by pressing on one of the column heads. This sorts the values ascendingly or descendingly by the values of the clicked column. Repeated clicks invert the sorting order. In the screenshot shown below, we have sorted by descending relative difference. This brings the value cohabitant to the top position. Then we have deselected the value on which the actual frequency does not significantly differ from the expected frequence, namely the value separated.

image file img/multivar_detail_popup2_556.png not found

If we now leave the pop-up window by pressing the button Apply selection and value order, both the new value ordering and the value selection is applied to the histogram chart:

image file img/multivar_customers_3sorted_951.png not found

The details pop-up view offers yet another feature: if you right-click on one of the table cells, the following options dialog pops up:

image file img/multivar_detail_popup3_???.png not found

This dialog permits selecting or deselecting all table rows whose values in the column in which the click was performed are in a certain value range, and this selection can be performed by one single click. This is an enormous reduction of effort especially if the field contains hundreds or thousands of different values.

The following picture results from right-clicking on the value 321 in the column #selected and by choosing the option deactivate < in the options dialog. This choice deselects all table rows which have a value of less than 321 in the column #selected.

image file img/multivar_detail_popup4_556.png not found


The bottom toolbar

The tool bar at the lower screen border provides the following buttons and functions:

image file img/multivar_toolbar_907.png not found


Rearranging and suppressing fields

Clicking on the button Visible fields opens a pop-up dialog in which the following actions can be performed:

In the following we want to demonstrate some of the options and functions with the help of a concrete example. We again start with the sample data doc/sample_data/customers.txt and we select the 1950 customers with an account balance of at least 20,000. Now we open the pop-up window Visible fields and choose Sort byrel. difference.

image file img/multivar_rich_968.png not found

The displayed field order on the main panel has changed. The selection field AccountBalance has been placed first, followed by Age and Profession, on which the difference between the selected data and the overall data is largest (26.6% and 23.0%).

It is not a big surprise thal pensioners and people elderly people often have a large account balance. More surprising is the fact that farmers are strongly over-represented in the group of people with an accoutn balance of 20000 or more. We want to introspect this group a bit closer, hence we select Profession=farmer. Then we again sort the fields by decreasing relative difference. The following picture arises. It shows that farmers with large bank accounts typically have the following characteristics:

  1. A medium age (30 to 70 years)
  2. They own a savings book
  3. High customer loyalty and long-term customer relationship of more than 20 years
  4. They are mainly male.

image file img/multivar_customers_richfarmers_968.png not found


Working with detail structure fields

Using the selector box Detail field one can add graphical detail information to the histogram bars which represent the selected data subset.

In the screenshot below, for example, the profession inactive has been selected, and the gender distribution (male/female) of this selected data subset has been shown on 12 data fields. This was achieved by choosing the field Gender as the detail structure field. As it can be seen from the histogram for the data field Gender, the red parts of the bars represent the females, the blue ones the males.

image file img/multivar_customers_detailFld1_944.png not found

We want to mention a particular application scenario of working with a detail field: if all data are selected and if the display mode 'all histogram bars have the same length (100%)' has been selected, specifying a detail field has the effect of creating a collection of many bivariate field-field matrix charts, the y-axis field of all bivariate charts being the detail structure field.

In the following picture we show an example for this application scenario. In the example, the field Age has been selected as the detail structure field.

image file img/multivar_customers_detailFld2_944.png not found


Working with set-valued data fields

If the examined data contain set-valued textual fields, multivariate exploration requires particular care and attention when interpreting the displayed results. Set-valued fields can emerge when a group field has been defined on the data. 'Set-valued' means that within one single data group the field can assume more than one different value. For example, the field PURCHASED_ARTICLE could comprise several different purchased articles on the data group TICKET_ID=3126.

In the following we want to demonstrate the arising subtleties using the sample data doc/sample_data/RETAIL_PURCHASES.txt. We assume that these data have been imported into Synop Analyzer as described in Name mappings, that means with PURCHASE_ID as group field and with doc/sample_data/RETAIL_NAMES_DE_EN.txt as article names. In these data, the field ARTICLE is set-valued with respect to the group field PURCHASE_ID: normally, a purchase comprises several different articles.

The screenshot below was obtained by deactivating the three most frequent values in the field ARTICLE and by pressing the button invert afterwards. We expect to obtain blue bars only for the first three bars in the histogram, but we find blue bars for almost all other values, too. Why?

image file img/multivar_articles_nonexcl_8526png not found

In order to answer this question, we must remember that for the set-valued data field ARTICLE, selecting the three articles can have two different meanings:

  1. Select all purchases (ticket IDs) which exclusively consist of the three selected articles and which do not contain any other article. We call this the exclusive selection mode.
  2. Select all purchases which contain at least one of the selected articles. We call this selection mode the non-exclusive mode.

Obviously, the histogram shown above interprets the selection task as 'non-exclusive' selection. Therefore, the selected purchases also contain many other articles in addition to the three selected articles. Can one switch to the exclusive selection mode in Synop Analyzer? To this purpose there is an additional button ex next to the invert button. Each click on this button toggles the exclusivity mode of the current selection.

If one clicks on the ex button in the histogram shown above, one obtains the following histogram:

image file img/multivar_articles_826.png not found

As desired, this histogram displays only those 20 purchases in which no other articles than the selected three articles were purchased. The fact that we now are in 'exclusive' mode is indicated by the text (excl.) in the title of the histogram chart.

In addition, the software applies the following principles when dealing with set-valued fields:

  1. If one starts with a data field without range limitations (all checkboxes marked) and begins deactivating single checkboxes, then Synop Analyzer automatically switches into the exclusive selection mode. This corresponds to the intuitively expected behaviour: by deactivating a checkbox I tell the software that I do not want to see those purchases which contain the deselected article.
  2. Pressing the invert button does not only invert the selection but also switches from exclusive to non-exclusive mode and vice versa. This is also the intuitively expected behaviour: if I deactivate a value, I want to see the data groups which do not contain the value. If I invert this task then I want to see the data groups which definitely contain the value, but which can contain other values, too). The two selections between which the invert switches back and forth are disjunct, and their combination is the entire data.
  3. All other actions which can be performed using the checkboxes or in the details pop-up view do not change the selection mode.

Creating forecasts and what-if scenarios

For numeric data fields with a date or timestamp data format, Synop Analyzer is able to start a time series analysis and forecast by clicking on a special button below the field's histogram chart. This special button is situated next to the button all and displays a time series plot as button icon.

In the following we want to give an example based on the sample data doc/sample_data/RETAIL_PURCHASES.txt. We assume that these data have been imported into Synop Analyzer as described in Name mappings, that means with PURCHASE_ID as group field and with DATE as order (timestamp) field.

The field DATE has the additional time series forecast button. We select all purchase prices of 10 EUR or more, then we press the forecast button.

image file img/multivar_forecast.png not found

A pop-up dialog opens up. Its content corresponds to the content of the time series forecast panel which can be started from the analysis module button list in the lower part of the left screen column of Synop Analyzer. We refer to section Time Series Analysis and Forecast for a more detailed description of the available buttons and functions. Here, we only show and example. It shows the forecast created when assuming a period (cyclic pattern) of 7 days and a forecast period (in days) of 5:

image file img/multivar_forecast2.png not found