The Module 'Bivariate Exploration'

Content

Purpose and short description
The left hand panel: select fields and value ranges
The bivariate matrix
The circle plot
The bottom tool bar
Selecting and exploring matrix cells


Purpose and short description

The data exploration module 'Bivariate Exploration' serves to study the dependencies and interrelations between the different values of two data fields in detail. This is done by creating a value combination matrix in which the values of the one field (the 'x-axis field') define the columns and the values of the other field (the 'y-axis field') define the matrix rows. A bivariate exploration can answer the following questions:


The left hand panel: select fields and value ranges

In the left part of the module's screen window you can select the two data fields whose values are to be traced and whose interrelations are to be examined. This can be done by clicking on the 'arrow down' symbol at the right border of the white selection boxes below the head lines 'x-axis' and 'y-axis'. In the following screenshot, the sample data doc/sample_data/customers.txt has been imported into Synop Analyzer, and the two data fields FamilyStatus and Age have been selected as the two data fields on which a bivariate exploration is to be performed.

bivariate value selection

In the same screen part in which you select the data fields you also specify how fine-grained the values of the two data fields are to be treated in the bivariate analysis. This is done by selecting or deselecting some of the checkboxes below the histogram charts of the two data fields. Each checkbox stands for one possible value range split between two values or value ranges which are represented by one histogram bar in the chart above the checkbox. Therefore, the number of checkboxes is always the number of histogram bars minus one. Only if the check box is selected (marked), the corresponding range split is activated. Each color change between a red bar and a blue bar in the histogram above the check boxes represents one value range split. The neighbored values or value ranges whose histogram bars show the same color are considered one single value range within the bivariate analysis.

The left side of the figure above shows a rather 'coarse-grained' value range specification. On the x-axis, only the value marriedis separated from the other values, all remaining values are treated as one single value range. On the y-axis, we have set one single range split at the age of 50. That means, two value ranges will be created: Age<50 and Age≥50.

The right side of the figure above shows a more fine-grained value range specification. Almost all possible range splits have been set. Only some low-frequency values have been combined: on the x-axis the values separated and cohabitant, on the y-axis the value ranges Age<10 and Age=10..20 as well as Age=80..90 and Age≥90.

Accordingly, the biariate matrix resulting from the range specification on the left side is very small:

bivariate matrix

whereas the bivariate matrix resulting from the range specification on the right side is much more detailed:

bivariate matrix 2


The bivariate matrix

The preceding section has described how a bivariate matrix such as the one in the figure above is generated. This sections will discuss which information can be derived from it.


The circle plot

The bivariate matrix and the color scheme of its cells focus on visualizing relative differences between actual and expected frequencies of different combinations of values of the two involved fields. A second graphical visualizations of the interrelations between the two fields is given in the chart with the blue circles below the matrix. It displays the absolute size (measured in the number of data records respectively data groups) of the different possible combinations of field values. Each circle stands for one combination of field values, and the area of the circle is propoertional to the occurrence frequency.

From this plot one can understand very easily which combinations occur most frequently. On the other hand, also the most extremely untypical combinations can be detected quite easily in the form of little blue spot far away from any large circle in the same row or column of the plot. For example, the plot shown below contains two little blue dots in the column for the value child which are far above the typical age range of 0 to 20 years: these are children with ages between 30 and 50 years:

bivariate statistics circle plot


The bottom tool bar

The tool bar at the lower border of the screen provides the following functions:

bivariate statistics toolbar


Selecting and exploring matrix cells

By clicking with the left mouse button one can select a cell of the bivariate matrix. If you keep the <CTRL> key pressed during mouse-clicking, you can select several matrix cells. Once one or more cells have been selected, the bottom tool bar of the bivariate analysis panel shows the total number of data records (or data groups if a group field has been specified) in the selected cells. By clicking the button multivariate exploration you can open a new multivariate exploration panel inwhich the value distributions of the selected data records (or data groups) are compared to the value distributions on the entire data.

We want to demonstrate this with the help of the example which has been shown above: the bivariate matrix showing the interrelations between the data fields Age and Family Status from the sample data doc/sample_data/customers.txt. In this matrix we have selected two noticeable cells, presumable data errors: children at ages between 30 and 50 years.

bivariate matrix 2

The multivariate epxploration of the four data sets from these two cells shows that most probably the age is correct by the family status is outdated, since the other properties of these data records, for example the account balance or the elevated accounting activity are more typical for adults than for children.

bivariate statistics multivariate exploration