The Module 'Split Analysis'

Content

Understanding the main panel
The bottom toolbar
Rearranging and suppressing fields
Working with set-valued data fields
Optimizing the control data
Automatized series of split analyses


Purpose and short description

Split Analysis is a data analysis approach in which two data subsets are selected: a 'test' data set and a 'control' data set. In many use cases, the test data set comprises a data subset whose data records have a certain property in common, for example all men, all customers below the age of 30, all vehicles produced after an improvement measure has been effectuated, etc. The first goal of the analysis is to select a suitable control group which is representative for the test group in all attributes except the ones used for defining the test group. The second goal is to find and quantify significant differences between the test data subset and the control data subset.


Understanding the main panel

The main part of the Split Analysis panel consists of one histogram chart per active data field. Each histogram chart compares a field's value distribution on the currently selected test data (blue bars) to the field's value distribution on the currently control data (red bars) and on the entire data (light green bars).

Histograms with more than 36 bars cover the entire screen width, histograms with not more than 18 bars are grouped into tupels of N charts per screen row, where N is the number entered into the tool bar input field named Charts/row. If this input field contains the value 0, the software decides autonomously how many charts to put into one screen row. Charts with 19 to 36 bars occupy twice as much horizontal space as the charts with not more than 20 bars. In order to avoid ugly gaps in the arrangements of the charts on screen, the 'large' charts (those with more than 18 bars) are placed before the 'small' charts, that means those with less than 19 bars.

In the histogram charts for non-numeric data fields, the values are arranged by descending occurrence frequency from left to right.If a data field has more then N different values, where N is the number in the input field #values (text fields) in the Input Data panel, then only the N most frequent values have been separately recorded when the data were imported. All other values have been summarized into the 'rest' value 'others'. This rest value will be represented in the chart by one single bar with label 'others'. If there is no such 'rest' value in the data, it can still be the case that there are so many different values that it is impossible to draw a histogram bar for each of them. In this case, the histogram chart will be truncated after 80 bars (you can change that value of 80 in the pop-up dialog PreferencesMultivariate Preferences). The fact that some bars could not be displayed is indicated by an additional label saying "... ?? others", where ?? is the number of suppressed bars.

Numeric data fields - such as the field Age in the picture below - often have so many different values that a binning into a small number of value ranges or intervals is reasonable. The number of bins and the bin boundaries have been defined and can be modified in the Input Data Panel.

By clicking on one of the checkboxes which are situated below each chart, a value selection (restriction) can be defined for the corresponding data field. The upper row of checkboxes specifies the selection defining the test data subset the lower row specifies the selection defining the control data subset.

In the following screenshot, the sample data doc/sample_data/customers.txt have been imported into Synop Analyzer. Then, the Split Analysis module has been started and the left checkbox below the chart for the field Gender has been deselected for the test data, the right one for the control data. That means, we have defined the female customers as test data subset and the male customers and the control data subset.

image file img/multivar_customers_all_933.png not found

We derive from the picture that the professions of the female customers strongly differ from those of the male customers - more women are employees or inactive whereas much more men are workers - while there is almost no difference between both groups as to the possession rate of savings books or credit cards.

The user can now interactively select an deselect values and value ranges in one or more arbitrary other data fields, independently for the test data and the control data, thereby defining two multivariate data selections. The calculation of the two overall selections is performed on an in-memory representation of the data which is optimized for those multivariate 'slicing' operations over several fields. Therefore, the results can be calculated and displayed within fractions of a second even on multi-gigabyte data.

By drawing with the mouse (keep the left mouse button pressed while moving) on a histogram chart you mark a rectangular region in which you want to zoom in.

By right-clicking on a histogram chart you open the pop-up dialog shown below. In this dialog, you can modify the appearance of the histogram chart (text fonts and sizes, axis styles, labels, etc.) via the menu item Properties. You can also save the chart as PNG graphics, print it or copy it as png graphics object to the system clipboard.

chart modification popup

Using the button Visible fields in the bottom toolbar, you can hide and remove certain fields from the charts panel in order to get a clearly arranged picture on data with many data fields. In the picture shown at the beginning of this section we have hidden the two fields NumberCredits and NumberDebits.


Working with the range selector buttons

Now we want to study the possibilites of selecting and deselecting value ranges by means of the button bars below the histogram charts in more detail. To that purpose we focus on a part of the screenshot shown above, namely the histograms and button bars for the four data fields Age, Gender, FamilyStatus and Profession.

In addition to the existing range limitation on the field Gender we want to restrict the values of the field Age, namely we want to focus on the customers below 40 years. To that purpose we could deselect the six rightmost checkboxes under the histogram for field Age. A bit faster is the alternative approach of deselecting the four leftmost checkboxes and then clicking on the invert button. The invert button inverts the existing range selection on a data field. The button allremoves all ranges restrictions from the field. We perform the value range selection twice: once for the upper, blue data, once for the lower, red data.

image file img/tcsplit_customers_malefemale_young_941.png not found

The new selection defines 4143 customers in the selected Age region. As the intersection with the existing preselection of 4981 female respectively 5019 male customers we get 1972 or about 20% young female and 2171 or about 22% young male customers (these numbers are displayed in and next to the progress bars in the bottom tool bar; the blue bar represents the test data, the red one the control data).

The range restriction in the field Age instantaneously changes the heights of the blue and red bars in all other data fields. As expected, the percentage of children and singles in the field FamilyStatus have grown significantly. The difference between the two the selected subsets and the light green background distribution on the entire data has grown strongly on most data fields. In contrast, the differences between the two selected groups on the fields FamilyStatus and Profession, which are displayed in the respective chart titles, have declined. The displayed 'diff value is calculated as the total length of all parts of the blue bares which exceed the red bars divided by the total length of all blue bars (the latter is always 100% if the respective field is not set-valued).

The chart titles of the fields in which we have specified a range restriction (selection) are displayed in blue; the titles of the 'response' fields in which the observed differences between blue, red and light-green bars are a reaction of range selections in other fields are displayed in black.


Working with detail pop-up dialogs für single fields

A left mouse-click on one of the histogram charts opens a tabular detail statistics which shows the field's values or value ranges and their actual and expected occurrence frequencies on the test (#test) and the control data (#control). #expected(test) is the expected number of test data records under the assumption that the value's relative frequency on the test data is identical to the value's relative frequency on the control data. The columns difference and rel. difference contain the absolute and relative difference between the actual and the exected occurrence frequency on the test data. Finally, the column significance displays the result of a χ2 significance test which indicates whether the observed difference between actual and expected occurrence frequencies on the test data are statistically significant (significance values close to 1) or not (significance values below 0.95...0.9).

image file img/tcsplit_detail_popup_556.png not found

If a non-numeric data field has many different values, for example far more than 100, then the available space in the histogram is not sufficient for displaying a separate bar and checkbox for each of them. In this case, the pop-up detail view is the only possibility for seeing all different values and for selecting or deselecting single values which do not figure among the 80 most frequent values. This selection or deselection can be performed by mouse-clicks on certain table rows in the detail view. If you keep the <CTRL> key pressed while clicking, you can select more than one row, by keeping the <SHIFT> key pressed you can select an entire value range. After selecting the desired table rows you activate your selection and close the pop-up view by pressing the button Apply selection. Selections in the pop-up view are always applied on both the test and the control data.

In the details pop-up view you can also reorder the values by pressing on one of the column heads. This sorts the values ascendingly or descendingly by the values of the clicked column. Repeated clicks invert the sorting order. In the screenshot shown below, we have sorted by descending relative difference. This brings the value widowed to the top position. Then we have deselected all values on which the differences in relative frequency between the test and the control data is not significant at a confidence level of at least 90%.

image file img/tcsplit_detail_popup2_556.png not found

If we now leave the pop-up window by pressing the button Apply selection and value order, both the new value ordering and the value selection is applied to the histogram chart:

image file img/tcsplit_customers_4sorted_951.png not found

The details pop-up view offers yet another feature: if you right-click on one of the table cells, the following options dialog pops up:

image file img/multivar_detail_popup3_???.png not found

This dialog permits selecting or deselecting all table rows whose values in the column in which the click was performed are in a certain value range, and this selection can be performed by one single click. This is an enormous reduction of effort especially if the field contains hundreds or thousands of different values.

The following picture results from right-clicking on the value 99 in the column #test and by choosing the option deactivate < in the options dialog. This choice deselects all table rows which have a value of less than 99 in the column #test.

image file img/tcsplit_detail_popup4_556.png not found


The bottom toolbar

The tool bar at the lower screen border provides the following buttons and functions:

image file img/tcsplit_toolbar_984.png not found


Rearranging and suppressing fields

Clicking on the button Visible fields opens a pop-up dialog in which the following actions can be performed:

In the following we want to demonstrate some of the options and functions with the help of concrete examples. We again start with the sample data doc/sample_data/customers.txt and with the selection discussed in the previous section: female customers below 40 years as test group, male customers below 40 as control group. Now we open the pop-up dialog Visible fields and hide the two fields NumberCredits and NumberDebits by left-clicking the two field names while keeping the <CTRL> key pressed. Then we choose Sort byrel. difference.

image file img/tcsplit_customers_malefemale_sorted_934.png not found

The field order and the number of displayed fields in the main panel changes: the field Gender, in which the two selected groups have a relative difference of 100%, is placed at the top position, followed by the fields Profession and FamilyStatus on which the difference between young males and femals is strongest (27.8% respectively 10.1%).


Working with set-valued data fields

If the examined data contain set-valued textual fields, the split analysis requires particular care and attention when interpreting the displayed results. Set-valued fields can emerge when a group field has been defined on the data. 'Set-valued' means that within one single data group the field can assume more than one different value. For example, the field PURCHASED_ARTICLE could comprise several different purchased articles on the data group TICKET_ID=3126.

The difficulties when dealing with set-valued fields is caused by the fact that it is not any more unambiguously clear what activating or deactivating a check box representing a histogram bar means:

  1. Select those data groups which only have the selected values but no other values. We call this mode the exclusive mode.
  2. Select those data groups on which the selected values are present among others. We call this mode the non-exclusive mode.

In the reference documentation of the module Multivariate Exploration we show in detail how Synop Analyzer can switch between these two different selection modes. That explanation applies one to one also to the split analysis module, therefore we refer to that part of the documentation and do not repeat the explanations here.


Optimizing the control data

A split analysis is performed with the aim of finding significant differences in the value distributions of one or more 'target' data fields between two data subsets: the 'test' subset, whose values have certain values in one or more 'selector' data fields, and the 'control' subset, whose valuesdo not have those values in the selector data fields. Unfortunately, in most real-world situations, there are inevitably many other differences between the two data subsets in addition to the desired ones. Therefore, one can not be sure whether the observed differences in the 'target' fields are caused by the controllable differences in the 'selector' fields or whether they are due to uncontrollable differences in some other data fields.

In order to make this more concrete, let us consider an example from applied social studies based on the sample data doc/sample_data/customers.txt. Using these data, we want to quantitatively verify or falsify the following hypothesis:

»the Managers are more frequently divorced than people with other professions but similar socio-economic background.«

The available data contain six data fields which define the profession, the marital status and the socio-economic background: Gender, FamilyStatus, Profession, Age, and the 'wealth-indicators' LifeInsurance and AccountBalance. We want to verify the hypothesis stated above by selecting a suitable group of managers as the test group and a group on non-managers as control group.

image file img/tcsplit_customers_manager_828.png not found

We import the data and start the module Split Analysis. In this module, we use the pop-up dialog Visible fields for hiding all data fields but the six fields listed above. In the histogram of the field FamilyStatus we deselect (for both the test and the control group) the values which do not match with professionally active persons: widowed and child. In the field Profession, we select the value Manager as test group and all other professions except the values inactive, Pensioner and unknown as the control group.

When we open the details view for the field FamilyStatus by left-clicking on the histogram chart, our hypothesis seems to be proved - at least by trend.

image file img/tcsplit_customers_manager_562.png not found

The table row highlighted in blue contains the result we are interested in. The row reads as follows: In the test data (managers) there were 22 divorced persons. If the percentage of divorced persons was identical to the percentage of divorced persons in the control group, we would only have 19 divorced managers. 22 minus 19 is an absolute difference of 3 and a relative difference of +14.9%. Unfortunately, the data sample (the number of cases) is not large enough so that the result is not yet really significant (confidence level strongly below 90%).

However, the preliminary result stated above is not really valid. The control group differs significantly from the test group in the value distributions of the data fields Age, Gender, LifeInsurance and AccountBalance. Therefore, it is unclear whether the observed differences in divorce rates are caused by the differring professions or the differences in the other fields.

Here, we can use Synop Analyzer's control data optimization feature, which aims at making the control data 'representative' for the test data in a couple of user-defined data fields. First, we have to tell Synop Analyzer which is the target field of our hypothesis. To that purpose, we open the Visible fields dialog and right-click on the field name FamilyStatus. A new pop-up dialog appears in which we select the option Target field (distribution will not be optimized). After closing the window Visible fields the histogram of the data field FamilyStatus carries an additional (T) (for 'target') in its chart title.

image file img/tcsplit_customers_targetfield_???.png not found

Now we optimize the control data, making them representative for the test data in all data fields but the target field and the selector field Profession. We use the tool bar fields min: and max: to tell the software how large the new controll data should be. The size of the test data is 440 records. We think that a size of the control data of about twice the size of the test data should be enough, therefore we enter 880 as the minimum and 900 as the maximum value. Then we press the button Optimize the control data.

image file img/tcsplit_customers_manager_opt_827.png not found

A moment later, the control data size has dropped to 882 data records, and the control data's value distributions on the four data fields to be optimized are perfectly identical with the respective value distributions of the test data. If we now open the details view of the field FamilyStatus, we get a result which differs strongly from our preliminary result:

image file img/tcsplit_customers_manager_opt_562.png not found

We see that when working with 'representative' control data, the profession Manager has no pushing impact on the divorce rate. On the contrary, there are less divorced managers then expected from the other profession groups (even though this tendency is not really statistically significant, the confidence level is only 75%). We understand how important it is to optimize the control data before deducing conclusions from a split analysis.


Automatized series of split analyses

Often, it is desirable to perform large series of similar split analyses. For example, we could repeat the split analysis performed in the previous section for all other professions, not only for managers. And maybe we would like to repeat the entire series of split analyses every 3 months in order to monitor socio-demographic trends.

For both goals, an automatized scheduling of many similar split analysis tests is required. Synop Analyzer provides the button Automatize for that purpose. The button creates an executable batch file in which the command line processor sacl is called with a suitable command line argument in order to perform the entire series of tests without any user interaction. Pressing the button Automatize first opens a file selection dialog in which one can define the file name of the batch file to be created. Then the following dialog opens up:

image file img/tcsplit_customers_serie_???.png not found

In this view we define in the first row, over which data field the series of split analysis tasks is supposed to iterate. The selection box offers all data fields in which exactly ine field value is currently activated on the test data and some other values are activated on the control data. In our example, only the field Profession satisfies these requirements.

The second row defines the maximum number of iterations over the field specified in the first row the series is to be terminated. The default value is 100. Since we only have 6 different professions in the data, we can leave that value unchanged, it has no effect; we could also enter 6 here.

In the third and fourth row, one can define a second data field to iterate over. In our example, there is no suitable second field for iterating over.

Then, we specify the name of the summary result file - a <TAB> separated text file which can be opened in MS Excel and which contains one line of summary information for each single split analysis performed during the series. Finally, there are three parameters with which you can modify the graphical representation of the single tests' results, and a parameter which defines the maximum amount of computer memory to be available when running the automatized analysis series.

In addition to the summary result file, the automatized series of tests will create one separate spreadsheet file per single test (iteration) which contains the same results that one would obtain if one manually executed the singe split analysis and then pressed the Export button in the bottom tool bar of the split analysis panel.

As soon as one presses OK in the pop-up window, the batch file is generated can be started any time.