The Data Source Specification Panel

Content

Supported data formats and data sources
The 'Input Data' panel
The 'active fields' pop-up dialog
The 'advanced options' pop-up dialog
User specified binnings and discretizations
Value groupings and variant elimination
Name mappings
Taxonomies (hierarchies)
Joining with auxiliary tables
Computed data fields
Transactional and streaming data


Supported data formats and data sources

Synop Analyzer is able to read data from the following data sources:

A new data source can be opened by clicking on the item File in Synop Analyzer's main menu:

Data formats supported by Synop Analyzer

Once you have opened a data source and you have specified some additional data import settings such as data field usage types, joins with auxiliary data, field discretizations, or computed fields, you can save these settings by clicking on FileSave data load task in Synop Analyzer's main menu. This will create a parameter file in XML format. You can later re-open this XML file via FileOpen data load task.


The 'Input Data' panel

When a data source has been selected using the menu item File, an input data specification panel opens up in the left part of the Synop Analyzer screen. By pressing the Start button in the middle of that panel, the data are read into your computer's memory. Once this process is finished, the buttons in the lower part of the panel become enabled. Using these buttons, you can start Synop Analyzer's different data analysis and exploration modules.

The input data panel

When reading a data source, Synop Analyzer uses certain predefined settings and makes some assumptions as to the desired usage of the single data fields. The settings can be introspected and modified using the menu item PreferencesData Import Preferences. Further assumptions and parameter settings are directly shown in the input data panel. Some basic parameters are visible in the upper part of the panel:

Hint: You can duplicate a data import panel by right-clicking on the tab header of the panel. This creates a new input data panel in addition to the existing one. The new panel inherits all settings and specifications for importing the data that were specified on the original panel. You can now modify some of these settings in order to generate a second, slightly modified view on the data.


The 'active fields' pop-up dialog

Clicking on the button Select active fields in the input data panel opens a pop-up dialog in which all data fields of the current data source are displayed. The picture below shows the pop-up dialog for the sample data source doc/sample_data/RETAIL_PURCHASES.txt.

The active fields popup dialog

In the dialog, you can define several properties for each data field:

Duplicating data fields

Sometimes it is desirable to use one single data field from the original data in two different ways in Synop Analyzer. For example, you could use the time stamp of a transaction within a transactions data collection both as grouping criterion (usage 'group') and as time order field (usage 'order'); or you could use and display one single date field with the two different aggregation modes 'minimum' and 'maximum' in order to show the date of the first and the last transaction of a customer.

In Synop Analyzer, you can duplicate a data field by right mouse click on the table row representing the data field in the pop-up dialog Select active fields. In a second pop-up dialog you will then be asked to specify the name of the new copy of the original data field. Make sure that the new display name is unique:

file input_data_duplicated_field_290.png not found

After closing the field name dialog the duplicated data field will appear as a new row in the table of all available data fields. You can then specify the desired usage type, aggregation mode and other desired properties of the duplicated field.

The screenshot below shows a practical example in which defining duplicated fields is very helpful: on the sample data sample_data/RETAIL_PURCHASES.txt we have defined the field CLIENT_ID as the grouping criterion. The field DATE has been duplicated, renamed into FIRST_PURCHASE_DATE and LAST_PURRCHASE_DATE and furnished with the aggregation modes 'minimum' resp. 'maximum'. Similarly, the field ARTICLE has been duplicated, renamed into CHEAPEST_ARTICLE and MOST_EXPENSIVE_ARTICLE and furnished with the aggregation modes 'value at which PRICE is minimum' resp. 'value at which PRICE us maximum'. wurden dupliziert, umbenannt in ERSTES_KAUFDATUM und LETZTES_KAUFDATUM und mit der jeweils passenden Aggregierungsfunktion (Minimum bzw. Maximum) versehen. Schließlich wurde noch das Datenfeld ARTIKEL dupliziert, umbenannt in BILLIGSTER_ARTIKEL und TEUERSTER_ARTIKEL und mit der Aggregierung 'Wert, bei dem PREIS minimal ist' bzw. 'Wert, bei dem PREISmaximal ist' versehen.

file input_data_duplicated_fields_881.png not found

When you import and display the data in Synop Analyzer in this way, the displayed data contain one single row per customer. The data row contains the ID of the customer, his or her total number of purchases, the total amount of money spent so far, the cheapest and the most expensive article purchased so far, and the date of the first and the last purchase.

Hints for minimizing memory requirements and for maximizing processing speed


The 'Settings' pop-up dialog

Many more advanced options for customizing the data preparation and data importing process are accessible via the pop-up panel Settings. The panel is organized into seven 'tabs' or pages, which will be described one by one in the following sections of this document.

The 'Reading Options' tab

In the upper part of the first tab named Reading options, one can specify whether the binary data object which has been composed in the computer's main memory is also stored permanently in the form of an .iad file. The iad data format is an Synop Analyzer-proprietary data format; it contains a compressed binary representation of the input data as well as all data preprocessing and data joining steps defined in the input data panel. An iadcan be loaded from disk very quickly - much faster than the time it took to read the data from the original data sources.

If only the data preparation and data import settings but not the imported data themselves are to be stores, one can activate the check box Store the load task as XML file. This option has the advantage that one can repeatedly re-load the must current data snapshot from the original data source without to be obliged to re-enter the data preparation and data import settings.

The advanced options popup dialog

In the lower part of the tab you can modify various predefined settings for the process of reading the data into the computer's memory:


User specified binnings and discretizations

This tab provides the means for a field-specific modification of the default settings of

In the following, we want to demonstrate this using the sample data RETAIL_PURCHASES.txt. If these data are imported into Synop Analyzer as described in the 'active fields' pop-up dialog, with PRICE as weight field and PURCHASE_ID as group field, then the values of the field PRICE will be partitioned into the following ranges:

The PRICE chart with default value ranges

This range partition shall now be replaced by the 11 ranges with the boundaries 5, 10, 15, 20, 30, 40, 50, 70, 100 and 150. We open the tab Field discretizations and enter the following string into the field Interval bounds (numeric fields only): 5 10 15 20 30 40 50 70 100 150. Then we finish the specification by pressing <TAB> or <Enter> and press Add. The tab now looks like this:

The field discretization dialog

Each series of interval boundaries such as 5 10 15 20 30 40 50 70 100 150 which has been specified for a field discretization is stored in a 'interval boundaries history' store in the preference settings of Synop Analyzer. You can access the the 50 most recently used interval boundary sets from the pull-down selection box at the right side of the input field for interval boundaries.

If we close the pop-up dialog now using OK and re-open an analysis view which shows value distribution histograms, the histogram for the field PRICE shows the desired value ranges:

The PRICE chart with manual value ranges

You can delete manually defined value range definitions by means of Delete and modify them using Edit.


Value groupings and variant elimination

This tab provides the means for reducing the set of field values of a textual data field. The most important application areas are:

In Synop Analyzer, you can define several value groupings or variant eliminations within the tab Variant elimination, and each of them can be activated for one or more data fields. Per default, all variant eliminations defined when importing a certain input data source are stored as a part of the 'data load task', that means the XML file which stores all settings and user-defined specifications which have been performed for the input data source. But using the button Save selected as file you can also save a variant elimination as a data-independent persistent XML file. This file can later be loaded and activated for a new data source using the button Load from file. This mechanism enables the creation of a data-independent 'knowledge base' of spelling variants for specific application areas, for example a data-independent knowledge base of Toyota car names.

file input_data_grouping_920.png not found

Each variant elimination consists of the following parts:

Once defined, the variant elimination settings will be applied to all subsequent data reading processes on the current input data. In our example, the two values manager,freelancer and technician,engineer of the data field Profession are replaced by the group value Leading Positions:

file input_data_grouping_result_407.png not found


Name mappings

This tab provides the means for assigning more cleary understandable alias names to the values of a textual data field. These names can be read from an auxiliary file or database table which must contain at least two columns: one column must contain values which exactly correspond to the existing values of the textual field, the second column must contain the desired alias names for these values.

In the following, we want to demonstrate this using the sample data RETAIL_PURCHASES.txt. If these data are imported into Synop Analyzer as described in the 'active fields' pop-up dialog then the field ARTICLE contains hardly understandable 3-digit ID numbers. We would like to replace these numbers by textual article names.

article IDs

A list of English and German article names is available in form of the file doc/sample_data/RETAIL_NAMES_DE_EN.txt. This file contains three columns: ARTICLE_ID, ARTICLE_NAME and LANG. ARTICLE_ID contains the same article identifier number which occur in the main data and LANG contains the two language identifiers DE and EN. We open the tab Name mappings within the Advanced options pop-up window and insert the entries shown in the picture below in the lower gray part of the tab. Then we press the Add button. The tab should look like this now:

name mapping specification tab

If we close the pop-up dialog now using OK and re-open an analysis view which shows value distribution histograms, the histogram for the field ARTICLE shows the desired textual values:

article names

You can delete manually defined name mapping definitions by means of Delete and modify them using Edit.


Taxonomies (hierarchies)

This tab provides the means for adding hierarchical grouping information to the values of a textual data field. Hierarchies, also called 'taxonomies', can be read from an auxiliary file or database table which must contain at least two columns: one column must contain the lower-level ('child') part of a hierarchy relation, the other column the higher-level ('parent') part.

In the following, we want to demonstrate this using the sample data RETAIL_PURCHASES.txt. We assume that these data have been imported into Synop Analyzer and enriched with name mapping information as described in section Name mappings. We would like to add article group and article department information to the article names.

article IDs

A list of article group and department information is available in form of the file doc/sample_data/RETAIL_ARTICLEGROUPS_DE_EN.txt. This file contains the columns: PARENT and SUBGROUP, which are easily identified as parent and child column. We open the tab Taxonomies within the Advanced options pop-up window and insert the entries shown in the picture below in the lower gray part of the tab. Then we press the Add button. The tab should look like this now:

taxonomy specification tab

If we close the pop-up dialog now using OK and re-open an analysis view which shows value distribution histograms, the histogram for the field ARTICLE shows new article group values and department values in addition to the existing article names. (Per default, the Synop Analyzer histograms show only up to 80 histogram bars. You can increase this value to 100 in the pop-up dialogs PreferencesUnivariate Preferences and PreferencesMultivariate Preferences in order to obtain the result shown below.)

article names

You can delete manually defined taxonomy definitions by means of Delete and modify them using Edit.


Joining with auxiliary tables

This tab provides the means for appending new data fields (columns) to an existing data source which has been opened in Synop Analyzer. The values in the new fields are obtained from a second data source; they are merged into the main data source via a foreign key - primary key relation between a field in the main data source and a field in the second data source. That means, the main data source must contain a data field ('foreign key field') whose values are the values of a primary key field in the second data source. It is not neccessary that the primary key field has been explicitly specified as primary key field within the second data source, it is sufficient that the field is a 'de-facto' key field in the sense that no value of the field occurs identically in more than one data row.

In the following, we want to demonstrate this using the sample data RETAIL_PURCHASES.txt. We assume that these data have been imported into Synop Analyzer and enriched with name mapping information as described in section Name mappings. We would like to add customer master data to these data. A master data file is available in doc/sample_data/RETAUL_CUSTOMERS.txt. It contains the data fields AGE, GENDER and START_DATE (the latter being the date at which the customer loyalty card was handed out). The connection to the main data source is established via the foreign key - primary key pair CUSTOMER_ID (foreign key field in RETAIL_PURCHASES.txt) and CUSTOMER_ID (primary key field in RETAIL_CUSTOMERS.txt).

We open the tab Joined tables within the Advanced options pop-up window and insert the entries shown in the picture below in the lower gray part of the tab. Then we press the Add button. The tab should look like this now:

joined table specification tab

Note: the input field Row filter criterion can be used to make a field which is not a primary key field in the auxiliary data source behave like a primary key field. Imagine that we want to add a customer address field to the main data, but in the address master data there are customers for which we have two or more addresses, labeled by an address counter field ADDRES_NBR which contains a running number 1, 2, 3, etc.. In this form, we cannot join-in address information because for some customers we don't know which address to take. However, if we enter WHERE ADDRESS_NBR=1 into the field Row filter criterion, the address becomes unique and the joining-in canbe performed.

If we close the pop-up dialog now using OK, re-read the data and open an analysis view which shows value distribution histograms, three new histograms for the fields AGE, GENDER and START_DATE appear. These new fields can now be used just as if they had been present in the main data source right from the beginning. And if you persistently save the data as an .iad, the saved data also contains the three new fields.

article names

You can delete manually defined joined data definitions by means of Delete and modify them using Edit.


Computed data fields

This tab provides the means for appending new fields (columns) to an existing data source whose values are the result of applying a computation formula to the values of one or more existing data fields.

In the following, we want to demonstrate this using the sample data RETAIL_PURCHASES.txt. We assume that these data have been imported into Synop Analyzer and enriched with name mapping information as described in section Name mappings. We would like to add a new data field which contains the number of elapsed days between the date of the purchase and the current day at which the data analysis takes place. This information can be calculated from the value of the data field DATE and the current date.

We open the tab Computed fields within the Advanced options pop-up window and insert the entries shown in the picture below in the lower gray part of the tab. Then we press the Add button. The tab should look like this now:

joined table specification tab

Note: In order to tell the software that the first part of the formula does not depend on the value of a data field, we have clicked in the button Select next to Existing field 1, then we have selected the last, empty entry in the pop-up menu. Without this step, the Add remains grayed-out and unusable.

Note: By activating the check box Replace existing fields you can tell the software that the existing fields involved in the formula should be removed from the data after the computation. However, this is not possible if one of the fields to be removed has been assigned one of the four special roles 'group', 'entity', 'weight' or 'order'. Therefore, the field DATE can not be replaced. One would get an error message if one tried to do so.

Note: The special constant [NOW] is a placeholder for inserting the current date and time of the time the data are read into memory.

If we close the pop-up dialog now using OK, read the data and open an analysis view which shows value distribution histograms, we see the new data field DAYS_SINCE_PURCHASE. In the screenshot below, we have deactivated the existing field DATE using the Visible fields button. Furthermore, we have increased the default number of histogram bars for numeric fields from 10 to 12 (using input field #bins (numeric fields)) before reading the data.

article names

You can delete manually defined computed field definitions by means of Delete and modify them using Edit.


Transactional and streaming data

Automatically recorded mass data from logging systems - for example supermarket cash desk data, web stream data or server log data - often have only two columns: a counter or time stamp column and another column which contains all the information recorded at a certain counter state or time. The sample data file doc/sample_data/CAR_REPAIR.txt is an example for such a type of data. The file contains car read-out data which were recorded when a cars were connected to a testing device at a car repair shop.

car repair data

As can be seen from the data extract shown above, the second column contains the read-out information in the form attribute name = attribute value. The first column is the ID column. Its values indicate which data rows belong to one single car read-out.

When the data CAR_REPAIR.txt are read into Synop Analyzer, the field REPAIR_ID should be specified as group field. See section The 'active fields' pop-up dialog for more details.

transactional data 1

Whenever Synop Analyzer is reading data which contain - apart from a possibly specified group, entity, order and/or weight field - only one single, textual data field, the software checks whether it is able to detect internal structures and groups of information within the single textual field. In particular, Synop Analyzer searches for prefixes of the kind attribute name=, with the aim of identifying several different such prefixes and using them for information grouping.

Once the data have been read-in by pressing the button Start in the input data panel, one can introspect the different 'information groups' or prefixes by clicking on the Select active fields button: each prefix group is shown as a separate data field. In the example of the CAR_REPAIR.txt data, 9 such artificial data attributes are detected. The original data field ITEM has been marked as 'replaced':

transactional data 2

In following data analysis steps, one can work with the artificial data fields as if they were full blown data fields from the data source. As an example, we show a univariate statistics view in which the value distributions of the artificial data fields can be studied:

univariate eploration car repair

At the end of this section we want to discuss the possible question what the advantage of the two-column 'transactional' data format is. The answer is that this 'slim' data format offers a very flexible possibility to store both set-valued and scalar-valued data attributes in one single flat data structure without any redundancy. In our example, some of the attributes typically contain many different values per repair ID, for example ERROR_LOG, EXTRA_EQUIPMENT or FINDING. Other attributes such as CAR_TYPE or KM_CLASS contain only one value per repair ID. The first group is set-valued (with respect to one repair ID), the second group is scalar-valued. Both groups can be stored together without to introduce redundancies, for example by repeating identical values of scalar-valued attributes.