Glossary

χ² conf
(module: Bivariate Exploration and Correlations)

χ² confidence indicates whether or not the field value distribution of one field significantly changes when the other field has a specific value or a value in a specific range. χ² confidence numbers are numbers between 0 and 1. The closer to 1, the higher the statistical evidence that a significant impact of one field on the value distribution of the other field has been detected. In general, statisticians consider an impact as 'significant' if the χ² confidence exceeds a value of 0.95 ('95% confidence level') or 0.99 ('99% confidence level')

A χ² confidence number appearing as the rightmost number of a normal matrix row indicates whether the value distribution of the x-axis field systematically differs from its general behavior if the y-axis field assumes the value or value range which is indicated in the leftmost entry of that row.

A χ² confidence number appearing as the last number of a normal matrix column indicates whether the value distribution of the y-axis field systematically differs from its general behavior if the x-axis field assumes the value or value range which is indicated in the first entry of that column.

The χ² confidence number in the bottom-right matrix corner indicates whether there is a significant dependence of the x-axis field's value distribution from the y-axis field's value and vice versa.


χ² conf
(module: Multivariate Exploration and Split Analysis)

The confidence that the value distribution of the selected data subset differs in a statistically significant way from the overall data's value distribution on the currently selected data field.

The confidence is calculated based on the confindence level with which the null hypothesis 'the two value distributions are identical' is rejected by a χ² test.


χ² conf
(module: Multivariate Exploration and Split Analysis)

The confidence that the value distributions of the test and the control data differ in a statistically significant way in at least one of the data fields in which the control data are not selected manually but chosen automatically to be as similar to the test data distribution as possible.

The confidence is calculated based on the confidence level with which the null hypothesis 'the two value distributions are identical' is rejected by a χ² test.


χ² conf
(module: Multivariate Exploration and Split Analysis)

The confidence that the value distributions of the test data and the contol data differ in a statistically significant way on the currently selected data field.

The confidence is calculated based on the confindence level with which the null hypothesis 'the two value distributions are identical' is rejected by a χ² test.


χ² conf.
(module: Multivariate Exploration and Split Analysis)

The confidence that the overall value distribution of the selected subset differs in a statistically significant way from the overall value distribution on the entire data.

The confidence is calculated based on the confidence level with which the null hypothesis 'the two value distributions are identical' is rejected by a χ² test.


χ² conf.
(module: Multivariate Exploration and Split Analysis)

The confidence that deviation of the overall selection's lift from 1 is statistically significant.

The confidence is calculated based on a χ² significance test with one degree of freedom.


χ² confidence
(module: Associations Analysis)

The χ² confidence level of an association indicates up to which extent each single item is relevant for the association because its occurrence probability together with the other items of the association significantly differs from its overall occurrence probability.

More formally, theχ² confidence level is the result of performing n χ² tests, one for each item of the association. The null hypothesis for each test is: the occurrence frequency of the item is independent of the occurrence of the item set formed by the other n-1 items.

Each of the n tests returns a confidence level (probability) with which the null hypothesis is rejected, and the χ² confidence level of the association is set to the minimum of these n rejection confidences.


Abs. support
(module: Associations Analysis)

The absolute support of an association is the number of groups (transactions) in which the association occurs.

When specifying the parameters for an associations training, you should always specify an lower boundary for the absolute or relative support, otherwise the training can take extremely long time.


Abs.diff.
(module: SOM Models)

Maximum absolute difference to the field's overall value distribution: the SOM card shows the nominal value for which the difference between its actual frequency within the records mapped to the given neuron and its expected frequency is maximum.


Absolute support
(module: Sequential Patterns)

The absolute support of a sequence is the number of entities in which the sequence occurs.


Additive Season
(module: Time Series Analysis)

Additive season means that the seasonal pattern is modeled as an added term to the long-term trend ('total = trend + season'). As a result, the amplitude of the seasonal fluctuations is constant and does not grow when the trend line increases.

Multiplicative season means that the seasonal pattern is modeled as a correction factor to the long-term trend ('total = trend * season'). As a result, the amplitude of the seasonal fluctuation increases when the trend line increases and decreases when the trend line decreases.


Allow irreversible binning
(module: Data Import)

If this check box is marked, numeric data fields can be discretized into a small number of intervals, and the original field values are irreversibly replaced by interval indices.

For example, the value AGE=37 might be replaced by AGE=[30..40[, and in the compressed data, the precise value 37 will be irreversibly lost.


Assoc Model
(modules: Workbench, Data Import, Associations Analysis)

An associations model is a collection of association rules which have been detected during an associations training run on the training data set. In the associations model panel, you can visualize and introspect the results of an associations training run. You can display the results in tabular form, sort, filter and export the filtered results to flat files or into a table in a RDBMS.

Furthermore, you can calculate additional statistics for the support of single associations in the introspected result.


Associations Detection
(modules: Workbench, Data Import, Associations Analysis)

In this module, you specify the parameters and settings which are to be used for the next associations training run.

Furthermore, you can store your parameter settings, manage them in a repository and later retrieve and reuse them. In the lower part of the panel, you can start and stop an associations training run and monitor its progress and its predicted run time.


Associations Scoring
(modules: Workbench, Data Import, Associations Analysis)

An associations scoring matches a collection of association rules (an associations model) with a new data table and indicates which associations are fulfilled (supported) by which data sets.

In the associations scoring task panel, you specify the parameters and settings which are to be used for applying detected associations to new data or for gathering additional statistics on the supporting transactions of certain associations.

You can store your parameter settings, manage them in a repository and later retrieve and reuse them. In the lower part of the panel, you can start and stop associations application runs and monitor their progress and predicted run time.


Automized data field
(module: Multivariate Exploration and Split Analysis)

Data field over whose values an automatically executed series of split analyses is to be performed. Automizable data fields are all fields on which one single value has been selected on the test data and several other values have been selected on the control data.

During each step of the automized series analysis, a different single value out of the initially selected test and control data values is considered the test data and all remaining initially selected values the control data.


Boolean field
(module: Data Import)

A data field which is to be treated as Boolean field. If it contains more than 2 different values, all but the the first two different values will be ignored, i.e. treated as missing values.


Browser call
(module: Workbench)

For accessing online help, the software must start an external web browser. This parameter contains the calling command for this browser. There are default settings for several operating systems. Therefore, you should only modify this parameter if you are unable to use the online help with the default settings.


Buffer page size
(module: Workbench)

The data page size (in bytes) which is used in the preliminary representation of data field objects. Allowed values are 10000 to 10000000. Larger values can speed up the data reading, but they can also raise memory requirements, in particular on data with many fields.


Cancel the training
(modules: Associations Analysis, Sequential Patterns, SOM Models, Regressions Analysis, Decision Trees)

Aborts the currently running training task without creating a result.


Chart start
(module: Time Series Analysis)

First time point shown in the time series charts


Chart width (pixels)
(modules: Statistics and Distributions, Multivariate Exploration and Split Analysis, Multivariate Exploration and Split Analysis)

The resolution (number of pixels in x direction) of the single histogram charts. The number refers to 'normal' charts. Extra-wide charts withmany histogram bars have a resultion which is a multiple of this number.


Charts/row
(modules: Statistics and Distributions, Multivariate Exploration and Split Analysis, Multivariate Exploration and Split Analysis)

The Number of histogram charts per row. If this value is 0, the software automatically selects a suitable number of charts per row, depending on the total number of charts to be shown.


Child support ratio
(modules: Associations Analysis, Sequential Patterns)

Specify a lower boundary for the acceptable 'support shrinking rate' when creating expanded associations out of existing associations.

An expanded association of n items will be rejected if at least one of the n parent associations has a support which is so large that when multiplied with the minimum shrinking rate, the result is larger than the actual support of the expanded association.


Chi² conf
(module: Multivariate Exploration and Split Analysis)

The confidence that the value distributions of the test and the control data differ in a statistically significant way in at least one of the data fields in which the control data are not selected manually but chosen automatically to be as similar to the test data distribution as possible.

The confidence is calculated based on the confidence level with which the null hypothesis 'the two value distributions are identical' is rejected by a χ² test.


Chi² conf
(module: Multivariate Exploration and Split Analysis)

The confidence that the value distributions of the test data and the contol data differ in a statistically significant way on the currently selected data field.

The confidence is calculated based on the confindence level with which the null hypothesis 'the two value distributions are identical' is rejected by a χ² test.


Chi² confidence (in the toolbar at the bottom edge of the panel)
(module: Multivariate Exploration and Split Analysis)

The confidence that the overall value distribution of the selected subset differs in a statistically significant way from the overall value distribution on the entire data.

The confidence is calculated based on the confidence level with which the null hypothesis 'the two value distributions are identical' is rejected by a χ² test.


Chi² confidence (in the toolbar at the bottom edge of the panel)
(module: Multivariate Exploration and Split Analysis)

The confidence that deviation of the overall selection's lift from 1 is statistically significant.

The confidence is calculated based on a χ² significance test with one degree of freedom.


Chi² confidence (of an association pattern)
(module: Associations Analysis)

The χ² confidence level of an association indicates up to which extent each single item is relevant for the association because its occurrence probability together with the other items of the association significantly differs from its overall occurrence probability.

More formally, theχ² confidence level is the result of performing n χ² tests, one for each item of the association. The null hypothesis for each test is: the occurrence frequency of the item is independent of the occurrence of the item set formed by the other n-1 items.

Each of the n tests returns a confidence level (probability) with which the null hypothesis is rejected, and the χ² confidence level of the association is set to the minimum of these n rejection confidences.


Chi² confidence (within a histogram chart title)
(module: Multivariate Exploration and Split Analysis)

The confidence that the value distribution of the selected data subset differs in a statistically significant way from the overall data's value distribution on the currently selected data field.

The confidence is calculated based on the confindence level with which the null hypothesis 'the two value distributions are identical' is rejected by a χ² test.


Computed fields
(module: Data Import)

Define additional data fields whose values are to be computed from the values of one or more existing data fields.


Confidence
(modules: Associations Analysis, Sequential Patterns)

The confidence of an association rule or sequence rule is the ratio between the rule's support and the rule body's support.

An association rule is an association of n items in which n-1 of the n items are considered the 'rule body' and the remaining item is considered the 'rule head'. Hence, n different association rules can be constructed from one association of length n.

Similarly, a sequence rule is a sequence of n sets of items - separated by n-1 time steps - in which the first n-1 item sets are considered the rule body and the item set after the last time step is considered the rule head.

A rule's confidence is the probability that the rule head is true if one knows for sure that the rule body is true.


Confidence range
(module: Pivot Tables)

This value determines whether an error bar (confidence range) is to be drawn for each point in the diagram, and it determines the confidence range represented by the error bar.

If the confidence value C is selected here, that means that a positive or negative deviation from the actual value in y-direction is with a confidence of C due to a significant change in the probability distribution and can not be explained by just a statistic fluctuation within the current probability distribution.


Confidences
(modules: Associations Analysis, Sequential Patterns)

The confidences C of the n different ways of interpreting the association as a rule of the form 'if (itemX and itemY and ... are present in a transaction) then also itemZ is present in the transaction with a probability (confidence) of C, in short notation: (itemX,itemY,...) =(C)=> itemZ.

The first number in the list corresponds to the rule (item2,item3,...) =(C)=> item1, the second to the rule (item1,item3,...) =(C)=> item2, and so on.


Confidences
(module: Sequential Patterns)

The confidences C of the n consecutive steps of the sequence. The first number in the list is the probability that an arbitrary entity contains the first item set of the sequence. The second number is the probability that an entity containing the first set also contains the sequence's second item set, and so on.


Contingency
(module: Bivariate Exploration and Correlations)

Cramer's contingency coefficient V as described in http://en.wikipedia.org/wiki/Contingency_table


Control data
(module: Multivariate Exploration and Split Analysis)

The currently selected control data subset in a test-control data analysis. The goal of the analysis is to detect and quantify systematic deviations in the field value distribution properties between the test data subset and the control data subset


Core item purity
(module: Associations Analysis)

The core item purity of an association is the ratio between the association's support and the support of the least frequent item within the association.

A core item purity of 1 indicates a 'mononuclear' group in which the support of the group is determined by the support of its least frequent item.

Note: the core item purity is always larger than or equal to the association's purity.


Correction Hints
(modules: Deviation Detection, Associations Analysis)

A set of possible corrections which would help removing an inconsistency from some data records. The hints are created based in a statistical analysis of the involved items.


Correlations Analysis
(modules: Workbench, Data Import, Bivariate Exploration and Correlations)

The correlation between two data fields indicates whether or not there is a significant statistical dependency between the values of the two data fields. The correlations module computes and visualizes these field-field correlations


Create a new residual field in the data
(module: Regressions Analysis)

Create a new field in the input data which contains the residuals 'actual target value - predicted target value'. The name of the new field is [targetFieldName]_RESIDUAL.


Create persistent data file
(module: Data Import)

If this check box is marked, a persistent version of the compressed data object will be written to a file and can be refetched later. This speeds up the data reading process in future mining sessions on this data object.


Data block size
(module: Workbench)

Data block size (in bytes) in block-wise data reading from flat text files. allowed values are 100000 to 1000000000


Data groups
(module: Statistics and Distributions)

Number of different data groups (group field values) in the input data


Data Subset
(module: Alle)

In this panel, you can explore the data selections created by a multivariate data explotation or another data analysis module.


Data to be joined-in
(module: Data Import)

Name of the data source from which certain fields are to be added to the currently active main data source


Default result directory
(module: Workbench)

Default directory path in which analysis results are stored.


Detail field
(module: Multivariate Exploration and Split Analysis)

Name of the data field whose value distribution defines the colors of the histogram bars representing the selected data set. When no detail field is selected, the histogram bars are displayed without detail structure and in uniformly blue color.


Detail field
(module: Time Series Analysis)

For each value of this field, a separate time series chart will be drawn


Deviation strength
(module: Deviation Detection)

The strength of a deviation pattern describes how strongly and significantly the number of occurrences of the pattern is below the expected number of occurrences.

The value is calculated as '10*(chi²-conf - 0.9) / lift', where 'lift' is the pattern's lift and 'chi²-conf' is the confidence level that the pattern is statistically significant.

For example, if a combination (A,B) of two data field values A and B occurs in 0.02% of all records and has a chi² confidence level of 0.99, and if A and B alone occur in 20% respectively 10% of the data records, then the deviation strength of the pattern (A,B) is 90 since lift is 0.02%/(20%*10%) = 1/100 and 10*(chi²-conf - 0.9) = 0.9.


Deviations, Inconsistencies
(modules: Workbench, Data Import, Deviation Detection)

In the Deviation Detection panel, outliers, deviations and presumable data inconsistencies can be detected..


Diff. values
(module: Statistics and Distributions)

Number of different valid values of the data field. Note: for binned numeric fields, only those different values are counted which were encountered while collecting statistics for determining the bin boundaries.


Difference
(modules: Multivariate Exploration and Split Analysis, Multivariate Exploration and Split Analysis)

Difference: #selected - #expected.


Discrete field
(module: Data Import)

A data field which is to be treated as discrete numeric field. If it contains textual values, these values will be ignored, i.e. considered as missing values.


Empty field threshold
(module: Workbench)

Data fields in which (almost) no data row has a valid value are normally of little interest within a data analysis. Therefore the software drops these fields when reading data from a data source. The parameter 'Empty field threshold' specifies the minimum filling rate below which a field will be dropped. The minimum filling rate is a number between 0.0 and 1.0; it describes the fraction of all data records in which the field has a valid value.


Entities
(module: Statistics and Distributions)

Number of different entities (entity field values) in the input data. If no entity field has been specified, the number of entities is equal to the number of groups, or, if no group field has been specified, equal to the total number of data records.


Entity field
(module: Data Import)

Specify a data field which marks several adjacent data records as referring to one single entity (such as a customer, a car, a product, or a patient). The entity data field contains the entity identifier (such as a customer or vehicle or product or patient ID).


ES alpha
(module: Time Series Analysis)

Exponential Smoothing coefficient alpha. (defines a damping factor (1-alpha) per time step.


ES weight
(module: Time Series Analysis)

Weight prefactor to the Exponential Smoothing part of the forecast; weight=0 switches off the Exponential Smoothing.


Excess
(module: Statistics and Distributions)

The sample excess of the value distribution. Note: the sample excess slightly differs from population excess (e.g. MS Excel's 'Excess Curtosis').


Expected
(module: Multivariate Exploration and Split Analysis)

Expected number of data records or data groups in the selected data subset, assuming that the field value distribution on the selected data is identical to the field value distribution on the entire data.


Expected number of selected data records
(module: Multivariate Exploration and Split Analysis)

Expected number of data records or data groups in the selected data subset, assuming that the field value distribution on the selected data is identical to the field value distribution on the entire data.


Explained fraction of target variance (R²)
(module: Regressions Analysis)

R² is a measure for the predictive power of the regression model. R² near 1 means that the model is able to predict the target values almost perfectly, R² near 0 means that the model is almost useless.


Export the compressed data object
(module: Workbench)

Save the in-memory data object as persistent iad file.


Export the data into a text file
(module: Workbench)

Export the data to a data table or flat file, preserving the all settings such as active field definitions, field types, discretizations, name mappings or joined tables. For data with set-valued fields or with a group field, you can choose among several output data formats:

The 'set-valued' format: one data row per group; all values of set-valued attributes are written into one single textual string within curly braces {} and separated by comma.

The 'pivoted' format: several data rows per group; all attributes are put into one single 'item' column, which contains values of the form [ATTRIBUTE_NAME]=[VALUE].

The 'boolean fields' format: one data row per group; for each textual value of each non-numeric attribute, the exported data contains one separate Boolean attribute containing '1' if the corresponding attribute value occurs in the current group, and '0' if it does not.

The 'only group IDs' format: creates a one-column output in which only the group IDs of the current data set are contained. This format is helpful if the exported data is only aimed to serve as a list of unique keys describing a subset of data records form a larger table.


Field containing the mapped values
(module: Data Import)

The data field in the auxiliary table which contains mapped names for the original values of the affected data field in the main table


Field containing the original values
(module: Data Import)

The data field in the auxiliary file or table which contains the different original values which also appear in the main table field for which the name mapping is being defined. Often, this field is a primary key field of the auxiliary table.


Field containing the taxonomy parents
(module: Data Import)

The data field in the auxiliary file or table which contains the group or category values


Field discretizations
(module: Data Import)

A discretization defines a 'binning' or 'grouping' of fine grained information from a numeric or textual data field into a small number or classes.

For textual fields, this means that only the N most frequently appearing textual values will be treated as separate values. All other values are represented by the group 'others'.

For numeric fields, this defines a binning into N value ranges (intervals). The interval boundaries are chosen automatically. If the automatically determined interval boundaries for a numeric field are not satisfying, user-defined interval boundaries can be specified manually by entering a list of N-1 numbers, time or date values in ascending order.


Fields to be added
(module: Data Import)

Data fields from the added data source which are to be joined into the currently active main data.


File or table containing the name mappings
(module: Data Import)

A flat file or database table containing at least two data fields (columns). One column contains the different values which currently appear in the main table's data field for which the name mapping is to be defined. The second column contains a mapped value for each of the original different values.


File or table containing the taxonomy relations
(module: Data Import)

A flat file (i.e. column separated text file) or database table which contains at least two data fields (columns): a 'parent' column and a 'child' column. The 'parent' and 'child' values in each data row describe one single hierarchy relation between a group or category (parent) and a member of the group or category (child).


Forecast start
(module: Time Series Analysis)

Starting time point for calculating the aggregated forecast values which are shown below the title line of each chart in the time series forecast screen.


Forecasts
(module: Time Series Analysis)

Number of future time series data points to be forecasted


Foreign key field
(module: Data Import)

A data field which is the primary key of another data file or table. The data field can be used to join that other data table into the current data source.


Freq.
(module: SOM Models)

Maximum frequency: the SOM card shows the nominal value which is the most frequent value on the data records mapped to the given neuron.


Frequency
(module: Data Import)

This parameter defines a lower boundary for the number of data records or data groups on which a value of a non-numeric data field must occur for being tracked as a separate field value and a separate bar in histogram charts. Less frequent values will be grouped into the category 'others'.


Frequency threshold for perfect tupels
(module: Workbench)

Default setting for the minimum required frequency above which a tupel of several items can be considered as a perfect tupel. Must be an integer larger than 1.


Graphs per row
(module: Time Series Analysis)

Number of time series graphs per row


Group field
(module: Data Import)

The input data for data mining can be pivoted or unpivoted. In the unpivoted data format, each 'object to be analyzed' (for example a customer, a process or a production tranche) is represented by exactly one data record (data row). In this case, no group column has to be specified.

In the pivoted data format, each 'object to be analyzed' can span multiple adjacent rows of data: there is one 'item' column containing one single property of the object per data row; and there is a 'group' column which contains an unambiguous identifier for the object to which the current data row belongs. In this case, the name of that group column must be specified here. One such 'object to be analyzed' is often called a 'transaction'.


Height of the neural net
(modules: SOM Models, Reporting)

The number of neurons in direction y. Should be a number between 2 and 100


Height-width ratio
(module: Time Series Analysis)

Height to width ratio of the time series charts to be created.


Icon (large)
(module: Workbench)

The icon to appear in the 'Help'-->'About' info screen. When working without a license key (free test version), you can freely change that icon. When working with a license key, the license key checks that the name of the icon corresponds with the information stored in the lichense key.


Icon (small)
(module: Workbench)

The icon to appear in the upper left corner of the graphical workbench window. When working without a license key (free test version), you can freely change that icon. When working with a license key, the license key checks that the name of the icon corresponds with the information stored in the lichense key.


Ignore invalid/missing values
(module: Bivariate Exploration and Correlations)

Ignore all missing and invalid values in the bivariate analysis.


Include constant offset term
(module: Regressions Analysis)

If this check box is marked, a linear model with constant term (y = b0 + b1*x1 + ... + bn*xn) will be created. Otherwise, a model without the term b0 will be created.


Incompatible items
(modules: Deviation Detection, Associations Analysis, Sequential Patterns)

If a set of items has been specified as incompatible (by pairs), then none of the detected deviations, associations or sequences will contain more than one item out of this set.

Enter several patterns, separated by comma (,). If a pattern contains a comma as part of the pattern name, escape it by a backslash (\). Each pattern can contain one or more wildcards (*) at the beginning, in the middle and/or at the end.


Index
(module: Statistics and Distributions)

Value index, i.e. the value's position on the list of all values. For numeric fields, value indices are assigned in the natural order of the values: the smallest value has inde 1. For textual fields, value indices are assagined by decreasing frequency: the most frequent value of a data field has the index 1, the second most frequent one the index 2 and so on.


Initial learning rate
(module: SOM Models)

A number between 0 and 1 which indicates how much the input weights of the best matching neuron are moved towards the field values of a data record when that record is presented to the SOM net during training.


Input Data
(module: Workbench)

In this panel, you can define, describe, preprocess and manage a data source that you want to use for the subsequent data analysis steps.


Intersection
(module: Associations Analysis)

If 'superset' is checked, the 'Show', 'Explore' and 'Export' buttons will handle each data record or group which supports at least one of the selected associations.

If 'intersection' is checked, the 'Show', 'Explore' and 'Export' buttons will only handle those data groups which support all selected associations.


Interval bounds (numeric fields only)
(module: Data Import)

Specify the desired interval boundaries. Specify n-1 numeric values in ascending order, separated by ';', '|' or ' ' for obtaining n intervals.


Invalid or NULL
(module: Statistics and Distributions)

Number of data records (resp. data groups in the pivoted data format) in which the data field has no valid value.


Invert
(modules: Multivariate Exploration and Split Analysis, Multivariate Exploration and Split Analysis)

Invert the field value selection on the current data field in a Multivariate exloration or a test-control data analysis: deactivate the previously selected value ranges and activate those ranges which were filtered out.


Item
(modules: Deviation Detection, Associations Analysis, Sequential Patterns)

An item is an atomic part of an association or sequential pattern, i.e. a single piece of information, typically of the form [field name]=[field value] or [field name]=[field value range from ... to ...].


Item frequencies
(modules: Associations Analysis, Reporting)

The absolute supports of the single items within the association (the first number corresponds to item1, the second to item2, etc.)

A star (*) after the number indicates that the item belongs to the core of the association. The core of an association is the smallest possible subset of items of the association which has the same support as the entire association.


Item pair purity
(modules: Associations Analysis, Sequential Patterns)

The item pair purity of two items i1 and i2, is the number of transactions in which both items occur divided by the maximum of the absolute supports of the two items. Item pairs with a purity of 1 are 'perfect pairs': whenever i1 occurs in a transaction, also i2 occurs in it, and vice versa.


Item set length
(module: Sequential Patterns)

The desired item set lengths in the sequences to be detected. Each 'equal-time' part of a sequence is an item set. In the sequence [A] >>> [B],[C],[D], for example, the minimum item set length is 1, the maximum item set length is 3.


Item supports
(modules: Deviation Detection, Associations Analysis, Sequential Patterns)

Number of data records or groups on which the different items which form the pattern appear.


JDBC connection string
(module: Data Import)

The string which is sent to a database management system (DBMS) for getting access to a data table via the JDBC protocol. The string contains the DBMS name, hostname and database name.

A default version of this string is automatically created from the user's input for DBMS type, host name and database name in the database connect panel. If this default string does not work properly, the manual specification of ':[4-digit port number]' after the host name might be necessary.


Joined tables
(module: Data Import)

Define tables and fields within them which are to be joined into the main table - for example master data tables containing additional properties of certain field values of the main table.


Key field in joined file
(module: Data Import)

Key field in the added data source, must contain the same values as the foreign key column in the main data.


Key-like field threshold
(module: Workbench)

Textual fields which contain a very large number of different values are interpreted as 'key-like' fields; the software assumes that their content is not suitable for being incorporated into subsequent analysis or data mining steps, and they are dropped when reading the data source. This parameter defines the number of different field values above which a field is classified as 'key-like'. Allowed values are 100 to 1000000.


Language
(module: Workbench)

Language in which all textual elements of the graphical workbench will appear


Last point completion
(module: Time Series Analysis)

Completion rate of the last time point, compared to the earlier time points.

If, for example, each time point describes the sales figures of one month and for the current month, the current number only covers the accumulated sales figures of 5 out of 25 sales days, then the completion rate of the last time point should be set to 0.2.


License key file
(module: Workbench)

File containing the license key for the software. The file name starts with IA_license_key. There is no license key file if you are working with a free test or trial version of the software.


Lift
(modules: Multivariate Exploration and Split Analysis, Multivariate Exploration and Split Analysis)

This measure compares the actual number of data groups passing the selection criteria to the expected number which would arise if all data fields used as selection criteria were statistically independent.

A lift value larger than 1 indicates that the field values used as selection criteria 'attract' each other, a value smaller than 1 indicates that the field values 'repulse' each other.


Lift
(module: Associations Analysis)

The lift of an association is the actual relative support of the association divided by the product of the relative supports of the items which form the association.

Associations with lift>1 are 'frequent patterns': the items within the association occur more frequently together than expected if these items were statistically independent.

Associations with lift<1 are 'exceptions' or 'deviations': the items within the association occur less frequently together than expected if these items were statistically independent.


Lift
(module: Sequential Patterns)

The lift of a sequence is a measure for the positive correlation of the item sets (events) which form the sequence.

Sequences with lift>0.5 are 'frequent patterns': the item sets within the sequence occur more frequently in that order than expected if the items were statistically independent.

Sequences with lift values close to zero are 'exceptions' or 'deviations': the items within the sequence occur less frequently in that order than expected if the items were statistically independent.


Lift increase factor
(module: Associations Analysis)

An association of n items has n lift increase factors, namely the n ratios of this association's lift divided by the lifts of its n different 'parent' associations. A parent association is an association which results when one of the n items is dropped.

Specifying limits for the lift increase factor helps keeping the result size manageable by suppressing the generation of redundant child patterns for significant parent patterns. When searching for frequent patterns, lift increase factors greater than 1 should be applied, e.g. 1.5. When searching for deviations, lift increase patterns smaller than 1 should be applied, e.g. 0.5.

As an example, let us consider the association ('AGE<18' and 'FAMILY_STATUS=child'). On real-life demographic data, this association is a typical frequent pattern with a lift largely above 1, e.g. 3.62. Therefore, when searching for frequent patterns with lift>3, this pattern will be detected. However, most likely also the following patterns will be detected: ('AGE<18' and 'FAMILY_STATUS=child' and 'GENDER=male'), ('AGE<18' and 'FAMILY_STATUS=child' and 'GENDER=female' and 'STATE=CA'), and many more. All these extended patterns most probably have a lift very close to 3.62 since the pattern extensions are just adding uncorrelated information to the significant 'core' pattern ('AGE<18' and 'FAMILY_STATUS=child'). Setting a minimum lift increase factor of 1.5 helps suppressing all these useless extensions as none of them has a lift greater than 5.43 = 1.5*3.62.


Lift increase factor
(module: Sequential Patterns)

The lift increase factor relates the lift of a sequence to the lift of its parent sequences which results from removing one single item from one of the n equal-time item sets of the sequence.

Specifying limits on the lift increase factor helps suppressing the generation of redundant, uninteresting sequences for interesting 'core' sequences. For more detail refer to the explanation of lift increase factor in the associations training module.


Linear
(module: Regressions Analysis)

In linear regression, the value of a numeric target field t is expressed as a linear formula of the values of several other data fields x, the so-called predictor fields or regressors: t = b0 + b1*x1 + ... + bn*xn.


Logistic
(module: Regressions Analysis)

In logistic regression, the probability of the '1'-value of a two-valued target field t is expressed as a formula of the values of several other data fields x, the so-called predictor fields or regressors. The formula has the form: proba(t=1) = 1/(1+eb0+b1*x1+...+bn*xn).


Look and feel
(module: Workbench)

You can adapt the workbench design and style (look and feel) to your preferences and to your operating system. You can change between a 'MS-Windows' style, a 'Unix-Motif' style and a system independent 'Java native' ('metal') style. Do not select 'windows' if you are running on MAC OS, Unix or Linux.


Mapped Name
(module: Statistics and Distributions)

Mapped field value names as they have been read from an auxiliary name mapping table.


Max. #deviations
(module: Deviation Detection)

Keep the result size manageable by limiting the maximum number of deviation patterns to be detected. If more deviation patterns can be found, only the strongest ones of them are kept.


Max. number of active fields
(module: Data Import)

The maximum desired number of active data fields. If the number of currently active fields exceeds this value, some of them will be deactivated. The software decides autonomously which fields are deactivated, based on the number of missing values, the number of different values and field-field correlations.


Max. number of iterations
(module: SOM Models)

Limit the possible number of SOM iterations. Within one SOM iteration, the SOM training algorithm performs one scan over all training data records and uses each record for adapting the neuron weights of the best matching neuron and its neighbors.


Max. number of selected data rows
(module: Workbench)

From various analysis modules of the software, the user can select a data subset, display it in tabular form in a separate screen window and export it to a flat file or database table. In this parameter, you can specify the maximum allowed number of data rows in such data subsets. Larger subsets wil be truncated. Allowed values are 100 to 100000000


Max. pattern length
(module: Deviation Detection)

The maximum length of the deviation patterns to be detected.


Max. tupel length
(module: Statistics and Distributions)

Upper limit for the length of the tupels to be identified, i.e. the maximum number of items per tupel.


Maximum neighbor distance
(module: SOM Models)

The maximum Euclidean distance of neighbored neurons in the SOM net over which adaptions to one neuron influence the neighbored neuron.


Maximum Number of different textual values per field
(module: Data Import)

Define a maximum number N of different textual values (categories) per data field. Whenever a textual field has more than N different values, only the N most frequent of them will be kept, all other ones will be grouped into the category 'others'.


Maximum textual value length
(module: Data Import)

Specify the maximum number of characters in textual values. Longer textual values will be truncated in the compressed data.


MC conf
(module: Associations Analysis)

MC conf stands for 'Monte Carlo significance verification confidence'. This measure indicates how sure one can be that the given association contains a statistically significant rule within the data and is not a product of 'hazard', that means random noise in the data.

The measure is calculated by trying to find associations with similar support, lift and purity values in simulated artificial data which contain the same items with the same item frequencies as the original data, but no correlations between the items.


Median
(module: Statistics and Distributions)

The median of the value distribution, that means the smallest value such that 50% of the data records or groups have a value which is smaller or equal.

For irreversibly binned fields, the exact median cannot be determined; instead, the mid point of the interval containing the median is returned.


Memory usage limit (MB)
(module: Multivariate Exploration and Split Analysis)

Upper limit (in MB) for the RAM to be used by the automized series of split analysis tasks to be deployed.


Min. #affected records
(module: Deviation Detection)

A minimum threshold for the number of data records in which a deviation pattern occurs. Deviation patterns which occur less frequently in the data will not be shown.


Min. deviation increase
(module: Deviation Detection)

A minimum threshold for the increase in deviation strength when expanding patterns be adding another part (item). If this threshold is X, then only those patterns will be shown whose deviation strength is at least X times the deviation strength of each 'parent' pattern which can be obtained from the initial pattern by removing one part (item).


Min. deviation strength
(module: Deviation Detection)

A minimum threshold for the strength of the deviation patterns to be detected. The strength of a deviation is the inverse of the deviation's lift value. For example, if a combination (A,B) of two data field values A and B occurs in 0.02% of all records, and if A and B alone occur in 20% respectively 10% of the data records, then the deviation strength of the pattern (A,B) is 100 since 0.02% is 100 times less than the expected occurrence frequency of 20% * 10% = 2%.


Min. tupel support
(module: Statistics and Distributions)

Minimum tupel support. The support of a tupel is the number of data groups in which all items of the tupel occur.


Minimum textual value frequency
(module: Data Import)

This parameter defines a lower boundary for the number of data records or data groups on which a value of a non-numeric data field must occur for being tracked as a separate field value and a separate bar in histogram charts. Less frequent values will be grouped into the category 'others'.


Minimum tupel purity
(module: Statistics and Distributions)

Minimum purity of the tupels to be detected. The purity of a tupel is the tupel's occurrence frequency divided by the occurrence frequency of the tupel's most frequent item.


Model name
(modules: Associations Analysis, Sequential Patterns, SOM Models, Regressions Analysis, Decision Trees)

File name under which the generated data mining model or analysis result will be stored on disk. The file name suffix determines the file format: .xml and .pmml produce a PMML model, .sql creates an SQL SELECT statement, .txt and .mdl create a flat text file.


Mouse-over help text dismiss delay
(module: Workbench)

Most labels, menu items, buttons, input fields and table column headers in the graphical workbench have a 'mouse-over' function showing a context-sensitive pop-up help text. This Parameter specifies for how many seconds the help text is shown.


Mouse-over help text initial delay
(module: Workbench)

Most labels, menu items, buttons, input fields and table column headers in the graphical workbench have a 'mouse-over' function showing a context-sensitive pop-up help text. This Parameter specifies how many seconds after placing the mouse pointer the help text pops up.


Mouse-over help text reshow delay
(module: Workbench)

Most labels, menu items, buttons, input fields and table column headers in the graphical workbench have a 'mouse-over' function showing a context-sensitive pop-up help text. This Parameter specifies for how many seconds the help text cannot be reshown after it has been shown once.


Name mappings
(module: Data Import)

A name mapping defines more readable textual values (e.g. product names) for the original values (e.g. product IDs) of a data field.

A name mapping definition must contain the file or table name (optionally preceeded by the directory path or jdbc connection), the names of the fields (columns) containing the original and the mapped value, and the field name of the main data source to which the name mapping applies.


Negated items
(modules: Associations Analysis, Sequential Patterns)

Negative items are items for which the complement, i.e. the fact that the item does NOT occur, should be treated as a separate item. For example, if the item 'OCCUPATION=Manager' is added to the list of negative items, then the item 'OCCUPATION!=Manager' is created, and its support is the complement of the support of 'OCCUPATION=Manager'.


No Negative Values
(module: Time Series Analysis)

Restrict the allowed range for the predicted time series values to values equal or greater than zero.


Nominal value selection mode
(module: SOM Models)

Method for selecting the 'best' nominal value which is shown in the SOM cards for nominal data fields.


Null-value string
(module: Workbench)

If a non-empty string is specified for this parameter, then this string will be interpreted as 'n/a' ('invalid or missing value') whenever it occurs as the value of a data field.


Number of active fields
(module: Data Import)

The number of currently activated data fields (not counting the entity field).


Number of items
(module: Sequential Patterns)

The total number of items in the sequences to be detected. An item is one elementary peace of information, that means an atomic part within the sequential pattern


Number of patterns
(module: Associations Analysis)

Keep the result size manageable by limiting the maximum number of associations to be detected. If more associations can be found, only the 'best' ones of them are kept. The criterion for selecting the 'best' associations can be defined using the radio button 'Sorting criterion'.


Number of regressors
(module: Regressions Analysis)

The total number of data fields which appear on the left hand side of the regression equation which predicts target field values.


Number of sequences
(module: Sequential Patterns)

Keep the result size manageable by limiting the maximum number of sequences to be detected. If more sequences can be found, only the 'best' ones of them are kept. The criterion for selecting the 'best' sequences can be defined using the radio button 'Ranking criterion'.


Number of threads
(module: Alle)

Specify an upper limit for the number of parallel threads used for reading and compressing the data. If no number or a number smaller than 1 is given here, the maximum available number of CPU cores will be used in parallel.


Number of values or intervals
(module: Data Import)

Determine the number of separately treated values or value ranges. Allowed values are 2...100 for numeric fields and 0...100 for textual fields.


Numeric field
(module: Data Import)

A data field which is to be treated as numeric field. If it contains textual values, these values will be ignored, i.e. considered as missing values.


Numeric field weight
(module: SOM Models)

Per default, each numeric data field contributes with the same weight factor (of 1) to the distance calculations between neurons and data records as the Boolean and textual fields. You can define a higher or lower weight factor for the numeric fields compared to Boolean and textual fields using this parameter. Note that weight settings for specific fields overwrite this general setting, the weight factors are not multiplied.


Numeric precision (digits)
(module: Data Import)

Specify the maximum numeric precision, i.e. the maximum number of digits that will be regarded when reading numeric values.

With the precision of 3, for example, the number 55555 will be stored as 55600 and -1.23456e-17 as -1.23e-17.


Only entity IDs
(module: Sequential Patterns)

If 'only entity IDs' is checked, the 'Show' and 'Export' buttons will show resp. export only the entity IDs of the supported entities.

If 'entire records' is checked, the 'Show' and 'Export' buttons will show resp. export the supported entities with all their available data fields.


Operator
(module: Data Import)

The operator which will be applied on the existing input field(s) and/or the existing value(s) in order to create the value of the computing field.


Optimize the control data
(module: Multivariate Exploration and Split Analysis)

Create a subset of the current control data set. The subset is aimed to be as representative as possible for the current test data set on all data fields which are not marked 'Target' (T) and for which the user has not manually selected different value ranges for the test and the control data.


Other values
(module: Statistics and Distributions)

Total frequency of all textual values which were not counted as a separate category but summarized under 'others'.


Overall RMSE
(modules: SOM Models, Regressions Analysis)

Root mean squared mapping error of the SOM net on the entire data


Parameter file
(modules: Associations Analysis, Sequential Patterns, SOM Models, Regressions Analysis, Decision Trees)

File name under which the current parameter settings will be stored on disk.


Parent support ratio
(module: Associations Analysis)

The acceptable support growth when comparing a given association to its parent associations.

A parent association (of n-1 items) will be rejected if its support is less than the support of the current association (of n items) multiplied by the minimum parent support ratio.

The effect of this filter criterion is that it reduces the number of detected associations by removing all sub-patterns of long associations whenever the sub-patterns have a support which is not strongly larger than the support of the long association.


Pattern length
(module: Associations Analysis)

The length of an association is the number of items which form the association.

When specifying the parameters for an associations training, you should always specify an upper boundary for the desired association lengths, otherwise the training can take extremely long time.


Perfect tupel frequency threshold
(module: Workbench)

Default setting for the minimum required frequency above which a tupel of several items can be considered as a perfect tupel. Must be an integer larger than 1.


Perfect tupel purity threshold
(module: Workbench)

Default setting for the minimum purity at which a tupel of several items is considered as a perfect tupel. Must be a number between 0.5 and 1.0. For the definition of purity: see definition in module associations analysis


Perfect Tupels
(module: Statistics and Distributions)

Detect (almost) perfect item tupels in the data, i.e. value combinations of textual set-valued data fields which appear (almost) always together.


Period
(module: Time Series Analysis)

Presumed cycle length of the seasonal (periodic) part of the time series in units of the time step between adjacent data points.


PMML version
(module: Workbench)

The software can create and export data mining models in the vendor independent PMML format (see http://www.dmg.org/pmml). This parameter defines, which version of PMML should be created.


Positions of required items
(module: Sequential Patterns)

The required item type indicates at which position within a sequence the item can occur.

If the type is 'Sequence start', the item must occur in the sequence's first item set.

If the type is 'Sequence end', the item must occur in the sequence's last item set.

If the type is 'Anywhere', the item can occur anywhere within the sequence.


Prediction error (RMSE)
(modules: Regressions Analysis, SOM Models)

Root mean squared prediction error of the regression model on the training data


Primary sorting criterion
(module: Workbench)

The selection box 'Primary sorting criterion' is an option that can be activated when exporting in-memory data objects into a text file on disk. When activated, the option sorts the exported data rows by ascending or descending values of the data field selected in the box.


Purity
(module: Associations Analysis)

The purity of an association is the ratio between the association's support and the support of the most frequent item within the association.

A purity of 1 indicates a 'perfect' group: each single item of the transaction occurs in a transaction if and only if also all the other items of the association occur in that transaction.


Purity threshold for perfect tupels
(module: Workbench)

Default setting for the minimum purity at which a tupel of several items is considered as a perfect tupel. Must be a number between 0.5 and 1.0. For the definition of purity: see definition in module associations analysis


Quotation mark (default)
(module: Workbench)

If this parameter in the data import settings is set to 'double quote' (or 'single quote'), then double (or single) quotes around field values are removed per default for all input data fields. If this parameter is set to 'none', then double or single quotes around field values are only removed if ALL values of the field are surrounded by the same quotes; in addition, numeric values surrounded by quotes are interpreted as textual values in this case.


Read data
(module: Data Import)

This button starts reading the original data source and transforming the data into a compressed binary data object which resides in memory.


Records for guessing field types
(module: Data Import)

When reading input data from flat files or spreadsheets, the data source does not provide meta data information on the types of data (integer, Boolean, floating point, textual) to be expected in the available data columns. Therefore, a presumable data type has to be derived from looking at the data fields actual content.

The parameter 'Number of records for guessing field types' determines, how many leading data rows are read from the data source for guessing data field types.


Refresh
(module: Statistics and Distributions)

Refresh the screen, for example in order to adapt to a changed screen size.


Regression coefficient
(module: Regressions Analysis)

Regression coefficients are the weight prefactors with which the different regressors enter into the regression equation.


Regression method
(module: Regressions Analysis)

The software supports two regression methods: linear regression and logistic regression. In linear regression, the value of a numeric target field t is expressed as a linear formula of the values of several other data fields x, the so-called predictor fields or regressors: t = b0 + b1*x1 + ... + bn*xn.

In logistic regression, the probability of the '1'-value of a two-valued target field t is expressed as a formula of the kind: proba(t=1) = 1/(1+eb0+b1*x1+...+bn*xn).


Regression Model
(modules: Workbench, Data Import, Regressions Analysis)

In this panel, you can visualize and introspect the results of a regression training run, that means the regression coefficients and model quality measures such as RMSE or R-squared values.


Regression Scoring
(modules: Workbench, Data Import, Regressions Analysis)

In this module, you specify the parameters and settings which are to be used for applying a regression model to new data.


Regression Training
(modules: Workbench, Data Import, Regressions Analysis)

A regression training establishes a formula which predicts the value of one single data field from the values of some other fields within the training data.

In the regression training panel, you specify the parameters and settings which are to be used for the next regression training run. Furthermore, you can store your parameter settings, manage them in a repository and later retrieve and reuse them. In the lower part of the panel, you can start and stop a regression training run and monitor its progress and its predicted run time.


Regressor
(module: Regressions Analysis)

A regressor is a data field which appears on the left hand side of the regression equation and whose values serve to predict the target field value.


Regressor fields
(module: Regressions Analysis)

Upper limit for the number of regressors which can enter into the regression model.


Rel. difference
(modules: Multivariate Exploration and Split Analysis, Multivariate Exploration and Split Analysis)

relative difference: |#selected - #expected| / #expected.


Rel.diff
(module: SOM Models)

Maximum relative difference to the field's overall value distribution: the SOM card shows the nominal value for which the ratio between its actual frequency within the records mapped to the given neuron and its expected frequency is maximum.


Relative difference
(modules: Multivariate Exploration and Split Analysis, Multivariate Exploration and Split Analysis)

relative difference: |#selected - #expected| / #expected.


Relative difference
(module: SOM Models)

Maximum relative difference to the field's overall value distribution: the SOM card shows the nominal value for which the ratio between its actual frequency within the records mapped to the given neuron and its expected frequency is maximum.


Relative Frequency
(module: Statistics and Distributions)

Fraction of all data records or data groups which contain the value


Relative item support
(modules: Associations Analysis, Sequential Patterns)

The relative support of an item is the item's absolute support divided by the total number of transaction (groups). In other words, the relative support is the a-priori probability that the item occurs in a randomly selected transaction.


Relative support
(module: Associations Analysis)

The relative support of an association is the absolute support divided by the total number of groups (transactions), that means the a-priori probability that an arbitrary group supports the association.

When specifying the parameters for an associations training, you should always specify an lower boundary for the absolute or relative support, otherwise the training can take extremely long time.


Relative support
(module: Sequential Patterns)

The relative support of the sequence, that means the fraction of all entities (transaction groups) in which the sequence occurs


Reporting Preferences
(module: Workbench)

Preference settings for the visual report designer and for creating HTML and PDF reports.


Required items
(modules: Deviation Detection, Associations Analysis, Sequential Patterns)

Required items are items which must occur in each detected pattern. If several item patterns are specified within one 'required group', at least one of them must appear in each detected deviation, association or sequence.

In the Associations and Sequences training modules, up to 3 different groups of required items can be specified. In this case, the detected patterns will contain at least one item out of every specified group. Each item specification can contain wildcards (*) at the beginning, in the middle and/or at the end.


Required items - permitted position
(module: Sequential Patterns)

The required item type indicates at which position within a sequence the item can occur.

If the type is 'Sequence start', the item must occur in the sequence's first item set.

If the type is 'Sequence end', the item must occur in the sequence's last item set.

If the type is 'Anywhere', the item can occur anywhere within the sequence.


Result file
(modules: Associations Analysis, Sequential Patterns, SOM Models, Regressions Analysis, Decision Trees)

File name under which the generated data mining model or analysis result will be stored on disk. The file name suffix determines the file format: .xml and .pmml produce a PMML model, .sql creates an SQL SELECT statement, .txt and .mdl create a flat text file.


RMSE
(modules: Regressions Analysis, SOM Models)

Root mean squared prediction error of the regression model on the training data


Row filter criterion
(module: Data Import)

A sampling criterion or SQL WHERE clause. For example, the criterion '10%' creates a random sample of about 10% of all data rows. The criterion '!10%' creates the complementary subset containing all records which the criterion '10%' would have blocked. The criterion WHERE GENDER='M' selects all data rows whose 'GENDER' value is 'M'.



(module: Regressions Analysis)

R² is a measure for the predictive power of the regression model. R² near 1 means that the model is able to predict the target values almost perfectly, R² near 0 means that the model is almost useless.


Screen height
(module: Workbench)

Default height of the main workbench window (in pixels). Allowed values are 480 to 1500


Screen width
(module: Workbench)

Default width of the main workbench window (in pixels). Allowed values are 640 to 2000


Secondary sorting criterion
(module: Workbench)

The selection box 'Secondary sorting criterion' defines an additional sorting criterion which applies for sorting data rows with identical values in the primary sorting criterion.


Selected data rows
(module: Workbench)

From various analysis modules of the software, the user can select a data subset, display it in tabular form in a separate screen window and export it to a flat file or database table. In this parameter, you can specify the maximum allowed number of data rows in such data subsets. Larger subsets wil be truncated. Allowed values are 100 to 100000000


Selected records
(module: SOM Models)

The number of data records mapped to the currently selected neurons.


Selected RMSE
(module: SOM Models)

Root mean squared mapping error of the SOM net on the data records mapped to the currently selected neurons.


Sequence length
(module: Sequential Patterns)

The desired sequence lengths of the sequences to be detected. The sequence length is the number of parts (events) separated by time steps.


Sequences Detection
(modules: Workbench, Data Import, Sequential Patterns)

In this panel, you specify the parameters and settings which are to be used for the next Sequential Patterns training run.

Furthermore, you can store your parameter settings, manage them in a repository and later retrieve and reuse them. In the lower part of the panel, you can start and stop a Sequential Patterns training run and monitor its progress and its predicted run time.

Sequential Patterns Analysis is only possible on data on which an 'Entity' field, a 'Group' field and an 'Order' field has been defined on the 'Active fields' dialog. The Group field and the Order field can be identical; in this case, specify the field as ''Order and Group' field.


Sequences Model
(modules: Workbench, Data Import, Sequential Patterns)

A sequences model is a collection of sequential patterns which have been detected during a sequences training run on a training data set. The model can be applied to a new data source in a sequences scoring step.

In the sequences model panel, you can visualize and introspect the results of a Sequential Patterns training run. You can display the results in tabular form, sort, filter and export the filtered results to flat files or into a table in a RDBMS.

Furthermore, you can calculate additional statistics for the support of selected sequential patterns.


Sequences Scoring
(modules: Workbench, Data Import, Sequential Patterns)

A Sequences Scoring presents new data records to a previously trained Sequential Patterns model. A Sequential Patterns model is a collection of sequences of events which were observed in the data on which the model was trained.

The scoring relates sequences from the model with data records from the new data. This can be done in two ways. The first way examines one or more selected data records (e.g. all purchases of one single customer) and returns all sequences which are partially or fully supported by the selected records. The second way examines one or more selected sequences and returns all records (e.g. all customers) that partially or fully support the selected sequences.

You can store and retrieve both the parameter settings for Sequences Scoring and the scoring results in the form of XML or flat text files.


Set frequencies
(module: Sequential Patterns)

The absolute supports of the item sets which form the sequence. (the first number corresponds to set1, the second to set2, etc.)

A star (*) after the number indicates that the set belongs to the core of the sequence. The core of a sequence is the smallest possible sub-sequence of item sets of the sequence which has the same support as the entire sequence.


Significance
(modules: Multivariate Exploration and Split Analysis, Multivariate Exploration and Split Analysis)


Skewness
(module: Statistics and Distributions)

The sample skewness of the value distribution. Note: the sample skewness slightly differs from population skewness (e.g. MS Excel's 'Skewness').


Smoothing
(module: Time Series Analysis)

Number of time points used for calculating the moving average trend line.


SOM cards per row
(module: SOM Models)

The number of SOM cards placed in one row. Reduce this number for obtaining larger graphs.


SOM Model
(modules: Workbench, Data Import, SOM Models)

A SOM model is a neural network which has been trained in a preceeding SOM training run on some training data and which has 'learned' the training data during that training.

You can visualize and introspect the SOM model with its SOM cards. You can explore different regions of the SOM map, explore the statistics of these regions and export data records mapped to these regions to flat files or into a table in a RDBMS.

The model can be applied to a new data source in a SOM scoring step, for example in order to predict one or more data fields' values which are unknown in the new data.


SOM Scoring
(modules: Workbench, Data Import, SOM Models)

A SOM Scoring presents new data records to a previously trained Self Organizing Map (SOM) model. A SOM model is a neural network which represents the data by means of a square grid of neurons.

The scoring can be used to predict missing values in the new data, to classify the new data records as deviations, or to assign them to clusters (segments).

You can store and retrieve both the parameter settings for a SOM scoring and the scoring results in the form of XML or flat text files.


SOM Training
(modules: Workbench, Data Import, SOM Models)

A SOM training task specifies the parameters and settings which are to be used for the next SOM training run.

In the SOM Training Task panel, you can store your parameter settings, manage them in a repository and later retrieve and reuse them. In the lower part of the panel, you can start and stop a SOM training run and monitor its progress and its predicted run time.


Sorting criterion
(modules: Associations Analysis, Sequential Patterns)

The ranking criterion which is used to sort out certain detected patterns (associations or sequences) when the total number of detected patterns becomes larger than the user-defined maximum desired number.

Possible values are Support, Lift, Purity, Core item purity, Weight or Trend. Weight is only allowed if a weight field has been defined on the input data. Trend is only allowed if an order field has been defined on the input data.


Split Analysis
(modules: Workbench, Data Import, Multivariate Exploration and Split Analysis)

Split Analysis is data analysis approach in which two data subsets are selected: a 'test' data set and a 'control' data set. In many use cases, the test data set comprises a data subset which have a certain property in common, for example all men, all customers below the age of 30, all vehicles produced after an improvement measure has been effectuated, etc.

The first goal of the analysis is to select a suitable control group which is representative for the test group in all attributes except the ones used for defining the test group. The second goal is to find and quantify significant differences between the test data subset and the control data subset.


Standard codepage
(module: Workbench)

Whenever a data source contains non-standard-English characters (such as î, ä, é, € etc.) you must specify in which encoding scheme (codepage) the data have been encoded, otherwise these characters will not be displayed correctly. If you do not know the encoding scheme, you have to try out various choices.


Standard deviation of relative difference
(modules: Multivariate Exploration and Split Analysis, Multivariate Exploration and Split Analysis)

Standard deviation of relative difference. This value indicates how exactly the relative difference can be calculated.


Std. deviation
(module: Statistics and Distributions)

The sample standard deviation of the value distribution (i.e. the 'n' and not the 'n-1' standard deviation!)


Std.dev.(rel.diff.)
(modules: Multivariate Exploration and Split Analysis, Multivariate Exploration and Split Analysis)

Standard deviation of relative difference. This value indicates how exactly the relative difference can be calculated.


Store the load task as XML file
(module: Data Import)

If this check box is marked, the current data load settings are written into a persistent XML file. The settings in this XML file can later be applied to any new data source of the same structure as the original data source.


Summary result file
(module: Multivariate Exploration and Split Analysis)

File name of a TAB-separated tabular text file in which the summary result of the series of split analysis tasks will be written. The file will contain one row per single split analysis.

If no value is given here, no summary result file will be created.


Superset
(module: Associations Analysis)

If 'superset' is checked, the 'Show', 'Explore' and 'Export' buttons will handle each data record or group which supports at least one of the selected associations.

If 'intersection' is checked, the 'Show', 'Explore' and 'Export' buttons will only handle those data groups which support all selected associations.


Superset
(module: Sequential Patterns)

If 'superset' is checked, the 'Show', 'Explore' and 'Export' buttons will cover each entity which supports at least one of the selected sequences. If 'intersection' is checked, the 'Show', 'Explore' and 'Export' buttons will only cover those entities which support all selected sequences.


Suppressed field
(module: Data Import)

A data field which will be completely ignored.


Suppressed items
(modules: Deviation Detection, Associations Analysis, Sequential Patterns)

Suppressed items are items which are completely ignored during the patterns analysis and which should never occur in the detected patterns. Each item specification can contain wildcards (*) at the beginning, in the middle and/or at the end.


Target (not to be optimized)
(module: Multivariate Exploration and Split Analysis)

Target fields are those visible fields whose field value differences between test and control data will be ignored during the control data optimization.

These fields are the 'target' fields of the hypothesis test. The aim of the test is to find out whether there are significant value distribution differences between the test and control data on these fields.


Target field
(modules: SOM Models, Reporting)

Specify the name of the target field if you want to use the SOM method for predicting the values of one single data field.


Target field
(modules: Regressions Analysis, Decision Trees)

The name of the target field, that means the name of the field whose values are to be predicted from the values of the other data fields.


Target field weight
(module: SOM Models)

Per default, each data field contributes with the same weight factor (of 1) to the distance calculations between neurons and data records. You can assign a higher weight factor to the target field.


Taxonomies (hierarchies)
(module: Data Import)

A taxonomy is the definition of a category hierarchy. For example, such a hierarchy could define the two products 'butter' and 'cheese' as members of the category 'milk products', and 'milk products' as a sub-category of 'food'.

Taxonomy definitions can be read from flat files or database tables. A taxonomy definition must contain the file or table name (optionally preceeded by the directory path or jdbc connection), the names of the fields (columns) containing the parent and the child categories, and the field name of the main data source to which the taxonomy applies.


Temporary file directory
(module: Workbench)

In this directory, temporary dump files will be stored. Dump files are created when reading data from very large data sources.


Test data
(module: Multivariate Exploration and Split Analysis)

The currently selected test data subset in a test-control data analysis. The goal of the analysis is to detect and quantify systematic deviations in the field value distribution properties between the test data subset and the control data subset


Textual field
(module: Data Import)

A data field whose values are to be treated as textual (categorical) values even if they are numeric values.


Textual resource file
(module: Workbench)

File in which all textual resources needed by the workbench are stored: labels of menus, input fields and buttons, context sensitive help texts, glossary entries etc. If you want to customize the software, you can work with personalized versions of the default file IA_texts.xml.


Time Series Analysis and Forecast
(modules: Workbench, Data Import, Time Series Analysis)

In the Time Series panel, time series can be explored and forecasts can be calculated using various forecasting algorithms.

This module can only be started on data which fulfill the following requirements:

i) An order field has been defined in the 'Active fields' dialog. This field will be the x-axis field in the time series charts.

ii) A weight/price field has been defined in the 'Active fields' dialog. This field will be the y-axis field in the time series charts.

iii) Not more than two further active fields exist (plus optionally a group field). All other fields have been deactivated in the 'Active fields' dialog.


Time step limits
(module: Sequential Patterns)

Time step limits define which time step size is permissible between adjacent parts (item sets) of a sequence.


Time/order field
(module: Data Import)

A data field should be marked as 'time/order field' if it does not contain an property of the entity to be analyzed but the time stamp or step identifier at which the entity's properties in the other data fields of the current data row have been recorded.

For some data mining functions, the specification of a time/order field is required (e.g. sequence analysis, time series prediction), other data mining functions will ignore any time/order information (e.g. associations analysis).


Tooltip dismiss delay
(module: Workbench)

Most labels, menu items, buttons, input fields and table column headers in the graphical workbench have a 'mouse-over' function showing a context-sensitive pop-up help text. This Parameter specifies for how many seconds the help text is shown.


Tooltip initial delay
(module: Workbench)

Most labels, menu items, buttons, input fields and table column headers in the graphical workbench have a 'mouse-over' function showing a context-sensitive pop-up help text. This Parameter specifies how many seconds after placing the mouse pointer the help text pops up.


Tooltip reshow delay
(module: Workbench)

Most labels, menu items, buttons, input fields and table column headers in the graphical workbench have a 'mouse-over' function showing a context-sensitive pop-up help text. This Parameter specifies for how many seconds the help text cannot be reshown after it has been shown once.


Total time window
(module: Sequential Patterns)

The desired time gap between the first and the last part (event) of the sequences to be detected.


Trace file
(module: Workbench)

Name of the trace file to which the software writes success, progress, warning and error messages. Choose a qualified file name such as 'C:\IA\IA_trace.log', or the string 'stdOut' if you want to trace to the black console window.


Trace level
(module: Workbench)

The frequency (intensity) of protocol output. The higher, the more protocol output is produced. Allowed levels are 0 to 4. In level 0, no protocol output is produced. In level 4, the protocol output might become very large if you are working on large data.


Tracked items
(module: Associations Analysis)

Tracked items are items whose occurrence rate is tracked and shown for every detected association. The tracked rate indicates the probability that the tracked item occurs in a data record or group which supports the current association.


Training data
(modules: Associations Analysis, Sequential Patterns, SOM Models, Regressions Analysis, Decision Trees)

Training data are a data collection on which a data mining model is being trained. During the training, the model 'learns' certain rules, interrelations and dependencies between the differen data fields of the training data. After the training, the model can be applied to new data, for example in order to predict missing field values or in order to classify or cluster new data reords. This is called 'scoring'.


Tree Preferences
(module: Workbench)

Preference settings for Decision and Regression Tree (model training and application)


Tree Training
(modules: Workbench, Data Import, Decision Trees)

A decision tree training establishes a hierarchical, tree-like set of Boolean predicates which describe the typical behavior of one single 'target' attribute in the training data. In the tree training panel, you specify the parameters and settings which are to be used for the next decision tree training run.

Furthermore, you can store your parameter settings, manage them in a repository and later retrieve and reuse them. In the lower part of the panel, you can start and stop a decision tree training run and monitor its progress and its predicted run time.


Trend damping
(module: Time Series Analysis)

Damping factor applied when projecting current trend into the future.

If, for example, the trend damping factor is 0.9, if the time series data are recorded monthly, if the current trend is a seasonally corrected month-to-month increase dx and if the current month's seasonally corrected value is x, then the seasonally corrected projected values for the next 3 months will be x+0.9*dx, x+(0.9+0.81)*dx, x+(0.9+0.81+0.729)*dx.


Undo
(module: Multivariate Exploration and Split Analysis)

Undo the previous control data optimization. That means, reactivate all available control data records.


Values
(module: Data Import)

Define a maximum number N of different textual values (categories) per data field. Whenever a textual field has more than N different values, only the N most frequent of them will be kept, all other ones will be grouped into the category 'others'.


Variant elimination
(module: Data Import)

A variant elimination replaces several spelling variants or misspellings, several case variants and/or several synonyms for identical things or concepts by one single 'canonical' form. Variant eliminations can be specified for all textual data fields. Variants can be defined either by listing the variants one by one or by using regular expressions (pattern matching).


Verif. confidence
(module: Associations Analysis)

Verification runs serve to assess whether the detected association or sequential patterns are statistically significant patterns or just random fluctuations (white noise). For each verification run, a separate data base is used. Each data base is generated from the original data by randomly assigning each data field's values to another data row index within the same data field. This approach is called a permutation test. The effect is that correlations and interrelations between different data fields are completely removed from the data. If one finds association or sequential patterns on a permuted data base, one can be sure that one has detected nothing but noise. One can record and trace the measure triples (pattern length, support, lift) of all detected noise patterns. The edge of the resulting point cloud defines the intrinsic 'noise level' of the original data. Patterns detected on the original data can only be considered significant if their corresponding measure triples are well above the noise level. These patterns have a verification confidence close to 1.


Verification confidence (of an association pattern)
(module: Associations Analysis)

Verification runs serve to assess whether the detected association or sequential patterns are statistically significant patterns or just random fluctuations (white noise). For each verification run, a separate data base is used. Each data base is generated from the original data by randomly assigning each data field's values to another data row index within the same data field. This approach is called a permutation test. The effect is that correlations and interrelations between different data fields are completely removed from the data. If one finds association or sequential patterns on a permuted data base, one can be sure that one has detected nothing but noise. One can record and trace the measure triples (pattern length, support, lift) of all detected noise patterns. The edge of the resulting point cloud defines the intrinsic 'noise level' of the original data. Patterns detected on the original data can only be considered significant if their corresponding measure triples are well above the noise level. These patterns have a verification confidence close to 1.


Verification run
(modules: SOM Models, Decision Trees)

In addition to the main training run, you can start 0 to 9 verification runs. Each verification run is a separate training run with the same parameters as the main training run but a different seed value for the random number generator.

The purpose of verification runs is to generate stability and reliability information for the model created by the main training run.


Verification run
(modules: Associations Analysis, Sequential Patterns)

Verification runs serve to assess whether the detected association or sequential patterns are statistically significant patterns or just random fluctuations (white noise). For each verification run, a separate data base is used. Each data base is generated from the original data by randomly assigning each data field's values to another data row index within the same data field. This approach is called a permutation test. The effect is that correlations and interrelations between different data fields are completely removed from the data. If one finds association or sequential patterns on a permuted data base, one can be sure that one has detected nothing but noise. One can record and trace the measure triples (pattern length, support, lift) of all detected noise patterns. The edge of the resulting point cloud defines the intrinsic 'noise level' of the original data. Patterns detected on the original data can only be considered significant if their corresponding measure triples are well above the noise level.


Verification runs
(modules: SOM Models, Decision Trees)

In addition to the main training run, you can start 0 to 9 verification runs. Each verification run is a separate training run with the same parameters as the main training run but a different seed value for the random number generator.

The purpose of verification runs is to generate stability and reliability information for the model created by the main training run.


Verification runs
(modules: Associations Analysis, Sequential Patterns)

Verification runs serve to assess whether the detected association or sequential patterns are statistically significant patterns or just random fluctuations (white noise). For each verification run, a separate data base is used. Each data base is generated from the original data by randomly assigning each data field's values to another data row index within the same data field. This approach is called a permutation test. The effect is that correlations and interrelations between different data fields are completely removed from the data. If one finds association or sequential patterns on a permuted data base, one can be sure that one has detected nothing but noise. One can record and trace the measure triples (pattern length, support, lift) of all detected noise patterns. The edge of the resulting point cloud defines the intrinsic 'noise level' of the original data. Patterns detected on the original data can only be considered significant if their corresponding measure triples are well above the noise level.


Visible SOM cards
(module: SOM Models)

Select the data fields for which you want to see SOM cards in the main panel above. Per default, the SOM cards for the 20 data fields with highest field importance numbers are shown.


Web browser call command
(module: Workbench)

For accessing online help, the software must start an external web browser. This parameter contains the calling command for this browser. There are default settings for several operating systems. Therefore, you should only modify this parameter if you are unable to use the online help with the default settings.


Weight
(module: Associations Analysis)

The weight of an association is the mean weight of all data records (or data groups) which support the association.

The weight of a data group is either the sum, the average, the minimum, or the maximum of the weight field values, or the number of records, of all input data records which form the group. The actual computation variant depends on the aggregation mode that has be set for the weight field in the input data panel (sum,mean,max,min, or count).


Weight/price field
(module: Data Import)

A data field should be marked as 'weight/price field' if it contains the price, cost, weight, or another numeric quantity which characterizes the 'importance' of the properties given in the other data fields of the current data row.


Width of the neural net
(modules: SOM Models, Reporting)

The number of neurons in direction x. Should be a number between 4 and 100