|
||||||||
FRAMES NO FRAMES |
χ² confidence indicates whether or not the field value distribution of one field significantly changes when the other field has a specific value or a value in a specific range. χ² confidence numbers are numbers between 0 and 1. The closer to 1, the higher the statistical evidence that a significant impact of one field on the value distribution of the other field has been detected. In general, statisticians consider an impact as 'significant' if the χ² confidence exceeds a value of 0.95 ('95% confidence level') or 0.99 ('99% confidence level')
A χ² confidence number appearing as the rightmost number of a normal matrix row indicates whether the value distribution of the x-axis field systematically differs from its general behavior if the y-axis field assumes the value or value range which is indicated in the leftmost entry of that row.
A χ² confidence number appearing as the last number of a normal matrix column indicates whether the value distribution of the y-axis field systematically differs from its general behavior if the x-axis field assumes the value or value range which is indicated in the first entry of that column.
The χ² confidence number in the bottom-right matrix corner indicates whether there is a significant dependence of the x-axis field's value distribution from the y-axis field's value and vice versa.
The confidence that the value distribution of the selected data subset differs in a statistically significant way from the overall data's value distribution on the currently selected data field.
The confidence is calculated based on the confindence level with which the null hypothesis 'the two value distributions are identical' is rejected by a χ² test.
The confidence that the value distributions of the test and the control data differ in a statistically significant way in at least one of the data fields in which the control data are not selected manually but chosen automatically to be as similar to the test data distribution as possible.
The confidence is calculated based on the confidence level with which the null hypothesis 'the two value distributions are identical' is rejected by a χ² test.
The confidence that the value distributions of the test data and the contol data differ in a statistically significant way on the currently selected data field.
The confidence is calculated based on the confindence level with which the null hypothesis 'the two value distributions are identical' is rejected by a χ² test.
The confidence that the overall value distribution of the selected subset differs in a statistically significant way from the overall value distribution on the entire data.
The confidence is calculated based on the confidence level with which the null hypothesis 'the two value distributions are identical' is rejected by a χ² test.
The confidence that deviation of the overall selection's lift from 1 is statistically significant.
The confidence is calculated based on a χ² significance test with one degree of freedom.
The χ² confidence level of an association indicates up to which extent each single item is relevant for the association because its occurrence probability together with the other items of the association significantly differs from its overall occurrence probability.
More formally, theχ² confidence level is the result of performing n χ² tests, one for each item of the association. The null hypothesis for each test is: the occurrence frequency of the item is independent of the occurrence of the item set formed by the other n-1 items.
Each of the n tests returns a confidence level (probability) with which the null hypothesis is rejected, and the χ² confidence level of the association is set to the minimum of these n rejection confidences.
The absolute support of an association is the number of groups (transactions) in which the association occurs.
When specifying the parameters for an associations training, you should always specify an lower boundary for the absolute or relative support, otherwise the training can take extremely long time.
Maximum absolute difference to the field's overall value distribution: the SOM card shows the nominal value for which the difference between its actual frequency within the records mapped to the given neuron and its expected frequency is maximum.
The absolute support of a sequence is the number of entities in which the sequence occurs.
Additive season means that the seasonal pattern is modeled as an added term to the long-term trend ('total = trend + season'). As a result, the amplitude of the seasonal fluctuations is constant and does not grow when the trend line increases.
Multiplicative season means that the seasonal pattern is modeled as a correction factor to the long-term trend ('total = trend * season'). As a result, the amplitude of the seasonal fluctuation increases when the trend line increases and decreases when the trend line decreases.
If this check box is marked, numeric data fields can be discretized into a small number of intervals, and the original field values are irreversibly replaced by interval indices.
For example, the value AGE=37 might be replaced by AGE=[30..40[, and in the compressed data, the precise value 37 will be irreversibly lost.
An associations model is a collection of association rules which have been detected during an associations training run on the training data set. In the associations model panel, you can visualize and introspect the results of an associations training run. You can display the results in tabular form, sort, filter and export the filtered results to flat files or into a table in a RDBMS.
Furthermore, you can calculate additional statistics for the support of single associations in the introspected result.
In this module, you specify the parameters and settings which are to be used for the next associations training run.
Furthermore, you can store your parameter settings, manage them in a repository and later retrieve and reuse them. In the lower part of the panel, you can start and stop an associations training run and monitor its progress and its predicted run time.
An associations scoring matches a collection of association rules (an associations model) with a new data table and indicates which associations are fulfilled (supported) by which data sets.
In the associations scoring task panel, you specify the parameters and settings which are to be used for applying detected associations to new data or for gathering additional statistics on the supporting transactions of certain associations.
You can store your parameter settings, manage them in a repository and later retrieve and reuse them. In the lower part of the panel, you can start and stop associations application runs and monitor their progress and predicted run time.
Data field over whose values an automatically executed series of split analyses is to be performed. Automizable data fields are all fields on which one single value has been selected on the test data and several other values have been selected on the control data.
During each step of the automized series analysis, a different single value out of the initially selected test and control data values is considered the test data and all remaining initially selected values the control data.
A data field which is to be treated as Boolean field. If it contains more than 2 different values, all but the the first two different values will be ignored, i.e. treated as missing values.
For accessing online help, the software must start an external web browser. This parameter contains the calling command for this browser. There are default settings for several operating systems. Therefore, you should only modify this parameter if you are unable to use the online help with the default settings.
The data page size (in bytes) which is used in the preliminary representation of data field objects. Allowed values are 10000 to 10000000. Larger values can speed up the data reading, but they can also raise memory requirements, in particular on data with many fields.
Aborts the currently running training task without creating a result.
First time point shown in the time series charts
The resolution (number of pixels in x direction) of the single histogram charts. The number refers to 'normal' charts. Extra-wide charts withmany histogram bars have a resultion which is a multiple of this number.
The Number of histogram charts per row. If this value is 0, the software automatically selects a suitable number of charts per row, depending on the total number of charts to be shown.
Specify a lower boundary for the acceptable 'support shrinking rate' when creating expanded associations out of existing associations.
An expanded association of n items will be rejected if at least one of the n parent associations has a support which is so large that when multiplied with the minimum shrinking rate, the result is larger than the actual support of the expanded association.
The confidence that the value distributions of the test and the control data differ in a statistically significant way in at least one of the data fields in which the control data are not selected manually but chosen automatically to be as similar to the test data distribution as possible.
The confidence is calculated based on the confidence level with which the null hypothesis 'the two value distributions are identical' is rejected by a χ² test.
The confidence that the value distributions of the test data and the contol data differ in a statistically significant way on the currently selected data field.
The confidence is calculated based on the confindence level with which the null hypothesis 'the two value distributions are identical' is rejected by a χ² test.
The confidence that the overall value distribution of the selected subset differs in a statistically significant way from the overall value distribution on the entire data.
The confidence is calculated based on the confidence level with which the null hypothesis 'the two value distributions are identical' is rejected by a χ² test.
The confidence that deviation of the overall selection's lift from 1 is statistically significant.
The confidence is calculated based on a χ² significance test with one degree of freedom.
The χ² confidence level of an association indicates up to which extent each single item is relevant for the association because its occurrence probability together with the other items of the association significantly differs from its overall occurrence probability.
More formally, theχ² confidence level is the result of performing n χ² tests, one for each item of the association. The null hypothesis for each test is: the occurrence frequency of the item is independent of the occurrence of the item set formed by the other n-1 items.
Each of the n tests returns a confidence level (probability) with which the null hypothesis is rejected, and the χ² confidence level of the association is set to the minimum of these n rejection confidences.
The confidence that the value distribution of the selected data subset differs in a statistically significant way from the overall data's value distribution on the currently selected data field.
The confidence is calculated based on the confindence level with which the null hypothesis 'the two value distributions are identical' is rejected by a χ² test.
Define additional data fields whose values are to be computed from the values of one or more existing data fields.
The confidence of an association rule or sequence rule is the ratio between the rule's support and the rule body's support.
An association rule is an association of n items in which n-1 of the n items are considered the 'rule body' and the remaining item is considered the 'rule head'. Hence, n different association rules can be constructed from one association of length n.
Similarly, a sequence rule is a sequence of n sets of items - separated by n-1 time steps - in which the first n-1 item sets are considered the rule body and the item set after the last time step is considered the rule head.
A rule's confidence is the probability that the rule head is true if one knows for sure that the rule body is true.
This value determines whether an error bar (confidence range) is to be drawn for each point in the diagram, and it determines the confidence range represented by the error bar.
If the confidence value C is selected here, that means that a positive or negative deviation from the actual value in y-direction is with a confidence of C due to a significant change in the probability distribution and can not be explained by just a statistic fluctuation within the current probability distribution.
The confidences C of the n different ways of interpreting the association as a rule of the form 'if (itemX and itemY and ... are present in a transaction) then also itemZ is present in the transaction with a probability (confidence) of C, in short notation: (itemX,itemY,...) =(C)=> itemZ.
The first number in the list corresponds to the rule (item2,item3,...) =(C)=> item1, the second to the rule (item1,item3,...) =(C)=> item2, and so on.
The confidences C of the n consecutive steps of the sequence. The first number in the list is the probability that an arbitrary entity contains the first item set of the sequence. The second number is the probability that an entity containing the first set also contains the sequence's second item set, and so on.
Cramer's contingency coefficient V as described in http://en.wikipedia.org/wiki/Contingency_table
The currently selected control data subset in a test-control data analysis. The goal of the analysis is to detect and quantify systematic deviations in the field value distribution properties between the test data subset and the control data subset
The core item purity of an association is the ratio between the association's support and the support of the least frequent item within the association.
A core item purity of 1 indicates a 'mononuclear' group in which the support of the group is determined by the support of its least frequent item.
Note: the core item purity is always larger than or equal to the association's purity.
A set of possible corrections which would help removing an inconsistency from some data records. The hints are created based in a statistical analysis of the involved items.
The correlation between two data fields indicates whether or not there is a significant statistical dependency between the values of the two data fields. The correlations module computes and visualizes these field-field correlations
Create a new field in the input data which contains the residuals 'actual target value - predicted target value'. The name of the new field is [targetFieldName]_RESIDUAL.
If this check box is marked, a persistent version of the compressed data object will be written to a file and can be refetched later. This speeds up the data reading process in future mining sessions on this data object.
Data block size (in bytes) in block-wise data reading from flat text files. allowed values are 100000 to 1000000000
Number of different data groups (group field values) in the input data
In this panel, you can explore the data selections created by a multivariate data explotation or another data analysis module.
Name of the data source from which certain fields are to be added to the currently active main data source
Default directory path in which analysis results are stored.
Name of the data field whose value distribution defines the colors of the histogram bars representing the selected data set. When no detail field is selected, the histogram bars are displayed without detail structure and in uniformly blue color.
For each value of this field, a separate time series chart will be drawn
The strength of a deviation pattern describes how strongly and significantly the number of occurrences of the pattern is below the expected number of occurrences.
The value is calculated as '10*(chi²-conf - 0.9) / lift', where 'lift' is the pattern's lift and 'chi²-conf' is the confidence level that the pattern is statistically significant.
For example, if a combination (A,B) of two data field values A and B occurs in 0.02% of all records and has a chi² confidence level of 0.99, and if A and B alone occur in 20% respectively 10% of the data records, then the deviation strength of the pattern (A,B) is 90 since lift is 0.02%/(20%*10%) = 1/100 and 10*(chi²-conf - 0.9) = 0.9.
In the Deviation Detection panel, outliers, deviations and presumable data inconsistencies can be detected..
Number of different valid values of the data field. Note: for binned numeric fields, only those different values are counted which were encountered while collecting statistics for determining the bin boundaries.
Difference: #selected - #expected.
A data field which is to be treated as discrete numeric field. If it contains textual values, these values will be ignored, i.e. considered as missing values.
Data fields in which (almost) no data row has a valid value are normally of little interest within a data analysis. Therefore the software drops these fields when reading data from a data source. The parameter 'Empty field threshold' specifies the minimum filling rate below which a field will be dropped. The minimum filling rate is a number between 0.0 and 1.0; it describes the fraction of all data records in which the field has a valid value.
Number of different entities (entity field values) in the input data. If no entity field has been specified, the number of entities is equal to the number of groups, or, if no group field has been specified, equal to the total number of data records.
Specify a data field which marks several adjacent data records as referring to one single entity (such as a customer, a car, a product, or a patient). The entity data field contains the entity identifier (such as a customer or vehicle or product or patient ID).
Exponential Smoothing coefficient alpha. (defines a damping factor (1-alpha) per time step.
Weight prefactor to the Exponential Smoothing part of the forecast; weight=0 switches off the Exponential Smoothing.
The sample excess of the value distribution. Note: the sample excess slightly differs from population excess (e.g. MS Excel's 'Excess Curtosis').
Expected number of data records or data groups in the selected data subset, assuming that the field value distribution on the selected data is identical to the field value distribution on the entire data.
Expected number of data records or data groups in the selected data subset, assuming that the field value distribution on the selected data is identical to the field value distribution on the entire data.
R² is a measure for the predictive power of the regression model. R² near 1 means that the model is able to predict the target values almost perfectly, R² near 0 means that the model is almost useless.
Save the in-memory data object as persistent iad file.
Export the data to a data table or flat file, preserving the all settings such as active field definitions, field types, discretizations, name mappings or joined tables. For data with set-valued fields or with a group field, you can choose among several output data formats:
The 'set-valued' format: one data row per group; all values of set-valued attributes are written into one single textual string within curly braces {} and separated by comma.
The 'pivoted' format: several data rows per group; all attributes are put into one single 'item' column, which contains values of the form [ATTRIBUTE_NAME]=[VALUE].
The 'boolean fields' format: one data row per group; for each textual value of each non-numeric attribute, the exported data contains one separate Boolean attribute containing '1' if the corresponding attribute value occurs in the current group, and '0' if it does not.
The 'only group IDs' format: creates a one-column output in which only the group IDs of the current data set are contained. This format is helpful if the exported data is only aimed to serve as a list of unique keys describing a subset of data records form a larger table.
The data field in the auxiliary table which contains mapped names for the original values of the affected data field in the main table
The data field in the auxiliary file or table which contains the different original values which also appear in the main table field for which the name mapping is being defined. Often, this field is a primary key field of the auxiliary table.
The data field in the auxiliary file or table which contains the group or category values
A discretization defines a 'binning' or 'grouping' of fine grained information from a numeric or textual data field into a small number or classes.
For textual fields, this means that only the N most frequently appearing textual values will be treated as separate values. All other values are represented by the group 'others'.
For numeric fields, this defines a binning into N value ranges (intervals). The interval boundaries are chosen automatically. If the automatically determined interval boundaries for a numeric field are not satisfying, user-defined interval boundaries can be specified manually by entering a list of N-1 numbers, time or date values in ascending order.
Data fields from the added data source which are to be joined into the currently active main data.
A flat file or database table containing at least two data fields (columns). One column contains the different values which currently appear in the main table's data field for which the name mapping is to be defined. The second column contains a mapped value for each of the original different values.
A flat file (i.e. column separated text file) or database table which contains at least two data fields (columns): a 'parent' column and a 'child' column. The 'parent' and 'child' values in each data row describe one single hierarchy relation between a group or category (parent) and a member of the group or category (child).
Starting time point for calculating the aggregated forecast values which are shown below the title line of each chart in the time series forecast screen.
Number of future time series data points to be forecasted
A data field which is the primary key of another data file or table. The data field can be used to join that other data table into the current data source.
Maximum frequency: the SOM card shows the nominal value which is the most frequent value on the data records mapped to the given neuron.
This parameter defines a lower boundary for the number of data records or data groups on which a value of a non-numeric data field must occur for being tracked as a separate field value and a separate bar in histogram charts. Less frequent values will be grouped into the category 'others'.
Default setting for the minimum required frequency above which a tupel of several items can be considered as a perfect tupel. Must be an integer larger than 1.
Number of time series graphs per row
The input data for data mining can be pivoted or unpivoted. In the unpivoted data format, each 'object to be analyzed' (for example a customer, a process or a production tranche) is represented by exactly one data record (data row). In this case, no group column has to be specified.
In the pivoted data format, each 'object to be analyzed' can span multiple adjacent rows of data: there is one 'item' column containing one single property of the object per data row; and there is a 'group' column which contains an unambiguous identifier for the object to which the current data row belongs. In this case, the name of that group column must be specified here. One such 'object to be analyzed' is often called a 'transaction'.
The number of neurons in direction y. Should be a number between 2 and 100
Height to width ratio of the time series charts to be created.
The icon to appear in the 'Help'-->'About' info screen. When working without a license key (free test version), you can freely change that icon. When working with a license key, the license key checks that the name of the icon corresponds with the information stored in the lichense key.
The icon to appear in the upper left corner of the graphical workbench window. When working without a license key (free test version), you can freely change that icon. When working with a license key, the license key checks that the name of the icon corresponds with the information stored in the lichense key.
Ignore all missing and invalid values in the bivariate analysis.
If this check box is marked, a linear model with constant term (y = b0 + b1*x1 + ... + bn*xn) will be created. Otherwise, a model without the term b0 will be created.
If a set of items has been specified as incompatible (by pairs), then none of the detected deviations, associations or sequences will contain more than one item out of this set.
Enter several patterns, separated by comma (,). If a pattern contains a comma as part of the pattern name, escape it by a backslash (\). Each pattern can contain one or more wildcards (*) at the beginning, in the middle and/or at the end.
Value index, i.e. the value's position on the list of all values. For numeric fields, value indices are assigned in the natural order of the values: the smallest value has inde 1. For textual fields, value indices are assagined by decreasing frequency: the most frequent value of a data field has the index 1, the second most frequent one the index 2 and so on.
A number between 0 and 1 which indicates how much the input weights of the best matching neuron are moved towards the field values of a data record when that record is presented to the SOM net during training.
In this panel, you can define, describe, preprocess and manage a data source that you want to use for the subsequent data analysis steps.
If 'superset' is checked, the 'Show', 'Explore' and 'Export' buttons will handle each data record or group which supports at least one of the selected associations.
If 'intersection' is checked, the 'Show', 'Explore' and 'Export' buttons will only handle those data groups which support all selected associations.
Specify the desired interval boundaries. Specify n-1 numeric values in ascending order, separated by ';', '|' or ' ' for obtaining n intervals.
Number of data records (resp. data groups in the pivoted data format) in which the data field has no valid value.
Invert the field value selection on the current data field in a Multivariate exloration or a test-control data analysis: deactivate the previously selected value ranges and activate those ranges which were filtered out.
An item is an atomic part of an association or sequential pattern, i.e. a single piece of information, typically of the form [field name]=[field value] or [field name]=[field value range from ... to ...].
The absolute supports of the single items within the association (the first number corresponds to item1, the second to item2, etc.)
A star (*) after the number indicates that the item belongs to the core of the association. The core of an association is the smallest possible subset of items of the association which has the same support as the entire association.
The item pair purity of two items i1 and i2, is the number of transactions in which both items occur divided by the maximum of the absolute supports of the two items. Item pairs with a purity of 1 are 'perfect pairs': whenever i1 occurs in a transaction, also i2 occurs in it, and vice versa.
The desired item set lengths in the sequences to be detected. Each 'equal-time' part of a sequence is an item set. In the sequence [A] >>> [B],[C],[D], for example, the minimum item set length is 1, the maximum item set length is 3.
Number of data records or groups on which the different items which form the pattern appear.
The string which is sent to a database management system (DBMS) for getting access to a data table via the JDBC protocol. The string contains the DBMS name, hostname and database name.
A default version of this string is automatically created from the user's input for DBMS type, host name and database name in the database connect panel. If this default string does not work properly, the manual specification of ':[4-digit port number]' after the host name might be necessary.
Define tables and fields within them which are to be joined into the main table - for example master data tables containing additional properties of certain field values of the main table.
Key field in the added data source, must contain the same values as the foreign key column in the main data.
Textual fields which contain a very large number of different values are interpreted as 'key-like' fields; the software assumes that their content is not suitable for being incorporated into subsequent analysis or data mining steps, and they are dropped when reading the data source. This parameter defines the number of different field values above which a field is classified as 'key-like'. Allowed values are 100 to 1000000.
Language in which all textual elements of the graphical workbench will appear
Completion rate of the last time point, compared to the earlier time points.
If, for example, each time point describes the sales figures of one month and for the current month, the current number only covers the accumulated sales figures of 5 out of 25 sales days, then the completion rate of the last time point should be set to 0.2.
File containing the license key for the software. The file name starts with IA_license_key. There is no license key file if you are working with a free test or trial version of the software.
This measure compares the actual number of data groups passing the selection criteria to the expected number which would arise if all data fields used as selection criteria were statistically independent.
A lift value larger than 1 indicates that the field values used as selection criteria 'attract' each other, a value smaller than 1 indicates that the field values 'repulse' each other.
The lift of an association is the actual relative support of the association divided by the product of the relative supports of the items which form the association.
Associations with lift>1 are 'frequent patterns': the items within the association occur more frequently together than expected if these items were statistically independent.
Associations with lift<1 are 'exceptions' or 'deviations': the items within the association occur less frequently together than expected if these items were statistically independent.
The lift of a sequence is a measure for the positive correlation of the item sets (events) which form the sequence.
Sequences with lift>0.5 are 'frequent patterns': the item sets within the sequence occur more frequently in that order than expected if the items were statistically independent.
Sequences with lift values close to zero are 'exceptions' or 'deviations': the items within the sequence occur less frequently in that order than expected if the items were statistically independent.
An association of n items has n lift increase factors, namely the n ratios of this association's lift divided by the lifts of its n different 'parent' associations. A parent association is an association which results when one of the n items is dropped.
Specifying limits for the lift increase factor helps keeping the result size manageable by suppressing the generation of redundant child patterns for significant parent patterns. When searching for frequent patterns, lift increase factors greater than 1 should be applied, e.g. 1.5. When searching for deviations, lift increase patterns smaller than 1 should be applied, e.g. 0.5.
As an example, let us consider the association ('AGE<18' and 'FAMILY_STATUS=child'). On real-life demographic data, this association is a typical frequent pattern with a lift largely above 1, e.g. 3.62. Therefore, when searching for frequent patterns with lift>3, this pattern will be detected. However, most likely also the following patterns will be detected: ('AGE<18' and 'FAMILY_STATUS=child' and 'GENDER=male'), ('AGE<18' and 'FAMILY_STATUS=child' and 'GENDER=female' and 'STATE=CA'), and many more. All these extended patterns most probably have a lift very close to 3.62 since the pattern extensions are just adding uncorrelated information to the significant 'core' pattern ('AGE<18' and 'FAMILY_STATUS=child'). Setting a minimum lift increase factor of 1.5 helps suppressing all these useless extensions as none of them has a lift greater than 5.43 = 1.5*3.62.
The lift increase factor relates the lift of a sequence to the lift of its parent sequences which results from removing one single item from one of the n equal-time item sets of the sequence.
Specifying limits on the lift increase factor helps suppressing the generation of redundant, uninteresting sequences for interesting 'core' sequences. For more detail refer to the explanation of lift increase factor in the associations training module.
In linear regression, the value of a numeric target field t is expressed as a linear formula of the values of several other data fields x, the so-called predictor fields or regressors: t = b0 + b1*x1 + ... + bn*xn.
In logistic regression, the probability of the '1'-value of a two-valued target field t is expressed as a formula of the values of several other data fields x, the so-called predictor fields or regressors. The formula has the form: proba(t=1) = 1/(1+eb0+b1*x1+...+bn*xn).
You can adapt the workbench design and style (look and feel) to your preferences and to your operating system. You can change between a 'MS-Windows' style, a 'Unix-Motif' style and a system independent 'Java native' ('metal') style. Do not select 'windows' if you are running on MAC OS, Unix or Linux.
Mapped field value names as they have been read from an auxiliary name mapping table.
Keep the result size manageable by limiting the maximum number of deviation patterns to be detected. If more deviation patterns can be found, only the strongest ones of them are kept.
The maximum desired number of active data fields. If the number of currently active fields exceeds this value, some of them will be deactivated. The software decides autonomously which fields are deactivated, based on the number of missing values, the number of different values and field-field correlations.
Limit the possible number of SOM iterations. Within one SOM iteration, the SOM training algorithm performs one scan over all training data records and uses each record for adapting the neuron weights of the best matching neuron and its neighbors.
From various analysis modules of the software, the user can select a data subset, display it in tabular form in a separate screen window and export it to a flat file or database table. In this parameter, you can specify the maximum allowed number of data rows in such data subsets. Larger subsets wil be truncated. Allowed values are 100 to 100000000
The maximum length of the deviation patterns to be detected.
Upper limit for the length of the tupels to be identified, i.e. the maximum number of items per tupel.
The maximum Euclidean distance of neighbored neurons in the SOM net over which adaptions to one neuron influence the neighbored neuron.
Define a maximum number N of different textual values (categories) per data field. Whenever a textual field has more than N different values, only the N most frequent of them will be kept, all other ones will be grouped into the category 'others'.
Specify the maximum number of characters in textual values. Longer textual values will be truncated in the compressed data.
MC conf stands for 'Monte Carlo significance verification confidence'. This measure indicates how sure one can be that the given association contains a statistically significant rule within the data and is not a product of 'hazard', that means random noise in the data.
The measure is calculated by trying to find associations with similar support, lift and purity values in simulated artificial data which contain the same items with the same item frequencies as the original data, but no correlations between the items.
The median of the value distribution, that means the smallest value such that 50% of the data records or groups have a value which is smaller or equal.
For irreversibly binned fields, the exact median cannot be determined; instead, the mid point of the interval containing the median is returned.
Upper limit (in MB) for the RAM to be used by the automized series of split analysis tasks to be deployed.
A minimum threshold for the number of data records in which a deviation pattern occurs. Deviation patterns which occur less frequently in the data will not be shown.
A minimum threshold for the increase in deviation strength when expanding patterns be adding another part (item). If this threshold is X, then only those patterns will be shown whose deviation strength is at least X times the deviation strength of each 'parent' pattern which can be obtained from the initial pattern by removing one part (item).
A minimum threshold for the strength of the deviation patterns to be detected. The strength of a deviation is the inverse of the deviation's lift value. For example, if a combination (A,B) of two data field values A and B occurs in 0.02% of all records, and if A and B alone occur in 20% respectively 10% of the data records, then the deviation strength of the pattern (A,B) is 100 since 0.02% is 100 times less than the expected occurrence frequency of 20% * 10% = 2%.
Minimum tupel support. The support of a tupel is the number of data groups in which all items of the tupel occur.
This parameter defines a lower boundary for the number of data records or data groups on which a value of a non-numeric data field must occur for being tracked as a separate field value and a separate bar in histogram charts. Less frequent values will be grouped into the category 'others'.
Minimum purity of the tupels to be detected. The purity of a tupel is the tupel's occurrence frequency divided by the occurrence frequency of the tupel's most frequent item.
File name under which the generated data mining model or analysis result will be stored on disk. The file name suffix determines the file format: .xml and .pmml produce a PMML model, .sql creates an SQL SELECT statement, .txt and .mdl create a flat text file.
Most labels, menu items, buttons, input fields and table column headers in the graphical workbench have a 'mouse-over' function showing a context-sensitive pop-up help text. This Parameter specifies for how many seconds the help text is shown.
Most labels, menu items, buttons, input fields and table column headers in the graphical workbench have a 'mouse-over' function showing a context-sensitive pop-up help text. This Parameter specifies how many seconds after placing the mouse pointer the help text pops up.
Most labels, menu items, buttons, input fields and table column headers in the graphical workbench have a 'mouse-over' function showing a context-sensitive pop-up help text. This Parameter specifies for how many seconds the help text cannot be reshown after it has been shown once.
A name mapping defines more readable textual values (e.g. product names) for the original values (e.g. product IDs) of a data field.
A name mapping definition must contain the file or table name (optionally preceeded by the directory path or jdbc connection), the names of the fields (columns) containing the original and the mapped value, and the field name of the main data source to which the name mapping applies.
Negative items are items for which the complement, i.e. the fact that the item does NOT occur, should be treated as a separate item. For example, if the item 'OCCUPATION=Manager' is added to the list of negative items, then the item 'OCCUPATION!=Manager' is created, and its support is the complement of the support of 'OCCUPATION=Manager'.
Restrict the allowed range for the predicted time series values to values equal or greater than zero.
Method for selecting the 'best' nominal value which is shown in the SOM cards for nominal data fields.
If a non-empty string is specified for this parameter, then this string will be interpreted as 'n/a' ('invalid or missing value') whenever it occurs as the value of a data field.
The number of currently activated data fields (not counting the entity field).
The total number of items in the sequences to be detected. An item is one elementary peace of information, that means an atomic part within the sequential pattern
Keep the result size manageable by limiting the maximum number of associations to be detected. If more associations can be found, only the 'best' ones of them are kept. The criterion for selecting the 'best' associations can be defined using the radio button 'Sorting criterion'.
The total number of data fields which appear on the left hand side of the regression equation which predicts target field values.
Keep the result size manageable by limiting the maximum number of sequences to be detected. If more sequences can be found, only the 'best' ones of them are kept. The criterion for selecting the 'best' sequences can be defined using the radio button 'Ranking criterion'.
Specify an upper limit for the number of parallel threads used for reading and compressing the data. If no number or a number smaller than 1 is given here, the maximum available number of CPU cores will be used in parallel.
Determine the number of separately treated values or value ranges. Allowed values are 2...100 for numeric fields and 0...100 for textual fields.
A data field which is to be treated as numeric field. If it contains textual values, these values will be ignored, i.e. considered as missing values.
Per default, each numeric data field contributes with the same weight factor (of 1) to the distance calculations between neurons and data records as the Boolean and textual fields. You can define a higher or lower weight factor for the numeric fields compared to Boolean and textual fields using this parameter. Note that weight settings for specific fields overwrite this general setting, the weight factors are not multiplied.
Specify the maximum numeric precision, i.e. the maximum number of digits that will be regarded when reading numeric values.
With the precision of 3, for example, the number 55555 will be stored as 55600 and -1.23456e-17 as -1.23e-17.
If 'only entity IDs' is checked, the 'Show' and 'Export' buttons will show resp. export only the entity IDs of the supported entities.
If 'entire records' is checked, the 'Show' and 'Export' buttons will show resp. export the supported entities with all their available data fields.
The operator which will be applied on the existing input field(s) and/or the existing value(s) in order to create the value of the computing field.
Create a subset of the current control data set. The subset is aimed to be as representative as possible for the current test data set on all data fields which are not marked 'Target' (T) and for which the user has not manually selected different value ranges for the test and the control data.
Total frequency of all textual values which were not counted as a separate category but summarized under 'others'.
Root mean squared mapping error of the SOM net on the entire data
File name under which the current parameter settings will be stored on disk.
The acceptable support growth when comparing a given association to its parent associations.
A parent association (of n-1 items) will be rejected if its support is less than the support of the current association (of n items) multiplied by the minimum parent support ratio.
The effect of this filter criterion is that it reduces the number of detected associations by removing all sub-patterns of long associations whenever the sub-patterns have a support which is not strongly larger than the support of the long association.
The length of an association is the number of items which form the association.
When specifying the parameters for an associations training, you should always specify an upper boundary for the desired association lengths, otherwise the training can take extremely long time.
Default setting for the minimum required frequency above which a tupel of several items can be considered as a perfect tupel. Must be an integer larger than 1.
Default setting for the minimum purity at which a tupel of several items is considered as a perfect tupel. Must be a number between 0.5 and 1.0. For the definition of purity: see definition in module associations analysis
Detect (almost) perfect item tupels in the data, i.e. value combinations of textual set-valued data fields which appear (almost) always together.
Presumed cycle length of the seasonal (periodic) part of the time series in units of the time step between adjacent data points.
The software can create and export data mining models in the vendor independent PMML format (see http://www.dmg.org/pmml). This parameter defines, which version of PMML should be created.
The required item type indicates at which position within a sequence the item can occur.
If the type is 'Sequence start', the item must occur in the sequence's first item set.
If the type is 'Sequence end', the item must occur in the sequence's last item set.
If the type is 'Anywhere', the item can occur anywhere within the sequence.
Root mean squared prediction error of the regression model on the training data
The selection box 'Primary sorting criterion' is an option that can be activated when exporting in-memory data objects into a text file on disk. When activated, the option sorts the exported data rows by ascending or descending values of the data field selected in the box.
The purity of an association is the ratio between the association's support and the support of the most frequent item within the association.
A purity of 1 indicates a 'perfect' group: each single item of the transaction occurs in a transaction if and only if also all the other items of the association occur in that transaction.
Default setting for the minimum purity at which a tupel of several items is considered as a perfect tupel. Must be a number between 0.5 and 1.0. For the definition of purity: see definition in module associations analysis
If this parameter in the data import settings is set to 'double quote' (or 'single quote'), then double (or single) quotes around field values are removed per default for all input data fields. If this parameter is set to 'none', then double or single quotes around field values are only removed if ALL values of the field are surrounded by the same quotes; in addition, numeric values surrounded by quotes are interpreted as textual values in this case.
This button starts reading the original data source and transforming the data into a compressed binary data object which resides in memory.
When reading input data from flat files or spreadsheets, the data source does not provide meta data information on the types of data (integer, Boolean, floating point, textual) to be expected in the available data columns. Therefore, a presumable data type has to be derived from looking at the data fields actual content.
The parameter 'Number of records for guessing field types' determines, how many leading data rows are read from the data source for guessing data field types.
Refresh the screen, for example in order to adapt to a changed screen size.
Regression coefficients are the weight prefactors with which the different regressors enter into the regression equation.
The software supports two regression methods: linear regression and logistic regression. In linear regression, the value of a numeric target field t is expressed as a linear formula of the values of several other data fields x, the so-called predictor fields or regressors: t = b0 + b1*x1 + ... + bn*xn.
In logistic regression, the probability of the '1'-value of a two-valued target field t is expressed as a formula of the kind: proba(t=1) = 1/(1+eb0+b1*x1+...+bn*xn).
In this panel, you can visualize and introspect the results of a regression training run, that means the regression coefficients and model quality measures such as RMSE or R-squared values.
In this module, you specify the parameters and settings which are to be used for applying a regression model to new data.
A regression training establishes a formula which predicts the value of one single data field from the values of some other fields within the training data.
In the regression training panel, you specify the parameters and settings which are to be used for the next regression training run. Furthermore, you can store your parameter settings, manage them in a repository and later retrieve and reuse them. In the lower part of the panel, you can start and stop a regression training run and monitor its progress and its predicted run time.
A regressor is a data field which appears on the left hand side of the regression equation and whose values serve to predict the target field value.
Upper limit for the number of regressors which can enter into the regression model.
relative difference: |#selected - #expected| / #expected.
Maximum relative difference to the field's overall value distribution: the SOM card shows the nominal value for which the ratio between its actual frequency within the records mapped to the given neuron and its expected frequency is maximum.
relative difference: |#selected - #expected| / #expected.
Maximum relative difference to the field's overall value distribution: the SOM card shows the nominal value for which the ratio between its actual frequency within the records mapped to the given neuron and its expected frequency is maximum.
Fraction of all data records or data groups which contain the value
The relative support of an item is the item's absolute support divided by the total number of transaction (groups). In other words, the relative support is the a-priori probability that the item occurs in a randomly selected transaction.
The relative support of an association is the absolute support divided by the total number of groups (transactions), that means the a-priori probability that an arbitrary group supports the association.
When specifying the parameters for an associations training, you should always specify an lower boundary for the absolute or relative support, otherwise the training can take extremely long time.
The relative support of the sequence, that means the fraction of all entities (transaction groups) in which the sequence occurs
Preference settings for the visual report designer and for creating HTML and PDF reports.
Required items are items which must occur in each detected pattern. If several item patterns are specified within one 'required group', at least one of them must appear in each detected deviation, association or sequence.
In the Associations and Sequences training modules, up to 3 different groups of required items can be specified. In this case, the detected patterns will contain at least one item out of every specified group. Each item specification can contain wildcards (*) at the beginning, in the middle and/or at the end.
The required item type indicates at which position within a sequence the item can occur.
If the type is 'Sequence start', the item must occur in the sequence's first item set.
If the type is 'Sequence end', the item must occur in the sequence's last item set.
If the type is 'Anywhere', the item can occur anywhere within the sequence.
File name under which the generated data mining model or analysis result will be stored on disk. The file name suffix determines the file format: .xml and .pmml produce a PMML model, .sql creates an SQL SELECT statement, .txt and .mdl create a flat text file.
Root mean squared prediction error of the regression model on the training data
A sampling criterion or SQL WHERE clause. For example, the criterion '10%' creates a random sample of about 10% of all data rows. The criterion '!10%' creates the complementary subset containing all records which the criterion '10%' would have blocked. The criterion WHERE GENDER='M' selects all data rows whose 'GENDER' value is 'M'.
R² is a measure for the predictive power of the regression model. R² near 1 means that the model is able to predict the target values almost perfectly, R² near 0 means that the model is almost useless.
Default height of the main workbench window (in pixels). Allowed values are 480 to 1500
Default width of the main workbench window (in pixels). Allowed values are 640 to 2000
The selection box 'Secondary sorting criterion' defines an additional sorting criterion which applies for sorting data rows with identical values in the primary sorting criterion.
From various analysis modules of the software, the user can select a data subset, display it in tabular form in a separate screen window and export it to a flat file or database table. In this parameter, you can specify the maximum allowed number of data rows in such data subsets. Larger subsets wil be truncated. Allowed values are 100 to 100000000
The number of data records mapped to the currently selected neurons.
Root mean squared mapping error of the SOM net on the data records mapped to the currently selected neurons.
The desired sequence lengths of the sequences to be detected. The sequence length is the number of parts (events) separated by time steps.
In this panel, you specify the parameters and settings which are to be used for the next Sequential Patterns training run.
Furthermore, you can store your parameter settings, manage them in a repository and later retrieve and reuse them. In the lower part of the panel, you can start and stop a Sequential Patterns training run and monitor its progress and its predicted run time.
Sequential Patterns Analysis is only possible on data on which an 'Entity' field, a 'Group' field and an 'Order' field has been defined on the 'Active fields' dialog. The Group field and the Order field can be identical; in this case, specify the field as ''Order and Group' field.
A sequences model is a collection of sequential patterns which have been detected during a sequences training run on a training data set. The model can be applied to a new data source in a sequences scoring step.
In the sequences model panel, you can visualize and introspect the results of a Sequential Patterns training run. You can display the results in tabular form, sort, filter and export the filtered results to flat files or into a table in a RDBMS.
Furthermore, you can calculate additional statistics for the support of selected sequential patterns.
A Sequences Scoring presents new data records to a previously trained Sequential Patterns model. A Sequential Patterns model is a collection of sequences of events which were observed in the data on which the model was trained.
The scoring relates sequences from the model with data records from the new data. This can be done in two ways. The first way examines one or more selected data records (e.g. all purchases of one single customer) and returns all sequences which are partially or fully supported by the selected records. The second way examines one or more selected sequences and returns all records (e.g. all customers) that partially or fully support the selected sequences.
You can store and retrieve both the parameter settings for Sequences Scoring and the scoring results in the form of XML or flat text files.
The absolute supports of the item sets which form the sequence. (the first number corresponds to set1, the second to set2, etc.)
A star (*) after the number indicates that the set belongs to the core of the sequence. The core of a sequence is the smallest possible sub-sequence of item sets of the sequence which has the same support as the entire sequence.
The sample skewness of the value distribution. Note: the sample skewness slightly differs from population skewness (e.g. MS Excel's 'Skewness').
Number of time points used for calculating the moving average trend line.
The number of SOM cards placed in one row. Reduce this number for obtaining larger graphs.
A SOM model is a neural network which has been trained in a preceeding SOM training run on some training data and which has 'learned' the training data during that training.
You can visualize and introspect the SOM model with its SOM cards. You can explore different regions of the SOM map, explore the statistics of these regions and export data records mapped to these regions to flat files or into a table in a RDBMS.
The model can be applied to a new data source in a SOM scoring step, for example in order to predict one or more data fields' values which are unknown in the new data.
A SOM Scoring presents new data records to a previously trained Self Organizing Map (SOM) model. A SOM model is a neural network which represents the data by means of a square grid of neurons.
The scoring can be used to predict missing values in the new data, to classify the new data records as deviations, or to assign them to clusters (segments).
You can store and retrieve both the parameter settings for a SOM scoring and the scoring results in the form of XML or flat text files.
A SOM training task specifies the parameters and settings which are to be used for the next SOM training run.
In the SOM Training Task panel, you can store your parameter settings, manage them in a repository and later retrieve and reuse them. In the lower part of the panel, you can start and stop a SOM training run and monitor its progress and its predicted run time.
The ranking criterion which is used to sort out certain detected patterns (associations or sequences) when the total number of detected patterns becomes larger than the user-defined maximum desired number.
Possible values are Support, Lift, Purity, Core item purity, Weight or Trend. Weight is only allowed if a weight field has been defined on the input data. Trend is only allowed if an order field has been defined on the input data.
Split Analysis is data analysis approach in which two data subsets are selected: a 'test' data set and a 'control' data set. In many use cases, the test data set comprises a data subset which have a certain property in common, for example all men, all customers below the age of 30, all vehicles produced after an improvement measure has been effectuated, etc.
The first goal of the analysis is to select a suitable control group which is representative for the test group in all attributes except the ones used for defining the test group. The second goal is to find and quantify significant differences between the test data subset and the control data subset.
Whenever a data source contains non-standard-English characters (such as î, ä, é, € etc.) you must specify in which encoding scheme (codepage) the data have been encoded, otherwise these characters will not be displayed correctly. If you do not know the encoding scheme, you have to try out various choices.
Standard deviation of relative difference. This value indicates how exactly the relative difference can be calculated.
The sample standard deviation of the value distribution (i.e. the 'n' and not the 'n-1' standard deviation!)
Standard deviation of relative difference. This value indicates how exactly the relative difference can be calculated.
If this check box is marked, the current data load settings are written into a persistent XML file. The settings in this XML file can later be applied to any new data source of the same structure as the original data source.
File name of a TAB-separated tabular text file in which the summary result of the series of split analysis tasks will be written. The file will contain one row per single split analysis.
If no value is given here, no summary result file will be created.
If 'superset' is checked, the 'Show', 'Explore' and 'Export' buttons will handle each data record or group which supports at least one of the selected associations.
If 'intersection' is checked, the 'Show', 'Explore' and 'Export' buttons will only handle those data groups which support all selected associations.
If 'superset' is checked, the 'Show', 'Explore' and 'Export' buttons will cover each entity which supports at least one of the selected sequences. If 'intersection' is checked, the 'Show', 'Explore' and 'Export' buttons will only cover those entities which support all selected sequences.
A data field which will be completely ignored.
Suppressed items are items which are completely ignored during the patterns analysis and which should never occur in the detected patterns. Each item specification can contain wildcards (*) at the beginning, in the middle and/or at the end.
Target fields are those visible fields whose field value differences between test and control data will be ignored during the control data optimization.
These fields are the 'target' fields of the hypothesis test. The aim of the test is to find out whether there are significant value distribution differences between the test and control data on these fields.
Specify the name of the target field if you want to use the SOM method for predicting the values of one single data field.
The name of the target field, that means the name of the field whose values are to be predicted from the values of the other data fields.
Per default, each data field contributes with the same weight factor (of 1) to the distance calculations between neurons and data records. You can assign a higher weight factor to the target field.
A taxonomy is the definition of a category hierarchy. For example, such a hierarchy could define the two products 'butter' and 'cheese' as members of the category 'milk products', and 'milk products' as a sub-category of 'food'.
Taxonomy definitions can be read from flat files or database tables. A taxonomy definition must contain the file or table name (optionally preceeded by the directory path or jdbc connection), the names of the fields (columns) containing the parent and the child categories, and the field name of the main data source to which the taxonomy applies.
In this directory, temporary dump files will be stored. Dump files are created when reading data from very large data sources.
The currently selected test data subset in a test-control data analysis. The goal of the analysis is to detect and quantify systematic deviations in the field value distribution properties between the test data subset and the control data subset
A data field whose values are to be treated as textual (categorical) values even if they are numeric values.
File in which all textual resources needed by the workbench are stored: labels of menus, input fields and buttons, context sensitive help texts, glossary entries etc. If you want to customize the software, you can work with personalized versions of the default file IA_texts.xml.
In the Time Series panel, time series can be explored and forecasts can be calculated using various forecasting algorithms.
This module can only be started on data which fulfill the following requirements:
i) An order field has been defined in the 'Active fields' dialog. This field will be the x-axis field in the time series charts.
ii) A weight/price field has been defined in the 'Active fields' dialog. This field will be the y-axis field in the time series charts.
iii) Not more than two further active fields exist (plus optionally a group field). All other fields have been deactivated in the 'Active fields' dialog.
Time step limits define which time step size is permissible between adjacent parts (item sets) of a sequence.
A data field should be marked as 'time/order field' if it does not contain an property of the entity to be analyzed but the time stamp or step identifier at which the entity's properties in the other data fields of the current data row have been recorded.
For some data mining functions, the specification of a time/order field is required (e.g. sequence analysis, time series prediction), other data mining functions will ignore any time/order information (e.g. associations analysis).
Most labels, menu items, buttons, input fields and table column headers in the graphical workbench have a 'mouse-over' function showing a context-sensitive pop-up help text. This Parameter specifies for how many seconds the help text is shown.
Most labels, menu items, buttons, input fields and table column headers in the graphical workbench have a 'mouse-over' function showing a context-sensitive pop-up help text. This Parameter specifies how many seconds after placing the mouse pointer the help text pops up.
Most labels, menu items, buttons, input fields and table column headers in the graphical workbench have a 'mouse-over' function showing a context-sensitive pop-up help text. This Parameter specifies for how many seconds the help text cannot be reshown after it has been shown once.
The desired time gap between the first and the last part (event) of the sequences to be detected.
Name of the trace file to which the software writes success, progress, warning and error messages. Choose a qualified file name such as 'C:\IA\IA_trace.log', or the string 'stdOut' if you want to trace to the black console window.
The frequency (intensity) of protocol output. The higher, the more protocol output is produced. Allowed levels are 0 to 4. In level 0, no protocol output is produced. In level 4, the protocol output might become very large if you are working on large data.
Tracked items are items whose occurrence rate is tracked and shown for every detected association. The tracked rate indicates the probability that the tracked item occurs in a data record or group which supports the current association.
Training data are a data collection on which a data mining model is being trained. During the training, the model 'learns' certain rules, interrelations and dependencies between the differen data fields of the training data. After the training, the model can be applied to new data, for example in order to predict missing field values or in order to classify or cluster new data reords. This is called 'scoring'.
Preference settings for Decision and Regression Tree (model training and application)
A decision tree training establishes a hierarchical, tree-like set of Boolean predicates which describe the typical behavior of one single 'target' attribute in the training data. In the tree training panel, you specify the parameters and settings which are to be used for the next decision tree training run.
Furthermore, you can store your parameter settings, manage them in a repository and later retrieve and reuse them. In the lower part of the panel, you can start and stop a decision tree training run and monitor its progress and its predicted run time.
Damping factor applied when projecting current trend into the future.
If, for example, the trend damping factor is 0.9, if the time series data are recorded monthly, if the current trend is a seasonally corrected month-to-month increase dx and if the current month's seasonally corrected value is x, then the seasonally corrected projected values for the next 3 months will be x+0.9*dx, x+(0.9+0.81)*dx, x+(0.9+0.81+0.729)*dx.
Undo the previous control data optimization. That means, reactivate all available control data records.
Define a maximum number N of different textual values (categories) per data field. Whenever a textual field has more than N different values, only the N most frequent of them will be kept, all other ones will be grouped into the category 'others'.
A variant elimination replaces several spelling variants or misspellings, several case variants and/or several synonyms for identical things or concepts by one single 'canonical' form. Variant eliminations can be specified for all textual data fields. Variants can be defined either by listing the variants one by one or by using regular expressions (pattern matching).
Verification runs serve to assess whether the detected association or sequential patterns are statistically significant patterns or just random fluctuations (white noise). For each verification run, a separate data base is used. Each data base is generated from the original data by randomly assigning each data field's values to another data row index within the same data field. This approach is called a permutation test. The effect is that correlations and interrelations between different data fields are completely removed from the data. If one finds association or sequential patterns on a permuted data base, one can be sure that one has detected nothing but noise. One can record and trace the measure triples (pattern length, support, lift) of all detected noise patterns. The edge of the resulting point cloud defines the intrinsic 'noise level' of the original data. Patterns detected on the original data can only be considered significant if their corresponding measure triples are well above the noise level. These patterns have a verification confidence close to 1.
Verification runs serve to assess whether the detected association or sequential patterns are statistically significant patterns or just random fluctuations (white noise). For each verification run, a separate data base is used. Each data base is generated from the original data by randomly assigning each data field's values to another data row index within the same data field. This approach is called a permutation test. The effect is that correlations and interrelations between different data fields are completely removed from the data. If one finds association or sequential patterns on a permuted data base, one can be sure that one has detected nothing but noise. One can record and trace the measure triples (pattern length, support, lift) of all detected noise patterns. The edge of the resulting point cloud defines the intrinsic 'noise level' of the original data. Patterns detected on the original data can only be considered significant if their corresponding measure triples are well above the noise level. These patterns have a verification confidence close to 1.
In addition to the main training run, you can start 0 to 9 verification runs. Each verification run is a separate training run with the same parameters as the main training run but a different seed value for the random number generator.
The purpose of verification runs is to generate stability and reliability information for the model created by the main training run.
Verification runs serve to assess whether the detected association or sequential patterns are statistically significant patterns or just random fluctuations (white noise). For each verification run, a separate data base is used. Each data base is generated from the original data by randomly assigning each data field's values to another data row index within the same data field. This approach is called a permutation test. The effect is that correlations and interrelations between different data fields are completely removed from the data. If one finds association or sequential patterns on a permuted data base, one can be sure that one has detected nothing but noise. One can record and trace the measure triples (pattern length, support, lift) of all detected noise patterns. The edge of the resulting point cloud defines the intrinsic 'noise level' of the original data. Patterns detected on the original data can only be considered significant if their corresponding measure triples are well above the noise level.
In addition to the main training run, you can start 0 to 9 verification runs. Each verification run is a separate training run with the same parameters as the main training run but a different seed value for the random number generator.
The purpose of verification runs is to generate stability and reliability information for the model created by the main training run.
Verification runs serve to assess whether the detected association or sequential patterns are statistically significant patterns or just random fluctuations (white noise). For each verification run, a separate data base is used. Each data base is generated from the original data by randomly assigning each data field's values to another data row index within the same data field. This approach is called a permutation test. The effect is that correlations and interrelations between different data fields are completely removed from the data. If one finds association or sequential patterns on a permuted data base, one can be sure that one has detected nothing but noise. One can record and trace the measure triples (pattern length, support, lift) of all detected noise patterns. The edge of the resulting point cloud defines the intrinsic 'noise level' of the original data. Patterns detected on the original data can only be considered significant if their corresponding measure triples are well above the noise level.
Select the data fields for which you want to see SOM cards in the main panel above. Per default, the SOM cards for the 20 data fields with highest field importance numbers are shown.
For accessing online help, the software must start an external web browser. This parameter contains the calling command for this browser. There are default settings for several operating systems. Therefore, you should only modify this parameter if you are unable to use the online help with the default settings.
The weight of an association is the mean weight of all data records (or data groups) which support the association.
The weight of a data group is either the sum, the average, the minimum, or the maximum of the weight field values, or the number of records, of all input data records which form the group. The actual computation variant depends on the aggregation mode that has be set for the weight field in the input data panel (sum,mean,max,min, or count).
A data field should be marked as 'weight/price field' if it contains the price, cost, weight, or another numeric quantity which characterizes the 'importance' of the properties given in the other data fields of the current data row.
The number of neurons in direction x. Should be a number between 4 and 100