The XML Application Programming Interface
Content
Command line parameters and the command line processor sacl
General structure and a simple example of an XML task
Reference description of the <InputData>
part
attributes of <InputData>
<DataLocator>
subelements
<FieldUsage>
subelements
<JoinedTable>
subelements
<Taxonomy>
subelements
<NameMapping>
subelements
<Discretization>
subelements
<PerfectTupelDetection>
subelements
Reference description of the analysis task part
<UnivariateExplorationTask>
<BivariateExplorationTask>
<MultivariateExplorationTask>
<TestControlAnalysisTask>
<CorrelationsTask>
<TimeSeriesTask>
<AssociationsTrainTask>
<SequencesTrainTask>
<RegressionTrainTask>
<SOMTrainTask>
Command line parameters and the command line processor sacl
Based on a XML interface, Synop Analyzer can be used as an analysis kernel within automated workflows or batch processes, or as a plugin component embedded into third-party software. Synop Analyzer can be called in two ways
-
as 'workbench' with graphical user interface (GUI) for working interactively
(
SynopAnalyzer.bat
),
-
as command line processor which processes a given analysis task without user interaction
(
sacl.bat
).
The first calling variant can take, the second one must take 1 or 2 command line parameters:
In the following sections of this document, syntax and usage of Synop Analyzer XML tasks will be described in more detail.
General structure and a simple example of an XML task
An XML task according to the XML schema http://www.synop-systems.com?/xml/InteractiveAnalyzerTask.xsd
consists of two parts:
-
a description of the data to be analyzed, in the form of an
<InputData>
tag,
-
a description of the analysis task to be performed.
A simple task, which reads the flat file kunden.txt from the subdirectory doc/sample_data of the Synop Analyzer installation directory and opens it in the Synop Analyzer graphical workbench, is given below:
<?xml version="1.0" ?>
<InteractiveAnalyzerTask>
<InputData>
<InputDataLocator usage="DATA_SOURCE" type="FLAT_FILE"
name="doc/sample_data/kunden.txt"/>
</InputData>
<StartInteractiveAnalyzerGUITask/>
</InteractiveAnalyzerTask>
If you store this task as kunden_task1.xml and start SynopAnalyzer or sacl with this file name as command line argument,
c:> SynopAnalyzer.bat kunden_task1.xml
,
then the Synop Analyzer workbench opens up, the data are automatically read into memory, and after a few seconds, you can start analyzing them. That means, the data from kunden.txt were read, interpreted, compressed, enriched with additional statistics and are now available in the computer's RAM for arbitrary analysis or data exploration tasks.
You could also submit the XML task directly as a textual string when calling SynopAnalyzer or sacl. In this case, however, you have to 'quote' the task by enclosing it into double quotes. The existing double quotes within the string have to be masked by backslashes (\) in this case. The call would then look like this:
c:> SynopAnalyzer.bat "<?xml version=\"1.0\" ?><InteractiveAnalyzerTask>
<InputData><InputDataLocator usage=\"DATA_SOURCE\" type=\"FLAT_FILE\"
name=\"doc/sample_data/kunden.txt\"/></InputData>
<StartInteractiveAnalyzerGUITask/></InteractiveAnalyzerTask>"
Reference description of the <InputData> part
The element <InputData>
describes a data source which can be opened in Synop Analyzer
Optional attributes of <InputData>
-
nbThreads:
maximum number of parallel threads used while reading and compressing the input data. If this value is missing or smaller than 1, all available CPU cores will be used for spawning one separate threads per core.
-
nbDigits:
precision (number of digits) with which floating point numbers are stored in the compressed data format. For statistical analysis and Data Mining, rarely more than 4 digit precision is needed, hence 4 is the predefined value. This value can be increased up to a maximum of 8.
-
nbRecordsForDataDescription:
number of data rows which are read for detecting the most probable field types of data fields when reading flat file data. The default value is 1000.
-
maxNbCharacters:
long textual field values are truncated after a certain number of characters while reading and compressing the input data. The default value is 40.
-
maxNbNumericHistogramBins:
defines the level of detail in the histogram charts that are created for numeric data fields. The predefined value is 10, that means the histograms for numeric data fields have up to 10 histogram bars.
-
maxDiffTextualValues:
determines how many different textual values are stored in the compressed data representation of textual data fields. The most frequent values are stored separately, the remaining values are grouped into the category 'others'. Default value is 2000, that means the 2000 most frequent values of each textual data field are treated as separate values.
-
maxNbActiveFields:
if this value is smaller than the number of available data fields in the input data, Synop Analyzer automatically deactivates data fields until not more than
maxNbActiveFields
active data fields remain. During this removal process, each data field is ranked with respect to several criteria: number of missing values, number of different values, predominance of the most frequent value, existence of high correlations with other fields. The joint score of these criteria provides a 'field importance' score, and the fields with smallest scores are deactivated. Per default, this mechanism is switched off, all active fields are kept.
-
allowIrreversibleBinning:
if this attribute is set to
"true"
, numeric data fields are irreversibly 'binned' into maxNbNumericHistogramBins
different value ranges (bins) if they initially contain more than maxNbNumericHistogramBins
different values. This irreversible binning reduces the size of the compressed data. Per default, irreversible binning is switched off.
-
anonymizationLevel:
defines, whether and how strongly data field names and data field values are (irreversibly) anonymized when reading input data.
0 (default): no anonymization,
1: anonymize the field names, keep the original field values,
2: anonymize the textual field values and transform all numeric field values such that the resulting value distribution for each numeric data field has a mean of 0 and a standard deviation of 1. Maintain the original data field name,
3: anonymize both the data field name and the field values.
-
exportMode:
defines, whether and how the imported and preprocessed input data are to be stored persistently on disk. The following export formats are available:
"COMPRESSED_IAD"
:
the data are stored in the proprietary Synop Analyzer Data Format (.iad), a compressed binary data format which consumes 5% to 10% of the original data size.
"PIVOTED"
:
the data are stored in a two-column, pivotized form. One column contains the record ID (or, if a 'group' column has been specified, the group ID). The other column contains, in several adjacent data rows, all combinations data field=value which appear in the original data for the given record or group ID.
"SET_VALUED"
:
writes uncompressed text data with one column per original data field. If no 'group' field has been defined then the exported data exactly correspond to the input data if these were read from a flat text file. If a 'group' field has been specified and in the original data one group ID can span several data rows, then the exported format is different: it always contains one single data rows for each group ID. If certain data fields have several values per group ID, the entire set of values is stored as one single textual string, enclosed by curly braces {}.
"BOOLEAN_FIELDS"
:
transforms pivoted input data in a data format with a large number of two-valued (yes/no) data fields and exactly one data row per group ID. Each of the new data fields stands for one combination 'data field=value' from the original data, and this field contains the value "1" if the current groupID contains the combination 'data field=value', and "0" otherwise.
"GROUP_ID"
:
writes a file which contains one single column. This column contains a record ID, or, if a 'group' field has been specified, the group ID. This data format is not useful for storing the entire data, but very helpful for storing previously selected data subsets, for example the customer IDs of previously selected customers etc.
<DataLocator> subelements
<InputData>
must contain one and can contain two more different subelements of type <DataLocator>
:
-
<InputDataLocator>:
contains the URL (access path and data name) and the data format of a data source which contains input data to be opened with Synop Analyzer.
-
<TaskDataLocator>:
if the user wants to permanently store the manual adjustments and data import settings performed on the current data source,
<TaskDataLocator>
specified the URL to which these settings are written in the form of an <InteractiveAnalyzerTask>
.
-
<OutputDataLocator>:
if the user wants to permanently store the imported and preprocessed data source, the URL for this persistent data file must be given.
Each <DataLocator>
must contain the following three attributes:
-
type: describes the data type (format). Must be one of the following constants:
"FLAT_FILE"
, "OOXML_SPREADSHEET"
, "COMPRESSED_IAD"
,"XML_FILE"
, "PMML_FILE"
, "JDBC_TABLE"
, or "MDB_TABLE"
-
usage: describes the data usage. Must be one of the following constants:
"DATA_SOURCE"
, "DATA_TARGET"
, "IA_PARAMETERS"
, or "IA_MODEL"
-
name:
datafile name or schema and table name
Optionally, <DataLocator>
can contain one or more of the following attributes
-
accesspath:
directory path or JDBC connection string (containing DBMS, server, port and database name)
-
encoding:
encoding scheme of the data source. Allowed values are
"US-ASCII"
:
suitable if the data only contains the first 127 characters of the ASCII table.
"ISO-8859-1"
:
for 'western European' data in which each character is represented by one single byte and in which the 127 ASCII characters plus some 'standard' western European characters (such as French accents or German Umlauts) occur.
"ISO-8859-15"
:
codepage specialized for German language information. Each character is represented by one single byte and the 127 first characters are the ASCII characters, the other 128 characters represent characters and other symbols which are frequently used in the German speaking countries (Germany, Austria, Switzerland).
"UTF-8"
:
The UTF coding standard can represent about 65000 regionally used characters from all over the world. In the variant UTF-8, the first 127 ASCII characters are represented by 1 byte, all other characters are represented by two or more bytes.
"UTF-16"
:
in the UTF-16 variant of the UTF standard, all characters are represented by two or more bytes. The first two bytes of an UTF-16 file contain information on the byte order (is the first byte the high byte or the low byte?)
"UTF-16LE"
, "UTF-16BE"
:
in the variants UTF-16LE ('little endian') and UTF-16BE ('big endian'), the first two bytes of the UTF-16 format which define the byte order are missing. Therefore, the user must know beforehand, which byte order the creator of the document used. Normally, Intel/Windows systems work with the 'little endian' convention, Unix systems and Mainframes in the 'big endian' convention.
"ISO-8859-2"
, "ISO-8859-4"
, "ISO-8859-5"
, "ISO-8859-7"
, "ISO-8859-9"
, "ISO-8859-13"
, "KOI8-R"
, "windows-1250"
, "windows-1251"
, "windows-1252"
, "windows-1253"
, "windows-1254"
, "windows-1257"
:
other possible codepages which will not be described in detail here.
-
jdbcUser: database user name
-
dbms: name of the database management system in which the data reside. Possible values are
"ORACLE"
, "SQLSERVER"
, "ACCESS"
, "DB2"
, "MYSQL"
, "POSTGRES"
, "SYBASE"
, "TERADATA"
, "PROGRESS"
, CACHE
, "SUN_ODBC_JDBC"
, "USERDEFINED"
or "NONE"
, the latter being the default value.
Optional <FieldUsage> subelements
<FieldUsage>
defines a usage specification for one single data field. The tag contains the following attributes:
-
field:
name of the data field (required).
-
alias:
mapped name of the data field (optional). This mapped name is used instead of the field name in captions and titles of histogram charts for the field.
-
dataType defines the data type class of the data field:
"DEFAULT"
, "TEXTUAL"
, "BOOLEAN"
, "INTEGER"
, or "NUMERIC"
.
If this attribute is not set, "DEFAULT"
is assumed. That means, Synop Analyzer autonomously detects the best matching data type class for that data field.
-
usage:
usage mode of the field in all data exploration and analysis steps to be performed on this data.
"SUPPRESSED"
:
the field will be ignored.
"SUPPLEMENTARY"
(this usage type is not used in Synop Analyzer v1.x).
"ACTIVE"
:
the default usage type.
"GROUP"
:
the field is the 'group' field: it contains group IDs which mark a group of adjacent data rows as members of one group.
"ENTITY"
:
the field is the 'entity' field: it contains a second grouping level on top of the 'group' field. The entity field contains entity IDs which marks a set of adjacent data row groups as members of one entity,
"WEIGHT"
:
the field contains the weight, price or cost value which is associated with the situation, event or good described by the other data field values of the data record.
"ORDER"
:
the field contains a time stamp or a date.
If the usage
attribute is not set, "ACTIVE"
is assumed.
-
aggregationType
defines the value aggregation type. This attribute is only of interest for numeric data fields and for the case that a group field has been defined. The attribute determines how the field's values in different data records within one data group are aggregated in order to form the data group's value for that field.
"SUM"
:
The field value of the data group (transaction) is the sum of the field values of all data records which form the group.
"MEAN"
:
The field value of the data group is the average of the field values of all data records which form the group.
"MAX"
:
The field value of the data group is the maximum of the field values of all data records which form the group.
"MIN"
:
The field value of the data group is the minimum of the field values of all data records which form the group.
"SPREAD"
:
The field value of the data group is the difference between the greatest and the smallest value of the field on all data records which form the group.
"RELATIVESPREAD"
:
The field value of the data group is the difference between the greatest and the smallest value of the field on all data records which form the group divided by the mean field value on the data group.
"MINDIFF"
:
The field value of the data group is the minimum of all field value differences between two adjacent data records within the group.
"MAXDIFF"
:
The field value of the data group is the maximum of all field value differences between two adjacent data records within the group.
"COUNT"
:
The field value of the data group is the number of records which form the group.
The default aggregation type is
"SUM"
.
-
anonymizationLevel:
overrides the general anonymization level (defined as attribute
anonymizationLevel
of the <InputData>
tag) for a single field:
0 (default): no anonymization,
1: anonymize the field names, keep the original field values,
2: anonymize the textual field values and transform all numeric field values such that the resulting value distribution for each numeric data field has a mean of 0 and a standard deviation of 1. Maintain the original data field name,
3: anonymize both the data field name and the field values.
-
dateFormat:
specifies the current field as a date/time field and indicates the date/time format of the field, e.g.
"MM/dd/yyyy hh:mm:ss"
.
Optional <JoinedTable> subelements
<JoinedTable>
specifies an auxiliary table which is to be combined with the main input data table using a primary key - foreign key relation between certain data fields of the two tables. <JoinedTable>
has the following required attributes:
-
<DataLocator
/>:
URL and data format of the auxiliary table. See here for a detailed description of the element <DataLocator>.
-
<KeyFieldPair mainTableField="
" joinedTableField="
"/>:
a pair of data fields, one from the main table, one from the auxiliary table, which serve as foreign key - primary key pair and thereby establish the relation between the two tables.
-
<AddedField field="
"/>:
the name of a data field from the auxiliary table which is to be added to the main table.
Optional <Taxonomy> subelements
<Taxonomy>
defines an auxiliary table which contains taxonomy (hierarchy) information for one or more data fields of the main input data table. <Taxonomy>
has the following required attributes:
-
parentField:
name of the data field in the auxiliary table which contains the 'parent', i.e. the higher order hierarchy level, of a taxonomy relation (parent-child relation).
-
childField:
name of the data field in the auxiliary table which contains the 'child', i.e. the lower order hierarchy level, of a taxonomy relation (parent-child relation).
<Taxonomy>
must contain at least one of each of the following sub-tags:
-
<DataLocator
/>:
URL and data format of the auxiliary taxonomy table. The internal structure of this element has been described here.
-
<AffectedField field="
"/>:
a data field in the main data table for which the taxonomy relations apply.
Optional <NameMapping> subelements
<NameMapping>
defines an auxiliary table which contains clear names for the values of one or more data fields of the main input data table. <NameMapping>
has the following required attributes:
-
origNameField:
name of the data field in the auxiliary table which contains the original field values for which clear names are to be defined.
-
mappedNameField:
name of the data field in the auxiliary table which contains the mapped values (clear names).
<NameMapping>
must contain at least one of each of the following sub-tags:
-
<DataLocator
/>:
URL and data format of the auxiliary name mapping table. The internal structure of this element has been described here.
-
<AffectedField field="
"/>:
a data field in the main data table for which the name mappings apply.
Optional <Discretization> subelements
<Discretization>
describes a manually defined discretization (binning) for one or more data fields in the main input data table. <Discretization>
can contain one integer valued numeric attribute:
-
nbBins:
number of intervals (bins), not counting a possible needed extra bin for invalid or missing values.
<Discretization>
has the following sub-tags:
-
<BinBounds>
(StringList)
</BinBounds>:
the interval boundaries. This sub-tag is optional and only allowed if the discretization is defined for a numeric data field.
-
<AffectedField field="
"/>:
one data field in the main data table for which the discretization applies.
Optional <PerfectTupelDetection> subelements
<PerfectTupelDetection>
defines a data analysis and data simplification step for data with set-valued data fields or data with a 'group' field. A perfect tupel detections identifies in a first step all combinations values of one data field which (almost) always occur together - that means in the same data records or data groups. In a second step, all values figuring in such a combination are removed from the data and replaced by a textual string representing the entire combination.
<PerfectTupelDetection>
can contain the following attributes:
-
minFrequency:
minimum frequency threshold for the perfect tupels to be detected. Default value is 10.
-
minPurity:
minimum purity threshold for the perfect tupels to be detected. Default value is 1.0, which means that only those tupels are detected and removed whose values never occur without all the other values from the tupel.
-
collationString:
text fragment or character which is used as link when composing the name of the combined tupel. Default value is '_'.
Reference description of the analysis task part
After the <InputData>
part, an Synop Analyzer task can contain one or more of the following elements, which define various analysis tasks that can be performed on the data using Synop Analyzer different analysis modules.
The element marked with (*) is a 'dummy' element. It does not define an analysis step but just starts the Synop Analyzer workbench and reads the input data which habe been specified in the preceding <InputData>
part of the task. This element can not be processed by the command line processor iacl
.
<UnivariateExplorationTask>
<UnivariateExplorationTask>
generates a statistical overview of the currently active input data and creates visualizations of the value distributions for all data fields.
<UnivariateExplorationTask>
can contain the following attributes:
-
nbChartsPerRow:
number of field value distribution histograms shown in one row on screen. The higher the value, the smaller the size of each single histogram chart.
-
yAxisLabel:
a label text to appear next to the y axis of the histogram charts in the Univariate Exploration panel.
-
barColors:
a series of RGB color byte triples, such as 0:0:255 for the color blue, separated by blancs. The first triple defines the color of the first bar in each histogram, the second triple defines the second bar, and so on.
<UnivariateExplorationTask>
can contain the following sub-elements:
- <HiddenField field="
"/>
specifies a data field which is to be ignored in the statistical and visual data overview schreen. Note that data fields which have been marked with the
<FieldUsage usage="SUPPRESSED"/>
tag in the <InputData>
element are ignored by default.
-
<ResultDataLocator
/>
defines name, access path and data format of the file or database table into which the result of the univariate exploration is to be exported. The internal structure of this element has been described in subsection
<DataLocator>
.
<CorrelationsTask>
<CorrelationsTask>
analyses and displays correlations between the data fields.
<CorrelationsTask>
can contain the following attributes:
- minCorrelation
defines a lower limit for the correlation coefficients to be shown in the panel. The value must be in the range from 0.0 to 1.0.
-
field1:
if this attribute is set and contains a valid field name, only correlation coefficients involving that field are shown on screen.
-
field2:
if this attribute is set in addition to the attribute
field1
, then only the correlation coefficient between field1
and field2
is shown on screen.
<CorrelationsTask>
can contain the following sub-element:
-
<ResultDataLocator
/>
defines name, access path and data format of the file or database table into which the result of the correlations analysis is to be exported. The internal structure of this element has been described in subsection
<DataLocator>
.
<BivariateExplorationTask>
<BivariateExplorationTask>
creates a bivariate analysis of the interdependencies of two data fields. The values or value ranges of one field are traced along the x axis, the values of the second field along the y axis. The resulting matrix contains in each matrix cell (m,n) the number of data records - or, if a 'group' field has been specified, the number of groups - in which the x-field has the the m-th value and the y-field the n-th value. A color code signals whether this combination occurs more (green) or less (red) frequently than expected. This method visualizes systematic interdependencies between certain values of the two fields.
<BivariateExplorationTask>
can contain the following attributes:
- ignoreMissingValues:
if this value is set to 'true', all data records (respectively all groups) in which one of the two involved data fields has no valid value are ignored in the counts shown in the matrix cells. The default setting for
<ignoreMissingValues>
is 'false'.
-
showCirclePlot:
indicate whether or not to show an absolute frequency plot. If this attribute is missing, the plot is shown.
<BivariateExplorationTask>
must contain the following required sub-elements:
-
<XField field="
" nbRanges="
">
<RangeBounds>
</RangeBounds>
</XField>
defines the x-axis field and its binning into diskrete ranges. Each discrete range corresponds to one column in the resulting bivariate counts matrix.
nbRanges
is the number of ranges (columns); the sub-element RangeBounds
contains a series of digits 1 and 0, separated by blancs. The series must contain nbRanges
-1 times the digit 1 and in total n-1 digits, where n is the number of different field values (or discretized ranges as defined in the InputData
part of the XML task).
Example: we assume that the data field AGE
is a numeric data field with more than 10 different values, and no <Discretization>
has been specified for this field in the <InputData>
part of the task. Then AGE
will be discretized into 10 value ranges (bins), plus an additional 11-th range 'missing/invalid' if the field contains missing or invalid values. Hence, <RangeBounds>
must contain 9 respectively 10 digits. It might look like this: <RangeBounds>0 1 0 0 0 1 0 0 0</RangeBounds>
. In this case, the bivariate matrix has 3 columns. The first column represents the first two discrete bins of field AGE
, the second column the next four bins and the last one the remaining four bins.
-
<YField field="
" nbRanges="
">
<RangeBounds>
</RangeBounds>
</YField>
defines the y-axis field and its binning into diskrete ranges. Each discrete range corresponds to one row in the resulting bivariate counts matrix.
-
<ResultDataLocator
/>
defines name, access path and data format of the file or database table into which the result of the bivariate exploration is to be exported. The internal structure of this element has been described in subsection
<DataLocator>
.
<MultivariateExplorationTask>
<MultivariateExplorationTask>
generates and visualizes a multivariate data selection, that means the equivalent of a SQL SELECT
statement with a WHERE
clause in which one or more data fields appear as filter criteria. As a result, the multivariate selection shows how the value distributions of all data fields - the ones serving as selection criteria and the other ones - on the selected data subset differ from the corresponding value distributions on the entire data.
<MultivariateExplorationTask>
can contain the following attribute:
-
nbChartsPerRow:
number of field value distribution histograms shown in one row on screen. The higher the value, the smaller the size of each single histogram chart.
<MultivariateExplorationTask>
can contain the following sub-elements:
-
FieldHistogram field="
" nbBins="
">
<SelectedBins>
</SelectedBins>
</FieldHistogram>
defines a selection criterion for the data field field
.
nbBins
is the number of different values (or value ranges as defined in <InputData>
). <SelectedBins>
contains a series of digits 0 or 1, separated by blancs. The series must contain exactly nbBins
digits. 1 signifies that the corresponding field value or value range is selected, 0 means that is is deselected.
-
<ResultDataLocator
/>
defines name, access path and data format of the file or database table into which the result of the multivariate exploration is to be exported. The internal structure of this element has been described in subsection
<DataLocator>
.
<TestControlAnalysisTask>
<TestControlAnalysisTask>
creates and compares two different (and normally disjunct) multivariate data selections on one single data set: the 'test' data and the 'control' data. The two subsets can then be analyzed for significant value distribution differences. Furthermore, the test/control analysis module can sample a subset of the original control data which is 'representative' for the test data on some specified data fields.
<TestControlAnalysisTask>
can contain the following optional attributes:
-
nbChartsPerRow:
number of field value distribution histograms shown in one row on screen. The higher the value, the smaller the size of each single histogram chart.
-
minNbControl:
minimum number of control data records (or groups) after sampling.
-
maxNbControl:
maximum number of control data records (or groups) after sampling.
-
iterateOverValuesOf:
name of a data field. This field will be used to define a series of test/control analysis tasks. Each task within the series defines one single value of the field as test data selector criterion and a set of other values of the field as the control data selector criterion.
-
maxNbIterations:
sets an upper limit for the number of different test/control analysis tasks generated by the attribute
iterateOverValuesOf
.
-
minChiSquareConfidence:
if this attribute is set (to a value beetween 0 and 1), test/control analysis results will only be generated and exported for those test/control data splits with a certain minimum difference in the value distributions of at least one of the specified target fields. The value distribution is measured by a χ² test with the null hypothesis 'the value distributions of the test data and the control data for the target field are identical'.
-
summaryResultFile:
name of a 'summary' file which contains one line of data for each single test/control analysis within a series of automatically executed test/control analysis steps. If this attribute is missing, no summary file will be written. If the name ends with 'xlsx', an Excel spreadsheet will be written, otherwise a tab-separated flat text file will be created.
<TestControlAnalysisTask>
can contain the following sub-elements:
-
<FieldHistogramTC field="
" nbBins="
" optimizable="
"/>
specifies how the data field field
is used within the test/control data analysis.
nbBins
is the number of different values (or value ranges as defined in <InputData>
).
optimizable
indicates whether this field's value distribution on the control data should be made representative for the field's value distribution on the test data when the control data is being 'optimized'. Default value is 'true'.
Furthermore, <FieldHistogramTC>
can contain sub-elements which describe how the field is used as a splitting criterion for test and control data. Either
<SelectedBinsTest>
0 1
</SelectedBinsTest>
<SelectedBinsControl>
0 1
</SelectedBinsControl>
,
(if the selection criteria for the test and the control data are intended to differ on this field), or
<SelectedBins>
0 1
</SelectedBins>
(if identical selection criteria for both data sets are to be defined).
Each <SelectedBins
>
tag contains a series of digits 0 or 1, separated by blancs. The series must contain exactly nbBins
digits. 1 signifies that the corresponding field value or value range is selected, 0 means that is is deselected.
-
<ResultDataLocator
/>
defines name, access path and data format of the file or database table into which the result of the split analysis is to be exported. The internal structure of this element has been described in subsection
<DataLocator>
.
<TimeSeriesAnalysisTask>
<TimeSeriesTask>
describes the analysis of a time series: detection of trends and cyclic components ('seasons'), modeling the impacts of singular events ('strokes') and calculation of forecasts.
<TimeSeriesTask>
can contain the following optional attributes:
-
nbChartsPerRow:
number of detail time series charts per row in the graphical overview. The larger the value, the smaller each single chart.
-
heightWidthRatio:
height-to-width ratio of the time series charts to be created.
-
groupingField:
name of the data field whose values are used to define the different detail time series and detail charts. For each value or value range of this field, a separate time series will be created and analyzed.
-
nbForecasts:
number of time steps to be predicted.
-
forecastStart:
time stamp at which the aggregation of aggregated forecast values (such as the forecasted total sales from January 1 till year end should start.
-
chartStart:
time stamp at which the generated charts should start.
-
exponentialSmoothingWeight:
weight factor (between 0.0 and 1.0) with which singular effects - strokes, deviations from the expected (trend+season) pattern - are influencing the prediction of future values.
-
exponentialSmoothingAlpha:
damping factor (0.0 < α < 1.0; 0.0 means no damping) for the influence of deviations from the long-term (trend+season) pattern which happened in the recent past. A damping factor of α means that the influence of a deviation which happened
n
time steps ago will be damped by a factor of (1-α)n
.
-
trendDamping:
trend damping factor
d > 0.0
models the expected behavior of the seasonally corrected trend line in the future. d < (>) 1.0
assumes that the seasonally corrected trend which was detected in the recent past will be reduced (increased) by a factor of d
with each time step into the future.
-
period:
presumed cycle length of the longest significant cyclic pattern (season) in the time series data. For example 12 if we assume a yearly pattern on monthly recorded data.
-
smoothing:
sliding average width. For calculating the seasonally corrected trend line, we use a symmetric sliding average over
smoothing
time steps. Default value is the value of period
.
-
season:
defines the way in which the cyclic (seasonal) component is modeled into the data. Possible values are
ADDITIVE
and MULTIPLICATIVE
. The first variant models the seasonal components as an additive contribution (added value), the second variant models it as a multiplicative factor (multiplication coefficient).
-
allowNegativeValues:
defines whether the forecast can contain negative values. The default value of this parameteris true.
<TimeSeriesTask>
can contain the following subelement:
-
<ResultDataLocator
/>:
defines name, access path and data format of the file or database table into which the result of the time series analysis is to be exported. The internal structure of this element has been described in subsection
<DataLocator>
.
<AssociationsTrainTask>
<AssociationsTrainTask>
defines the task to perform an associations analysis and to generate a collection of association rules on the data described in the <InputData>
section. The result can be returned in the form of a PMML <AssociationModel>
or in tabular form as a flat file.
<AssociationsTrainTask>
can contain the following optional attributes:
-
nbVerificationRuns:
number of control models, which are calculated with the same rule filter settings as the main model but on artificially shuffled (permuted) data in which each data field's values are randomly moved to new data rows. By analyzing the significance numbers (support, lift, confidence) of the best 'artificial' rules detected on the control models, Synop Analyzer derives a reliability criterion for the associations and rules of the main model. This helps in differentiating true and robust patterns from artificial noise.
-
maxNbPatterns:
maximum number of associations to be detected. If more associations matchig all specified filter criteria can be found, they are sorted with respect to
sortingCriterion
, and only the best maxNbPatterns
associations are kept.
-
sortingCriterion:
sorting criterion used for selecting the 'best'
maxNbPatterns
associations nach dem die 'besten' maxNbPatterns. Possible values are "SUPPORT"
, "LIFT"
, "CONFIDENCE"
, "PURITY"
, "COREITEMPURITY"
und "WEIGHT"
. Default value is "SUPPORT"
.
-
minChildSupportRatio:
number between 0.0 and 1.0, with default value 0.0;
- minParentSupportRatio:
number equal to or greater than 1.0, default value is 1.0; filter criterion which regulates the acceptance of short associations: an association of length
n-1
will only be accepted if its support (frequency of occurrence) is not smaller than minParentSupportRatio
times the minimum of the supports of all 'child associations' of length n
in which exactly 1 item is appended to the existing association. Setting minParentSupportRatio
to a value greater than 1.0, for example to 1.2, helps suppressing the appearance of masses of redundant partial patterns of one single interesting long pattern.
-
minChiSqrConf:
number between 0.0 and 1.0 with default value 0.0; if a value greater than 0.0 is set, for example 0.95, each detected association is submitted to a χ2 significance test with the null hypothesis 'the appearance probability on the training data of at least one item within the association is independent of whether or not also the other
n-1
items of the association appear in the same data groups. The association will only be accepted if this null hypothesis is rejected at a significance level of at least minChiSqrConf
. In other words: all associations are rejected in which at least one item seems not to be 'significant' for the entire pattern because its appearence probability is independent of the rest of the pattern.
<AssociationsTrainTask>
can contain the following optional subelements:
-
<PatternLength min="
" max="
"/>:
lower and upper limit for the number of parts ('items') in the associations to be detected.
-
<AbsoluteSupport min="
" max="
"/>:
lower and upper limit for the absolute support (that is the absolute occurrence frequency on the training data) of the associations to be detected. Both limits must be integers greater than 0.
-
<RelativeSupport min=max="
"/>:
lower and upper limit for the relative support (that is the occurrence frequency on the training data divided by the total number of data groups in the training data) of the associations to be detected. Both numbers must be probability numbers between 0.0 (exclusive) and 1.0 (inclusive).
-
<RelativeItemSupport min="
" max="
"/>:
lower and upper limit for the relative supports of the single items which can occur in the associations to be detected. Both limits must be probability numbers between 0.0 (inclusive) and 1.0 (inclusive).
-
<Lift min="
" max="
"/>:
lower and upper limit for the lift of the associations to be detected. The lift of an association (A,B) is the relative support of (A,B) divided by the product of the relative supports of A and of B. Lift values greater (smaller) than 1 indicate a positive (negative) correlation between the items A and B. Both limits must be floating point numbers greater than 0.0.
-
<LiftIncreaseFactor min="
" max="
"/>:
lower and upper limit for the permitted lift ratios which result from comparing the lift of a 'child' association of length
n
to the lifts of all its 'parent' associations of length n-1
. <LiftIncreaseFactor>
greater than 1 enforces that only those items can be appended to existing parent patterns which have a positive correlation with the existing pattern. Both limits must be positive numbers.
-
<Purity min="
" max="
"/>:
lower and upper limits for the purity (on the training data) of the associations to be detected. Purity is a number between 0.0 and 1.0. In associations with purity 1.0, each single item within the association appears only in those data records or data groups in which also all other items of the association occur. More general, purity is defined as the support of the association divided by the maximum of the supports of its items. Both limits must be numbers between 0.0 and 1.0.
-
<CoreItemPurity min="
" max="
"/>:
lower and upper limits for the core item purity (on the training data) of the associations to be detected. Core item purity is a number between 0.0 and 1.0; it is defined as the support of an association divided by the minimum of the supports of the association's items. In associations with a core item purity of 1.0, there is at least one item only occurs (on the training data) together with all other items of the association. Both limits must be numbers between 0.0 and 1.0.
-
<ItemPairPurity min="
" max="
"/>:
lower and upper limit for the pairwise purity of the items which are allowed to occur in the detected associations. Both limits must be numbers between 0.0 and 1.0. Setting a maximum item pair purity below 1.0 can be a means for suppressing the occurrence of well-known and trivial item-item correlations in the detected associations, for example combinations such as 'AGE<18' and 'MARITAL_STATUS=child'.
-
<Confidence min="
" max="
"/>:
lower and upper limits for the confidences of the 'if-then' rules which can be formed from the detected associations by taking one item as the 'then' part and all other items as the 'if' part of the rule. If this filter has been set, only those associations will be contained in the resulting model for which the confidence of at least one 'if-then' rule is in the specified range. Both limits must be probability numbers between 0.0 (exclusive) and 1.0 (inclusive).
-
<Weight min="
" max="
"/>:
lower and upper limit for the mean weights (prices or costs on the training data) of the associations to be detected. This filter criterion will be ignored unless a WEIGHT data field has been specified in the
<InputData>
section of the task. Both limits can be arbitrary numbers.
-
<RequiredItemGroups>
<ItemGroup><item>
</item>
</ItemGroup>
<ItemGroup><item>
</item>
</ItemGroup>
</RequiredItemGroups>
defines bits of information ('items') which must occur in each association to be detected: from each <ItemGroup>
, at least one item must occur in the patterns to be detected.
-
<IncompatibleItemGroups>
<ItemGroup><item>
</item>
</ItemGroup>
<ItemGroup><item>
</item>
</ItemGroup>
</IncompatibleItemGroups>
defines bits of information ('items') which must not occur together in the patterns to be detected: from each <ItemGroup>
, not more than one <item>
may occur. Defining incompatible item groups is a means for eliminating the appearance of well known and trivial correlations from the detected associations.
-
<NegativeItems><item>
</item>
</NegativeItems>:
defines those bits of information ('items') for which not only the appearance but also the non-appearance within a data record or data group can become part of a detected pattern.
-
<SuppressedItems><item>
</item>
</SuppressedItems>:
defines those bits of information ('items') which are to be completely ignored during the associations analysis.
-
<TrackedItems><item>
</item>
</TrackedItems>:
defines certain bits of information ('items') for which the relative occurrence frequency (relative support) on the support of each detected pattern is to be tracked. For example, if you specify the item 'PRICE>100EUR' as a TrackedItem
, you will be shown for every detected association how many of the data records or data groups in which the association occurs have a price of more than 100 EUR.
-
<AssociationsResultSpec
/>:
defines various settings for exporting association models. The element has the following (optional) attributes:
-
format:
output format of the model ("FLAT_FILE", "FLAT_FILE_NO_HEADER", "PMML" or "JDBC_TABLE")
-
colSeparator:
column separator character to be used in the output model (only required in the output formats "FLAT_FILE" and "FLAT_FILE_NO_HEADER"). Default value is <TAB>.
-
writeToStdOut:
if this parameter is set to 'true', the model will be written both to the standard output console (
stdOut
) and to the specified output file.
-
description:
textual description of the association model.
-
writeChiSqrConf:
'true' or 'false'. Indicates whether the χ2 confidence of each association is to be written into the model. Per default, chi-square confidences are written if and only if a
minChiSqrConf
filter greater than 0.0 has been set.
-
writePurities:
'true' or 'false'. Indicates whether the purity of each associations is to be written into the model output. Default is 'true'.
-
writeWeight:
'true' or 'false'. Indicates whether the weight (price, cost) of each association is to be written into the model output. Per default, weight is written if and only if a weight/price field has been specified on the training data.
-
writeConfidences:
'true' or 'false'. Indicates whether the model output should contain the confidences of all possible if-then rules which can be formed from a given association within the model by taking one of the association's items as 'then' side and all other items as the 'if' side of the rule.
-
writeItemSupports:
'true' or 'false'. Indicates whether the occurrence frequencies (absolute supports) of each single item within each association are to be written into the model output. Default is 'true'.
-
writeSupportGroups:
'true' or 'false'. Indicates whether up to 3 sample data records or data groups out of the support of each association are to be written into the model output. Default is 'false'.
-
itemMode:
"SINGLE" or "COMBINED". Indicates whether the names of the single items which form a association are to be written into separate columns of the model output or into one single column containing all item names. This setting is irrelevant for the output format 'PMML'. Default value is "SINGLE".
-
<ResultDataLocator
/>:
defines name, access path and data format of the file or database table into which the result of the associations analysis is to be exported. The internal structure of this element has been described in subsection
<DataLocator>
.
<SequencesTrainTask>
<SequencesTrainTask>
defines the task to perform a sequential patterns analysis and to generate a collection of sequential patterns on the data described in the <InputData>
section. The result can be returned in the form of a PMML <SequenceModel>
or in tabular form as a flat file.
<SequencesTrainTask>
can contain the same attributes and subelements as <AssociationsTrainTask>
. However, between formally identical attributes and subelements in <AssociationsTrainTask>
and <SequencesTrainTask>
there is the semantic difference that 'support' means something different in sequences compared to associations. In associations, support refers to the number of data records or data groups (transactions) in which a pattern occurs. In sequences, support refers to the number of 'entities' for which transaction data have been collected. For example, in market basket analysis, the support of an association is the number of sales slips (transactions) in which a combination of articles occurs, whereas the support of a sequence is the number of customers (entities) for whom a certain time-ordered purchasing pattern applies.
A sequential patterns analysis can only be performed if the <InputData>
section of the task specification defines an 'ENTITY' data field and an 'ORDER' data field.
<SequencesTrainTask>
can contain the following additional subelements which can not occur in <AssociationsTrainTask>
or which have a different meaning there:
-
<NbItems min="
" max="
"/>:
lower and upper limit for the number of single bits of information ('items') in the sequential patterns to be detected.
-
<PatternLength min="
" max="
"/>:
lower and upper limits for the number of 'item sets' (events) in the sequences to be detected. Each 'item set' consists of one or more atomic bits of information ('items') which occur at the same time. Hence,
<PatternLength>
is the number of time steps in the sequence plus 1.
-
<ItemsetLength min="
" max="
"/>:
lower and upper limits for the number of atomic bits of information ('items') which can be contained in a single event ('item set') which can appear in the sequences to be detected.
-
<SequencesResultSpec
/>:
has the same function as the element
<AssociationsResultSpec>
in <AssociationsResultSpec>
and contains exactly the same attributes.
-
<ResultDataLocator
/>:
defines name, access path and data format of the file or database table into which the result of the sequences analysis is to be exported. The internal structure of this element has been described in subsection
<DataLocator>
.
<
<RegressionTrainTask>
<RegressionTrainTask>
defines the task to perform a regression analysis and to generate a regression model on the data described in the <InputData>
section. The result can be returned in the form of a PMML <RegressionModel>
or in tabular form as a flat file.
<RegressionTrainTask>
can contain the following optional attributes:
- maxNbRegressors:
maximum number of regressor variables, that means data fields which appear on the left hand side of the regression equation to be created. If there are more active data fields, a selection will be performed based on the fields' importance (regression coefficient strength) and on field-field correlations.
-
missingValueReplace:
specifies how missing values in regressor fields are to be handled. Possible values are:
"ZERO"
(default): replaces missing values by 0.
"MEAN"
: replaces missing values by the field's mean value.
"SKIP_RECORD"
: ignores every data record in which at least one active regressor field has no valid value.
-
withConstantOffset:
specifies whether the regression equation can contain a constant term (offset). Default is 'true'.
-
createResidualField:
if this parameter is 'true', a new data field named 'RESIDUAL' will be created in the training data. The new data field contains the model's prediction error for each data record, that is the residual (actual target field value minus predicted target field value). Default value is 'false'.
<RegressionTrainTask>
can contain the following subelements:
-
<RegressionResultSpec
/>:
defines various settings for exporting regression models. The element has the following (optional) attributes:
-
format:
output format of the model ("FLAT_FILE", "FLAT_FILE_NO_HEADER", "PMML" or "JDBC_TABLE"). Default value is
"FLAT_FILE"
-
colSeparator:
column separator character to be used in the output model (only required in the output formats "FLAT_FILE" and "FLAT_FILE_NO_HEADER"). Default value is <TAB>.
-
writeToStdOut:
if this parameter is set to 'true', the model will be written both to the standard output console (
stdOut
) and to the specified output file.
-
description:
textual description of the regression model.
-
writePredictedError:
'true' or 'false'. Specifies whether the mean prediction accuracy (root mean squared error) on the training data is to be written into the model. Default is 'true'.
-
<ResultDataLocator
/>:
defines name, access path and data format of the file or database table into which the result of the regression analysis is to be exported. The internal structure of this element has been described in subsection
<DataLocator>
.
/LI>
<SOMTrainTask>
<SOMTrainTask>
defines the training of a self organizing map (SOM model) - that means a two-dimensional grid of neurons - on the data described in the <InputData>
section.
SOM models can be used for cluster analysis and for prediction of unknown data field values. The resulting SOM model can be returned in the form of a PMML <ClusteringModel>
or in a proprietary binary format.
<SOMTrainTask>
can contain the following optional attributes:
- nbVerificationRuns:
number of control models, which are built with the same parameter settings as the main model but with different random initializations of the neuron weights. The comparison between the main model and the control model(s) indicates whether the main model is well converged.
-
maxNbIterations:
maximum number of training iterations of the SOM net
-
targetWeight:
multiplication factor for the relative weight of the 'target' data field compared to the other active data fields. Default value is 1.0. Setting the parameter to values grater than 1 results in SOM models in which the SOM card for the target field shows a clearer distinction between low-target-value regions and high-target-value regions.
-
nbNeuronsX:
number of SOM neurons in x direction
-
nbNeuronsY:
number of SOM neurons in y direction.
-
createResidualField:
if this parameter is 'true', a new data field named 'RESIDUAL' will be created in the training data. The new data field contains the model's prediction error for each data record, that is the residual (actual target field value minus predicted target field value). Default value is 'false'.
<SOMTrainTask>
can contain the following subelements:
-
<SOMResultSpec
/>:
defines various settings for exporting SOM models. The element has the following (optional) attributes:
- format:
output format of the model ("BINARY" or "PMML")
-
writeToStdOut:
if this parameter is set to 'true', the model will be written both to the standard output console (
stdOut
) and to the specified output file.
-
description:
textual description of the SOM model.
-
writePredictedError:
'true' or 'false'. Specifies whether the mean prediction accuracy (root mean squared error) on the training data is to be written into the model. Default is 'true'.
Wenn dieses Attribut auf "true"
gesetzt ist, wird die mittlere Vorhersagegόte (die Quadratwurzel des mittleren quadratischen Fehlers) des SOM-Modells bei der Vorhersage der Zielfeldwerte auf den Trainingsdaten in das exportierte Modell geschrieben. Voreinstellung ist "true"
.
-
<ResultDataLocator
/>:
defines name, access path and data format of the file or database table into which the result of the SOM training is to be exported. The internal structure of this element has been described in subsection
<DataLocator>
.
<