The XML Application Programming Interface

Content

Command line parameters and the command line processor sacl
General structure and a simple example of an XML task
Reference description of the <InputData> part
attributes of <InputData>
<DataLocator> subelements
<FieldUsage> subelements
<JoinedTable> subelements
<Taxonomy> subelements
<NameMapping> subelements
<Discretization> subelements
<PerfectTupelDetection> subelements
Reference description of the analysis task part
<UnivariateExplorationTask>
<BivariateExplorationTask>
<MultivariateExplorationTask>
<TestControlAnalysisTask>
<CorrelationsTask>
<TimeSeriesTask>
<AssociationsTrainTask>
<SequencesTrainTask>
<RegressionTrainTask>
<SOMTrainTask>


Command line parameters and the command line processor sacl

Based on a XML interface, Synop Analyzer can be used as an analysis kernel within automated workflows or batch processes, or as a plugin component embedded into third-party software. Synop Analyzer can be called in two ways

The first calling variant can take, the second one must take 1 or 2 command line parameters:

image file img/IA_process_800.png not found

In the following sections of this document, syntax and usage of Synop Analyzer XML tasks will be described in more detail.


General structure and a simple example of an XML task

An XML task according to the XML schema http://www.synop-systems.com?/xml/InteractiveAnalyzerTask.xsd consists of two parts:

A simple task, which reads the flat file kunden.txt from the subdirectory doc/sample_data of the Synop Analyzer installation directory and opens it in the Synop Analyzer graphical workbench, is given below:

  <?xml version="1.0" ?>
  <InteractiveAnalyzerTask>
     <InputData>
        <InputDataLocator usage="DATA_SOURCE" type="FLAT_FILE"
           name="doc/sample_data/kunden.txt"/>
     </InputData>
     <StartInteractiveAnalyzerGUITask/>
  </InteractiveAnalyzerTask>

If you store this task as kunden_task1.xml and start SynopAnalyzer or sacl with this file name as command line argument,

 c:> SynopAnalyzer.bat kunden_task1.xml,

then the Synop Analyzer workbench opens up, the data are automatically read into memory, and after a few seconds, you can start analyzing them. That means, the data from kunden.txt were read, interpreted, compressed, enriched with additional statistics and are now available in the computer's RAM for arbitrary analysis or data exploration tasks.

You could also submit the XML task directly as a textual string when calling SynopAnalyzer or sacl. In this case, however, you have to 'quote' the task by enclosing it into double quotes. The existing double quotes within the string have to be masked by backslashes (\) in this case. The call would then look like this:

  c:> SynopAnalyzer.bat "<?xml version=\"1.0\" ?><InteractiveAnalyzerTask>
  <InputData><InputDataLocator usage=\"DATA_SOURCE\" type=\"FLAT_FILE\"
  name=\"doc/sample_data/kunden.txt\"/></InputData>
  <StartInteractiveAnalyzerGUITask/></InteractiveAnalyzerTask>"


Reference description of the <InputData> part

The element <InputData> describes a data source which can be opened in Synop Analyzer

Optional attributes of <InputData>

<DataLocator> subelements

<InputData> must contain one and can contain two more different subelements of type <DataLocator>:

Each <DataLocator> must contain the following three attributes:

Optionally, <DataLocator> can contain one or more of the following attributes

Optional <FieldUsage> subelements

<FieldUsage> defines a usage specification for one single data field. The tag contains the following attributes:

Optional <JoinedTable> subelements

<JoinedTable> specifies an auxiliary table which is to be combined with the main input data table using a primary key - foreign key relation between certain data fields of the two tables. <JoinedTable> has the following required attributes:

Optional <Taxonomy> subelements

<Taxonomy> defines an auxiliary table which contains taxonomy (hierarchy) information for one or more data fields of the main input data table. <Taxonomy> has the following required attributes:

<Taxonomy> must contain at least one of each of the following sub-tags:

Optional <NameMapping> subelements

<NameMapping> defines an auxiliary table which contains clear names for the values of one or more data fields of the main input data table. <NameMapping> has the following required attributes:

<NameMapping> must contain at least one of each of the following sub-tags:

Optional <Discretization> subelements

<Discretization> describes a manually defined discretization (binning) for one or more data fields in the main input data table. <Discretization> can contain one integer valued numeric attribute:

<Discretization> has the following sub-tags:

Optional <PerfectTupelDetection> subelements

<PerfectTupelDetection> defines a data analysis and data simplification step for data with set-valued data fields or data with a 'group' field. A perfect tupel detections identifies in a first step all combinations values of one data field which (almost) always occur together - that means in the same data records or data groups. In a second step, all values figuring in such a combination are removed from the data and replaced by a textual string representing the entire combination. <PerfectTupelDetection> can contain the following attributes:


Reference description of the analysis task part

After the <InputData> part, an Synop Analyzer task can contain one or more of the following elements, which define various analysis tasks that can be performed on the data using Synop Analyzer different analysis modules.

The element marked with (*) is a 'dummy' element. It does not define an analysis step but just starts the Synop Analyzer workbench and reads the input data which habe been specified in the preceding <InputData> part of the task. This element can not be processed by the command line processor iacl.

<UnivariateExplorationTask>

<UnivariateExplorationTask> generates a statistical overview of the currently active input data and creates visualizations of the value distributions for all data fields.

<UnivariateExplorationTask> can contain the following attributes:

<UnivariateExplorationTask> can contain the following sub-elements:

<CorrelationsTask>

<CorrelationsTask> analyses and displays correlations between the data fields.

<CorrelationsTask> can contain the following attributes:

<CorrelationsTask> can contain the following sub-element:

<BivariateExplorationTask>

<BivariateExplorationTask> creates a bivariate analysis of the interdependencies of two data fields. The values or value ranges of one field are traced along the x axis, the values of the second field along the y axis. The resulting matrix contains in each matrix cell (m,n) the number of data records - or, if a 'group' field has been specified, the number of groups - in which the x-field has the the m-th value and the y-field the n-th value. A color code signals whether this combination occurs more (green) or less (red) frequently than expected. This method visualizes systematic interdependencies between certain values of the two fields.

<BivariateExplorationTask> can contain the following attributes:

<BivariateExplorationTask> must contain the following required sub-elements:

<MultivariateExplorationTask>

<MultivariateExplorationTask> generates and visualizes a multivariate data selection, that means the equivalent of a SQL SELECT statement with a WHERE clause in which one or more data fields appear as filter criteria. As a result, the multivariate selection shows how the value distributions of all data fields - the ones serving as selection criteria and the other ones - on the selected data subset differ from the corresponding value distributions on the entire data.

<MultivariateExplorationTask> can contain the following attribute:

<MultivariateExplorationTask> can contain the following sub-elements:

<TestControlAnalysisTask>

<TestControlAnalysisTask> creates and compares two different (and normally disjunct) multivariate data selections on one single data set: the 'test' data and the 'control' data. The two subsets can then be analyzed for significant value distribution differences. Furthermore, the test/control analysis module can sample a subset of the original control data which is 'representative' for the test data on some specified data fields.

<TestControlAnalysisTask> can contain the following optional attributes:

<TestControlAnalysisTask> can contain the following sub-elements:

<TimeSeriesAnalysisTask>

<TimeSeriesTask> describes the analysis of a time series: detection of trends and cyclic components ('seasons'), modeling the impacts of singular events ('strokes') and calculation of forecasts.

<TimeSeriesTask> can contain the following optional attributes:

<TimeSeriesTask> can contain the following subelement:

<AssociationsTrainTask>

<AssociationsTrainTask> defines the task to perform an associations analysis and to generate a collection of association rules on the data described in the <InputData> section. The result can be returned in the form of a PMML <AssociationModel> or in tabular form as a flat file.

<AssociationsTrainTask> can contain the following optional attributes:

<AssociationsTrainTask> can contain the following optional subelements:

<SequencesTrainTask>

<SequencesTrainTask> defines the task to perform a sequential patterns analysis and to generate a collection of sequential patterns on the data described in the <InputData> section. The result can be returned in the form of a PMML <SequenceModel> or in tabular form as a flat file.

<SequencesTrainTask> can contain the same attributes and subelements as <AssociationsTrainTask>. However, between formally identical attributes and subelements in <AssociationsTrainTask> and <SequencesTrainTask> there is the semantic difference that 'support' means something different in sequences compared to associations. In associations, support refers to the number of data records or data groups (transactions) in which a pattern occurs. In sequences, support refers to the number of 'entities' for which transaction data have been collected. For example, in market basket analysis, the support of an association is the number of sales slips (transactions) in which a combination of articles occurs, whereas the support of a sequence is the number of customers (entities) for whom a certain time-ordered purchasing pattern applies.

A sequential patterns analysis can only be performed if the <InputData> section of the task specification defines an 'ENTITY' data field and an 'ORDER' data field.

<SequencesTrainTask> can contain the following additional subelements which can not occur in <AssociationsTrainTask> or which have a different meaning there:

<RegressionTrainTask>

<RegressionTrainTask> defines the task to perform a regression analysis and to generate a regression model on the data described in the <InputData> section. The result can be returned in the form of a PMML <RegressionModel> or in tabular form as a flat file.

<RegressionTrainTask> can contain the following optional attributes:

<RegressionTrainTask> can contain the following subelements:

<SOMTrainTask>

<SOMTrainTask> defines the training of a self organizing map (SOM model) - that means a two-dimensional grid of neurons - on the data described in the <InputData> section. SOM models can be used for cluster analysis and for prediction of unknown data field values. The resulting SOM model can be returned in the form of a PMML <ClusteringModel> or in a proprietary binary format.

<SOMTrainTask> can contain the following optional attributes:

<SOMTrainTask> can contain the following subelements: