The Self-Organizing Maps (SOM) module

Content

Purpose and short description
Basic parameters for SOM trainings
Expert parameters for SOM trainings
Interpreting the result visualizations
Apply SOM models to new data
Creating scoring results


Purpose and short description

Self-organizing maps (SOM) are neural networks in which the neurons form a two-dimensional square grid or a hexagonal grid and each neuron is connected by artificial synapses to its near neighbors. A SOM is trained in an unsupervised learning process on a so-called training data set. Each neuron has a set of properties - the so-called weights - which corresponds to the set of data attributes available in the training data, and each neuron represents a unique combination of values of these attributes.

The purpose of the SOM is to define a mapping from the high-dimensional training data space with its many attribute dimensions to a two-dimensional representation which is easy to visualize and interpret but which conserves as much as possible of the structural (topological) information of the original data space.

There are two major application areas for SOM models: data visualization and data clustering on the one hand and scoring (prediction of unknown attribute values) on the other hand. In this latter case, the trained SOM model is applied to a neu data collection, the so-called scoring data, in which some of the attributes or attribute values of the original training data are missing.

You can find more details on the theoretical approach and links for further reading on http://en.wikipedia.org/wiki/Self-organizing_map.


Basic parameters for SOM trainings

In Synop Analyzer, a SOM training is started by loading a data source - the so-called training data - into memory and by clicking on the button SOM in the input data panel on the left side of the Synop Analyzer GUI. The button opens a panel named SOM Training. In the lower part of this panel, you can specify some parameters for the next SOM training and start the training process. The training process itself can be a long-running task, therefore it is executed asynchronically in one or more parallelized background threads. After the end of the training, the resulting SOM model will be displayed in the upper part of the panel.

The following paragraphs and screenshots demonstrate the handling of the various sub-panels and buttons at hand of the sample data doc/sample_data/customers.txt. We assume that these data have been read into memory without changing any default settings in the data import panel on the left side of the screen.

The first visible tab in the toolbar at the lower end of the SOM panel contains the most important parameters for SOM trainings.

image file som_toolbar_train_939.png not found

In the screenshot, the following settings were specified:

You should consider specifying a target weight larger than 1 if you want to train a SOM for predicting a target field and if with the default training settings the resulting SOM target field shows no clear structures but rather an amorphous green and grey pattern.

On the other hand, one can easily generate an 'over-trained' model by pushing the target weight to high. Over-training means that the resulting SOM almost perfectly maps all record's target field values but performs poorly both on the other data fields on the training data and when predicting the target field values of new scoring data.

It is always a good idea to put aside a small part of the available training before starting the SOM training. These data can then be used to validate the SOM. That means one lets the model predict the target field values and compares the predictions to the actual target field values. This approach helps to find the training parameter settings which produce the model with the smallest mean squared difference between the actual and the predicted target field values.


Expert parameters for SOM trainings

The second tab at the lower end of the screen, Advanced Parameters, provides 4 parameters which serve for fine-tuning the training process. You should only modify them if you are familiar with the SOM approach and algorithm parameters such as 'learning rate' or 'neighborhood radius'.

image file som_advanced_563.png not found


Interpreting the result visualizations

The third tab within the tool bar at the lower border of the SOM training window offers some capabilites to modify the display mode of the created SOM model and to introspect and export the model itself or certain data clusters marked on it. Some of the buttons only become enabled if you have selected one or more neurons by mouse clicks within the SOM cards.

The screenshot shown below results if one performs the parameter settings described in the previous sections and then presses the button Start training.

image file som_resultview_939.png not found

The main part of the screen displays one separate map, a so-called SOM card, for each data field. The SOM cards can be interpreted as follows:

Synop Analyzer's SOM cards provide a wide variety of mouse-based interactivity and selection features:

image file som_selected_939.png not found

In the tool bar tab Result introspection, the following options are available:

image file som_multivar_977.png not found


Apply SOM models to new data

SOM models which have been trained and stored earlier can later be reloaded and applied to a new data source. Synop Analyzer then compares the data fields available in the new data and the data fields used in the SOM model. Applying the model to the new data is only possible if at least half of the data fields used in the model are available in the new data. You load and apply a SOM model by first opening and reading the new data, by then pressing the button SOM in order to start the SOM module and by then clicking the button Load model in the fourth tab of the tool bar at the lower end of the SOM panel's GUI window.

image file som_toolbar_scoring_888.png not found

Once the SOM model has been loaded and applied successfully, the same SOM cards appear that you have seen at the end of the training process on the training data. But the black dots and quadrangles within the cards now represent the mapping of the new data records to the neural net. Correspondingly, the mapping quality measures Overall RMSE and Selection RMSE as well as the displayed relative and absolute numbers of selected data records shown in the panel's bottom tool bar now refer to the new data.

image file som_applyModel_939.png not found

When applying a SOM model to a new data source, you should always have a look at the measure Overall RMSE. If this value is much larger on the new data than it was on the training data, the new data do not match the model very well, indicating that between the training data and the application data, some major shift in the rules and relations which interrelate the different data fields and their values has occurred. Hence, using this SOM model for scoring the new data, that means for predicting missing field values, can yield misleading results.

In our example, we see from the distribution of the black quadrangles that the average demographic properties of the new customers do not coincide with the average demographic proberties of the existing customer base - new customers are mostly children or young adults. But nontheless, the model seems well applicable to the new data because the overall RMSE value is only slightly larger than it was on the training data, and it is still close to 0.


Creating scoring results

Now we want to use the loaded SOM model for scoring, more precisely for predicting the average account balance that we can expect from each new customer after a few months of getting into business with him or her. This information can be important for customer relationship management aspects and for optimizing the bank's internal refinancing strategy. The tab Scoring Settings within the SOM panel's bottom tool bar offers the following customization parameters for the SOM scoring:

image file som_toolbar_scoring_888.png not found

Once all settings and customizations have been performed, pressing the button Start scoring executes the scoring process. When the process has terminated without an error, the scoring result data are automatically opened in a new input data tab within the left screen column of the Synop Analyzer workbench. You can now apply all available analysis modules provided by your Synop Analyzer license to these new data. In the scrennshot shown below, we have opened the scoring result file of our example in the module Multivariate Exploration.

image file som_scoring_1141.png not found

We have selected the new data field BalanceStdDev as the detail structure field of our visualization. This field contains the SOM model's self-estimation on the accuracy of each of its predictions. Blue or violett values correspond to low incertitude ranges, orange and red values to very high incertitude ranges. For some data records, the SOM thought that its prediction was very accurate - up to some 100 EUR - for other data records the model gave an incertitude range of up to 30,000 EUR.

In our example we understand that the average balance of children can be predicted quite precisely and that surprisingly the self-estimated prediction quality for men is more often very low or very high than for women, where medium incertitude ranges dominate.

By pressing the button show data one can introspect the entire scoring results in tabular form, sort and filter them and export parts or all of them into different persistent target formats such as flat text files or spreadsheets. In the picture shown below we have sorted the scoring results by decreasing predicted value. We see that the model predicts the highest account balances for 40 to 55 years old engineers, freelancers, craftsmen and farmers and for pensioners.

image file som_scoringresults_874.png not found