The Regression Analysis panel

Content

Purpose and short description
Parameters for regression analsis
The Regressions result panel
Applying regression models to new data ('Scoring')


Purpose and short description

A regression analysis finds a formula which predicts the value of one single data field, the so-called target field, as a function of other data fields, the so-called predictor fields. The formula is detected during a so-called training process on data, on which both the target field and the predictor fields are filled with values. The resulting formula is also called a regression model. The regression model can later be applied to new data in which the target field values are missing in order to predict the target field values. This step is called scoring.

Synop Analyzer provides several methods for creating more general regression models, for example the neural SOM method. This chapter, however, shall be focussed on linear regression and logistic regression.

A linear regression model is a linear formula which predicts the target field value y of a numeric target field from n predictor field values x1 bis xn:

y = c0 + c1 x1 + ... + cn xn.

In logistic regression, the probability of one of the two values of a two-valued target field t (the so-called '1'-value) is expressed as a formula of the kind:

proba(t=1) = 1/(1+eb0+b1*x1+...+bn*xn).

Does every predictor field contribute exactly one regressor xi? This is only the case for numeric and Boolean data fields. More precisely, the following holds:

Training a linear or logistic regression model means detecting the best coefficient values ci such that the resulting formula minimizes the mean squared difference between the actual and the predicted target field values on the training data.

Within Synop Analyzer, a linear or logistic regression analysis is started by pressing the button regression in the left screen column.


Parameters for regression analsis

The first visible tab in the toolbar at the lower end of Synop Analyzer's linear regression panel contains the available parameters for linear regression analysis.

image file reg_toolbar_train_873.png not found

In the following, we want to explain the process of training and interpreting a regression model at the hand of a concrete example operating on the sample doc/sample_data/customers.txt using the following settings:

When working with the module Linear Regression Analysis, you sometimes get the following error message:

image file reg_collinear.png not found

The message says that some of the predictor fields are collinear, that means perfectly correlated. In this case, no unambiguous linear regression model can be built. In the example mentioned above, the message appears due to the fact that the data field Profession contains a the value unknown on couple of data records, which all happen to have ages below 12 and the family status child. All other data records with these data subset have Profession=inactive. Therefore, the two profession values are collinear.

The problem can be resolved by defining the value unknown as a 'null value' for the field Profession within the pop-up dialog Active data fields before reading the data into memory. The effect of this is that the value unknown does not any more represent a valid field value, and no regressor is created for it.

image file reg_collinear_793.png not found


The Regression result panel

After successfully terminating the training process, the main part of the window 'Regression Analysis' displays the regressors of the resulting model and their coefficients ci in the first two columns of the tabular result view.

The right column of the table ranks the corresponding predictor fields by their importance within the regression model, that means by their average impact on the predicted target field values. For calculating this impact measure, the software memorizes for each data record the contribution to the target value which comes from all regressors deriving from the one single examined predictor field. The displayed number is then the standard deviation of this list of contribution numbers.

image file reg_resultview.png not found

The tab Result introspection within the bottom tool bar displays the total number of regressors within the model, and it contains two quality numbers which help to judge the quality of the generated model:

From the low quality values of our sample model we can see that linear regression models often deliver poor prediction quality. This is due to the fact that the linear regression approach is mathematically simple but completely neglects many important possible types of relations between the target field value and the predictor values. In particular, non-linear relations such as quadratic, exponential or cyclic relations can not be modeled, and the same holds for multi factor effects such as y = c * xi * xj.

Therefore, linear regression models should be used for actually predicting values ('scoring') with care. Rather, they are useful for studying the principal relations between different fields, and for serving as reference models for regression models created by more sophisticated algorithms such as SOM or regression trees.


Applying regression models to new data ('Scoring')

Regression models can be applied to new data in order to create predictions on these data. This application of regression models to new data for predictive purposes is called 'scoring'.

You load and apply a linear or logistic regression model by first opening and reading the new data, by then pressing the button regression in order to start the regression analysis module and by then clicking the button Load model in the tab Scoring Settings of the tool bar at the lower end of the panel's GUI window.

image file reg_scoring_train_878.png not found

In the following sections we will demonstrate the process of regression model scoring with the help of a concrete example use case: using an logistic regression model we want to predict the propensity of newly acquired bank customers to sign a life insurance contract.

For this purpose, we load the sample data doc/sample_data/customers.txt. We keep the default data import settings with one exception but mark the field CUSTOMER_ID as the group field in the pop-up window Active Fields. Then we start the regression analysis module and train a model called regr_li.mdl, using the following parameter settings:

For model evaluation purposes, we apply the generated model to the training data and compare the predicted life insurance propensity to the actual existence or non-existence of a life insurance contract. In the Scoring Settings tab of the toolbar, we specify that we want to create a predicted field called LI_PRED. Optionally, we can also specify a file name to which the scoring results will be written (scored_customers_LI.txt). Then we press Start scoring in order to create the desired scoring results.

image file reg_scoring_params_845.png not found

A new in-memory data source tab pops up in the left column of the Synop Analyzer workbench. In this new data source, we select the module Bivariate Exploration and select the data fields LifeInsurance as x-axis field and LI_PRED as y-axis field. The red and green colors in the bivariate matrix show us that the model's prediction generally coincides well with the actual values.

image file reg_scoring_eval_685.png not found

The predictions calculated on the training data could also be used for detecting interesting 'candidates' for sales actions concerning life insurance contracts. For example, one could select the customers under 50 years which do not (yet) have a life insurance but for which the model has predicted a high propensity for signing a life insurance contract:

image file reg_scoring_eval_result_968.png not found

Now we apply the logistic regression model to a new data collection: 159 new customers, among which we want to find the most interesting candidates for selling life insurance contracts. We load the data doc/sample_data/newcustomers_159.txt into Synop Analyzer, thereby marking the data field CUSTOMER_ID as the 'group' field in the pop-up dialog Active fields

On this new in-memory data source, we start the regression analysis module and move to the tab Scoring Parameters in the tool bar at the lower end of the screen. Here, first load the regression model to be applied to the data, the model regr_LI.mdl. Then, we enter the name of the file in which the scoring results are to be stored (newcustomers_LI.txt), we define the scoring result data fields to be contained in that file and we specify that the new file should be a copy of the existing file newcustomers_159.txt plus the new computed data fields. (Create new data, original plus computed fields).

image file reg_scoring_params2_838.png not found

By means of the button Start scoring we create the scoring results, write the desired result file to disk and open the resulting data as a new in-memory data source in Synop Analyzer, that means as a new tab in the left column of the Synop Analyzer workbench.

We introspect the scoring result data with the module 'multivariate exploration'. We see that the model has created a propensity probability of at least 20% for 23 of the 159 new customers whose age is below 50 years and who do not yet have a life insurance.

image file reg_scoring_multivar_849.png not found

Via the button show data we submit the selected 23 data records to a last visual examination. Then we can use the button Export to save the resulting list to a flat file or Excel spreadsheet, or we can use the main menu button Report to create a HTML or PDF report.

image file reg_scoring_result_849.png not found