FS Analysis of the Transformed Loyalty Card Data

AuthorFrancesca Torti - Marco Riani - Anthony C. Atkinson - Domenico Perrotta - Aldo Corbellini
ProfessionEuropean Commission, Joint Research Centre (JRC) - University of Parma, Italy - London School of Economics, UK - European Commission, Joint Research Centre (JRC) - University of Parma, Italy
Pages11-12
/* SAS working library and data matrix creation */
libname lib "C:\FSDA\data\regression";
use ("lib.loyalty");
read all var {’x1’ ’x2’ ’x3’} into x[colname=colnx];
read all var ’y’ into y[colname=colny];
close ("lib.loyalty");
/* Add constant variable to the data for model with intercept */
x = x || j(nrow(x),1,1);
Figure 1: Example of SAS IML Studio code which uploads the Loyalty card data in SAS IML.
6. FS Analysis of the Transformed Loyalty Card Data
The data example we use in virtually all the calculations in this paper, taken from Atkinson and Riani (2006),
is of 509 observations on the behavior of customers with loyalty cards from a supermarket chain in Northern
Italy. The data are themselves a random sample from a larger database. The sample of 509 observations
is part of the FSDA toolbox for MATLAB. The response is the amount, in euros, spent at the shop over
six months and the explanatory variables are: x1, the number of visits to the supermarket in the six-month
period; x2, the age of the customer and, x3, the number of members of the customer’s family. The data are
loaded in SAS IML with the commands reported in Figure 1.
Atkinson and Riani (2006) show that the data need transformation to achieve constant variance for
which purpose we use the Box-Cox power transformation. As we see in §10 a value of 0.4 is indicated, and
we work with this transformation for the rest of this section.
Figure 2 shows, in the top panel, a forward plot of absolute minimum deletion residuals for observations
not in the subset used in f‌itting. In addition to the residuals, the plot includes a series of pointwise percentage
levels for the residuals (at 1%, 50%, 99%. 99.9%, 99.99% and 99.999%) found by the order statistic
arguments of §4. Several large residuals occur towards the end of the search. These are identif‌ied by an
automatic procedure including the resuperimposition of envelopes de scribed at the end of §5. In all 18
outliers (plotted as red crosses in the .pdf version), are identif‌ied. These form the last observations to enter
the subset in the search. The f‌igure shows that, at the very end of the search, the trajectory of residuals
returns inside the envelopes, the result of masking. As a consequence, the outliers would not be detected
by the deletion of single observations from the f‌it to all nobservations.
These results are very stable once a subset of non-outlying observations has been achieved. Figure 2
also shows, in the bottom panel, the monitoring of scaled residuals during the search with the 18 outliers
shown in red (in the on-line .pdf version). The outliers all have negative residuals the values of which change
little during the search until the end, when the outliers start to enter the subset. Then the residuals for the
outliers decrease steadily in magnitude. At the same time, the residuals for some of the observations from
the main body of the data begin to increase. The plots in Figures 2-4 were produced by brushing, that is
selecting the observations of interest from the top panel of Figure 2 and highlighting them in all others.
The observations we have found are outlying in an interesting way, especially for the values of x1.
Figure 3 shows the scatterplots of yagainst the three explanatory variables, with brushing used to highlight
the outlying observations in red (in the on-line .pdf version). The f‌irst panel is of yagainst x1. The FS has
identif‌ied a subset of individuals, most of whom are behaving in a strikingly dif‌ferent way from the majority of
the population. They appear to form a group who spend less than would be expected from the frequency of
their visits. The scatterplots for x2and x3, on the other hand, do not show any distinct pattern of outliers.
The ef‌fect of the 18 outliers on inference can be seen in Figure 3 which gives forward plots of the
four parameter estimates, again with brushing used to plot the outlying observations in red (in the on-line
.pdf version). The upper left panel, that for ˆ
β1, is the most dramatic. As the outliers are introduced, the
estimate decreases rapidly in a seemingly linear manner. This behaviour ref‌lects the inclusion of the outliers
in Figure 3, all of which lie below the general linear structure: their inclusion causes ˆ
β1to decrease. The
outliers also have ef‌fects on the other three parameter estimates (the lower right panel is for the estimate
of the intercept β0). Although the ef‌fects for these three parameters are appreciable, they do not take any
of the estimates outside the range of values which was found before the inclusion of outliers. The group
of outliers who are spending less than would be expected, and which does not agree with the model for the
majority of the data, will be important in any further modelling.
11

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT