Data Analyses with S, LTS and LMS Routines

AuthorFrancesca Torti - Marco Riani - Anthony C. Atkinson - Domenico Perrotta - Aldo Corbellini
ProfessionEuropean Commission, Joint Research Centre (JRC) - University of Parma, Italy - London School of Economics, UK - European Commission, Joint Research Centre (JRC) - University of Parma, Italy
Pages24-25
0 2000 4000 6000 8000
QUANTITY
0
1
2
3
4
5
6
VALUE
104
“Books” dataset
Figure 15: A rather complex trade dataset. Following Perrotta and Torti (2018), we analyze the subset of retained
units: “Printed books, brochures, leaflets and similar printed matter” (723 units).
12. Data Analyses with S, LTS and LMS Routines
This section details a new SAS function that monitors a number of traditional robust multivariate and
regression estimators for various choices of breakdown or ef‌f‌iciency. Here we provide examples of the
analysis of regression data with LMS and LTS as well as using S-estimation. The full list of possibilities is
presented at the end of §7.
The pattern of scaled residuals in the forward plots of Figures 2, 8 and 12 (bottom panels) all show a
stable horizontal pattern for the greater part of the search until outliers start to enter the f‌it. In introducing
the idea of monitoring regression, Riani et al. (2014a) provided plots when monitoring S estimation as a
function of bdp which had a related structure; changes in the estimate occurred at one or two values of
bdp, between which the pattern of residuals remained similar, but with decreasing magnitude of the scaled
residuals as bdp decreased. The structure for LTS showed the residuals decreasing more rapidly as a function
of bdp with LMS similar but with appreciably more noise in the curves. Similar structures are obtained by
Perrotta and Torti (2018) in the analysis of simulated data. Changes in the f‌it from robust to non-robust
allow the determination of the empirical breakdown point and hence the provision of ef‌f‌icient estimates.
To illustrate the use of the SAS routines for these regressions, we use a rather more complicated data
set introduced by Perrotta and Torti (2018) in the discussion of Cerioli et al. (2017). The dataset is an
example of trade data from EU customs returns. The response is value and the single explanatory variable is
quantity. If markets are working correctly, there should be a linear relationship between the two. The focus
in the analysis is on the detection of observations with a dif‌ferent relationship between the two variables,
which may be an indicator of fraudulent customs declarations. A seafood example is used by Atkinson et al.
(2018b) in which there are two distinct linear relationships between price and quantity. An extra problem
in monitoring these data sets is the presence of a large number of small transactions, which obscure any
structure of the larger observations, which is where f‌inancially important fraud, if any, will occur. Perrotta
and Torti (2018) analyse thinned data sets in which the number of observations is partially reduced, whist
retaining the overall structure of the data. The data set is “Books", def‌ined as printed books, brochures,
leaf‌lets and similar printed matter. The scatterplot of the data set is in Figure 15; it contains n= 723 units,
after thinning following the procedure of Cerioli and Perrotta (2014).
The panels of Figure 16 are monitoring plots of scaled residuals for the books data from S,LTS and LMS
estimators from our IML Studio implementation. The Scurve shape, which is identical to the one produced
with FSDA by Perrotta and Torti (2018), as well as the new LTS and LMS curves show the presence of
structure in the data. A sharp decrease in the residuals occurs below a breakdown value of approximately
8%, which corresponds to the percentage of outliers visible as two dispersed clusters in Figure 15.
The transition shown for S-estimation (top panel of Figure 16) is going from a robust analysis to non-
robust least squares is smooth. The transitions for LTS (middle panel of Figure 16) happens at the same
value of bdp, but is appreciably sharper. That for LMS (bottom panel of Figure 16) is less sharp and, for
indication of a change point, closer to that for S-estimation.
The structure of these plots can be interpreted by inspection of Figure 15. For f‌its with a high bdp the
24

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT