Three Classes of Estimator for Robust Regression

AuthorFrancesca Torti - Marco Riani - Anthony C. Atkinson - Domenico Perrotta - Aldo Corbellini
ProfessionEuropean Commission, Joint Research Centre (JRC) - University of Parma, Italy - London School of Economics, UK - European Commission, Joint Research Centre (JRC) - University of Parma, Italy
Pages7-8
1. Introduction
Automated monitoring of external trade data is a tactical operational activity by the Anti Fraud Of‌f‌ice
(OLAF) of the European Commission (EC) in support of its own investigation Directorate and its partners
in the EU Member States Customs. The Automated Monitoring Tool (AMT) is a sequence of projects
f‌inanced by administrative agreements between OLAF and the EC Joint Research Centre in support of that
activity. A key objective of AMT is the systematic estimation of trade prices and statistical detection of
patterns of anti-fraud relevance in large volumes of trade and other relevant data. The JRC and its academic
partners, represented in this report, work together on the developed of instruments for this purpose. This
report focuses on robust regression tools that are at the core of the AMT.
The forward search (FS) is a general method of robust data f‌itting that moves smoothly from very
robust to maximum likelihood estimation. The FS procedures are included in the MATLAB toolbox FSDA.
The work on a SAS version of the FS, presented in this paper, is in the framework of an European Union
program supporting the Customs Union and Anti-Fraud policies. It originates from the need for the analysis
of large data sets expressed by law enforcement services operating in the European Union (the EU anti-fraud
of‌f‌ice in particular) that are already using our SAS software for detecting data anomalies that may point to
fraudulent customs returns. For them, the library is also accessible through a restricted web platform called
Web-Ariadne: https://webariadne.jrc.ec.europa.eu.
The series of f‌its provided by the FS is combined with an automatic procedure for outlier detection that
leads to the adaptive data-dependent choice of highly ef‌f‌icient robust estimates. It also allows monitoring of
residuals and parameter estimates for f‌its of dif‌fering robustness. Linking plots of such quantities, combined
with brushing, provides a set of powerful tools for understanding the properties of data including anomalous
structures and data points. Our SAS package extends this ideas of monitoring to several traditional robust
estimators of regression for a range of values of their key parameters (maximum possible breakdown or
nominal ef‌f‌iciency). We again obtain data adaptive values for these parameters and provide a variety of
plots linked through brushing. Examples in the paper are for S estimation and for Least Median of Squares
(LMS) and Least Trimmed Squares (LTS) regression.
In the next section we def‌ine three classes of robust estimators (downweighting, hard trimming and
adaptive hard trimming) of all of which occur in the numerical examples. Algebra for the FS is in Section 3.
Sections 4 and 5 describe general procedures for outlier detection and the rule to control the statistical
size of the procedure to allow for testing at each step of the FS. These procedures are illustrated in §6 by
analysis of data on 509 bank customers. The next two sections are specif‌ic to our SAS implementation: in
§7 we describe the properties of the language that make it suitable for handling large datasets and list the
procedures that we have implemented; §8 describes the approximations used to provide fast analyses of large
datasets. As Figure 7 shows, there is a considerable advantage in using SAS instead of MATLAB functions
for analysing large datasets.
The data analysed in §6 have been transformed to approximate normality by the Box-Cox transformation
(Box and Cox, 1964). In §10 we illustrate the use of our SAS routines, including the “fan plot", to monitor
a series of robust f‌its and establish the value of the transformation parameter in the presence of data
contamination. Section 11 provides background for soft-trimming estimation, in our case M- and S-
estimators. Examples of the use of our SAS routines for S, LMS and LTS regression are in §12 for a trade
dataset that has a more complicated structure than the loyalty card data. Section 13 concludes. The
supplementary material reported in Annex B illustrates the use of our SAS software by European Union
services, for anti-fraud purposes.
2. Three Classes of Estimator for Robust Regression
We work with the customary null regression model in which the nunivariate response variables yiare related
to the values of a set of pexplanatory variables xby the relationship
yi=βxi+ǫii= 1, . . . , n, (1)
including an intercept term. The independent errors ǫihave constant variance σ2. The purpose of robust
estimation is to f‌ind good estimators of the parameters when there are departures from this model, typically
caused by outliers, which should also be identif‌ied.
It is helpful to divide methods of robust regression into three classes.
1. Hard (0,1) Trimming. In Least Trimmed Squares (LTS: Hampel, 1975, Rousseeuw, 1984) the amount
of trimming is determined by the choice of the trimming parameter h,[n/2] + [(p+ 1)/2] hn,
7

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT