Introduction
Author | Francesca Torti - Marco Riani - Anthony C. Atkinson - Domenico Perrotta - Aldo Corbellini |
Profession | European Commission, Joint Research Centre (JRC) - University of Parma, Italy - London School of Economics, UK - European Commission, Joint Research Centre (JRC) - University of Parma, Italy |
Pages | 7-7 |
1. Introduction
Automated monitoring of external trade data is a tactical operational activity by the Anti Fraud Office
(OLAF) of the European Commission (EC) in support of its own investigation Directorate and its partners
in the EU Member States Customs. The Automated Monitoring Tool (AMT) is a sequence of projects
financed by administrative agreements between OLAF and the EC Joint Research Centre in support of that
activity. A key objective of AMT is the systematic estimation of trade prices and statistical detection of
patterns of anti-fraud relevance in large volumes of trade and other relevant data. The JRC and its academic
partners, represented in this report, work together on the developed of instruments for this purpose. This
report focuses on robust regression tools that are at the core of the AMT.
The forward search (FS) is a general method of robust data fitting that moves smoothly from very
robust to maximum likelihood estimation. The FS procedures are included in the MATLAB toolbox FSDA.
The work on a SAS version of the FS, presented in this paper, is in the framework of an European Union
program supporting the Customs Union and Anti-Fraud policies. It originates from the need for the analysis
of large data sets expressed by law enforcement services operating in the European Union (the EU anti-fraud
office in particular) that are already using our SAS software for detecting data anomalies that may point to
fraudulent customs returns. For them, the library is also accessible through a restricted web platform called
Web-Ariadne: https://webariadne.jrc.ec.europa.eu.
The series of fits provided by the FS is combined with an automatic procedure for outlier detection that
leads to the adaptive data-dependent choice of highly efficient robust estimates. It also allows monitoring of
residuals and parameter estimates for fits of differing robustness. Linking plots of such quantities, combined
with brushing, provides a set of powerful tools for understanding the properties of data including anomalous
structures and data points. Our SAS package extends this ideas of monitoring to several traditional robust
estimators of regression for a range of values of their key parameters (maximum possible breakdown or
nominal efficiency). We again obtain data adaptive values for these parameters and provide a variety of
plots linked through brushing. Examples in the paper are for S estimation and for Least Median of Squares
(LMS) and Least Trimmed Squares (LTS) regression.
In the next section we define three classes of robust estimators (downweighting, hard trimming and
adaptive hard trimming) of all of which occur in the numerical examples. Algebra for the FS is in Section 3.
Sections 4 and 5 describe general procedures for outlier detection and the rule to control the statistical
size of the procedure to allow for testing at each step of the FS. These procedures are illustrated in §6 by
analysis of data on 509 bank customers. The next two sections are specific to our SAS implementation: in
§7 we describe the properties of the language that make it suitable for handling large datasets and list the
procedures that we have implemented; §8 describes the approximations used to provide fast analyses of large
datasets. As Figure 7 shows, there is a considerable advantage in using SAS instead of MATLAB functions
for analysing large datasets.
The data analysed in §6 have been transformed to approximate normality by the Box-Cox transformation
(Box and Cox, 1964). In §10 we illustrate the use of our SAS routines, including the “fan plot", to monitor
a series of robust fits and establish the value of the transformation parameter in the presence of data
contamination. Section 11 provides background for soft-trimming estimation, in our case M- and S-
estimators. Examples of the use of our SAS routines for S, LMS and LTS regression are in §12 for a trade
dataset that has a more complicated structure than the loyalty card data. Section 13 concludes. The
supplementary material reported in Annex B illustrates the use of our SAS software by European Union
services, for anti-fraud purposes.
2. Three Classes of Estimator for Robust Regression
We work with the customary null regression model in which the nunivariate response variables yiare related
to the values of a set of pexplanatory variables xby the relationship
yi=β⊤xi+ǫii= 1, . . . , n, (1)
including an intercept term. The independent errors ǫihave constant variance σ2. The purpose of robust
estimation is to find good estimators of the parameters when there are departures from this model, typically
caused by outliers, which should also be identified.
It is helpful to divide methods of robust regression into three classes.
1. Hard (0,1) Trimming. In Least Trimmed Squares (LTS: Hampel, 1975, Rousseeuw, 1984) the amount
of trimming is determined by the choice of the trimming parameter h,[n/2] + [(p+ 1)/2] ≤h≤n,
7
To continue reading
Request your trial