The FS batch procedure

AuthorFrancesca Torti - Marco Riani - Anthony C. Atkinson - Domenico Perrotta - Aldo Corbellini
ProfessionEuropean Commission, Joint Research Centre (JRC) - University of Parma, Italy - London School of Economics, UK - European Commission, Joint Research Centre (JRC) - University of Parma, Italy
Pages15-16
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
-0.5
0
0.5
1
1.5
2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Figure 5: Two artificial datasets generated with MixSim for the assessment.
8. The FS batch procedure
Our SAS library contains a new Forward Search strategy that increases the possibility to treat large datasets.
The idea is to reduce the size of the output tables and the amount of memory required through a batch
updating procedure.
The standard FS algorithm in §3 produces a sequence of nm0subsets with corresponding model
parameters and relevant test statistics, used typically to test the presence of outliers. The initial subset size
m0can be as small as p, the minimum number of observations necessary to provide a f‌it to the data. In
the standard algorithm the subset size m0mnis increased by one unit at a time and only the smallest
value of the test statistic among the observations outside the subset is retained. The batch version of the
algorithm f‌its instead only one subset every k > 1steps. The value of kis set by the user through the
input parameter fs_steps. The number of subsets to be evaluated therefore reduces to (nk)/k. For
each subset and set of estimated model parameters, the ksmallest values of the test statistic are retained :
they are assigned to the current step and to the preceding k1, in order to obtain the complete vector of
minimum test statistics Eq. (4) to compare with the envelopes. Of course this vector is an approximation
to the real one which would be found by evaluating each of the ksteps individually; the approximation is the
cost of reducing the number of f‌its to (nk)/k while still applying the signal detection, signal validation
and envelope superimposition phases described in §5 at each of the nm0FS steps.
If the data are contaminated and kis too large, this approach may not be accurate enough to detect the
outliers, giving rise to biased estimates. The problem can be apprised by monitoring the statistical properties
of the batch algorithm for increasing k. We have conducted such exploratory assessment using artif‌icial
data.
We generated the data using MixSim (Maitra and Melnykov, 2010) in the MATLAB implementation of
the FSDA toolbox (Torti et al., 2018, Section 3); the functions used are MixSimreg.m and simdataset.m.
MixSim allows generation of data from a mixture of linear models on the basis of an average overlap measure
¯ωpre-specif‌ied by the user. We generated a dominant linear component containing 95% of the data and a
5% “contaminating” one with small average overlap ( ¯ω= 0.01). The generating regression model is without
intercept, with random slopes from a Uniform distribution between tanπ
6=3
3and tan π
3=3, and
independent variables from a Uniform distribution in the interval [0,1]. Each slope is equally likely to be
that of the dominant component. We took the error variances in the two components to be equal when
specif‌ication of the value of ¯ω, together with the values of the slopes, def‌ines the error variance for each
sample. We also added additional uniform contamination of 3% of the above data over the rectangle def‌ined
by the two slopes and the range of the two independent variables. The plots in Figure 5 are examples of
two datasets with 4750 + 250 + 150 units.
The boxplots of Figure 6 show the bias for the slope and intercept obtained from 500 such datasets with
5,150 observations each, for k∈ {1,5,10,15,20,40,60,80,100}. The bias here is simply the dif‌ference
between the estimated and real slopes, the latter referring to the dominant generating component.
The upper panel of the f‌igure shows that the median bias for both the slopes and intercepts are virtually
zero. The dispersion of the estimates for slopes and intercept remain both stable and quite small even for
values of kapproaching 100 (note that the boxplot whiskers are in [0.01,0.01]). However, the variability
of the estimates outside the whiskers rapidly increases for k= 100. The fact that the bottom and top edges
15

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT