Analyzing
Censored
Data
450b
011
Goals
Examine
several
different
methods
for
dealing
with
censored
data
replace
DL
with
specific
value
adjust
estimators
of
parameters
impute
replacement
values
nonparametric
methods
Consider
when
each
method
might
be
most
appropriate
450b
012
Dealing
with
Censored
Data
The
way
censored
data
are
adressed
depends
on
the
objectives
of
the
analysis:

hypothesis
test
or
estimation?

parametric
or
non­
parametric
methods?

relationship
between
detection
limits
and
concentration
levels
of
concern
The
method
for
addressing
censored
data
also
depends
on
the
conceptual
site
model:

chemicals
expected
at
concentrations
lower
than
the
detection
limits?

naturally
occurring
chemicals?
450b
013
Start
with
Summarizing
Data
As
with
any
statistical
analysis,
it
is
important
to
begin
by
looking
closely
at
the
data
Summary
statistics
may
help
identify
interesting
aspects
of
the
data
range,
median,
mean,
variance,
etc.

for
both
censored
and
uncensored
data
Graphical
presentations
often
show
other
characteristics
of
the
data
histograms,
probability
density
plots,
box
plots,
normal
quantile
plots,
etc.
450b
014
Arsenic
Example
Future
land
use
at
a
contaminated
site
in
Nevada
is
residential
Contamination
levels
are
low
across
much
of
the
several
hundred
acres,
but,
some
"
hot
spots"
of
concern
(
several
acres)
exist
Arsenic
levels
are
a
concern
­
background
data
collected
for
arsenic
Multiple
detection
limits
are
encountered
when
analyzing
the
data
450b
015
Arsenic
Example:

Summary
Statistics
Detected
concentrations
exist
that
are
less
than
detection
limits
Detects
Non­
detects
No.
of
samples
No.
Min.
Max.
No.
Min.
Med.
Ave.
Max.

102
24
1.25
7.9
78
1.7
6.5
7.3
31
450b
016
­
10
0
10
20
30
40
0
10
20
30
40
50
Histogram
Frequency
0
10
20
30
Boxplot
w/
Data
Conc.

(

mg/

kg)
x
=
Non­
detect
O
=
Detect
Density
Estimate
Conc.
(
mg/
kg)

Density
­
10
0
10
20
30
40
0.0
0.04
0.08
Normal
Probability
Plot
Standard
Quantiles
Conc.

(

mg/

kg)
­
2
­
1
0
1
2
0
10
20
30
#
of
ND's
=
24
450b
017
Arsenic
Example:
Observations
The
bowing
of
the
quantile
plot
indicates
a
fair
degree
of
skewness
The
bumps
in
the
curve
may
be
random,
or
possibly
symptomatic
of
something
else
The
detection
limits
vary
over
a
large
range
Working
with
the
scientists
on
the
team,
it
became
apparent
that
the
data
could
also
have
been
summarized
as
follows:
450b
018
0
10
20
30
Source
1
x
=
Non­
detect
O
=
Detect
0
10
20
30
Source
2
0
10
20
30
Source
3
450b
019
Arsenic
Example:

Data
Re­
evaluated
There
are
3
different
sources
for
the
data
different
geologies
different
analytical
methods
Different
detection
limits
were
encountered
for
each
method
The
data
require
closer
scrutiny
to
explain
some
of
the
unresolved
issues
Detects
Non­
detects
Data
source
n
n
Min.
Max.
n
Min.
Med.
Ave.
Max.

1
28
19
5.8
7.9
9
6.5
7.2
7.72
9.4
2
47
3
5
5
44
4.4
8.1
9.64
31
3
27
2
1.25
1.25
25
1.7
3.3
3.15
4.2
450b
0110
Arsenic
Example:

Lessons
Learned
Summary
statistics
and
graphics
can
be
broken
out
for
detects
and
non­
detects
without
losing
any
information
from
the
data
Substitution
methods
are
not
always
needed
Without
needing
to
change
the
values
of
the
non­
detects,
it
is
clear
from
the
summary
statistics
and
the
plots
that
background
is
not
adequately
characterized
A
new
background
study
has
been
ordered
450b
0111
Dealing
with
Censored
Data
If
estimation
of
hypothesis
testing
is
desired
or
required,
then
some
manipulation
of
the
non­
detects
is
needed
EPA
QA/
G­
9
guidance
provides
some
simple
recommendations
for
handling
censored
data
Percentage
of
Nondetects
Statistical
Analysis
Method
<
15%
Replace
nondetects
with:

DL/
2
DL,
or
a
very
small
number
15%
­
50%
Use:
Trimmed
mean,

Cohen's
adjustment,
or
Winsorized
mean
and
standard
deviation
450b
0112
Dealing
with
Censored
Data
There
isn't
a
perfect,
mathematically
correct,

logical
way
to
always
analyze
censored
data
There
are,
however,
many
different
options
from
the
simple
to
the
complex,
and
from
the
tried
and
tested
to
the
relatively
unproven
The
professional
judgement
of
all
of
the
scientists
working
on
the
project
needs
to
be
utilized
to
determine
which
method
seems
most
appropriate
450b
0113
DL
Replacement
Methods
If
parametric
methods
are
going
to
be
used
for
censored
data,
then
rules
need
to
be
devised
for
how
the
censored
data
will
be
treated
Let's
look
at
the
impact
of
three
very
simple
methods
that
are
commonly
used
for
assigning
values
to
the
undetected
data
by
considering
an
example
450b
0114
DL
Replacement
Methods:

Sample
Data
Set
#
1
Suppose
we
received
the
following
data
from
our
analytical
laboratory
1.175
<
0.500
33.144
<
1.000
<
1.000
6.266
<
1.000
<
0.250
16.019
4.696
3.361
2.971
<
0.500
13.521
<
0.250
<
0.500
3.982
2.288
2.242
<
0.250
450b
0115
0
20
40
0
5
10
15
Histogram
Frequency
0
10
20
30
40
Boxplot
w/
Data
Conc.

(

mg/

kg)
x
=
Non­
detect
O
=
Detect
Density
Estimate
Conc.
(
mg/
kg)

Density
0
20
40
0.0
0.04
0.08
Normal
Probability
Plot
Standard
Quantiles
Conc.

(

mg/

kg)
­
2
­
1
0
1
2
0
10
20
30
40
#
of
ND's
=
9
450b
0116
DL
Replacement
Methods
­
DL
The
first
commonly
used
method
is
to
assign
each
datum
below
the
detection
limit
the
value
of
the
detection
limit
(
DL)

XND
=
DL
What
are
the
ramifications
of
this
in
terms
of
the
subsequent
analysis?
450b
0117
DL
Replacement
Methods:

Sample
Data
Set
#
1
It
will
result
in
an
overestimate
of
the
mean
It
will
(
usually)
result
in
an
underestimate
of
the
variability
Complete
Data
Censored
Data
Assigned
Detection
Limits
Mean
4.643
4.746
Variance
64.090
63.180
450b
0118
DL
Replacement
Methods
­
1/
2
DL
The
second
method
frequently
used
for
censored
data
is
to
assign
each
datum
below
the
detection
limit
the
value
equal
to
half
the
detection
limit
XND
=
½
DL
What
are
the
ramifications
of
this
in
terms
of
the
subsequent
analysis?
450b
0119
DL
Replacement
Methods:

Sample
Data
Set
#
1
The
impact
on
statistical
parameters
such
as
the
mean
and
the
variance
is
not
clear
Complete
Data
Censored
Data
Assigned
1/
2
Detection
Limits
Mean
4.643
4.615
Variance
64.090
64.318
450b
0120
DL
Replacement
Methods
­
0
The
third
method
is
to
assign
each
datum
below
the
detection
limit
the
value
of
0
XND
=
0
How
will
this
substitution
effect
subsequent
analysis?
450b
0121
DL
Replacement
Methods:

Sample
Data
Set
#
1
It
will
result
in
an
underestimate
of
the
mean
The
effect
on
variance
in
general
is
not
clear
Here
we
see
that
the
variance
has
been
overestimated
Complete
Data
Censored
Data
Assigned
Zeros
Mean
4.643
4.483
Variance
64.090
65.523
450b
0122
Atchison's
Method
Atchison's
method,
as
presented
in
EPA
QA/
G­
9,
also
assigns
each
datum
below
the
detection
limit
the
value
of
0
Atchison's
method
is
presented
with
several
complicated
looking
formulas...
450b
0123
Aitchison's
Formulas
Let
X1,
X2,...,
Xm,...,
Xn
represent
the
data
The
first
m
values
are
above
the
detection
limit
(
DL)
and
the
remaining
(
n­
m)
data
points
are
below
the
DL
X
m
X
d
i
i
m
=
=
 

1
1
X
m
n
X
d
=
s
X
m
X
m
d
i
i
m
i
i
m
2
2
1
1
2
1
1
=
 
 

=
=

 
 (
)

s
m
n
s
m
n
m
n
n
X
d
d
2
2
2
1
1
1
=
 

 
+
 
 

(
)

(
)
450b
0124
DL
Replacement
Methods
The
data
we
just
looked
at
were
skewed
and
the
detected
values
were
large
relative
to
the
detection
limits
What
happens
when
the
data
are
symmetric
and
the
detected
values
are
relatively
close
to
the
detection
limits?
450b
0125
DL
Replacement
Methods:

Sample
Data
Set
#
2
Suppose
we
received
these
data
from
our
analytical
laboratory
1.752
<
1.000
1.418
1.477
<
1.000
<
1.000
1.289
1.498
<
1.000
<
1.000
1.327
<
1.000
1.060
<
1.000
<
1.000
<
1.000
1.045
<
1.000
1.563
1.148
450b
0126
DL
Replacement
Methods:

Sample
Data
Set
#
2
Here
we
see
that
the
impacts
of
different
substitution
schemes
are
greater
with
smaller
values
and
fewer
outliers
These
data
sets
illustrate
the
varying
impacts
DL
replacement
methods
have
on
parameter
estimates
Complete
Data
DL
1/
2
DL
Zero
Mean
1.055
1.179
0.929
0.679
Variance
0.138
0.058
0.218
0.510
450b
0127
DL
Replacement
Methods
The
main
advantage
of
the
replacement
of
each
datum
below
the
detection
limit
with
the
same
value
is
simplicity
The
substitutions
are
quite
simple
to
perform,

can
accomodate
multiple
detection
limits,

and
are
easy
to
deal
with
in
terms
of
the
subsequent
statistical
analyses
Choice
of
substitution
value
(
DL,
1/
2
DL,
or
0)
should
be
based
on
the
site
conceptual
model
and
objective
of
the
analysis
450b
0128
DL
Replacement
Methods
The
main
disadvantages
are
that
these
methods
are
crude
and
misrepresent
both
the
average
and
the
variability
of
the
sample
results
(
sometimes
in
unexpected
ways)

Depending
on
the
situation,
one
needs
to
decide
whether
these
simple
substitutions
will
provide
an
adequate
representation
of
the
data
for
the
task
at
hand
450b
0129
Estimator
Adjustment
Cohen's
method
provides
adjusted
estimates
of
the
sample
mean
and
standard
deviation
Estimates
are
based
on
the
statistical
technique
of
maximum
likelihood
estimation
that
accounts
for
censored
data
The
adjusted
mean
and
standard
deviation
can
be
used
in
parametric
tests
If
less
than
50%
of
the
data
are
detected,

Cohen's
method
should
not
be
used
450b
0130
Cohen's
Method
This
method
requires
that
the
data
without
the
nondetects
be
normally
distributed
This
method
also
assumes
that
there
is
only
one
detection
limit
in
the
data
These
are
both
significant
limitations
for
the
use
of
this
method
Cohen's
method
is
described
in
G­
9
450b
0131
Cohen's
Method
Let
X1,
X2,
...,
Xm,
...,
Xn
represent
the
n
data
points
The
first
m
values
represent
the
data
points
above
the
detection
limit
(
DL)
and
the
remaining
(
n­
m)
data
points
are
below
the
DL
450b
0132
Cohen's
Method
Compute
the
sample
mean
Xd
from
the
data
above
the
detection
limit
Compute
the
sample
variance,
,
from
the
data
above
the
detection
limit
X
m
X
d
i
i
m
=
=
 

1
1
s
X
m
X
m
d
i
i
i
m
i
m
2
2
1
2
1
1
1
=
 
 
 
 
 
 
 
 
=

=
 

 
s
d
2
450b
0133
Cohen's
Method
Compute
the
following:
and
Use
a
table
to
find
h
n
m
n
=
 
(
)

 
=
 
s
X
DL
d
2
d
2
$(
)

 
 ,
h
450b
0134
Cohen's
Method
Estimate
the
corrected
sample
mean,
,
and
sample
variance,
,
to
account
for
the
data
below
the
detection
limit,
as
follows
These
adjusted
statistics
can
be
used
in
further
statistical
analyses
of
the
data
(
)

X
X
X
d
d
=
 
 

$

 
DL
(
)

s
s
d
2
2
=
+
 

$
 X
DL
d
2
X
2
s
450b
0135
Cohen's
Method
on
Sample
Data
Set
#
2
Cohen's
method
applied
to
Data
Set
#
2
provides
the
following
results
These
are
more
reasonable
estimates
than
those
provided
by
the
simple
substitution
methods
above
Even
though
50%
were
non­
detects
Complete
Data
Cohen's
method
Mean
1.055
1.015
Variance
0.138
0.17
450b
0136
Trimmed
Mean
Trimming
removes
data
in
both
tails
of
a
data
set
to
estimate
a
mean
concentration
This
method
comes
from
the
field
of
Robust
Statistics
where
the
intent
is
to
remove
the
effect
of
outliers
its
applicability
to
censored
data
is
questionable
For
censored
data,
remove
the
non­
detects
and
an
equivalent
number
of
the
greatest
concentrations
­
then
recalculate
the
mean
450b
0137
Winsorized
Mean
and
Variance
Winsorizing
replaces
data
in
the
tails
of
a
data
set
with
the
next
most
extreme
data
value
This
is
performed
in
both
tails
of
the
data
(
e.
g.,

if
m
non­
detects
are
replaced
by
the
next
highest
value,
then
the
m
highest
values
are
replaced
by
the
next
lowest
value)

Winsorizing
creates
adjusted
estimates
of
the
sample
mean
and
variance
accounting
for
non­
detects
450b
0138
Other
Methods
Another
possible
way
to
deal
with
the
data
below
detection
limits
is
through
the
use
of
imputation
jittering
distributional
regression/
correlation
expert
judgment
There
must
be
good
justification
for
the
imputation
method,
and
documentation
should
always
be
provided
450b
0139
Imputing
Censored
Data
Expert
opinion
and
regression/
correlation
approaches
can
be
used
when:

There
is
relevant
historic
knowledge
that
can
be
used
to
build
quantitative
or
qualitative
relationships
between
variables
This
could
include
background
concentration
data,
correlations
with
other
analytes,
or
physical
system
indicators
(
e.
g.,
alkalinity
impact
on
alkaline
earth
metals)
450b
0140
Maximum
Likelihood
Estimation
Another
approach
is
to
use
MLE
to
estimate
parameters
of
the
underlying
distribution
while
accounting
for
non­
detects
in
the
estimation
procedure.

This
method
is
often
used
in
survival
analysis
Cohen's
method
is
a
MLE
method
that
assumes
normality
and
a
single
detection
limit,
and
assumes
that
all
reported
observations
are
greater
than
that
limit
450b
0141
Imputing
Censored
Data
Another
potential
advantage
of
the
MLE
method
is
that
it
accommodates
fitting
distributions
with
a
point
mass
at
zero
Effectively,
this
assumes
that
some
of
the
samples
might
be
analyte­
free
These
are
not
used
often,
but
perhaps
should
be
for
chemicals
that
are
not
present
in
background
450b
0142
