SUMMARY
REPORT
PEER
REVIEW
OF
"
STATISTICAL
ANALYSIS
OF
SELENIUM
TOXICITY
DATA"

Prepared
for:

U.
S.
Environmental
Protection
Agency
Office
of
Water
Office
of
Science
and
Technology
Health
and
Ecological
Criteria
Division
1200
Pennsylvania
Ave.,
NW
Washington,
D.
C.
20460
Prepared
by:

Versar,
Inc.
6850
Versar
Center
Springfield,
Virginia
22151
December
2000
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

selenium
12/
00
TABLE
OF
CONTENTS
1.0
INTRODUCTION
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1.1
Peer
Reviewers
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1.2
Peer
Review
Comment
Format
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
2.0
CHARGE
TO
THE
PEER
REVIEWERS
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
2.1
General
Comments
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
2.2
Response
to
Charge
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
2.3
Specific
Comments
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
2.4
Miscellaneous
Comments
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
2.5
Additional
References
Recommended
For
Inclusion
in
The
Document
.
.
.
.
.
.
.
.
.
.
.
.
.
25
2.6
References
of
Interest
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
APPENDIX
A
­
REVIEWER
COMMENTS
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
1
of
25
selenium
12/
00
1.0
INTRODUCTION
EPA's
Office
of
Water/
Office
of
Science
and
Technology
(
OST)
Health
and
Ecological
Criteria
Division
(
HECD)
develops
health
standards,
health
criteria,
health
advisories,
and
technical
guidance
documents
for
water
and
water­
related
media.

The
purpose
of
this
report
is
to
present
peer
review
comments
for
the
statistical
analysis
of
biological
response
data
from
the
Monticello
Ecological
Research
Station
(
MERS)
Selenium
(
Se)
Study.
The
MERS
study
design
is
complex
and
statistical
analysis
required
the
use
of
mixed
model
ANOVA
methods
and
repeated
measures
analysis.
As
a
result,
there
are
statistical
issues
associated
with
the
choice
of
statistical
models
used
to
account
for
both
fixed
and
random
effects,
the
methods
used
to
address
normality
and
homogeneity
of
variance
issues
(
including
the
ranked
transformation),
and
the
use
of
repeated
measures
analysis
to
address
age
and
treatment
interactions,
among
other
possible
issues.
As
such,
peer
reviewers
with
expertise
in
these
areas,
as
well
as
in
parametric
and
nonparametric
statistical
analysis
of
data
from
mixed
model
and
unbalanced
experimental
designs
were
selected
to
review
the
document.

Statistical
Analysis
of
Selenium
Toxicity
Data,
dated
August
31,
1999,
was
the
subject
of
external
peer
review.
Peer
review
is
an
important
component
of
the
scientific
process.
It
provides
a
focused,
objective
evaluation
of
a
research
proposal
publication,
risk
assessment,
health
advisory,
guidance
or
other
document
submitted
for
review.
The
criticisms,
suggestions
and
new
ideas
provided
by
the
peer
reviewers
ensure
objectivity,
stimulate
creative
thought,
strengthen
the
reviewed
document
and
confer
scientific
credibility
on
the
product.
Comprehensive,
objective
peer
review
leads
to
good
science
and
product
acceptance
within
the
scientific
community.

1.1
Peer
Reviewers
This
draft
document
was
reviewed
by
a
panel
of
three
expert
peer
reviewers:
Dr.
Dallas
E.
Johnson,
Dr.
Kinley
Larntz,
and
Dr.
Clifton
Sutton.
These
panelists
were
selected
because
of
their
expertise
in
various
areas
relevant
to
the
document,
including
statistical
analysis,
analysis
of
repeated
measurements,
messy
data
analysis,
and
logistic
regression.
A
brief
description
of
their
educational
backgrounds
and
experience
is
provided
below.

Dr.
Dallas
Johnson
has
over
25
years
in
different
areas
of
statistics.
His
fields
of
research
competence
include
messy
data
analysis,
repeated
measures
studies,
model
validation
methods,
multiple
and
logistic
regression,
and
crossover
designs.
He
has
experience
in
teaching,
research,
and
consulting
in
the
above
areas
and
has
authored
two
books
on
the
analysis
of
messy
data.
Dr.
Johnson
also
has
numerous
publications
in
professional
statistical
journals.
He
is
currently
professor
and
head
of
the
Department
of
Statistics,
Kansas
State
University.

Dr.
Johnson
received
a
Ph.
D.
in
Statistics
and
M.
A.
T.
in
Mathematics
from
Colorado
State
University.
Dr.
Johnson
is
a
member
of
the
Institute
of
Mathematical
Statistics
and
the
International
Biometric
Society,
a
distinguished
member
of
the
American
Statistical
Association
(
ASA),
was
elected
a
Fellow
in
1983,
and
received
the
Don
Owen
Aware
in
1997.
He
has
also
served
as
Chair
of
many
committees
in
ASA
and
founding
editor
of
the
Journal
of
Agricultural,
Biological,
and
Environmental
Statistics.

Dr.
Kinley
Larntz
has
more
than
30
years
experience
in
teaching,
research,
and
consulting
in
applied
statistics.
He
is
currently
a
professor
emeritus
at
the
Department
of
Applied
Statistics
at
the
University
of
Minnesota.
His
activities
and
research
interests
are
focused
on
three
areas:
understanding
the
small­
sample
properties
of
statistical
procedures,
developing
optimal
design
strategies
for
nonlinear
problems,
and
implementing
largescale
social
experiments.
Dr.
Larntz
has
been
a
consultant
to
numerous
State
and
Federal
government
agencies,
organizations,
and
private
industry.
This
work
has
included
support
to
and
research
for
the
National
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
2
of
25
selenium
12/
00
Institute
of
Justice,
the
Food
and
Drug
Administration,
and
the
National
Science
Foundation.
Additionally,
he
has
served
as
a
member
of
the
Scientific
Advisory
Board
Subcommittee
on
Secondary
Uses
of
Data
and
as
a
consultant
to
the
Clean
Air
Scientific
Advisory
Committee
both
for
the
U.
S.
EPA.

Dr.
Larntz
is
a
member
of
the
American
Society
for
Quality
and
the
American
Statistical
Association.
Dr.
Larntz
received
a
Ph.
D.
from
the
Department
of
Statistics
at
the
University
of
Chicago
and
an
A.
B.
(
Magna
cum
Laude)
in
mathematics
from
Dartmouth
College.

Dr.
Clifton
D.
Sutton
has
approximately
15
years
experience
in
teaching,
research,
and
consulting
in
statistics
and
probability.
He
currently
is
an
associate
professor
and
Graduate
Program
Coordinator
at
the
Department
of
Applied
and
Engineering
Statistics
at
George
Mason
University.
His
major
areas
of
interest
are
robust
statistics,
computer­
intensive
methods
of
applied
statistics,
and
geometric
probability.
Dr.
Sutton
recently
completed
an
extensive
study
to
assess
the
performance
of
numerous
robust
and
computer­
sensitive
methods
for
hypothesis
testing.

Dr.
Sutton
is
a
member
of
the
Institute
of
Mathematical
Statistics,
the
International
Association
for
Statistical
Computing,
and
the
American
Statistical
Association.
He
presently
serves
as
an
associate
editor
for
the
professional
publication
Computational
Statistics
and
Data
Analysis.
He
received
his
Ph.
D.
and
M.
S.
in
Statistics
from
Stanford
University
and
a
B.
S.
in
Applied
Mathematics
from
the
University
of
Virginia.

1.2
Peer
Review
Comment
Format
This
document
contains
the
peer
review
comments
from
Dallas
E.
Johnson,
Kinley
Larntz,
and
Clifton
D.
Sutton
on
the
document
Statistical
Analysis
of
Selenium
Toxicity
Data.
The
comments
and
recommendations
from
all
three
reviewers
have
been
combined
and
organized
as
follows:


General
comments;


Charge
to
the
reviewer;


Comments
on
specific
criteria
presented
as
a
charge
to
the
reviewers;


Specific
comments
by
document
page
number
referenced
by
commentor;


Miscellaneous
comments;
and

Additional
references
recommended
by
the
reviewers.

In
an
effort
to
assist
cross
referencing,
the
name
of
each
contributor
precedes
each
comment.
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

1
See
Hermanutz,
R.
O.,
K.
N.
Allen,
T.
H.
Rousch,
and
S.
Hedtke.
1992.
Effects
of
elevated
selenium
concentrations
on
bluegills
(
Lepomis
macrochirus)
in
outdoor
experimental
streams.
Environ.
Toxicol.
Chem.
11:
217­
224.

Page
3
of
25
selenium
12/
00
2.0
CHARGE
TO
THE
PEER
REVIEWERS
Background
Under
section
304(
a)
of
the
Clean
Water
Act,
the
U.
S.
Environmental
Protection
Agency
(
EPA)
establishes
water
quality
criteria
to
protect
aquatic
organisms,
among
other
types
of
criteria.
Once
incorporated
into
water
quality
standards
by
States
and
Tribes,
aquatic
life
criteria
serve
as
the
basis
of
legally
enforceable
water
quality
standards
which
are
used
to
set
limits
on
pollutant
loads
to
U.
S.
water
bodies.

Aquatic
life
criteria
for
one
such
pollutant,
selenium,
are
currently
being
revised
by
EPA.
The
occurrence
of
selenium
in
surface
waters
is
widespread
resulting
from
a
variety
of
natural
and
anthropogenic
sources.
These
sources
include
natural
weathering
and
irrigation­
induced
leaching
of
selenium
containing
rocks
and
soils,
mobilization
and
discharge
from
mining
and
smelting
activities,
flue
gas
emission
from
fuel
oil
and
coal
combustion,
and
fly­
ash
disposal
practices
(
i.
e.,
pond
leachate
and
runoff
from
land
disposal
areas).
Selenium
contamination
in
aquatic
ecosystems
has
been
linked
to
adverse
ecological
effects
in
several
field
settings
that
include
reproductive
and
developmental
impairment
of
fish.
Thus,
water
quality
criteria
for
selenium
affect
a
sizable
array
of
the
regulated
community
and
may
be
applicable
to
numerous
locations
in
the
United
States.

Since
their
last
revision
in
1987,
additional
data
have
become
available
on
the
effects
of
selenium
on
aquatic
organisms
which
may
impact
the
chronic
criterion
of
5
µ
g/
L.
One
such
study
that
may
be
highly
influential
in
revising
the
chronic
criterion
is
a
series
of
selenium
experiments
involving
bluegill
sunfish
(
Lepomis
macrochirus)
conducted
in
large
outdoor
experimental
streams
at
the
Monticello
Ecological
Research
Station
(
MERS)
in
Monticello,
Minnesota.
Assessing
the
effects
of
selenium
on
aquatic
life
in
field
settings,
such
as
in
outdoor
streams
used
in
the
MERS
study,
is
crucial
because
dietary
uptake
of
selenium
by
fish
can
dominate
low­
level,
chronic
exposures
in
natural
aquatic
systems.
Results
from
the
first
of
the
three
MERS
studies
(
Study
I)
were
published
in
19921
and
are
not
the
focus
of
this
peer
review.
The
second
and
third
studies
(
Study
II
and
Study
III),
which
involve
exposure
to
selenium
above
and
below
the
current
criterion
of
5
µ
g/
L,
have
yet
to
be
published
by
EPA
and
are
the
focus
of
this
peer
review.

Purpose
of
This
Peer
Review
In
this
peer
review,
you
are
being
requested
to
review
and
provide
comments
on
the
draft
report
entitled:
"
Statistical
Analysis
of
Selenium
Toxicity
Data."
This
report
contains
results
from
the
statistical
analysis
of
the
Study
II
and
III
MERS
experiment
and
was
prepared
by
The
Cadmus
Group,
Inc.
on
contract
to
EPA's
Office
of
Water.
It
is
EPA's
intent
to
first
obtain
peer
review
of
the
statistical
aspects
of
the
MERS
study
(
i.
e.,
the
draft
Cadmus
report)
since
the
statistical
analysis
is
considered
important
to
interpretation
of
the
results.
Once
this
peer
review
is
complete,
EPA
will
make
any
appropriate
revisions
or
additions
to
the
statistical
analysis
contained
in
the
draft
Cadmus
report
and
then
produce
a
final
study
report
which
will
include
detail
of
the
biological
and
statistical
aspects
of
the
Study
II
and
III
MERS
experiments.
To
provide
a
context
to
the
statistical
results,
a
brief
summation
of
the
experimental
design
and
methods
is
included
in
the
draft
Cadmus
report.
Should
you
have
additional
questions
pertaining
to
other
aspects
of
the
MERS
studies
(
e.
g.,
study
design,
biological
issues),
you
are
requested
to
contact
the
peer
review
contractor
(
Versar,
Inc.)
who
will
arrange
to
have
your
questions
addressed.
To
aid
in
summarizing
the
study
components,
a
chart
illustrating
the
different
data
sets,
effect
variables
and
statistical
analyses
is
attached
to
this
charge.
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
4
of
25
selenium
12/
00
Overriding
Issues
There
are
several
aspects
to
the
Study
II
and
III
MERS
experiments
which
require
careful
consideration
by
the
statistical
analysis
of
the
MERS
data.
First,
the
MERS
experiments
involve
a
nested
study
design
whereby
measurements
are
made
on
subsamples
taken
from
each
stream.
For
example,
in
the
"
Egg
Cup"
portion
of
study,
observations
of
larval
abnormalities
are
recorded
for
multiple
spawning
events
within
each
bluegill
nest
within
each
stream
among
each
of
the
selenium
treatments.
In
the
statistical
treatment
of
the
data,
EPA
considers
"
stream"
to
be
the
experimental
unit
largely
because
within
a
stream,
samples
of
individual
fish
are
not
considered
independent
of
one
another
(
i.
e.,
they
are
subject
to
the
same
historical
dosing
regime
within
a
given
stream).
Therefore,
there
is
reason
to
suspect
that
a
"
stream
effect"
might
exist
due
to
past
dosing
history
or
spatial
configuration
(
i.
e.,
if
the
outer
streams
are
more
susceptible
to
fish
predators,
etc).

A
second
complicating
factor
involves
the
analysis
of
repeated
measurements
which
are
recorded
for
each
subsample
over
time
(
e.
g.,
days
1
through
5
on
the
egg
cup
portion
of
the
study).
In
the
Field
Nest
portion
of
the
study,
sampling
was
not
done
with
replacement,
so
that
the
exact
same
population
was
not
being
sampled
over
time.
However,
it
is
reasonable
to
expect
that
repeated
measurements
from
a
given
nest
would
not
be
independent
of
one
another,
given
the
importance
of
maternal
selenium
transfer
and
possibly
genetic
variation
in
susceptibility
to
selenium
impacts.
With
the
Egg
Cup
portion
of
the
study,
sampling
was
done
with
replacement
(
except
for
dead
individuals
which
were
removed),
and
again,
it
is
reasonable
to
expect
that
repeated
measurements
would
not
be
independent
of
one
another.

Another
factor
to
consider
is
that
the
number
of
streams
(
or
experimental
units)
is
limited
for
each
selenium
treatment
(
i.
e.,
two
streams
are
used
in
each
selenium
treatment)
which
raises
some
concern
over
the
statistical
power
of
the
experiments.
Finally,
the
effect
variable
of
greatest
concern
to
EPA
(%
larval
abnormalities)
exhibits
significant
departures
from
homogeneity
of
variance
and
normality
assumptions
of
the
parametric
statistical
tests.
For
example,
control
fish
naturally
exhibit
an
extremely
low
incidence
of
%
larval
abnormalities
while
%
larval
abnormality
in
selenium
treated
fish
tends
to
be
highly
variable.
This
phenomenon
has
been
observed
in
other
studies
and
may
in
part
be
due
to
the
differential
delivery
of
selenium
from
the
parental
ovary
to
the
eggs
both
within
and
across
spawning
events
and
across
different
fish.

Technical
Charge
Questions
Listed
below
are
the
specific
charge
questions
for
which
your
response
is
requested.
Results
from
Study
II
(
i.
e.,
the
active
exposure
portion
of
the
MERS
study)
are
of
most
interest
to
EPA
since
there
was
a
general
lack
of
effects
observed
in
Study
III
(
the
recovery
portion
of
the
study
where
bluegills
were
not
actively
exposed
to
selenium).

The
technical
charge
to
the
peer
reviewers
was
as
follows:

Study
II,
Adult
Survival
and
Growth
Data
1.
Given
the
constraints
of
the
experimental
design,
do
you
consider
the
ANOVA
models
and
statistical
procedures
used
to
analyze
the
effects
of
selenium
on
adult
survival
(
p.
11)
and
adult
growth
(
p.
13)
in
Study
II
to
be
most
appropriate?
If
not,
which
alternative
procedures
and
models
would
you
recommend
and
why?

1b.
An
internal
review
of
the
draft
Cadmus
report
suggested
the
use
of
logistic
regression
for
analysis
adult
survival
data,
specifically
the
GLIMMIX
procedure
in
SAS.
Based
on
your
experience,
would
logistic
regression
(
including
but
not
necessarily
limited
to
the
GLIMMIX
procedure)
likely
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
5
of
25
selenium
12/
00
be
superior
to
the
ANOVA
approach
used
in
the
draft
report
and
still
be
consistent
(
in
terms
of
degrees
of
freedom)
with
the
notion
that
"
stream"
is
the
experimental
unit
of
study?
Do
you
agree
that
"
stream"
is
the
appropriate
experimental
unit
of
study
in
the
MERS
study
for
adult
growth
and
survival
data?

2.
Does
the
analysis
of
statistical
power
of
the
ANOVA
procedure
on
page
13
adequately
support
the
conclusion
that
six
streams
provide
reasonably
good
statistical
power
for
analysis
of
the
adult
survival
data?

Study
II,
Field
Nest
Data
3.
Considering
the
experimental
design
and
limitations
associated
with
the
Study
II
field
nest
data
set,
do
you
consider
the
combined
ANOVA
model
(
p.
21­
22)
and
PROC
Mixed
SAS
procedure
the
most
appropriate
and
scientifically
defensible
approach
for
analyzing
the
maximum
%
abnormality
data
(
i.
e.,
edema,
lordosis,
hemorrhaging)?
If
not,
why
not
and
which
alternative
procedures
and
models
would
you
recommend
and
why?

3b.
An
internal
review
of
the
draft
Cadmus
report
suggested
the
use
of
logistic
regression
for
analysis
Field
Nest
%
abnormality
data,
specifically
the
GLIMMIX
procedure
in
SAS.
Based
on
your
experience,
would
logistic
regression
(
including
but
not
necessarily
limited
to
the
GLIMMIX
procedure)
likely
be
superior
to
the
ANOVA
approach
used
in
the
draft
report
and
still
be
consistent
(
in
terms
of
degrees
of
freedom)
with
the
notion
that
"
stream"
is
the
experimental
unit
of
study?
Do
you
agree
that
"
stream"
is
the
appropriate
experimental
unit
of
study
in
the
MERS
study
for
the
Field
Nest
abnormality
data?

4.
Do
you
consider
the
repeated­
measures
ANOVA
procedure
and
associated
statistical
model
(
p.
24)
to
be
the
most
appropriate
choices
for
analyzing
time­
dependent
effects
of
selenium
on
larval
abnormalities?
If
not,
why
not
and
which
alternative
procedures
and
models
would
you
recommend
and
why?

5a.
To
address
violations
of
normality
and
homogeneity
of
variance
assumptions
that
occurred
with
raw
and
arcsine
square
root­
transformed
%
abnormality
data,
these
data
were
ranked­
transformed
and
reanalyzed
using
the
same
ANOVA
procedures
that
were
applied
to
the
raw
and
arcsine
square
roottransformed
data
(
i.
e.,
the
ANOVA
of
maximum
%
abnormality
responses
and
repeated­
measures
ANOVA).
Is
this
analysis
of
ranked
data
(
e.
g.,
p.
23
&
p.
26)
scientifically
defensible
given
EPA's
understanding
that
nonparametric
methods
do
not
exist
for
mixed
model
ANOVA
designs?

5b.
If
not,
do
you
consider
any
other
statistical
methods
or
data
transformations
superior
to
the
ranking
procedure
used
in
the
draft
report
for
addressing
the
violations
of
normality
and
homogeneity
of
variance
assumptions
using
a
mixed­
model
design?
If
so,
which
ones
do
you
recommend
and
why?

Study
II,
Egg
Cup
Data
6.
Considering
the
experimental
design
and
limitations
associated
with
the
Study
II
egg
cup
data
set,
do
you
consider
the
combined
ANOVA
model
(
p.
26­
27)
and
Proc
Mixed
SAS
procedure
the
most
appropriate
approach
for
analyzing
the
%
hatch,
%
survival,
and
maximum
%
abnormality
data?
If
not,
why
not
and
which
alternative
procedures
and
models
would
you
recommend
and
why?
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
6
of
25
selenium
12/
00
6b.
Again,
internal
reviewers
recommended
the
use
of
logistic
regression
(
e.
g.,
GLIMMIX
in
SAS)
for
the
analysis
of
Egg
Cup
%
abnormality
data.
Based
on
your
experience,
would
logistic
regression
(
including
but
not
necessarily
limited
to
the
GLIMMIX
procedure)
likely
be
superior
to
the
ANOVA
approach
used
in
the
draft
report
and
still
be
consistent
(
in
terms
of
degrees
of
freedom)
with
the
notion
that
"
stream"
is
the
experimental
unit
of
study?
Do
you
agree
that
"
stream"
is
the
appropriate
experimental
unit
of
study
in
the
MERS
study
for
the
Egg
Cup
abnormality
data?

7.
Do
you
consider
the
repeated­
measures
ANOVA
procedures
(
full
and
partial
time
series)
and
associated
statistical
models
the
most
appropriate
approaches
for
analyzing
time­
course
effects
of
selenium
on
the
egg
cup
%
abnormality
responses?
If
not,
why
not
and
which
alternative
procedures
and
models
would
you
recommend
and
why?

8.
Similar
to
question
5a
and
5b
above,
is
the
analysis
of
ranked­
transformed
%
abnormality
data
from
the
egg
cup
study
appropriate
and
scientifically
defensible?
If
not,
why
not
and
which
alternative
statistical
procedures
would
you
recommend
and
why?

9.
To
address
potential
confounding
effects
of
larval
starvation
on
the
analysis
of
the
%
abnormality
data,
a
partial
time
series
was
analyzed
(
i.
e.,
data
beyond
day
3
were
omitted
from
the
analysis).
Do
you
prefer
the
analysis
of
the
partial­
time
series
data
set
to
the
analysis
of
the
full
time
series
data
set?
Do
you
consider
any
alternative
statistical
approaches
superior
to
the
partial
time­
series
analysis
used
here?
If
so,
which
ones
and
why?

10.
An
earlier
MERS
study
(
i.
e.,
the
attached
Study
I;
Hermanutz
et
al.,
1992)
involved
the
continuous
exposure
of
bluegills
to
selenium
treatments
of
10
and
30
µ
g/
L
in
addition
to
controls.
EPA
has
considered
combining
the
results
from
continuous
selenium
exposure
in
Study
I
and
II
for
statistical
analysis,
which
would
result
in
three
concentrations
(
2.5,
10,
30
µ
g/
L)
plus
controls.
However,
data
(
particularly
%
larval
abnormality)
are
very
limited
at
30
µ
g/
L
in
Study
I
due
to
severe
adult
mortality.
With
this
limitation
in
mind,
do
you
consider
the
combination
of
data
(
e.
g.,
%
larval
abnormality)
from
Study
I
and
Study
II
to
be
both
desirable
and
feasible
from
a
statistical
perspective?
If
so,
how
would
one
account
for
the
effect
of
"
time"
(
i.
e.,
different
experiments
in
different
years)
on
the
results?

11.
From
a
statistical
perspective,
how
would
you
recommend
addressing
the
issue
of
cumulative
impacts
or
competing
effects
of
selenium
in
the
statistical
analysis
(
i.
e.,
the
potential
influence
or
bias
that
death
of
organisms
might
have
on
the
incidence
of
sublethal
effects
such
as
larval
abnormality?

12.
Do
you
agree
with
the
summary
and
conclusions
presented
in
Section
5
of
the
report?
Why
or
why
not?

Study
III
Data
13.
Nearly
all
of
the
questions
listed
above
for
the
analysis
of
Study
II
results
are
applicable
to
the
analysis
of
Study
III
results.
One
exception
is
that
those
questions
pertaining
to
use
of
the
repeated­
measures
ANOVA
are
not
applicable
to
Study
III
since
the
preponderance
of
zero
response
observations
negated
the
use
of
the
repeated­
measure
ANOVA
in
Study
III.
With
this
exception
in
mind,
do
you
advise
anything
different
for
the
analysis
of
Study
III
data
than
what
you
recommend
for
Study
II?
If
so,
what
would
you
recommend
and
why?
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
7
of
25
selenium
12/
00
2.1
General
Comments
Dallas
E.
Johnson
This
paper
involved
statistical
analyses
of
several
studies.
Two
studies
involved
the
effects
of
selenium
on
the
survival
and
growth
of
adult
bluegill
and
two
studies
involved
the
effects
of
selenium
on
the
spawning
activity
and
progeny
of
adult
bluegills
through
two
field
nest
data
studies
and
two
egg
cup
data
studies.
The
report
does
an
excellent
job
of
describing
the
experimental
designs
and
procedures
used
to
collect
data
in
all
of
the
studies.
The
EPA
is
correct
when
it
considers
"
stream"
to
be
the
experimental
unit
for
the
selenium
treatments.
A
major
concern
of
mine
concerns
the
extreme
messiness
of
the
data
in
field
nest
and
egg
cup
studies
due
to
the
number
of
zero
observations
in
these
data
sets.
I
will
say
more
about
this
later.
Clearly,
the
limited
number
of
streams
available
for
the
study
is
a
concern
from
a
scientific
point
of
view
(
i.
e.,
a
lack
of
power),
but
not
from
a
legitimate
statistical
analysis
point
of
view
(
i.
e.,
valid
statistical
tests).

The
FREQ
Procedure
Table
of
Status
by
trt
Status
trt
Frequency
Percent
Row
Pct
Col
Pct
1
2
3
4
5
6
Total
dead
45
8.82
15.10
52.94
43
8.43
14.43
50.59
57
11.18
19.13
67.06
49
9.61
16.44
57.65
56
10.98
18.79
65.88
48
9.41
16.11
56.47
298
58.43
live
40
7.84
18.87
47.06
42
8.24
19.81
49.41
28
5.49
13.21
32.94
36
7.06
16.98
42.35
29
5.69
13.68
34.12
37
7.25
17.45
43.53
212
41.57
Total
85
16.67
85
16.67
85
16.67
85
16.67
85
16.67
85
16.67
510
100.0
Statistics
for
Table
of
Status
by
trt
Statistic
DF
Value
Prob
Chi­
Square
5
7.9112
0.1612
Lieklihood
Ratio
Chi­
Square
5
7.9806
0.1573
Mantel­
Haenszel
Chi­
Square
1
1.4613
0.2267
Kinley
Larntz
Overall,
the
statistical
methods
used
in
this
report
are
not
appropriate
given
the
nature
of
the
data
and
the
imbalance
found
in
the
design.
Use
of
a
method
that
reflects
the
binomial
structure
and
unequal
sample
sizes
is
a
minimum
for
statistical
validity
here.
Use
of
ANOVA
and
repeated
measure
analysis
on
raw
percentages,
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
8
of
25
selenium
12/
00
maximum
%
abnormality,
and
rank
data
is
not
appropriate.
The
conclusions
depend
to
a
great
extent
on
choice
of
transformation
scale.
Interactions
are
reported
that
are
clearly
due
only
to
inappropriate
scaling.

The
report
emphasizes
statistical
testing,
reporting
many
tables
of
p­
values.
There
are
tables
of
summary
statistics,
but
these
do
not
always
reflect
the
data
since
many,
many
counts
are
zero.
A
better
approach
would
emphasize
estimation
of
effects.
The
data
clearly
indicate
strong
effects
of
selenium
on
the
rate
of
abnormalities.
The
size
of
these
effects
is
not
quantified
in
this
report.
To
generate
appropriate
estimates,
I
suggest
construction
of
Bayesian
hierarchical
models.
Fitting
such
models
would
provide
reasonable
credible
limits
for
the
effects
studied.

The
experiment
is
quite
small
from
a
statistical
perspective,
particularly
given
the
obviously
large
stream
to
stream
variation.
I
do
believe
that
there
is
much
to
be
gained
from
a
careful
statistical
analysis,
but
one
must
understand
that
for
basic
responses
such
as
survival,
the
study
is
greatly
underpowered.
I
understand
that
larger
experiments
would
have
been
expensive
or
impractical,
but
survival
differences
of
10%
or
greater
cannot
be
expected
to
be
found
from
such
a
small
study.

Clifton
D.
Sutton
For
this
project,
it's
hard
for
me
to
be
more
definite
in
my
remarks
without
having
had
a
good
opportunity
to
perform
a
thorough
analysis,
trying
several
different
approaches,
since
the
data
appears
to
be
somewhat
resistant
to
a
simple
straightforward
analysis.
Instead,
since
the
data
is
a
bit
messy,
several
different
models
should
be
investigated,
and
each
one
assessed
based
on
the
collective
results
and
a
careful
consideration
of
diagnostic
checks.
For
out­
of­
the­
ordinary
statistical
methods,
such
as
using
rank­
based
ANOVA
in
more
complicated
designs,
applied
to
messy
data,
it
seems
prudent
to
partially
check
their
accuracy
using
Monte
Carlo
studies.

ANOVA
methods,
if
the
assumptions
are
reasonably
met,
would
seem
like
a
good
possibility,
and
they
are
relatively
easy
to
interpret
and
explain
to
others.
An
advantage
that
they
have
over
a
logistic
regression
modeling,
is
that
it
is
relatively
easy
to
do
multiple
comparison
procedures
that
can
supply
additional
knowledge
about
the
situation
beyond
just
knowing
that
some
treatments
and/
or
interactions
are
statistically
significant.
However,
such
multiple
comparison
procedures
are
based
on
an
assumption
of
equal
variances,
and
can
be
quite
misleading
when
this
assumption
is
violated.
Since
it
appears
that
we
shouldn't
assume
constant
variance
with
this
data,
this
potential
advantage
over
logistic
regression
cannot
be
realized
in
this
analysis.

Even
if
ANOVA
methods
resulted
in
statistically
significant
results,
one
should
be
cautious
due
to
violations
of
the
assumptions.
But
since
the
ANOVA
methods
don't
provide
many
strong
and
confident
conclusions,
I'd
recommend
also
trying
to
use
logistic
regression
models.
A
logistic
regression
model
would
allow
for
a
general
test
of
treatment
effects
(
a
test
of
the
null
hypothesis
that
there
are
no
differences
in
the
response
due
to
differing
concentrations
of
selenium,
against
the
alternative
that
there
are
some
differences),
and
given
that
there
are
just
3
levels
of
selenium
concentration,
it
shouldn't
be
too
complicated
to
investigate
the
nature
of
the
differences
if
differences
are
found
to
be
statistically
significant.

On
the
whole,
the
draft
being
reviewed
is
a
bit
terse
in
places.
Perhaps
because
there
are
so
many
investigations
of
interest,
not
many
details
are
given
for
each
one.
Furthermore,
while
there
are
a
lot
of
graphics
supplied,
some
of
them
aren't
very
useful,
and
also
some
useful
graphics
that
could
have
been
included
were
not.
So
without
having
adequate
time
to
do
a
thorough
analysis,
it's
impossible
for
me
to
guess
if
the
authors
tried
all
of
the
reasonable
approaches
and
failed
to
report
on
them
because
they
didn't
seem
to
supply
anything
too
much
different
than
what
they
reported
on,
or
if
perhaps
they
may
be
been
better
off
trying
a
few
more
approaches.
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
9
of
25
selenium
12/
00
In
summary,
since
the
ANOVA
methods
reported
on
in
some
parts
of
the
draft
didn't
seem
to
lead
to
a
conclusion
that
the
differences
in
selenium
concentration
matters,
and
in
other
parts
of
the
draft
the
reliability
of
the
ANOVA
methods
is
quite
suspect,
before
dismissing
the
studies
as
failing
to
supply
reasonably
conclusive
evidence
for
differences,
I'd
certainly
recommend
investigating
logistic
regression
models.
If
logistic
regression
models
lead
to
the
same
conclusions,
then
it
may
be
the
case
that
no
standard
approach
will
identify
statistically
significant
differences
for
some
of
the
analyses.
This
would
leave
the
possibility
of
trying
more
complex
approaches
(
like
bootstrap
methods),
but
since
such
approaches
should
be
carefully
studied
(
perhaps
through
extensive
Monte
Carlo
studies)
in
order
to
guess
how
accurate
they
might
be
in
situations
similar
to
the
analyses
at
hand,
it
may
be
reasonable
to
conclude
that
if
a
logistic
regression
model
doesn't
suggest
that
there
are
significant
differences,
then
perhaps
the
small
sample
sizes,
combined
with
messy
data,
will
prevent
significant
differences
from
being
confidently
identified.

2.2
Response
to
Charge
STUDY
II,
ADULT
SURVIVAL
AND
GROWTH
DATA
Dallas
E.
Johnson
The
statistical
procedures
used
to
analyze
the
effects
of
selenium
on
adult
survival
are
acceptable.
I
doubt
that
it
was
necessary
to
use
the
arcsine
and
square
root
transformations
on
these
data,
but
I
don't
feel
strongly
about
it.
The
report
provides
some
residual
plots
in
Figures
3.1
and
3.2
and
suggests
that
there
is
are
unequal
variances
in
the
three
treatment
groups.
I
don't
agree
with
this
assessment.
In
fact,
one
can
not
say
much
at
all
about
variances
based
on
samples
of
size
2.
One
variance
would
need
to
be
about
60
time
larger
than
another
to
be
anywhere
close
to
significant
at
the
10%
significance
level.
I
think
the
authors
of
the
report
are
misreading
their
residual
plots.

Kinley
Larntz
First,
the
appropriate
experimental
unit
is
certainly
stream.
As
such,
there
are
six
observations
in
each
study.

Second,
it
is
probably
not
greatly
important
which
statistical
procedure
is
used,
as
long
as
the
model
allows
for
stream­
to­
stream
variability.
The
ANOVA
models
do
that
by
using
the
within
stream
variation
as
error
term.
The
GLIMMIX
procedure
should
do
the
same.
If
implemented
correctly,
the
degrees
of
freedom
for
tests
should
be
the
same
in
the
GLIMMIX
and
ANOVA
procedures.
The
main
difference
is
that
the
GLIMMIX
procedure
automatically
understands
the
binomial
nature
of
the
data
within
a
stream.
That,
however,
does
not
gain
any
degrees
of
freedom
for
the
main
comparison.
Overall,
I
prefer
the
GLIMMIX
approach
to
the
ANOVA
approach.
It
would
be
easy
to
run
the
GLIMMIX
procedure
to
compare
with
the
ANOVA
results,
but
I
doubt
there
will
a
great
deal
of
difference.

Third,
the
analysis
of
the
long­
term
data
require
special
care
since
only
a
sample
of
fish
were
used
in
the
second
part
of
each
study.
The
ANOVA
approach
is
statistically
valid,
except
for
inequality
of
variance
caused
by
differences
in
survival
proportions
and
unequal
sample
sizes.
GLIMMIX
modeling
would
have
to
be
done
on
each
part
separately
and
then
combined.
It
could
be
done,
but
it
is
not
easy.
1.
Given
the
constraints
of
the
experimental
design,
do
you
consider
the
ANOVA
models
and
statistical
procedures
used
to
analyze
the
effects
of
selenium
on
adult
survival
(
p.
11)
and
adult
growth
(
p.
13)
in
Study
II
to
be
most
appropriate?
If
not,
which
alternative
procedures
and
models
would
you
recommend
and
why?
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
10
of
25
selenium
12/
00
Fourth,
I
do
not
think
the
growth
analysis
is
valid
in
any
real
sense.
The
fish
cannot
be
matched
and
gender
is
not
even
determined
for
small
fish
when
the
transfer
is
made.
So,
while
ANOVA
would
likely
be
OK
if
the
population
of
fish
remained
the
same
and
gender
was
known,
that
is
not
the
case
here.

Finally,
I
would
probably
model
these
data
using
a
Bayesian
hierarchical
model.
The
results
from
such
a
small
study
are
greatly
influenced
by
the
observed
stream­
to­
stream
variation,
which
is
only
estimated
with
three
degrees
of
freedom.
Thus,
it
is
easy
for
the
frequentist
approach
to
"
jump
to
conclusions"
if
the
stream­
tostream
variation
is
estimated
to
be
too
small,
which
could
easily
happen.
My
experience
is
the
Bayes
approach
smooths
out
the
extremely
small
and
extremely
large
variance
component
estimates
to
provide
a
more
robust
analysis
for
small
datasets.
Also,
the
Bayes
approach
would
provide
posterior
distributions
for
the
long­
term
survival
rates
as
products
of
separately
estimated
survival
rates.
This
is
difficult
to
do
in
the
frequentist
approach
taken
here.

Clifton
D.
Sutton
Given
that
one
has
samples
of
size
two
with
which
to
do
the
one­
way
ANOVA,
and
thus
has
possibly
low
power
to
reject
the
null
hypothesis
of
no
differences,
I'd
recommend
trying
a
test
of
the
null
hypothesis
of
no
differences
against
a
one­
directional
monotone
alternative
in
situations
such
as
the
one
at
hand
in
order
to
have
increased
power
with
which
to
detect
differences
of
interest.
That
is,
instead
of
the
usual
one­
way
ANOVA
alternative
hypothesis
that
the
means
are
not
all
equal,
I'd
use
the
monotone
alternative
that
the
mean
for
0
microgram/
L
is
greater
than
or
equal
to
the
mean
for
2.5
microgram/
L,
which
is
greater
than
or
equal
to
the
mean
for
10
microgram/
L,
with
at
least
one
of
the
inequalities
being
strict.
Such
a
test
increases
the
power
for
the
alternative
situation
of
interest:
that
an
increased
level
of
selenium
can
reduce
the
survival
rate
(
or
growth)
rate.

Under
an
assumption
of
approximate
normality
and
homoscedasticity
(
equal
variances),
a
test
due
to
Abelson
and
Tukey
(
see
pages
78­­
80
and
91­­
92
of
Miller's
Beyond
ANOVA:
Basics
of
Applied
Statistics)
is
one
possibility.
My
guess
is
that
because
the
sample
sizes
are
equal,
a
violation
of
the
equal
variance
assumption
shouldn't
cause
a
huge
problem,
and
this
is
good
since
there
is
no
good
way
to
do
a
good
check
of
the
equal
variance
assumption
with
such
small
sample
sizes.
But
because
the
sample
sizes
are
so
small,
the
accuracy
of
the
test
may
be
destroyed
if
the
non­
normality
is
too
drastic
(
and
it
may
be
that
we
had
better
not
stray
too
far
at
all
from
normality,
but
of
course
this
is
also
the
case
for
the
usual
ANOVA
F
test
when
the
sample
sizes
are
so
small),
and
so
the
result
from
this
test
needs
to
be
weighed
appropriately,
since
we
have
no
way
to
get
a
good
check
of
normality
with
so
few
observations.

A
nonparametric
alternative
to
the
Abelson­
Tukey
test
is
the
Jonckheere­
Terpstra
test.
Exact
p­
values
for
this
test
(
as
opposed
to
p­
values
based
on
an
asymptotic
approximation)
can
be
obtained
using
StatXact,
but
as
is
the
case
for
other
nonparametric
tests,
small
samples
sizes
hurt
the
test's
ability
to
give
small
p­
values.
In
order
to
show
that
it
does
make
a
difference
whether
one
does
a
test
against
the
general
alternative
or
a
test
against
a
monotone
alternative,
I
computed
p­
values
for
a
variety
of
tests
using
two
sets
of
survival
data:
the
percent
survival
values
at
Day
221,
and
the
cumulative
percent
survival
values.
The
F
test
and
the
Abelson­
Tukey
test
were
done
on
both
the
raw
data
and
the
transformed
data.
The
p­
values
are
shown
below.
One
can
see
that
transformation
makes
little
difference,
and
one
can
also
see
that
the
p­
values
from
the
Abelson­
Tukey
test
are
only
about
1/
3
as
large
as
the
p­
values
from
the
ANOVA
F
test.

percent
survival
at
Day
221
0.26
­­­
F
test
(
raw
data)
0.26
­­­
F
test
(
transformed
data)
0.09
­­­
Abelson­
Tukey
test
(
raw
data)
0.10
­­­
Abelson­
Tukey
test
(
transformed
data)
0.17
­­­
Jonckheere­
Terpstra
test
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
11
of
25
selenium
12/
00
cumulative
percent
survival
to
Day
320
0.61
­­­
F
test
(
raw
data)
0.60
­­­
F
test
(
transformed
data)
0.19
­­­
Abelson­
Tukey
test
(
raw
data)
0.18
­­­
Abelson­
Tukey
test
(
transformed
data)
0.29
­­­
Jonckheere­
Terpstra
test
Alternatively,
one
could
try
a
one­
directional
Dunnett's
test
against
the
control,
even
if
the
ANOVA
F
test
is
not
significant.
(
From
page
6
of
the
draft,
it
seems
like
Dunnett's
test
was
only
to
be
considered
if
the
F
test
resulted
in
a
p­
value
of
0.05
or
less.)
Because
the
F
test
does
not
concentrate
its
power
like
Dunnett's
test
does,
although
the
F
test
can
have
good
power
against
a
wide
variety
of
specific
alternatives
(
different
ways
that
the
3
distribution
means
aren't
all
equal)
that
are
rather
extreme
deviations
from
all
of
the
means
being
equal,
Dunnett's
test
has
better
power
against
the
subset
of
alternatives
that
are
of
interest
to
the
study
(
alternatives
for
which
increased
selenium
concentration
corresponds
to
reduced
survival
rate).
I
simply
don't
agree
with
the
notion
that
ones
needs
to
obtain
a
statistically
significant
F
test
result
before
it's
proper
to
use
Dunnett's
test
(
or
Tukey's
studentized
range
test
too).
(
Note:
On
page
6
of
the
draft,
the
studentized
range
test
is
referred
to
as
the
standardized
range
test.)
While
Dunnett's
test
does
indeed
give
us
smaller
p­
values
than
the
ANOVA
F
test
does
with
the
survival
data,
it
is
still
the
case
that
we
don't
have
statistical
significance
at
the
0.05
level
(
or
even
the
0.1
level).

With
regard
to
concerns
about
(
approximate)
normality
and
homoscedasticity
(
equal
variances),
I'll
first
address
the
normality
issue.
If
we
thought
that
the
source
of
variation
for
the
two
observations
under
each
treatment
was
due
to
the
fish
more
so
than
the
streams,
then
we
could
feel
good
about
an
assumption
of
approximate
normality
since
each
observation
is
a
proportion,
which
is
a
sample
mean,
and
therefore
should
be
approximately
normally
distributed
due
to
the
central
limit
theorem.
Also
it
can
be
noted
that
the
arc­
sine
square­
root
transformation
seems
entirely
sensible
in
such
a
setting
since
it
is
the
variance
stabilizing
transformation
for
the
sample
mean
of
Bernoulli
random
variables.
However,
in
this
study,
in
which
the
streams
are
viewed
as
the
sampling
units,
the
error
term
variance
is
also,
and
perhaps
largely,
due
to
differences
between
the
streams,
and
we
really
don't
know
what
sort
of
a
distribution
to
assume
for
this
variation.
With
only
3
samples
of
2
observations
each,
one
can
not
get
a
reliable
indication
of
what
the
error
term
distribution
is
like,
although
it
can
be
noted
that
the
pooled
residuals
suggest
that
it
may
be
in
the
general
ballpark
of
normality
since
there
are
no
extreme
outliers,
but
on
the
other
hand,
with
only
two
observations
per
sample,
outliers
would
tend
to
be
masked.
With
such
small
samples,
we
cannot
depend
on
a
central
limit
theorem
type
of
effect
to
make
the
sample
means
used
in
the
numerators
of
the
ANOVA
test
statistics
to
be
approximately
normal,
which
would
of
course
help
with
regard
to
robustness
for
non­
normality.
Also,
we
cannot
think
that
the
denominators
of
the
statistics
have
closely
converged
to
the
constant
values
that
they
would
closely
converge
to
if
the
samples
sizes
were
larger,
which
would
also
help
with
regard
to
robustness.
So
we
cannot
rely
on
robustness
the
way
we
could
if
the
sample
sizes
were
larger,
and
so
we
must
hope
that
the
error
term
distribution
is
close
to
normal,
but
of
course
with
only
two
observations
per
sample
it's
impossible
to
get
a
good
check
about
the
reasonableness
of
this
assumption.
So
if
we
use
ANOVA
methods,
we
just
pretty
much
have
to
hope
that
violations
of
the
assumptions
don't
lead
to
misleading
test
results.
In
my
experience,
I
have
come
to
believe
that
violations
of
the
normality
assumption
have
a
greater
effect
with
in
the
direction
of
resulting
in
a
conservative
test
with
low
power
than
they
do
in
the
direction
of
resulting
in
a
test
having
an
inflated
type
I
error
rate.
Since
the
ANOVA
$
F$
tests
in
Study
II
tend
to
have
produced
large
p­
values,
we
don't
really
have
to
be
concerned
about
having
an
inflated
type
I
error
rate,
but
we
may
be
concerned
that
non­
normality
led
to
low
power.
Of
course
it
may
very
well
be
that
the
very
small
sample
sizes
are
the
main
reason
the
power
would
be
low,
and
of
course
it
may
be
that
the
null
hypothesis
is
true
and
we
shouldn't
expect
a
rejection,
but
when
taken
in
total,
the
data
from
Study
II
seems
to
suggest
that
selenium
concentration
may
have
an
effect
(
since
the
data
typically
  
points
in
that
direction''
even
if
it
doesn't
result
in
small
p­
values).
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
12
of
25
selenium
12/
00
With
regard
to
the
assumption
of
equal
variances,
we
can
take
some
comfort
in
the
fact
that
the
effects
of
heteroscedasticity
are
diminished
if
the
sample
sizes
are
equal,
which
is
the
case
here.
With
only
2
observations
per
sample,
it's
a
bit
silly
to
test
for
heterogeneity
of
variance
since
the
normal
theory
procedures
are
not
very
robust
to
violations
of
the
normality
assumption,
and
more
robust
procedures
cannot
be
trusted
to
perform
accurately
with
such
small
sample
sizes.
Nevertheless,
I
did
compute
Bartlett's
statistic
and
consulted
tables
to
check
for
significance
(
since
asymptotic
sampling
distribution
could
be
inaccurate),
and
the
assumption
of
a
constant
variance
cannot
be
rejected
(
for
the
transformed
survival
data,
after
both
221
and
320
days).

In
paragraph
1
on
page
13,
the
draft
notes
that
the
residuals
show
some
violations
of
the
error
assumptions,
but
really
we
can't
learn
a
lot
from
the
residuals.
When
the
sample
size
is
two,
the
residuals
will
be
equal
in
magnitude
(
no
matter
what
the
shape
the
of
distribution
is),
and
so
they
cannot
be
used
to
check
for
skewness.
Furthermore,
since
they
are
guaranteed
to
be
equal
in
magnitude,
in
a
sense
we
have
only
one
observation
per
sample
to
tell
us
something
about
the
variances,
and
one
observation
per
sample
does
not
tell
us
very
much
in
this
case.
What
we
can
see
is
consistent
with
equal
variances,
and
it's
also
consistent
with
unequal
variances
­­­
we
simple
have
no
way
of
knowing
whether
or
not
the
variances
differ
appreciably.
Of
course
in
paragraph
1
on
page
13
of
the
draft
it
is
properly
noted
that
the
small
sample
size
probably
has
the
largest
impact
on
the
ANOVA
results,
but
I
think
it's
worth
noting
that
while
we
may
have
some
robustness
to
counter
unequal
variances,
the
ANOVA
tests
can
be
quite
sensitive
to
non­
normality
when
the
sample
sizes
are
so
small
and
we
don't
have
a
good
way
to
check
for
non­
normality.

Dallas
E.
Johnson
I
don't
think
that
the
use
of
logistic
regression
using
the
SAS­
GLIMMIX
procedure
will
likely
be
superior
to
the
analyses
that
have
been
performed.

You
did
not
ask
for
this,
but
another
analysis
that
can
be
performed
on
these
data
is
a
2­
way
chi­
square
contingency
table
analysis
with
rows
being
identified
by
live/
dead
and
columns
being
identified
by
each
of
the
6
streams
in
the
study.
I
performed
this
analysis
on
the
Day
1­
221
data
from
Study
II.
The
chi­
square
test
statistic
with
5
degrees
of
freedom
is
7.9112
(
p­
value=
0.1612)
indicating
there
are
no
differences
in
death
rates
in
any
of
the
six
streams.
This
is
consistent
with
the
conclusions
shown
in
Table
3­
5
(
p.
11).
See
below
for
a
portion
of
the
results
from
the
SAS­
FREQ
procedure.

Clifton
D.
Sutton
The
streams
play
the
same
role
in
a
logistic
regression
model
as
they
do
in
an
ANOVA
model,
and
so
if
they
are
viewed
as
the
sampling
units
in
the
ANOVA
model
they
can
be
viewed
in
this
way
for
the
logistic
regression
model.
Since
there
is
the
possibility
that
stream
differences
contribute
a
lot
of
the
observed
differences,
then
one
has
to
account
for
stream
differences
when
assessing
the
treatment
effects.
So
viewing
the
streams
as
the
experimental
units
for
the
ANOVA
models
seems
appropriate,
and
adjusting
for
stream
1b.
An
internal
review
of
the
draft
Cadmus
report
suggested
the
use
of
logistic
regression
for
analysis
adult
survival
data,
specifically
the
GLIMMIX
procedure
in
SAS.
Based
on
your
experience,
would
logistic
regression
(
including
but
not
necessarily
limited
to
the
GLIMMIX
procedure)
likely
be
superior
to
the
ANOVA
approach
used
in
the
draft
report
and
still
be
consistent
(
in
terms
of
degrees
of
freedom)
with
the
notion
that
"
stream"
is
the
experimental
unit
of
study?
Do
you
agree
that
"
stream"
is
the
appropriate
experimental
unit
of
study
in
the
MERS
study
for
adult
growth
and
survival
data?
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
13
of
25
selenium
12/
00
differences
by
including
stream
effect
terms
in
the
logistic
regression
model
is
also
appropriate
and
allows
the
streams
to
play
the
same
role
in
both
models.

Both
ANOVA
and
logistic
regression
models
will
be
hampered
by
the
fact
there
are
only
two
streams
per
treatment.
It
may
be
the
case
that
both
models
yield
similar
conclusions,
but
I
would
certainly
be
in
favor
of
investigating
logistic
regression
models
since
one
should
be
concerned
about
the
effect
of
non­
normality
on
the
ANOVA
models.

While
the
ANOVA
models
are
simple,
and
allow
for
multiple
comparisons
to
investigate
any
differences
(
treatment
effects
and/
or
interaction
effects)
that
are
found,
and
also
allow
for
tests
against
monotone
alternatives,
for
the
survival
and
growth
data
in
Study
II,
the
ANOVA
approach
does
not
uncover
statistically
significant
differences,
and
so
one
may
want
to
check
on
the
reasonableness
of
this
with
a
logistic
regression
model,
especially
since
we
have
to
be
concerned
about
non­
normality
hurting
the
power
of
the
ANOVA
methods.
With
the
possibility
that
stream
differences
serve
to
mask
treatment
differences
in
this
study
in
which
there
are
only
two
streams
per
treatment
and
thus
not
a
lot
of
information
with
which
to
counter
the
effects
of
experimental
noise,
one
should
not
assume
that
any
reasonable
method
will
detect
treatment
effects,
and
so
it
seems
prudent
to
try
more
than
one
approach.
Based
on
my
experience,
I
would
favor
a
logistic
regression
approach
over
the
ANOVA
approach
with
data
such
as
we
have
here.

With
only
three
treatment
levels,
and
anticipating
a
monotone
alternative
if
the
null
hypothesis
isn't
true,
I
think
it
should
be
easy
to
compare
the
treatment
effects
if
the
null
hypothesis
of
no
effects
is
rejected.
But
again,
it
could
be
that
the
experimental
noise
due
to
such
a
small
number
of
streams
per
treatment
presents
an
obstacle
which
is
too
large
to
overcome.

Dallas
E.
Johnson
There
is
not
enough
information
in
the
report
to
allow
me
to
assess
the
accuracy
of
the
power
curves
in
Figures
3.3­
3.6.
In
fact,
I
don't
think
they
are
done
accurately,
but
I
don't
have
the
facts
to
support
my
thoughts.
Perhaps
some
of
your
other
reviewers
will
be
able
to
assess
this
for
you.
The
statement
in
the
report
in
the
next
to
last
line
of
the
second
paragraph
of
page
13
is
wrong.
The
authors
say
"
It
is
desirable
to
have
a
high
chance
of
classifying
the
treatment
means
as
equal,
when
the
are
(
i.
e.,
high
power,
....)."
This
statement
does
not
describe
power
correctly.
Consider
Ho:
There
is
no
difference
between
streams
due
to
selenium
level.
Power
then
is
the
probability
of
rejecting
Ho
when
there
are
differences
between
streams
of
a
given
magnitude.
That
is,
power
is
usually
the
probability
of
rejecting
Ho
when
Ho
is
false.
We
could
also
say
that
when
we
have
good
power,
we
have
a
high
chance
of
saying
that
the
treatment
means
are
unequal
when
they
are,
in
fact,
unequal.

Kinley
Larntz
I
do
not
believe
the
power
curves
presented
on
pages
14­
15.
First,
I
would
only
be
interested
in
the
curves
for
level
0.05.
Second,
there
are
differences
in
the
data
as
great
as
10%
that
are
not
statistically
significant.
(
The
percentages
for
survival
to
Day
221
in
Study
II
are
48.24,
37.65,
and
38.83,
and
the
p­
value
reported
is
0.263.)
So,
it
is
clear
that
this
study
does
not
have
power
great
enough
to
detect
a
10%
difference.
The
power
calculations
must
have
used
incorrect
estimates
of
stream­
to­
stream
variation,
or
perhaps
the
sample
sizes
are
per
stream
in
place
of
total
number
of
streams.
2.
Does
the
analysis
of
statistical
power
of
the
ANOVA
procedure
on
page
13
adequately
support
the
conclusion
that
six
streams
provide
reasonably
good
statistical
power
for
analysis
of
the
adult
survival
data?
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
14
of
25
selenium
12/
00
Clifton
D.
Sutton
I
do
not
agree
with
the
power
analysis
presented
in
the
draft.
Although
I
may
be
addressing
a
slightly
different
situation
than
the
authors
did
(
since
the
details
are
rather
sketchy),
I
feel
that
the
situations
that
I
considered
should
be
in
the
general
ballpark
of
what
is
appropriate
for
the
survival
data
sets
in
Study
II.
Based
on
the
work
I
did,
I
think
that
anywhere
from
4
to
15
streams
per
treatment
may
be
necessary
to
have
decent
power
against
alternatives
of
interest
if
an
ANOVA
F
test
is
used.
If
one
wants
to
take
a
chance,
it
may
be
that
for
one
of
the
studies,
3
streams
per
treatment
(
concentration
level)
may
be
adequate,
but
I
would
recommend
4
as
a
minimum.
(
I
used
Pearson
and
Hartley's
Biometrika
Tables
for
Statisticians
to
perform
a
power
analysis
for
two
survival
rate
data
sets
from
Study
II.)
Without
doing
the
necessary
calculations,
2
streams
per
treatment
seemed
unreasonably
low,
and
my
calculations
served
to
confirm
this.

Although
I
used
the
usual
estimate
of
the
error
term
standard
deviation
in
my
calculations,
as
I
guess
the
authors
did
as
well,
we
should
keep
in
mind
that
the
determination
of
the
power
requires
the
true
error
term
standard
deviation,
which
is
unknown.
If
the
estimate
used
happens
to
underestimate
the
true
value,
then
even
larger
sample
sizes
may
be
needed.
With
an
MSE
having
only
3
degrees
of
freedom,
the
estimated
standard
deviation
can
be
rather
inaccurate,
and
thus
it
can
easily
help
to
produce
a
misleading
estimate
of
the
power.
Also,
the
power
calculations
that
I
did
assume
normally
distributed
error
terms,
and
we
don't
know
how
close
we
are
to
normality.
Certain
types
on
non­
normality
hurt
the
power,
and
so
once
again
we
need
to
realize
that
the
usual
estimates
of
power
may
be
overly
optimistic.

It's
unfortunate
that
many
biological
studies
are
done
using
such
small
samples
sizes,
which
when
combined
with
noisy
data,
give
us
little
power
to
make
claims
of
statistically
significant
differences.
Going
from
sample
sizes
of
2
to
samples
sizes
of
even
just
3
or
4
can
help
a
lot
with
regard
to
power,
as
well
with
regard
to
robustness
and
with
regard
to
giving
us
better
ability
to
check
the
assumptions.

STUDY
II,
FIELD
NEST
DATA
Dallas
E.
Johnson
Given
the
experimental
design
and
limitations
associated
with
Study
II
field
nest
data
set,
I
do
believe
that
the
SAS­
MIXED
procedure
is
the
most
appropriate
and
scientifically
defensible
approach
for
analyzing
the
maximum
%
abnormality
data.
However,
because
there
is
ample
reason
from
a
scientific
point
of
view
to
believe
that
the
variances
are
not
constant
across
selenium
levels,
it
might
have
been
wise
to
used
an
option
in
SAS­
MIXED
that
would
incorporate
this
into
the
analysis.
This
could
be
done
by
using
a
GROUP=
option
on
the
RANDOM
option
in
the
MIXED
analysis.
I
do
have
some
concerns
about
performing
this
kind
of
analysis
on
ranked
data.
Does
the
paper
by
Iman
that
is
referenced
in
the
report
recommend
using
ranks
when
the
data
are
nested
such
as
they
are
nested
in
this
experiment?
I
don't
think
so,
but
it
has
been
a
while
since
I
have
looked
at
this
paper.
I
expect
that
some
statisticians
might
recommend
using
bootstrapping
to
estimate
p­
values
and
standard
errors
for
this
kind
of
an
experiment,
but
bootstrapping
is
beyond
my
level
of
expertise.
Still
it
might
be
something
that
one
should
consider.
3.
Considering
the
experimental
design
and
limitations
associated
with
the
Study
II
field
nest
data
set,
do
you
consider
the
combined
ANOVA
model
(
p.
21­
22)
and
PROC
Mixed
SAS
procedure
the
most
appropriate
and
scientifically
defensible
approach
for
analyzing
the
maximum
%
abnormality
data
(
i.
e.,
edema,
lordosis,
hemorrhaging)?
If
not,
why
not
and
which
alternative
procedures
and
models
would
you
recommend
and
why?
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
15
of
25
selenium
12/
00
Kinley
Larntz
There
are
major
statistical
problems
with
this
section.
First,
the
measurement
of
maximum
%
abnormality
is
invalid
because
the
number
of
samples
differs
for
each
nest
and
the
size
of
each
subsample
differs
from
sample
to
sample.
The
subsample
sizes
range
from
6
to
140,
hardly
a
fair
basis
for
computing
a
maximum.
Converting
to
ranks
does
nothing
to
get
rid
of
this
problem.
(
Also
see
comments
of
Larntz
under
Question
3b.)

Clifton
D.
Sutton
For
the
reasons
noted
in
the
draft,
using
PROC
Mixed
is
certainly
an
approach
worthy
of
consideration
and
is
a
reasonable
starting
point.
However,
in
light
of
some
of
the
problems
that
were
encountered
and
that
were
noted
in
the
draft,
and
that
are
addressed
below,
it
may
be
that
a
logistic
regression
model
should
be
considered
to
be
more
trustworthy.

Page
22
of
the
draft
indicates
that
tests
for
non­
normality
and
heteroscedasticity
(
which
are
referred
to
as
tests
of
normality
and
homogeneity
of
variance
in
the
draft)
are
significant,
even
after
transformation.
Even
though
significant
results
from
such
tests
do
not
necessarily
mean
that
ANOVA
results
cannot
be
reasonably
accurate
(
for
example,
it
can
depend
on
the
type
of
non­
normality),
with
unbalanced
designs
such
as
we
have,
I'd
be
rather
cautious
in
trusting
the
ANOVA
results.

Perhaps
some
transformations
other
than
the
arc­
sine
square­
root
transformation
should
have
been
tried.
The
arc­
sine
square­
root
transformation
often
works
well
when
one
has
proportions
as
the
observations,
but
not
all
of
the
time.
If
some
other
reasonable
transformation
does
a
better
job
of
stabilizing
the
variance
and
attaining
approximate
normality,
then
it
should
be
considered
for
use.

Dallas
E.
Johnson
Unless
you
still
have
the
data
from
individual
fish,
I
don't
think
you
can
use
the
GLIMMIX
procedure
to
perform
a
logistic
regression,
but
I
could
be
wrong
since
I
have
never
used
the
GLIMMIX
procedure
myself.
Even
then,
I
doubt
that
it
would
be
superior
to
the
to
the
approach
used
in
this
report
given
that
the
report
uses
the
GROUP=
option
to
allow
for
different
variances
for
each
of
the
selenium
levels.

Kinley
Larntz
I
agree
that
GLIMMIX
or
some
procedure
that
takes
account
of
the
binomial
nature
of
the
measurements
would
be
a
better
choice.
It
is
considerably
different
here
compared
to
the
survival
data
which
had
approximately
equal
sample
sizes.
The
unequal
sample
sizes
and
extreme
imbalance
in
sampling
makes
it
imperative
to
construct
a
more
appropriate
statistical
model.
Again
I
would
construct
an
appropriate
Bayesian
hierarchical
model
to
handle
these
data,
but
other
methods
exist.

One
very
simple
non­
parametric
method
would
be
to
use
a
permutation
test
with
stream
as
the
unit
of
analysis.
An
objective
function
might
be
the
rank
correlation
of
%
abnormality
overall
in
a
stream
with
dose
(
0,
2.5,
10).
3b.
An
internal
review
of
the
draft
Cadmus
report
suggested
the
use
of
logistic
regression
for
analysis
Field
Nest
%
abnormality
data,
specifically
the
GLIMMIX
procedure
in
SAS.
Based
on
your
experience,
would
logistic
regression
(
including
but
not
necessarily
limited
to
the
GLIMMIX
procedure)
likely
be
superior
to
the
ANOVA
approach
used
in
the
draft
report
and
still
be
consistent
(
in
terms
of
degrees
of
freedom)
with
the
notion
that
"
stream"
is
the
experimental
unit
of
study?
Do
you
agree
that
"
stream"
is
the
appropriate
experimental
unit
of
study
in
the
MERS
study
for
the
Field
Nest
abnormality
data?
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
16
of
25
selenium
12/
00
For
each
of
the
three
responses,
the
lowest
%
abnormalities
occur
in
the
control
streams;
the
next
two
lowest
occur
in
the
2.5
dose
streams,
and
the
highest
two
occur
in
the
10
dose
streams.
This
results
in
the
highest
possible
rank
correlation
under
the
permutation
null
hypothesis
that
randomly
assigns
streams
to
dosages.
If
I've
done
my
counting
right,
there
are
90
distinct
possible
random
assignments
so
the
p­
value
associated
with
the
observed
outcome
is
1/
90,
which
would
be
considered
statistically
significant.
This
is
true
for
all
three
responses.

In
conclusion,
because
of
the
unequal
sampling
and
imbalance
of
nest
occurrences,
the
analysis
presented
using
maximum
%
abnormality
or
rank
transformed
values
do
not
present
an
appropriate
picture
of
what
is
going
on
here.
I'm
not
sure
if
GLIMMIX
would
be
able
to
model
the
data
well
enough
since
there
are
many
zero
counts,
but
I
believe
an
appropriate
Bayesian
hierarchical
model
would
be
better.

Clifton
D.
Sutton
I
don't
think
that
using
maximum
percent
variables
as
response
variables
is
appropriate
for
logistic
regression
models
with
the
field
nest
data
sets,
but
it
may
be
that
the
phenomena
can
be
successfully
modeled
in
some
way
using
logistic
regression.
For
example,
each
nest
can
be
labeled
a
  
success''
or
a
  
failure''
according
to
whether
it's
below
or
above
a
certain
appropriate
threshold,
and
then
this
dichotomized
data
can
be
modeled
using
logistic
regression.
It
seems
to
me
that
this
would
be
a
way
of
addressing
whether
or
not
selenium
concentration
affects
progeny
while
avoiding
some
of
the
difficulties
of
the
ANOVA
approach.

Streams
play
the
same
role
in
the
logistic
regression
model
as
they
do
in
the
ANOVA
model,
and
the
test
of
variation
due
to
stream
differences
would
have
three
degrees
of
freedom
in
the
logistic
regression
model,
just
like
it
does
in
the
ANOVA
model.

Dallas
E.
Johnson
I
believe
the
repeated­
measures
analysis
performed
in
the
report
and
its
associated
statistical
model
is
the
best
choice
for
analyzing
this
data.
I
know
of
no
alternative
procedures
to
suggest.

Kinley
Larntz
My
criticisms
listed
in
response
to
the
previous
question
apply
to
the
repeated
measures
ANOVA
in
the
scales
presented.
I
think
GLIMMIX
would
be
much,
much
better,
but
again
a
Bayesian
hierarchical
model
would
be
even
better
at
capturing
what
is
going
on.

The
interaction
between
age
and
treatment
is
likely
just
a
result
of
using
an
inappropriate
scale.
As
I
noted
above,
all
three
responses
follow
the
same
monotonic
pattern
(
lowest
abnormality
rates
in
the
control
streams,
next
two
lowest
in
the
2.5
dosage
streams,
and
highest
values
in
the
10
dosage
streams).
Thus,
there
is
no
substantive
interaction
in
the
sense
that
there
exists
a
monotonic
transformation
that
removes
the
interaction.
A
model
that
appropriately
accounts
for
cumulative
exposed
dosage
would
certainly
not
show
an
interaction.
4.
Do
you
consider
the
repeated­
measures
ANOVA
procedure
and
associated
statistical
model
(
p.
24)
to
be
the
most
appropriate
choices
for
analyzing
time­
dependent
effects
of
selenium
on
larval
abnormalities?
If
not,
why
not
and
which
alternative
procedures
and
models
would
you
recommend
and
why?
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
17
of
25
selenium
12/
00
Clifton
D.
Sutton
The
apparent
extreme
non­
normality
and
heteroscedasticity
makes
me
leery
about
trusting
the
repeated
measures
ANOVA
procedure.
In
addition
to
the
problems
noted
in
the
draft
(
non­
normality,
nonconstant
variance,
and
estimates
of
variances
equal
to
zero),
it
can
be
noted
(
see
Table
4­
14
on
page
26
of
the
draft)
that
a
strong
Age/
Treatment
interaction
is
observed
in
the
analysis
of
the
lordosis
data,
which
seems
a
bit
odd,
especially
considering
that
neither
Treatment
nor
Age
is
statistically
significant.

Unfortunately,
I
don't
know
of
any
good
alternative
to
the
repeated
measures
ANOVA.
Perhaps
a
better
transformation
can
be
found
that
will
reduce
concerns.
Given
that
the
2.5
microgram/
L
concentration
seems
to
have
a
mild
effect
compared
to
the
effect
of
the
10
microgram/
L
concentration
(
assuming
that
there
are
some
real
effects),
one
might
see
if
problems
are
reduced
if
only
two
treatments
(
10
microgram/
L
and
the
control)
are
used
instead
of
all
three
concentrations.

The
fact
that
some
of
the
estimated
variances
are
zero
leads
me
to
wonder
if
perhaps
the
maximum
percent
incidence
is
a
good
choice
for
a
response
variable
in
this
case.
Could
some
other
choice
be
made
for
the
response
variable
that
would
still
allow
one
to
investigate
the
issues?
If
so,
then
perhaps
such
an
alternative
response
variable
would
result
in
fewer
problems
with
the
analysis.

Dallas
E.
Johnson
As
mentioned
above
in
[
comment
for
Question]
3,
I
don't
think
it
is
appropriate
to
used
ranked
data
in
a
mixed
model
framework
and
I
don't
think
the
paper
by
Iman
suggests
that
ranking
data
works
in
a
mixed
model
framework,
but
I
could
be
wrong.

Kinley
Larntz
Again,
as
explained
above
[
comments
for
Question
4],
I
do
not
think
analysis
of
these
data
in
the
raw
or
ranktransformed
scales
is
appropriate
because
of
the
extreme
imbalance
in
the
sample
sizes
and
unequal
numbers
of
nests.
These
transformations
do
not
correct
this
major
problem.

Clifton
D.
Sutton
While
the
ANOVA
with
ranks
seems
somewhat
reasonable,
given
that
there
are
inadequate
research
results
indicating
that
it
will
be
accurate
in
a
situation
with
an
unbalanced
design,
heteroscedasticity,
and
perhaps
severe
non­
normality,
I
would
feel
much
better
about
trusting
such
results
if
a
Monte
Carlo
study
was
performed
that
could
partially
confirm
that
the
technique
might
be
appropriate
for
cases
such
as
those
under
consideration.
The
fact
that
commonly
used
nonparametric
methods
are
not
available
for
situations
such
as
we
have
here
doesn't
make
the
use
of
the
utilized
rank­
transformation
scientifically
defensible
(
and
in
fact,
one
might
wonder
why
the
rank­
transformation
scheme
isn't
used
more
if
it's
without
worries).
Nevertheless,
my
experience
leads
me
to
believe
that
it
may
be
okay,
although
it
may
be
that
other
scores
would
do
better
than
integer
ranks.
Still,
it
troubles
me
a
bit
that
not
many
have
adopted
Iman's
proposed
tactic.
For
example,
in
Milliken
and
Johnson's
Analysis
of
Messy
Data,
this
rank­
transformation
technique
is
not
included.
5a.
To
address
violations
of
normality
and
homogeneity
of
variance
assumptions
that
occurred
with
raw
and
arcsine
square
root­
transformed
%
abnormality
data,
these
data
were
rankedtransformed
and
re­
analyzed
using
the
same
ANOVA
procedures
that
were
applied
to
the
raw
and
arcsine
square
root­
transformed
data
(
i.
e.,
the
ANOVA
of
maximum
%
abnormality
responses
and
repeated­
measures
ANOVA).
Is
this
analysis
of
ranked
data
(
e.
g.,
p.
23
&
p.
26)
scientifically
defensible
given
EPA's
understanding
that
nonparametric
methods
do
not
exist
for
mixed
model
ANOVA
designs?
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
18
of
25
selenium
12/
00
With
the
commonly
used
nonparametric
methods
that
use
ranks,
one
can
think
that
under
the
null
hypothesis,
the
ranks
reflect
the
natural
variation
of
the
response
variable,
and
the
sampling
distribution
arises
by
a
consideration
of
all
possible
ways
in
which
the
experimental
units
could
have
been
randomly
assigned
to
the
various
treatments,
preserving
blocks
and
any
other
features
of
the
design.
But
as
opposed
to
a
randomized
block
design,
where
Friedman's
test
is
performed
by
ranking
each
block
separately,
it
seems
the
rank
method
utilized
in
this
study
ranked
all
of
the
observations
with
a
single
ranking.
I've
seen
this
done
and
justified
in
stratified
permutation
tests,
but
it
is
a
bit
different
from
the
usual
rank­
based
approaches.
My
greatest
worry
is
that
the
technique
may
work
okay
as
long
as
the
variances
aren't
too
unequal,
but
may
be
unreliable
if
the
variances
differ
too
much.
Again,
one
way
to
investigate
this
issue
would
be
to
do
a
Monte
Carlo
study.

Dallas
E.
Johnson
My
main
suggestions
here
are
to
use
a
GROUP=
option
on
the
RANDOM
option
to
deal
with
unequal
variances,
I
don't
have
any
other
transformations
to
suggest
other
than
the
ones
that
were
used
in
the
report.

Kinley
Larntz
Non­
parametric
mixed
model
analyses
will
not
work
for
these
data.
The
correct
way
to
proceed
is
to
model
the
binomial
responses
using
a
hierarchical
model.
I'm
not
sure
GLIMMIX
can
capture
what
is
going
on
because
of
the
number
of
zero
counts.
Exact
methods
would
work
better
and,
of
course,
Bayesian
methods
would,
in
my
opinion,
be
even
better.

Clifton
D.
Sutton
There
are
no
obvious
alternatives.
Perhaps
some
sort
of
bootstrap
method
could
be
investigated,
but
I
think
such
a
method
would
have
to
be
rather
thoroughly
tested
in
a
Monte
Carlo
study
before
it
could
be
deemed
to
be
trustworthy.
A
nonconstant
variance
structure
could
make
bootstrapping
residuals
problematic.

Another
possibility,
although
it
should
perhaps
be
viewed
as
a
remote
possibility,
is
to
develop
some
sort
of
robust
procedure
similar
to
the
one
used
for
a
split­
plot
design
in
Wilcox's
Introduction
to
Robust
Estimation
and
Hypothesis
Testing.
But
I
fear
that
small
sample
sizes,
along
with
an
unbalanced
design,
would
make
such
an
approach
extremely
difficult
to
tame.
Once
again,
the
appropriateness
of
any
such
method
for
a
situation
such
as
the
one
at
hand
should
be
investigated
with
a
Monte
Carlo
study.

The
only
simple
thing
I
can
suggest
is
to
explore
other
transformations,
but
I'm
not
highly
hopeful
that
a
good
transformation
can
be
found.
Once
again,
I'll
suggest
that
some
other
response
variables
be
considered
­­­
variables
that
address
the
issues,
but
are
perhaps
better
suited
for
ANOVA
methods
with
the
given
data.
5b.
If
not,
do
you
consider
any
other
statistical
methods
or
data
transformations
superior
to
the
ranking
procedure
used
in
the
draft
report
for
addressing
the
violations
of
normality
and
homogeneity
of
variance
assumptions
using
a
mixed­
model
design?
If
so,
which
ones
do
you
recommend
and
why?
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
19
of
25
selenium
12/
00
STUDY
II,
EGG
CUP
DATA
Dallas
E.
Johnson
I
believe
that
the
SAS­
MIXED
procedure
is
the
most
appropriate
approach
for
analyzing
%
hatch,
%
survival,
and
maximum
%
abnormality
data.
To
deal
with
unequal
variances,
one
might
again
use
the
GROUP=
option.
I
don't
believe
that
analyzing
ranks
is
appropriate
for
mixed
model
data.

Kinley
Larntz
These
data
have
many
of
the
same
characteristics
as
the
field
nest
data.
The
percentages
are
based
on
different
denominators
and
the
number
of
spawns
available
varies
from
stream
to
stream.
Thus,
analyses
that
use
standard
ANOVA
assumptions
are
doomed
to
failure.
Rank
transformed
data
are
not
valid
since
the
data
have
different
precision,
being
based
on
different
sample
sizes.
This
criticism
applies
to
all
the
analyses
reported
in
Section
4.2.2.

Clifton
D.
Sutton
As
was
the
case
for
the
field
nest
data,
using
ANOVA
and
PROC
Mixed
was
a
reasonable
thing
to
try,
but
I
have
concerns
about
accuracy
due
to
the
violations
of
the
assumptions
of
normality
and
homoscedasticity.
Perhaps
an
ANOVA
analysis
would
cause
fewer
concerns
if
the
data
could
be
viewed
as
being
more
appropriate,
given
the
assumptions.
Maybe
using
another
transformation,
not
using
the
2.5
microgram/
L
concentration
data
(
and
thus
using
just
two
treatments
instead
of
three),
or
using
an
alternative
response
variable
would
result
in
a
more
trustworthy
analysis.

Dallas
E.
Johnson
Unless
you
have
data
on
individual
eggs,
I
doubt
that
logistic
regression
would
be
applicable
to
the
data.
I
doubt
that
logistic
regression
would
be
a
superior
approach
to
the
MIXED
approach.
Stream
is
the
appropriate
experimental
unit
for
the
egg
cup
abnormality
data.

Kinley
Larntz
A
model
that
accounts
for
the
binomial
nature
of
the
responses
and
hierarchical
nature
of
the
design
must
be
used
for
these
analyses.
GLIMMIX
is
a
reasonable
start,
but
because
there
are
many
zero
counts,
the
methods
6.
Considering
the
experimental
design
and
limitations
associated
with
the
Study
II
egg
cup
data
set,
do
you
consider
the
combined
ANOVA
model
(
p.
26­
27)
and
Proc
Mixed
SAS
procedure
the
most
appropriate
approach
for
analyzing
the
%
hatch,
%
survival,
and
maximum
%
abnormality
data?
If
not,
why
not
and
which
alternative
procedures
and
models
would
you
recommend
and
why?

6b.
Again,
internal
reviewers
recommended
the
use
of
logistic
regression
(
e.
g.,
GLIMMIX
in
SAS)
for
the
analysis
of
Egg
Cup
%
abnormality
data.
Based
on
your
experience,
would
logistic
regression
(
including
but
not
necessarily
limited
to
the
GLIMMIX
procedure)
likely
be
superior
to
the
ANOVA
approach
used
in
the
draft
report
and
still
be
consistent
(
in
terms
of
degrees
of
freedom)
with
the
notion
that
"
stream"
is
the
experimental
unit
of
study?
Do
you
agree
that
"
stream"
is
the
appropriate
experimental
unit
of
study
in
the
MERS
study
for
the
Egg
Cup
abnormality
data?
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
20
of
25
selenium
12/
00
used
by
GLIMMIX
may
also
give
misleading
results.
As
I've
mentioned
already,
a
Bayesian
hierarchial
model
would
be
able
to
reflect
the
structure
of
the
data
and
the
design
and,
thus,
seems
appropriate.

Clifton
D.
Sutton
Due
to
the
problems
with
the
ANOVA
methods,
I'll
advise
giving
strong
consideration
to
a
logistic
regression
approach
similar
to
the
one
described
previously
for
the
field
nest
data.

Dallas
E.
Johnson
I
agree
that
the
repeated
measures
procedures
used
in
the
report
and
their
associated
statistical
models
are
the
most
appropriate
approaches
for
analyzing
time
and
concentration
effects
on
the
egg
cup
%
abnormality
responses.

Kinley
Larntz
I
do
not
believe
ANOVA
procedures
can
capture
what
is
going
on
with
these
data
for
reasons
given
in
the
previous
answer.

Clearly,
a
procedure
that
reflects
the
binomial
nature
of
the
responses
and
accounts
for
the
imbalance
in
the
design
is
required.
The
ANOVA
procedures
do
not
do
that.
Again,
a
Bayesian
hierarchical
model
could
do
that.
Also,
carefully
thought
through
exact
procedures
may
work,
as
many
of
the
counts
are
small.

Clifton
D.
Sutton
Once
again,
the
non­
normality,
heteroscedasticity,
estimated
variances
of
zero,
and
the
unbalanced
design
give
me
reasons
to
doubt
the
accuracy
of
the
ANOVA
methods,
but
also,
once
again,
there
are
no
simple
alternatives
that
can
be
trusted
without
a
rather
thorough
investigation
(
using
a
Monte
Carlo
study).
Perhaps
an
ANOVA
analysis
would
cause
fewer
concerns
if
the
data
could
be
viewed
as
being
more
appropriate,
given
the
assumptions.
Maybe
using
another
transformation,
not
using
the
2.5
microgram/
L
concentration
data
(
and
thus
using
just
two
treatments
instead
of
three),
or
using
an
alternative
response
variable
would
result
in
a
more
trustworthy
analysis.

Dallas
E.
Johnson
I
don't
believe
that
analyzing
ranks
is
appropriate
for
experimental
designs
that
involve
nesting.
Analyzing
the
arcsin
transformed
data
is
probably
the
most
appropriate
thing
to
do.
7.
Do
you
consider
the
repeated­
measures
ANOVA
procedures
(
full
and
partial
time
series)
and
associated
statistical
models
the
most
appropriate
approaches
for
analyzing
time­
course
effects
of
selenium
on
the
egg
cup
%
abnormality
responses?
If
not,
why
not
and
which
alternative
procedures
and
models
would
you
recommend
and
why?

8.
Similar
to
question
5a
and
5b
above,
is
the
analysis
of
ranked­
transformed
%
abnormality
data
from
the
egg
cup
study
appropriate
and
scientifically
defensible?
If
not,
why
not
and
which
alternative
statistical
procedures
would
you
recommend
and
why?
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
21
of
25
selenium
12/
00
Kinley
Larntz
The
ranked­
transformed
%
abnormality
data
are
based
on
different
sample
sizes
for
each
observation.
As
such,
the
resulting
transformed
values
do
not
reflect
the
structure
of
the
data.

If
I
wished
to
carry
out
an
appropriate
non­
parametric
test,
I
would
consider
a
permutation
test
based
on
the
basic
randomization.
In
this
case,
there
are
five
streams,
two
control,
one
2.5
dosage,
and
two
10
dosage.
For
several
of
the
responses,
it
is
clear
that
the
lowest
two
percent
abnormality
streams
are
found
in
the
control,
the
2.5
stream
is
next,
and
the
two
10
dosage
streams
exhibit
the
highest
abnormality
percentages.
In
these
cases,
the
rank
correlation
of
percent
abnormality
versus
dosage
is
as
large
as
it
can
be
for
the
observed
data
configuration.
There
are
30
distinct
possible
random
assignments
of
streams
to
dosages.
So
a
permutation
test
would
yield
a
statistically
significant
p­
value
of
1/
30
for
response
measures
that
follow
this
pattern.
For
any
other
pattern,
the
results
would
not
attain
significance
at
the
usual
0.05
level.

Clifton
D.
Sutton
The
large
differences
(
nearly
a
factor
of
100
in
some
places)
between
the
p­
values
from
the
raw
data
and
the
rank­
transformed
data
in
some
parts
of
Table
4­
24
on
page
33
of
the
draft
is
bothersome,
since
if
two
methods
yield
such
different
results
it
suggests
that
perhaps
one
of
them,
or
perhaps
both
of
them,
are
rather
inaccurate.
But
which
method
is
more
trustworthy?
On
the
one
hand,
due
to
the
nature
of
the
data,
we
can
suspect
that
the
ANOVA
on
the
raw
data
is
subject
to
inaccuracy,
but
on
the
other
hand,
how
much
is
known
about
the
accuracy
of
the
ANOVA
procedures
with
the
rank­
transformed
data?
If
I
had
to
choose,
I'd
rely
on
the
use
of
the
rank­
transformed
data,
but
it's
not
clear
that
I
could
strongly
defend
this
tactic.

Other
alternatives
are
bootstrap
methods
and
the
creation
of
a
robust
technique,
but
such
possibilities
are
problematic
for
the
reasons
given
previously.
Still,
they
could
be
developed
and
investigated,
but
this
would
be
a
lot
of
work
that
may
lead
to
no
good
solution.

Dallas
E.
Johnson
I
believe
the
arguments
in
favor
of
omitting
data
beyond
day
3
are
sound
and
I
don't
have
any
alternative
statistical
approaches
to
recommend.

Kinley
Larntz
I
have
no
strong
opinion
on
which
analysis
is
better
from
a
statistical
point
of
view.
I
think
both
are
flawed
statistically.
Based
on
a
layman's
understanding,
if
it
is
believed
that
the
effects
of
day
4
and
5
are
due
to
starvation,
then
these
data
should
not
be
used.

Clifton
D.
Sutton
Due
to
the
possibility
that
a
starvation
effect
could
affect
the
results
(
as
noted
on
page
33
of
the
draft),
along
with
the
fact
that
the
proportion
of
empty
cells
is
reduced
by
doing
so,
I
favor
the
elimination
of
Days
4
and
5,
and
using
the
partial
time­
series
data
set.
9.
To
address
potential
confounding
effects
of
larval
starvation
on
the
analysis
of
the
%
abnormality
data,
a
partial
time
series
was
analyzed
(
i.
e.,
data
beyond
day
3
were
omitted
from
the
analysis).
Do
you
prefer
the
analysis
of
the
partial­
time
series
data
set
to
the
analysis
of
the
full
time
series
data
set?
Do
you
consider
any
alternative
statistical
approaches
superior
to
the
partial
time­
series
analysis
used
here?
If
so,
which
ones
and
why?
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
22
of
25
selenium
12/
00
Dallas
E.
Johnson
It
might
be
desirable
to
combine
data
from
both
Study
I
and
Study
II.
You
might
be
able
to
do
this
in
a
couple
of
different
ways.
One
approach
might
be
to
eliminate
the
30
µ
g/
L
data
(
since
there
is
little
doubt
that
this
concentration
level
has
significant
effects
on
bluegill),
and
then
to
treat
the
data
from
both
studies
as
though
it
came
from
12
different
streams.
A
second
approach
might
be
to
estimate
differences
between
concentration
levels
and
their
standard
errors
from
each
study
and
then
combine
the
information
into
a
single
estimate
and
standard
error
using
meta
analysis
techniques.
A
disadvantage
of
combining
the
two
studies
is
that
the
first
one
has
already
been
published
so
it
might
be
best
to
write
a
paper
on
the
second
study
and
then
write
a
second
paper
that
combines
the
results
of
the
two
studies.

Kinley
Larntz
I
think
it
is
an
excellent
idea
to
combine
the
results
of
the
two
studies.
Both
studies
included
control
and
10
dosages,
so
it
should
be
relatively
easy
to
"
calibrate"
them
and
pool
the
results.
Of
course,
I
believe
the
modeling
required
is
considerably
more
complex
than
the
ANOVA
models
used
in
this
report.
A
full
model
would
account
for
the
binary
nature
of
the
responses
and
imbalance
in
the
design.
It
also
must
be
kept
in
mind
that
data
that
are
missing
because
of
death
are
not
"
missing
at
random,"
but
are
informative
and
must
be
included
in
the
modeling.

As
said
again
and
again,
Bayesian
hierarchial
modeling
would
provide
a
basis
for
appropriate
inference.
Also,
carefully
constructed
exact
methods
may
also
work,
but
it
would
be
a
much
larger
job
to
develop
appropriate
methods.

Clifton
D.
Sutton
Given
that
the
data
from
Study
II
is
rather
messy
and
presents
problems
when
one
tries
to
analyze
it,
I
don't
think
it
is
wise
to
try
to
combine
it
with
the
data
from
Study
I,
especially
since
there
is
no
good
way
to
account
for
the
effect
of
time.

Dallas
E.
Johnson
I
don't
have
any
recommendations
to
make
concerning
the
issue
of
cumulative
impacts
or
competing
effects
of
selenium
in
the
statistical
analysis.
10.
An
earlier
MERS
study
(
i.
e.,
the
attached
Study
I;
Hermanutz
et
al.,
1992)
involved
the
continuous
exposure
of
bluegills
to
selenium
treatments
of
10
and
30
µ
g/
L
in
addition
to
controls.
EPA
has
considered
combining
the
results
from
continuous
selenium
exposure
in
Study
I
and
II
for
statistical
analysis,
which
would
result
in
three
concentrations
(
2.5,
10,
30
µ
g/
L)
plus
controls.
However,
data
(
particularly
%
larval
abnormality)
are
very
limited
at
30
µ
g/
L
in
Study
I
due
to
severe
adult
mortality.
With
this
limitation
in
mind,
do
you
consider
the
combination
of
data
(
e.
g.,
%
larval
abnormality)
from
Study
I
and
Study
II
to
be
both
desirable
and
feasible
from
a
statistical
perspective?
If
so,
how
would
one
account
for
the
effect
of
"
time"
(
i.
e.,
different
experiments
in
different
years)
on
the
results?

11.
From
a
statistical
perspective,
how
would
you
recommend
addressing
the
issue
of
cumulative
impacts
or
competing
effects
of
selenium
in
the
statistical
analysis
(
i.
e.,
the
potential
influence
or
bias
that
death
of
organisms
might
have
on
the
incidence
of
sublethal
effects
such
as
larval
abnormality?
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
23
of
25
selenium
12/
00
Kinley
Larntz
The
analyses
reported
here
are
conditional
in
the
sense
that
what
is
analyzed
is
"
conditional
on
survival."
If
you
wish
an
analysis
that
combines
death
and
the
responses
of
alive
cases,
you
would
have
to
create
a
model
that
includes
both.
This
would
be
complex,
but
could
be
done.
It
must
be
noted
that
some
modeling
would
have
to
be
done
to
impute
values
for
the
"
dead"
cases.
I
doubt
that
creation
of
an
unequivocal
model
is
possible.

Clifton
D.
Sutton
One
possibility
would
be
to
assign
a
cumulative
score,
awarding
points
for
various
abnormalities
and
for
death.
In
a
way,
this
turns
a
multivariate
response
into
a
univariate
response.

Dallas
E.
Johnson
I
tend
to
agree
with
the
summary
and
conclusions
presented
in
Section
5
of
the
report
except
for
the
ones
that
involve
ranked
data.
I
believe
that
the
mixed
model
analysis
is
fairly
robust
and
will
work
reasonably
well
even
though
the
data
might
not
be
normal
particularly
if
one
makes
adjustments
for
the
unequal
variances.
I
would
like
to
see
the
analyses
after
making
adjustments
for
unequal
variances
using
the
GROUP=
option
in
SASMIXED

Kinley
Larntz
I
do
believe
that
is
an
effect
of
selenium
on
the
%
abnormality
results.
I
do
not
believe
that
the
statistical
methods
used
here
present
a
compelling
case
for
the
specific
results
summarized
in
Section
5.
Modeling
that
accounts
for
the
binomial
nature
of
the
responses
and
the
imbalance
of
the
design
is
required
to
draw
appropriate
conclusions.

Clifton
D.
Sutton
Given
the
available
information,
I'll
guess
that
most
of
the
conclusions
are
generally
good,
while
strongly
noting
that
many
of
them
need
to
be
viewed
cautiously.
Overall,
I'd
feel
much
better
if
the
conclusions,
other
than
those
based
on
repeated
measures,
could
be
supported
with
the
analysis
of
logistic
regression
models,
since
we
don't
want
to
rely
so
heavily
on
guesswork.
If
the
analysis
of
the
repeated
measures
data
seems
on
the
whole
inconsistent
with
the
rest
of
the
data,
then
I'd
be
rather
suspicious
of
the
repeated
measures
analysis.

It
should
be
well
noted
that
I
cannot
give
a
strong
opinion
about
the
summary
and
conclusions
without
spending
much
more
time
analyzing
the
data
myself.
It
needs
to
be
kept
in
mind
that
some
results
of
the
report
need
to
be
viewed
as
inconclusive
due
to
problems
associated
with
the
messy
data.
But
my
hunch
is
that
the
10
microgram/
L
concentration
does
make
a
difference,
while
the
2.5
mircogram/
L
concentration
makes
very
little
difference
(
relative
to
the
control).
12.
Do
you
agree
with
the
summary
and
conclusions
presented
in
Section
5
of
the
report?
Why
or
why
not?
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
24
of
25
selenium
12/
00
STUDY
III
DATA
Dallas
E.
Johnson
I
don't
have
any
additional
advice
for
Study
III
data
other
than
incorporating
many
of
my
comments
above
towards
the
Study
III
analyses.

Kinley
Larntz
I
have
no
additional
comments
on
statistical
analyses.
Appropriate
analyses
would
not
eliminate
the
zero
response
observations.
This
is
a
limitation
of
using
"
asymptotic"
methods
for
small
sample
responses.

2.3
Specific
Comments
Dallas
E.
Johnson
I
wish
that
I
could
have
seen
printouts
of
the
data
that
were
actually
analyzed.
I
had
to
follow
all
of
the
SAS
codes
to
see
how
the
data
sets
that
were
printed
in
Appendix
E
were
actually
utilized
in
the
statistical
analyses.
Nevertheless,
I
still
don't
have
much
of
an
idea
of
what
the
end­
point
data
sets
looked
like
before
being
analyzed
by
the
MIXED
model
procedures.
I
don't
think
my
conclusions
would
change,
but
I
can't
say
that
for
sure.
Should
you
wish
to
send
me
copies
of
these
data
sets,
I
would
be
happy
to
look
them
over.

Clifton
D.
Sutton
Page
3
In
the
parts
of
the
table
that
give
comments
about
the
egg
cup
data,
it
is
noted
that
percent
hatch
values
exceeding
100%
were
truncated
to
100%
only
for
arc­
sine
square­
root
transformation.
It
seems
odd
to
me
not
to
just
truncate
them
for
all
of
the
analyses.

Clifton
D.
Sutton
Page
3,
Paragraph
6
The
studentized
range
test
is
referred
to
as
the
standardized
range
test.

Clifton
D.
Sutton
Page
11,
Paragraph
2
It
is
stated
that
  
because
the
dataset
contains
only
six
observations
spanning
three
treatments,
the
assumption
of
normal
independent
errors
may
not
hold.''
To
me,
this
reflects
a
lack
of
understanding.
The
sample
size
shouldn't
affect
how
closely
the
assumption
of
normality
is
met.
(
The
sample
size
affects
the
robustness
of
many
methods,
and
affects
one's
ability
to
detect
non­
normality.)

Kinley
Larntz
Page
11,
Line
11
The
sentence,
"
Because
the
dataset
contains
only
six
observations
spanning
three
treatments,
the
assumption
of
normal
independent
errors
may
not
hold,"
makes
no
sense
statistically.
The
validity
of
the
assumption
of
normal
independent
errors
does
not
depend
on
sample
size
whatsoever.
13.
Nearly
all
of
the
questions
listed
above
for
the
analysis
of
Study
II
results
are
applicable
to
the
analysis
of
Study
III
results.
One
exception
is
that
those
questions
pertaining
to
use
of
the
repeated­
measures
ANOVA
are
not
applicable
to
Study
III
since
the
preponderance
of
zero
response
observations
negated
the
use
of
the
repeated­
measure
ANOVA
in
Study
III.
With
this
exception
in
mind,
do
you
advise
anything
different
for
the
analysis
of
Study
III
data
than
what
you
recommend
for
Study
II?
If
so,
what
would
you
recommend
and
why?
Peer
Review
of
"
Statistical
Analysis
of
Selenium
Toxicity
Data"

Page
25
of
25
selenium
12/
00
Kinley
Larntz
Page
12
Residual
plots
on
page
12
are
meaningless
in
such
a
small
study.
Also,
it
is
not
appropriate
to
put
the
shortterm
and
long­
term
residuals
on
the
same
plot.
Thus,
if
they
are
to
be
done,
there
should
be
four
plots,
not
two.
But
because
of
the
structure
imposed
by
the
design
on
the
residuals,
these
plots
are
meaningless.
Using
a
transformation
is
sensible
without
the
plots.
The
plots
do
not
indicate
the
need
for
a
transformation!
The
plots
merely
indicate
the
stream­
to­
stream
variation
within
each
treatment.

Kinley
Larntz
Page
32,
Line
1
It
is
stated,
"
Due
to
the
relatively
high
incidence
of
zero
abnormalities
at
Day
1
for
%
edema
and
%
lordosis
(
see
Table
4­
22),
these
observations
were
eliminated
from
the
repeated­
measures
ANOVA."
Elimination
of
zero
counts
is
not
appropriate.
They
are
obviously
informative
data.
Statistical
methods
must
be
used
that
do
not
require
such
elimination.

Clifton
D.
Sutton
Page
33,
Table
4­
24
I
wonder
if
the
value
0f
0.0041
in
the
Day
row
and
the
lordosis
column
of
the
rank­
transformed
data
section
is
a
mistake.
It
can
be
noted
that
it's
exactly
equal
to
the
value
right
next
to
it
in
the
hemor.\
column,
and
that
this
value
is
about
100
times
smaller
than
the
corresponding
value
in
the
raw
data
section
of
the
table,
whereas
in
most
cases,
going
from
the
raw
data
to
the
ranks
doesn't
produce
such
big
differences.

2.4
Miscellaneous
Comments
No
miscellaneous
comments
were
provided
by
the
reviewers.

2.5
Additional
References
Recommended
For
Inclusion
in
The
Document
No
additional
references
were
provided
by
the
reviewers.

2.6
References
of
Interest
Clifton
D.
Sutton
Note:
These
aren't
references
that
necessarily
need
to
be
included
in
the
final
document.
Rather,
they
are
sources
that
I
consulted,
and
may
be
of
interest
to
those
who
give
consideration
to
my
comments.

Miller,
R.
G.,
Jr.;
Beyond
ANOVA:
Basics
of
Applied
Statistics;
Chapman
and
Hall;
1997.

Milliken,
G.,
and
D.
Johnson;
Analysis
of
Messy
Data,
Volume
I:
Designed
Experiments;
Chapman
and
Hall;
1992.

Pearson,
E.
S.,
and
H.
O.
Hartley;
Biometrika
Tables
for
Statisticians,
Volume
2;
Cambridge
University
Press;
1972.

Wilcox,
R.;
Introduction
to
Robust
Estimation
and
Hypothesis
Testing;
Academic
Press;
1997.
