Idaho
Department
of
Environmental
Quality
Concepts
and
Recommendations
for
Using
the
"
Natural
Conditions"
Provisions
of
the
Idaho
Water
Quality
Standards
Idaho
Department
of
Environmental
Quality
Concepts
and
Recommendations
for
Using
the
"
Natural
Conditions"
Provisions
of
the
Idaho
Water
Quality
Standards
Christopher
Mebane
Don
Essig
Idaho
Department
of
Environmental
Quality
Boise,
Idaho
April
2003
Contents
Introduction.....................................................................................................................
2
Concepts
to
consider
when
evaluating
whether
natural
background
conditions
exceed
numeric
criteria
...............................................................................................................
4
Natural
Variability
..........................................................................................................
8
Measurable
Changes
.....................................................................................................
14
Detection
Limits....................................................................................................
14
Statistical
Considerations......................................................................................
17
Practical
Approaches
to
Estimating
Natural
Conditions
..............................................
19
Natural
Watersheds...................................................................................................
19
Forest
watersheds..................................................................................................
20
Rangeland
Watersheds..........................................................................................
22
Comparison
to
Reference
Streams............................................................................
26
Stream
Temperature
Models.....................................................................................
26
Large
River
Basins....................................................................................................
28
Biological
Assessment
..............................................................................................
29
Balanced
Indigenous
Populations
in
Large
Rivers
...............................................
29
"
Healthy"
streams
and
rivers
................................................................................
29
Lakes.........................................................................................................................
30
Metals........................................................................................................................
31
Acknowledgements.......................................................................................................
32
References.....................................................................................................................
33
Appendix
 
Excerpts
from
Idaho
Water
Quality
Standards
relevant
to
Natural
Background
Conditions
................................................................................................
39
(
Blue
text
indicates
hyperlinks
to
regulatory
definitions
or
other
internal
links)
14
Measurable
Changes
Factors
affecting
whether
a
change
is
measurable
in
a
water's
characteristic
such
as
temperature
or
a
concentration
include
the
detection
limits
and
statistical
considerations.

Detection
Limits
Any
measuring
method
has
some
inherent
limit
of
detection
that
is
based
on
the
instrument
and
the
parameter
of
interest.
A
limit
of
detection
is
the
lowest
amount
of
a
substance
or
other
parameter
that
can
be
reliably
detected,
based
on
the
variability
of
either
the
blank
response
or
that
of
a
low
level
standard.
A
related
term
is
the
quantitation
limit,
which
is
the
lowest
level
at
which
a
substance
may
be
accurately
measured
and
reported
without
qualification
as
an
estimated
value.
In
chemical
analyses,
this
is
often
estimated
to
be
5­
times
the
detection
limit
(
EPA
1991).

Field
measurements
of
water
temperature
are
routinely
made
by
deploying
data
logging
thermistors.
Typically
these
devices
display
and
record
values
without
rounding
or
truncating
to
1/
100th
of
a
degree
(
0.01
º
C).
However,
this
display
of
apparent
precision
can
be
misleading.
These
devices
are
manufactured
to
be
accurate
to
±
0.2
°
C,
based
on
comparisons
to
NIST
standards
(
e.
g.
Onset
Computer
Corporation
www.
onsetcomp.
com).
Temperatures
recorded
by
data
loggers
are
also
digitized
in
discrete
steps,
the
size
of
the
step
is
determined
by
the
amount
of
memory
allocated
to
each
measurement.
For
example,
data
loggers
commonly
use
8­
bits
to
record
a
single
temperature.
Over
a
devices'
range
of
 
4
to
+
38
°
C,
there
are
28
or
256
steps,
which
works
out
to
average
quantization
errors
of
0.16
°
C.

The
repeatability
of
temperature
measurements
was
further
evaluated
through
a
calibration
test
of
the
variability
of
the
responses
of
several
data
loggers
tested
together
in
a
bucket
at
constant
temperatures.
In
this
test,
18
units
were
set
in
a
bucket
at
room
temperature,
allowed
to
equilibrate,
and
then
the
temperature
measurements
by
each
unit
were
recorded
at
1­
minute
intervals
for
10
minutes.
Then
the
units
were
moved
to
an
ice
bath,
allowed
to
equilibrate,
and
then
the
temperatures
were
recorded.
Results
were
consistent,
whether
calculated
as
the
ranges
of
average
temperatures,
differences
of
maximum
temperatures,
or
the
average
of
the
range
of
differences
recorded
for
each
unit.
At
room
temperature,
the
units
were
accurate
to
±
0.3
°
C
and
at
freezing
(
ice
bath),
accurate
to
±
0.2
°
C
(
Table
1).
Since
temperature
criteria
and
concerns
are
usually
focused
at
higher
temperatures,
not
freezing,
the
room
temperature
test
is
the
more
relevant
for
estimating
the
limits
of
repeatable
temperature
measurements.

Table
1.
Bench
tests
of
temperature
measurement
error
at
constant
temperatures
(
º
C)

Condition
Grand
average
Temperature
(
±
SD)
Range
of
average
temperatures
Differences
in
maximum
temperatures
recorded
Average
range
of
temperature
differences
Number
of
sensors
Measurement
interval
Room
temperature
21.64
(
0.08)
0.28
0.32
0.32
18
1/
minute
Ice
Bath
­
0.01
(
0.05)
0.15
0.17
0.18
18
1/
minute
15
The
bench
test
of
accuracy
of
temperature
measurements
in
a
static
vessel
at
constant
temperatures
represents
an
environment
of
minimum
variability.
In
practice,
temperature
comparisons
are
made
in
natural
streams
at
fluctuating
temperatures.
These
field
conditions
introduce
additional
variability
into
temperature
measurements
and
comparisons.
Temperature
loggers
are
often
placed
to
evaluate
nonpoint
or
point
source
temperature
effects
(
Zaroban
2000).
For
example,
to
assess
temperatures
relating
to
nonpoint
source
watershed
disturbances,
loggers
should
be
placed
at
the
downstream
end
of
a
reach
with
relatively
uniform
morphology,
land
use,
and
cover.
Once
in
the
channel,
the
logger
should
be
placed
in
a
shaded
spot
where
the
water
is
well
mixed
and
not
influenced
by
warm
or
cool
water
sources
such
as
ground
water,
tributary
confluences,
or
direct
sunlight.
In
flowing
waters,
well
mixed
waters
normally
occur
in
the
center
of
the
thalweg.
To
show
that
the
water
at
the
site
is
well
mixed
and
representative
of
reach
conditions,
horizontal
and
vertical
mixing
is
verified
with
handheld
temperature
measurements
(
Zaroban
2000).
These
protocols
for
site
selection
and
placement
of
temperature
data
loggers
in
streams
minimize
confounding
field
measurement
error
and
improve
the
comparability
of
data
between
places
and
times.
However
some
added
field
variability
is
unavoidable.

Table
2
presents
selected
results
of
differences
between
temperature
measurements
from
two
reference
sites
with
six
replicate
sensors,
and
28
sites
with
duplicate
sensors
deployed
for
the
same
62
day
periods.
Each
sensor
was
mounted
in
a
flow­
through
shading
canister
and
were
placed
in
a
well
mixed
portions
of
the
streams
following
Zaroban
(
2000).
Sensors
were
placed
at
the
top
and
bottom
of
habitat
and
biological
sampling
reaches,
about
40­
stream
channel
widths
apart,
which
worked
out
to
100­
200
meters
apart.
Two
sites
were
replicated
to
compare
variability
of
physical
and
biological
measurements
within
what
appeared
to
be
representative
reaches.
Each
temperature
sensor
was
considered
to
be
representative
of
the
reach,
so
differences
among
these
individually
representative
sensors
can
be
considered
to
be
measurement
error.
When
reduced
to
conventional
regulatory
temperature
metrics,
the
maximum
daily
average
temperatures
(
MDATs)
from
these
site
replicates
never
varied
more
than
to
±
0.34
°
C.
The
maximum
daily
temperature
(
MDMT)
differences
were
never
greater
than
±
0.65
°
C
(
Table
2).
16
Table
2.
Differences
in
commonly
used
temperature
metrics
among
sites
with
duplicate
or
multiple
sensors.
Each
replicate
reach
(
RR)
had
duplicate
sensors,
A­
upstream
end
of
reach,
B­
downstream
(
Source
 
Ott
and
Maret
2002)

(
a)
Replicated
sites
(
3
replicates
with
2
sensors
each
MDMT
(
°
C)
MWMT
(
°
C)
MDAT
(
°
C)
MWAT
(
°
C)

Stream
A
B
A
B
A
B
A
B
Big
Creek,
RR
1
19.54
19.54
18.44
18.40
13.79
13.79
12.95
12.94
Big
Creek,
RR
2
19.51
19.64
18.45
18.51
13.87
13.84
13.04
13.00
Big
Creek,
RR
3
18.99
19.05
17.96
18.08
13.53
13.66
12.70
12.85
Range
0.65
0.55
0.34
0.25
Valley
Creek,
RR
1
21.99
20.61
16.03
15.18
Valley
Creek,
RR
2
22.18
22.06
20.72
20.67
16.05
16.14
15.21
15.29
Valley
Creek,
RR
3
22.53
22.19
21.04
20.75
16.29
16.18
15.43
15.33
Range
0.54
0.43
0.26
0.25
(
b)
Non­
replicated
sites
(
28
sites
with
2
sensors
each)

MDMT
(
°
C)
MWMT
(
°
C)
MDAT
(
°
C)
MWAT
(
°
C)

Average
difference
0.12
0.09
0.07
0.07
Maximum
difference
0.34
0.29
0.23
0.22
Sites
with
multiple
sensors
tended
to
record
slightly
higher
variability
than
the
sites
with
just
two
sensors.
Also,
averaging
reduces
variability,
with
the
metric
with
no
averaging
being
the
most
variable
(
MDMT)
and
the
metric
with
the
most
averaging
(
MWAT)
having
the
least
variability
(
differences
between
MDAT
and
MWAT
were
nearly
identical).

While
some
measurements
of
temperature
metrics
at
field
replicate
sites
varied
up
to
0.6
°
C,
most
differences
were
0.3
°
C
or
less.
From
this
we
conclude
that
when
appropriately
deployed
(
e.
g.
following
Zaroban
2000
or
similar
protocols),
temperature
data
loggers
are
nearly
as
precise
for
long
deployments
in
fluctuating
environments
as
in
bench
tests.
From
these
analyses,
we
conclude
that
potentially
measurable
changes
in
temperature
are
differences
greater
than
0.3
°
C.
Both
the
bench
tests
and
the
field
tests
show
the
remarkable
stability
and
repeatability
of
properly
deployed
modern
temperature
data
loggers.

For
chemical
analyses,
at
concentration
above
the
quantitation
limit
quality
control
guidelines
for
laboratory
duplicate
analyses
are
customarily
set
at
relative
percent
difference
(
RPD)
of
±
20%
(
EPA
1991,
Beltman
et
al.
1993).
However,
as
concentrations
approach
the
limits
of
the
instrument's
capability
to
"
see"
the
analytes,
differences
increase.
Beltman
et
al
(
1993)
reported
RPD's
in
field
replicates
of
up
to
±
50%
for
dissolved
copper
at
concentrations
<
10X
the
detection
limits,
but
at
higher
concentrations
RPDs
were
usually
less
than
±
10%.
Since
the
issue
of
a
measurable
change
is
in
the
context
of
the
requirement
not
to
measurably
exceed
background
17
conditions
when
background
exceed
numeric
standards,
ambient
concentrations
will
likely
be
sufficiently
high
that
differences
of
>
10%
are
at
least
potentially
detectable
in
laboratory
analyses.
In
contrast,
in
the
case
of
analyses
of
waters
with
concentrations
approaching
detection
limits
(
and
well
below
numeric
standards),
variability
may
be
much
higher.
These
types
of
low­
level
analyses
may
be
needed
for
evaluations
of
special
resource
waters
or
nondegradation
of
high
quality
waters.
Mebane
(
2000)
concluded
that
in
the
upper
Salmon
River,
where
ambient
concentrations
were
very
low,
minimum
detectable
differences
for
copper
and
zinc
were
2
µ
g/
L
and
13
µ
g/
L
respectively.
In
cases
like
these
with
low
concentrations,
a
large
amount
of
the
data
may
be
below
the
detection
limit
which
requires
special
consideration
in
statistical
analyses
(
Helsel
1990).

Statistical
Considerations
Statistics
are
an
inherent
part
of
evaluating
changes
from
background
conditions,
as
well
as
nearly
all
water
quality
monitoring
programs.
A
common
question
of
monitoring
programs
is
whether
a
particular
management
action
is
having
an
adverse
change
in
water
quality.
To
quantitatively
answer
this
question,
it
is
necessary
to
acquire
data
and
make
statistical
comparisons
to
other
site(
s),
or
to
data
from
the
site
before
and
after
the
activity
(
MacDonald
et
al.
1991).
In
the
present
case
of
using
natural
background
conditions
to
manage
water
quality
where
"
pollutant
levels
shall
not
exceed
the
natural
background
conditions,"
it
is
necessary
to
both
determine
natural
conditions
and
whether
they
are
exceeded.
Desirably,
once
natural
conditions
are
determined,
monitoring
and
assessment
to
determine
whether
conditions
are
exceeded
due
to
an
action
should
likely
to
detect
differences
in
ambient
water
quality
if
in
fact
they
exist
(
referred
to
as
having
a
low
Type
II
error
in
statistical
jargon).
Further,
the
assessment
should
be
unlikely
to
falsely
indicate
there
is
a
difference
when
in
fact
there
is
none;
that
is,
observed
differences
are
just
due
to
chance
(
low
Type
I
error).
The
detection
limit
becomes
an
intrinsic
part
of
statistical
comparisons,
which
require
selection
of
a
minimum
detectable
effect.

Statistics
are
not
a
"
black
box"
calculations,
nor
should
they
be
rote.
Before
statistically
examining
existing
data
for
changes
from
background
conditions
or
designing
a
monitoring
program,
the
investigator
must
answer
certain
questions:

Which
is
the
greater
concern
 
falsely
concluding
that
an
effect
has
happened,
which
could
cause
unnecessary
expense
or
restrictions
to
dischargers,
land
managers,
etc.
(
Type
I
error),
or
to
fail
to
detect
actual
effects
which
could
allow
environmental
degradation
(
Type
II
error)?

How
much
increase
in
the
parameter
being
evaluated
(
e.
g.
temperature,
metals
concentration,
%
fine
sediments)
is
acceptable
before
concluding
that
values
exceed
natural
conditions?
Although
the
regulatory
answer
may
be
"
no
increase
is
acceptable,"
this
is
not
a
statistically
acceptable
answer
because
no
monitoring
program
or
statistical
test
can
detect
an
infinitesimal
increase.
A
minimum
detectable
effect
must
be
selected
(
MacDonald
et
al.
1991).

When
working
with
the
requirement
for
activities
to
not
to
exceed
background
conditions
when
background
exceed
numeric
standards,
a
fundamental
question
when
evaluating
monitoring
data
is
whether
a
significant
change
has
occurred.
The
ability
to
statistically
analyze
this
depends
upon
compromises
between
five
interacting
factors:
sample
size,
18
variability,
level
of
significance,
power,
minimum
detectable
effect
(
MacDonald
et
al.
1991).

1.
Sample
size:
Larger
sample
size
increases
the
ability
to
detect
a
difference
between
two
groups
of
samples.

2.
Variability:
The
more
variable
a
measure,
the
less
the
ability
to
detect
significant
change.

3.
Level
of
significance:
This
refers
to
the
probability
that
an
apparently
significant
difference
is
not
real
but
simply
due
to
chance.
This
is
referred
to
as
 
or
a
Type
I
error.
The
 
value
is
often
arbitrarily
set
at
0.05
for
confirmatory
statistical
tests
and
0.10
in
exploratory
tests.
An
 
of
0.10
means
there
is
a
1
in
10
chance
that
an
observed
difference
is
due
to
chance,
or
a
test
is
90%
"
confident."
The
lower
the
significance
level
is
set
at,
the
more
likely
the
difference
is
real.
However,
lower
significance
levels
also
mean
that
a
test
has
reduced
power
to
detect
real
differences
if
they
exist.

Significance
testing
requires
choosing
between
a
"
one­
tailed"
or
"
two­
tailed"
test.
The
one­
tailed
probability
is
exactly
half
the
value
of
the
two­
tailed
probability,
so
for
a
given
test
a
one­
tailed
test
is
more
likely
to
be
significant.
A
two
tailed
test
is
appropriate
when
the
investigator
cannot
predict
the
direction
of
response
based
on
theory,
a
one­
tailed
test
is
appropriate
when
the
investigator
can
predict
the
direction
of
potential
response,
if
any.
For
example,
removal
of
riparian
vegetation
would
be
predicted
to
result
in
an
increase
in
summer
time
stream
temperatures,
so
a
one­
tailed
test
would
be
appropriate;
however
removal
of
riparian
shade
could
result
in
either
an
increase
or
decrease
in
trout
populations
due
to
increases
in
primary
productivity
and
temperature,
so
a
two­
tailed
test
would
be
appropriate.

4.
Power:
The
probability
of
detecting
a
difference
when
in
fact
one
exists;
designated
(
1­
 ).
 
or
a
"
Type
II"
error,
is
the
probability
of
incorrectly
concluding
that
two
groups
of
samples
are
the
same
when
in
fact
they
are
different.
In
environmental
sampling
 
is
commonly
set
at
0.25
to
0.1;
that
is
a
test
has
a
75%
to
90%
probability
of
detecting
a
change
if
there
is
one.
While
higher
probabilities
would
be
desirable,
because
power
function
curves
are
logarithmic,
as
sample
sizes
increase,
further
increases
in
sample
size
make
little
improvement
in
a
test's
power.
Tests
with
90
to
95%
statistical
power
and
 
of
0.05
or
less
would
require
huge
sample
sizes.
Increasing
the
statistical
power
of
a
sampling
plan
reduces
the
likelihood
of
making
a
Type
II
error
(
failing
to
detect
an
actual
difference),
but
at
the
same
time
increases
the
likelihood
of
making
a
Type
I
(
concluding
there
is
a
difference
when
none
exists).
As
a
starting
point
for
evaluating
if
activities
result
in
an
exceedence
of
natural
background
conditions,
we
suggest
power
and
significance
values
of
 
<
0.1
and
 
<
0.2.

5.
Minimum
detectable
difference
(
MDD):
Determining
how
much
change
is
acceptable
and
thus
needs
to
be
detected
in
the
ambient
concentrations
is
a
key
factor
in
