Transcript of Using Data Mining to Investigate Interaction between Channel Characteristics and Hydraulic Geometry...
- Slide 1
- Using Data Mining to Investigate Interaction between Channel
Characteristics and Hydraulic Geometry Channel Types Leong Lee,
Ph.D. Associate Professor, Dept. of Computer Science Austin Peay
State University, Tennessee, USA Gregory S. Ridenour, Ph.D.
Professor, Dept. of Geosciences Austin Peay State University,
Tennessee, USA
- Slide 2
- Introduction Hydraulic geometry is the study of variations in
channel characteristics with respect to variations in channel
discharge. The purpose of this paper is to illustrate the use of
data mining in hydraulic geometry to establish a large database for
five objectives: 1.empirical estimation of parameters in
theoretical equations (power functions), 2.classification (based on
a ternary diagram), 3.production of maps with a GIS, 4.assessment
of data quality by computing Euclidean distance to multivariate
means from previous studies, generating scatterplot matrices for
identifying outliers, and comparison utilizing Spearmans rank order
correlation coefficient, and 5.pattern discovery via chi-square
analysis for goodness of fit with expected distributions and tests
of interaction of variables from pivot tables. 2
- Slide 3
- Hydraulic Geometry Leopold and Maddock coined the term
hydraulic geometry to refer to the set of equations that describe
the functional relationships (of a stream) between the width (w),
(1) mean depth (d), (2) and mean velocity (v) (3) and its
discharge, (Q) Q = ackQ b+f+m (4) (4) = (1) (2) (3) it is evident
that ack = 1; b + f + m = 1 3
- Slide 4
- Ternary Diagram (b-f-m diagram) and Channel Types For graphical
comparison, Park and Rhodes simultaneously and independently
recognized that the appropriate diagram for the unit-sum
constrained exponents is a triangular plot, or ternary diagram [11,
12, 13] A barycentric plot, 3 variables sum to 1 4 Fig. 1. Channel
types within the b-f-m diagram [12].
- Slide 5
- Ternary Diagram (b-f-m diagram) and Channel Types The hydraulic
geometry exponents or b, f, m values obtained at-a-station (each
station along a river) can be plotted on a ternary diagram Five
hydrologically significant lines differentiate the ternary diagram
into 10 fields (channel types) 5 Channel Type Criteria Ib + f f IIb
+ f < m AND b < f IIIb + f > m AND b > f AND m > f
IVb + f > m AND b f Vf > m AND b > f AND m/f > 2/3 VIf
> m AND b 2/3 VIIm > f/2 AND b > f AND m/f < 2/3 VIIIm
> f/2 AND b < f AND m/f < 2/3 IXm f Xm < f/2 AND b <
f TABLE I. Rhodes Classification of Hydraulic Geometry [12]
- Slide 6
- Ternary Diagram (b-f-m diagram) and Channel Types 6 The first
line is f = b. If a point plots to the left of the line (b > f),
the width-depth ratio (w/d) increases with increasing discharge; to
the right of the line (b < f) the ratio would decrease. The
second line is m = f. If a point plots above the line (m > f),
competence (the largest particle size a stream can transport)
should increase with increasing discharge; below the line
competence should decrease. The third line is m = f /2. If a point
plots above the line (m > f/2), the Froude number (which
differentiates supercritical and subcritical flow) increases with
increasing discharge; below the line the Froude number decreases.
The fourth line is m = b+f. If a point plots above the line (m >
b+f), velocity increases more rapidly than cross-sectional area
with increasing discharge; below the line it decreases. The fifth
line is m/f = 2/3, which is related to the Manning equation. If a
point plots above the line (m > (2/3) f), the ratio of the
square root of slope (s) to the roughness coefficient (s /n)
increases with increasing discharge; below the line it
decreases.
- Slide 7
- Hydraulic Geometry & Applications Hydraulic geometry is
applicable to prediction of channel deformation, layout of river
training works, design of stable canals and intakes, river flow
control works, irrigation schemes, and river improvement works [1].
Hydraulic geometry can discriminate between different types of
river sections [3], which could be used by planners for resource
and impact assessment [4]. 7
- Slide 8
- Hydraulic Geometry & Applications Additional applications
include drought management, climate change vulnerability [5], and
assessment of instream aquatic habitat [6]. U.S. EPAs current
definition of a TMDL (Total Maximum Daily Load) incorporates both
nonpoint sources and point sources of pollution, include all
sources subject to regulation under the National Pollutant
Discharge Elimination System (NPDES) program [7], Point source
pollutant models such as QUAL2E utilize some of the parameters a,
c, k, and b, f, m [8]. 8
- Slide 9
- United States (Rivers Runs Through) Software engineer Nelson
Minar used data provided by the Environmental Protection Agency
(Business Insider)
http://www.businessinsider.com/map-of-americas-rivers-2013-6#ixzz2kNKSuqPM
http://www.businessinsider.com/map-of-americas-rivers-2013-6#ixzz2kNKSuqPM
9
- Slide 10
- Data Mining Data mining is the process of discovering
interesting patterns and knowledge from large amounts of data. The
data sources can include databases, data warehouses, the Web, other
information repositories, or data that are streamed into the system
dynamically [10]. Data Mining is referred to as the entire
knowledge discovery process, which may include: data cleaning, data
integration, data selection, data transformation, data mining,
pattern evaluation, and knowledge presentation [10]. 10
- Slide 11
- Summary of Data Mining Approach 11 Fig. 2. Flowchart of
multi-stage data mining method
- Slide 12
- Module 1 (Data Extraction and Cleaning) The USGS started
collecting stream information in 1889 by using a stream gage on the
Rio Grande River in New Mexico [14]. A stream gage primarily
measures a streams water level, but often also collects information
about the water quality and amounts of sediment. Measurements are
usually recorded every 15 minutes and uploaded to the USGS on
average every four hours. 12
- Slide 13
- Module 1 (Data Extraction and Cleaning) The USGS stores data
from stream gages and manual measurements in a large database that
currently has information from about 1.5 million sites in the US
and territories [16]. Historical measurements and nearly real time
data are available to the public through online filtered searches
[16]. Module 1 acquires preliminary data in the form of html files
from the USGS site. 13
- Slide 14
- U.S. Geological Survey Water Resources 14
- Slide 15
- U.S. Geological Survey Water Resources 15
- Slide 16
- Module 1 (Data Extraction and Cleaning) We performed a filtered
search constrained by a selected state, and a date range of January
1, 2009, to December 31, 2011. Five southeastern states (KY, TN,
MS, AL, GA). A web scraping program performs data extraction :
gaging station identification numbers (id), latitude (la) and
longitude (lo), channel location distance (dist), field
measurements: stream width (w), stream cross-sectional area (A)
from which mean depth (d, or A/w) is computed, mean velocity (v),
and discharge (Q), channel characteristic: stability (sta),
material (mat), and evenness (eve). 16
- Slide 17
- Algorithm 1: Data Extraction and Cleaning begin use USGS
website to perform search, acquire input file for each station in
input file retrieve id, la, lo, dist, and add them to output file
for all records in this station retrieve w, A, v, Q, sta, mat, eve
check for missing data for all fields above if no missing data
begin calculate d add w, A, v, Q, sta, mat, eve, d to output file
calculate logarithms for Q, w, d, v add logarithms of Q, w, d, v to
output file end end-for end-algorithm 17
- Slide 18
- Module 2 (Data Selection) It selects only stations with a
minimum of twelve valid measurement records. Algorithm 2: Data
Selection begin declare cutoff variable = 12 for each station in
input file read station records into an array if the number of
station records cutoff variable (12) begin add all records to
output file end end-for end-algorithm 18
- Slide 19
- Module 3 (Data Transformation and Preliminary Data Pattern
Extraction) It produces the final data set for data pattern
extraction and post-processing analysis. It uses equations 5, 6, 7,
8, 9 (later slides). It transforms latitude (la) and longitude (lo)
to values suitable for the equations. It finds the minimum value of
channel location distance (dist) and uses it to select the suitable
stability (sta), material (mat), and evenness (eve) values, and
classifies each channel, 19
- Slide 20
- Algorithm 3: Data Transformation and Preliminary Data Pattern
Extraction begin for each station in input file retrieve id, la,
lo, dist add id to output file convert la, lo from
degree/minute/second to decimal add la, lo to output file end-for
for each station in input file for each record in this station
retrieve log(w), log(d), log(v), log(Q) retrieve sta, mat, eve,
dist use liner regression (equations 8, 9) to obtain b, f, m and a,
c, k for this station normalize b, f, m increase values of a, c, k
proportionally normalize a, c, k find minimum dist for this station
select sta, mat, eve at the minimum dist classify channel type (I
to X) based on Table I end-for add b, f, m and a, c, k to output
file for this station add minimum dist to output file for this
station add sta, mat, eve to output file for this station add
channel type to output file for this station end-for end-algorithm
20
- Slide 21
- Module 4 (Data Pattern Extraction and Post-Processing) Uses
MS-Excel, SigmaPlot, and ArcGIS to achieve pattern evaluation and
knowledge presentation empirical estimation of parameters graphical
displays production of maps assessment of data quality statistical
analysis for goodness of fit and interaction of variables 21
- Slide 22
- Estimation of Parameters and Classification There are 3,467
sites in Tennessee with stream flow measurements. Only sites with
channel information (stream flow, channel geometry, stability,
evenness, and material) were selected. Data set was then cleaned,
and filtered. 22
- Slide 23
- Estimation of Parameters and Classification 23
- Slide 24
- Calculate b, f, m (one site) 24
- Slide 25
- Calculate b, f, m (one site) 25
- Slide 26
- Calculate b, f, m (one site) 26
- Slide 27
- Ternary Diagram (b-f-m diagram) Five hydrologically significant
lines differentiate the ternary diagram into 10 fields. 27
- Slide 28
- Channel Characteristics and Production of Maps Three non-metric
channel characteristics were also downloaded for each station:
Stability is a nominal variable which categorizes the bed as either
firm or soft. Material is an ordinal variable which ranks the
composition of bed material from finest to coarsest in the sequence
silt (silt and/or mud), sand, gravel, cobbles, boulders (cobbles
and/or boulders), or ledge (bedrock or an artificial material like
concrete or metal). Evenness is a nominal variable which
categorizes the bed as either even (channel has significant
variation in its cross-section) or uneven. 28
- Slide 29
- Assessment of Data Quality 29
- Slide 30
- Assessment of Data Quality 30
- Slide 31
- Goodness of Fit and Interaction 31
- Slide 32
- Goodness of Fit and Interaction 32
- Slide 33
- Goodness of Fit and Interaction 33
- Slide 34
- Results - Empirical Estimation of Parameters, Production of
Maps 225 gauging stations (sites) in Kentucky and Tennessee
remained following our preliminary filtering criteria. The
hydraulic geometry of 218 gaging stations (sites) that had positive
b, f, and m values were plotted on the b-f-m diagram. 218 gaging
stations were mapped by channel stability (Figure 3) and a scatter
plot matrix of b, f, and m was generated in ArcGIS (Figure 4).
34
- Slide 35
- 35 Fig. 3. Stability of stream channel cross sections at or
near gaging stations in Kentucky and Tennessee. Results - Empirical
Estimation of Parameters, Production of Maps
- Slide 36
- 36 Fig. 4. Scatter plot matrix of hydraulic geometry exponents
b, f, and m in Kentucky and Tennessee. The large graph at the upper
right is an enlargement that is displayed by clicking one of the
figures in the matrix of graphs.
- Slide 37
- Results - Empirical Estimation of Parameters, Production of
Maps The histograms are similar to those produced by Rhodes and no
obvious outliers were detected from the scatter plots The mean
hydraulic geometry (0.219, 0.377, 0.404) for our data from five
states (KY, TN, MS, AL, GA) was most similar to the mean
at-a-station hydraulic geometry from a study by Stall and Yang
[19], whose averages of b, f, and m (0.23, 0.41, 0.36), with a
Euclidean distance of 0.0565 from our data, were from the Big Sandy
River basin in Kentucky. Stall and Yang [19] report that there were
19 gages in this basin, described as a mature plateau of fine
texture with moderate to strong relief. 37
- Slide 38
- Results - Classification The expected frequency count (EFC)
within each channel type polygon in the ternary diagram was
computed by multiplying its area percentage by the total count of
observations in the usable data set (Table II) next slide. Cursory
inspection of the table reveals that hydraulic exponents are
concentrated in even-numbered channel types, which lie to the right
of vertical line b = f in the b-f-m diagram. Note that all EFCs
exceeded 5, meeting the requirements for a Chi square test. The Chi
square statistic for a contingency table that included only
even-numbered channel types was about 186.3 with four degrees of
freedom and a pvalue of 3.25 10 -39, indicating that the points are
not randomly distributed among even-numbered channel types on the
b-f-m diagram. 38
- Slide 39
- Results - Classification - Table II. Chi-square from observed
and expected counts in each channel type. 39 Chann el type Observed
count (O) Ternary area [%] Expected count (E) [(O-E)] 2 /E
I812.5027.3 13.6 II5412.5027.3 26.3 III2320.8345.4 11.1 IV384.179.1
91.9 V34.179.1 4.1 VI345.8312.7 35.7 VII32.505.5 1.1 VIII234.179.1
21.3 IX510.0021.8 12.9 X2723.3350.9 11.2 Total218100.00218.0
229.1
- Slide 40
- Ternary Diagram: b-f-m diagram Five hydrologically significant
lines differentiate the ternary diagram into 10 fields. 40
- Slide 41
- Ternary Diagram: Channel Type I b + f < m AND b > f
41
- Slide 42
- Ternary Diagram: Channel Type II b + f < m AND b < f
42
- Slide 43
- Ternary Diagram: Channel Type III b + f > m AND b > f AND
m > f 43
- Slide 44
- Ternary Diagram: Channel Type IV b + f > m AND b < f AND
m > f 44
- Slide 45
- Ternary Diagram: Channel Type V f > m AND b > f AND m/f
> 2/3 45
- Slide 46
- Ternary Diagram: Channel Type VI f > m AND b < f AND m/f
> 2/3 46
- Slide 47
- Ternary Diagram: Channel Type VII m > f/2 AND b > f AND
m/f < 2/3 47
- Slide 48
- Ternary Diagram: Channel Type VIII m > f/2 AND b < f AND
m/f < 2/3 48
- Slide 49
- Ternary Diagram: Channel Type IX m < f/2 AND b > f
49
- Slide 50
- Ternary Diagram: Channel Type X m < f/2 AND b < f 50
- Slide 51
- Results Assessment of Data Quality The comparison of our
frequency distribution of channels types with that of Rhodes is
shown in Table III. The value of Spearmans rank order correlation
coefficient for the comparison was 0.82. The critical values of
Spearmans rank order correlation coefficient for one- and
two-tailed tests at the 0.05 significance level for 10 pairs of
observations were, respectively, 0.564 and 0.648, suggesting a high
degree of similarity in the compared frequency distributions of
channels types. In both distributions, even number channel types,
which lie to the right of the vertical line b = f on the diagram,
all have ranks from 1 to 5.5. 51
- Slide 52
- Results - Assessment of Data Quality - Table III. Frequency
distributions for channels types. 52 Channel Type Rhodes [12]
rankOur rank I97 II11 III65.5 IV52 V79.5 VI23 VII89.5 VIII45.5
IX108 X34
- Slide 53
- Results Stability and Material (significant interaction) The
cross tabulation for each combination of stability and material are
shown in Table IV. The pvalue of the Chi square statistic was 7.28
10 -4, indicating a significant interaction between channel
stability and material. This corroborates the findings of Ridenour
[20], who found an 86% success rate in predicting bank stability
from hydraulic geometry exponents using compositional discriminant
function analysis. 53
- Slide 54
- Results Stability and Material - Table IV. Cross tabulation of
channel stability and material (KY-TN). 54 firmsoftTotal
gravel40242 sand13720 silt44347 Total9712109
- Slide 55
- Results Evenness and Material (NO significant interaction) A
similar cross tabulation of channel evenness and material
(eliminating stations with either or both characteristics
unspecified) is shown in Table V. The pvalue of the Chi square
statistic was 0.56, indicating no significant difference between
the observed frequencies and those expected by chance, thus, there
is no interaction between channel evenness and material. 55
- Slide 56
- Results Evenness and Material - Table V. Cross tabulation of
channel evenness and material (KY-TN). 56 evenunevenTotal
ledge30838 boulders + cobbles29837 gravel 37643 sand 15520 silt
351449 Total14641187
- Slide 57
- Results Stability and Evenness (Interaction, NOT strong)
Additional data was mined from the three states that share
Tennessee's southern border: Mississippi, Alabama, and Georgia,
which more than tripled the size of the database (to 599 stations).
The cross tabulation is shown in Table VI. The pvalue of the Chi
square statistic was 0.11, indicating interaction between channel
stability and evenness at the 5% and 10% significance levels but
not at the level of 1%. 57
- Slide 58
- Results Stability and Evenness - Table 6. Cross tabulation of
channel stability and evenness (KY-TN-MS-AL-GA). 58 firmsoftTotal
even317103420 uneven11762179 Total434165599
- Slide 59
- Results - Channel Characteristics and Graphical Classification
The data for Kentucky and Tennessee were plotted on b-f-m diagrams
and differentiated by stability, evenness, and material in Figures
5, 6, 7, and 8. To determine the efficacy of these lines with
regard to separating channels by stability, evenness, or material,
two channel types on either side of each line were created by
consolidating all channel types on the same side. The pvalues are
summarized in Table VII. 59
- Slide 60
- Results Table VII. Chi-square pvalues for tests of interaction
of merged channel types with stability, evenness, and material
(KY-TN-MS-AL-GA). (Small number means nonrandom.) 60 LineBounding
polygon 1Bounding polygon 2stability evennes s material m = b + fI
thru IIIII thru X.024.386