The Use of Cluster Analysis and Classifiers for Determination and Prediction of Oil Family Type Base
description
Transcript of The Use of Cluster Analysis and Classifiers for Determination and Prediction of Oil Family Type Base
- 19 -
http://www.sjie.org
Scientific Journal of Information Engineering February 2016, Volume 6, Issue 1, PP.19-28
The Use of Cluster Analysis and Classifiers for
Determination and Prediction of Oil Family
Type Based on Geochemical Data Qian ZHANG
†, Guangren Shi
Research Institute of Petroleum Exploration and Development, PetroChina, P. O. Box 910, Beijing 100083, China
†Email: [email protected]
Abstract
The Q-mode cluster analysis (QCA), the multiple regression analysis (MRA) and three classifiers have been applied to an oil
family type (OFT) problem using geochemical data, which has practical value when experiment data is less limited. The three
classifiers are the support vector classification (SVC), the naïve Bayesian (NBAY), and the Bayesian successive discrimination
(BAYSD). QCA is used to determine OFT at first, and then SVC, NBAY and BAYSD are used to predict OFT. In the four
regression/classification algorithms, only MRA is a linear algorithm whereas SVC, NBAY and BAYSD are nonlinear algorithms.
In general, when all these four algorithms are used to solve a real-world problem, they often produce different solution accuracies.
Toward this issue, the solution accuracy is expressed with the sum of mean absolute relative residual (MARR) of all samples,
R(%). And three criteria have been proposed: 1) nonlinearity degree of a studied problem based on R(%) of MRA (weak if
R(%)<10, and strong if R(%)≥10); 2) solution accuracy of a given algorithm application based on its R(%) (high if R(%)<10, and
low if R(%)≥10); and 3) results availability of a given algorithm application based on its R(%) (applicable if R(%)<10, and
inapplicable if R(%)≥10). A case study (the OFT problem) has been used to validate the proposed approach. The case study is a
classification problem. The calculation results indicate that a) when a case study is a weakly nonlinear problem, SVC, NBAY and
BAYSD are all applicable, and BAYSD is better than SVC and NBAY; and b) BAYSD and SVC can be applied to dimensionality
reduction.
Keywords: Cluster Analysis; Multiple Regression Analysis; Support Vector Machine; Naïve Bayesian; Bayesian Successive
Discrimination; Problem Nonlinearity; Solution Accuracy; Results Availability; Dimensionality Reduction
1 INTRODUCTION
In the recent years, the regression/classification algorithms have seen the preliminary success in petroleum
geochemistry[1, 2]
, but the application of these algorithms and the cluster analysis algorithm to oil family type (OFT)
prediction have not started yet. Peters et al (2007) employed the decision trees to predict the OFT using the data of
samples in which each sample contains 21 geochemical factors[3]
. Using all the samples presented in [3], we employ
the Q-mode cluster analysis (QCA) algorithm to determine OFT at first, and then employ four
regression/classification algorithms below to predict OFT. This work is successful and has not been seen in
literatures before.
Besides QCA algorithm for determining OFT, one regression algorithm and three classification algorithms have been
applied to predict OFT. The one regression algorithm is the multiple regression analysis (MRA), while the other
three classification algorithms are the support vector classification (SVC), the naïve Bayesian (NBAY), and the
Bayesian successive discrimination (BAYSD). In these four regression/classification algorithms, only MRA is a
linear algorithm whereas the other three are nonlinear algorithms. In general, when all these four algorithms are used
to solve a real-world problem, they often produce different solution accuracies. Toward this issue, the solution
- 20 -
http://www.sjie.org
accuracy is expressed with the sum of mean absolute relative residual (MARR) of all samples, R(%). And three
criteria have been proposed that: 1) nonlinearity degree of a studied problem based on R(%) of MRA (weak if
R(%)<10, and strong if R(%)≥10); 2) solution accuracy of a given algorithm application based on its R(%) (high if
R(%)<10, and low if R(%)≥10); and 3) results availability of a given algorithm application based on its R(%)
(applicable if R(%)<10, and inapplicable if R(%)≥10) (Table 1). A case study (the OFT problem) has been used to
validate the proposed approach.
TABLE 1. PROPOSED THREE CRITERIA
Range of R(%)
Criterion 1: nonlinearity degree of a
studied problem based on R(%) of
MRA
Criterion 2: solution accuracy of a
given algorithm application based on
its R(%)
Criterion 3: Results availability
of a given algorithm application based on its R(%)
R(%)<10 Weak High Applicable
R(%)≥10 Strong Low Inapplicable
2 METHODOLOGY
The methodology consists of the following three major parts: definitions commonly used by regression/classification
algorithms; instruction of five algorithms; dimensionality reduction.
2.1 Definitions Commonly Used by Regression/classification Algorithms
The aforementioned regression/classification algorithms share the same sample data. The essential difference
between the two types of algorithms is that the output of regression algorithms is real-type value and in general
differs from the real number given in the corresponding learning sample, whereas the output of classification
algorithms is integer-type value and must be one of the integers defined in the learning samples. In the view of
dataology, the integer-type value is called as discrete attribute, while the real-type value is called as continuous
attribute.
The four algorithms (MRA, SVC, NBAY, BAYSD) use the same known parameters, and also share the same
unknown that is predicted. The only differences between them are the approach and calculation results.
Assume that there are n learning samples, each is associated with m+1 numbers (x1, x2, …, xm, y*) and a set of
observed values (xi1, xi2, …, xim, *
iy ), with i=1, 2, …, n for these numbers. In principle, n>m, but in actual practice,
n>>m. The n samples associated with m+1 numbers are defined as n vectors:
xi=(xi1, xi2, …, xim, *
iy ) (i=1, 2, …, n) (1)
where n is the number of learning samples; m is the number of independent variables in samples; xi is the ith
learning
sample vector; xij is the value of the jth
independent variable in the ith
learning sample, j=1, 2, …, m; and *
iy is the
observed value of the ith
learning sample. Eq. 1 is the expression of learning samples.
Let x0 be the general form of a vector of (xi1, xi2, …, xim). The principles of MRA, NBAY and BAYSD are the same,
i.e., try to construct an expression, y=y(x0), such that Eq. 2 is minimized. Certainly, these three different algorithms
use different approaches and obtain calculation results in differing accuracies.
2
*
01
i i
n
iy yx (2)
where y=y(x0i) is the calculation result of the dependent variable in the ith
learning sample; and the other symbols
have been defined in Eq. 1.
However, the principle of SVC algorithm is to try to construct an expression, y=y(x0), such that to maximize the
margin based on support vector points so as to obtain the optimal separating line.
This y=y(x0) is called the fitting formula obtained in the learning process. The fitting formulas of different algorithms
are different. In this paper, y is defined as a single variable.
The flowchart is as follows: the 1st step is the learning process, using n learning samples to obtain a fitting formula;
- 21 -
http://www.sjie.org
the 2nd
step is the learning validation, substituting n learning samples (xi1, xi2, …, xim) into the fitting formula to get
prediction values (y1, y2, …, yn), respectively, so as to verify the fitness of an algorithm; and the 3rd
step is the
prediction process, substituting k prediction samples expressed with Eq. 3 into the fitting formula to get prediction
values (yn+1, yn+2, …, yn+k), respectively.
xi=(xi1, xi2, …, xim) (i=n+1, n+2, …, n+k) (3)
where k is the number of prediction samples; xi is the ith
prediction sample vector; and the other symbols have been
defined in Eq. 1. Eq. 3 is the expression of prediction samples.
In the four algorithms, only MRA is a linear algorithm whereas the other three are nonlinear algorithms, this is due
to the fact that MRA constructs a linear function whereas the other three construct nonlinear functions, respectively.
To express the calculation accuracies of the prediction variable y for learning and prediction samples when the four
algorithms are used, the following four types of residuals are defined.
The absolute relative residual for each sample, R(%)i (i=1, 2, …, n, n+1, n+2, …, n+k), is defined as
* *(%) ( ) 100/= -
i i iiR y y y (4)
where yi is the calculation result of the dependent variable in the ith
sample; and the other symbols have been defined
in Eqs. 1 and 3. R(%)i is the fitting residual to express the fitness for a sample in learning or prediction process.
It is noted that zero must not be taken as a value of *
iy to avoid floating-point overflow. Therefore, for regression
algorithm, delete the sample if its *
iy =0; and for classification algorithm, positive integer is taken as values of *
iy .
The MARR for all learning samples, R1(%), is defined as
11
(%) /=(%) i
n
iR R n (5)
where all symbols have been defined in Eqs. 1 and 4. R1(%) is the fitting residual to express the fitness of learning
process.
The MARR for all prediction samples, R2(%), is defined as
21
(%) /i
n k
i nR R k
=(%) (6)
where all symbols have been defined in Eqs. 3 and 4. R2(%) is the fitting residual to express the fitness of prediction
process.
The sum of MARR of all samples, R(%), is defined as
1(%) (%) / ( )=
i
n k
iR R n k (7)
where all symbols have been defined in Eqs. 1, 3 and 4. If there are no prediction samples, k=0, then R(%)=R1(%).
R(%) is the fitting residual to express the fitness of learning and prediction processes.
When the four algorithms (SVC, NBAY, BAYSD, MRA) are used to solve a real-world problem, they often produce
different solution accuracies. Toward this issue, the three criteria have been proposed (Table 1).
2.2 Instruction of Five Algorithms
The five algorithms (QCA, MRA, SVC, NBAY, BAYSD) are detailedly described by Shi (2013), and the latter four
algorithms are concisely introduced[1, 4–11]
. Hence, we only concisely introduce QCA.
Cluster analysis has been widely applied since the 1970's[1, 12]
. Q-mode and R-mode cluster analyses is one of the
most popular methods in cluster analysis, and still a very useful tool in some fields. Cluster analysis is a
classification analysis approach of data set, which is a multiple linear statistical analysis.
- 22 -
http://www.sjie.org
As mentioned above, regression/classification algorithms need n learning samples expressed with Eq. 1 and k
prediction samples expressed with Eq. 3. From Eqs. 1 and 3, we know each learning sample must have the data of
both independent variables (xi) and a dependent variable ( *
iy ), while each prediction sample only has the data of
independent variables (xi). But cluster analysis has neither learning process nor prediction process.
Assume that
xi=(xi1, xi2, …, xi n*) (i=1, 2, …, n*) (8)
where n* is the number of samples; and the other symbols have been defined in Eq. 1.
Equation 8 expresses n* sample vectors. The cluster analysis to n
* sample vectors is conducted, thus this kind of
cluster analysis is called Q-mode cluster analysis (QCA).
Transposing the matrix defined by the right-hand side of Eq. 8, we obtained
xj=(xj1, xj2, …, xjm) (j=1, 2, …, m) (9)
Equation 9 expresses m independent variable vectors. The cluster analysis to m independent variable vectors is
conducted, thus this kind of cluster analysis is called R-mode cluster analysis (RCA).
Obviously, the matrix defined by the right-hand side of Eq.8 and that defined by the right-hand side of Eq. 9 are
mutually transposed.
This cluster analysis has two functions: QCA and RCA, conducting to samples and independent variables,
respectively[1]
. By adopting QCA, a cluster pedigree chart about n*
samples (xi) will be achieved, indicating the order
of dependence between samples, therefore, QCA possibly serves as a promising sample-reduction tool. By adopting
RCA, a cluster pedigree chart about m independent variables (xj) will be achieved, indicating the order of
dependence between independent variables, therefore, RCA possibly serves as a promising independent variable
reduction (i.e., dimension-reduction) tool.
The cluster analysis introduced in this paper only refers to QCA. As for RCA, the procedure is definitely same in
essential as that of QCA. The only difference is that the input data for QCA is expressed with Eq. 8, while the input
data for RCA is expressed with Eq. 9.
The commonly used procedure for cluster analysis at present is aggregation procedure. Just as the name implies,
aggregation procedure is to merge more classes into less till into one class. For example, QCA merges n* classes into
one class while RCA merges m classes into one class.
The procedure of QCA is performed in the following four steps[1, 12]
:
1) Determine the number of cluster objects, i.e., the total number of classes is n* at the beginning of cluster analysis;
2) Select one method such as standard difference standardization, maximum difference standardization and
maximum difference normalization, for conducting data normalization on the original data expressed by Eq. 8.
3) Select an approach involving a cluster statistic and a class-distance method to conduct procedures to every two
classes in n*
classes by a class-distance method (e.g., the shortest distance, the longest distance, and the weighted
mean) of a cluster statistic (e.g., distance coefficient, analog coefficient, and correlation coefficient), achieving the
order of dependence between classes, and merge the closest two classes (e.g., to distance coefficient it is minimum
class-distance, to analog coefficient or correlation coefficient it is maximum class-distance) into a new class which is
composed of two sample vectors, thus the total number of classes turns to (n*−1).
4) Use the above selected method to conduct procedures to every two classes in (n*−1) classes, achieving the order
of dependence between classes, and merge the closest two classes into a new class, thus the total number of classes
turns to (n*−2). Repeat this operation in this way till the total number of classes turns to 1, at this time n
* classes are
merged into 1 class, and then stop procedure.
Up to now, nine clustering methods have been introduced: three class-distance methods of the shortest-distance
method, the longest-distance method, and the weighted mean method under the definition of each of the three cluster
- 23 -
http://www.sjie.org
statistics (distance coefficient, analog coefficient, and correlation coefficient). Even if the same kind of data
normalization (to select one form of standard difference standardization, maximum difference standardization, and
maximum difference normalization) is applied to the original data, the calculation results are found to be different
with these nine methods. Then which method is the best one? A suitable one will be chosen depending on the
practical application and specific data.
It must be pointed out here that since cluster analysis is performed one by one in the light of ―two-in-one‖
aggregation, the number of the new class is one for the last, two for the previous, four for that before the previous, …
and so on. Obviously, except for the last aggregation, the number of the new class is an even number. So the number
of the new class that is obtained in the sample classification by QCA is an even number.
2.3 Dimensionality Reduction
The definition of dimensionality reduction is to reduce the number of dimensions of a data space as small as possible
but the results of studied problem are unchanged. The benefits of dimensionality reduction are to reduce the amount
of data so as to enhance the calculating speed, to reduce the independent variables so as to extend applying ranges,
and to reduce misclassification ratio of prediction samples so as to enhance processing quality.
Each of MRA and BAYSD can serve as a promising dimension-reduction tool, respectively, because the two
algorithms all can give the dependence of the predicted value (y) on independent variables (x1, x2, …, xm), in
decreasing order. However, because MRA belongs to data analysis in linear correlation whereas BAYSD is in
nonlinear correlation, in applications the preferable tool is BAYSD, whereas MRA is applicable only when the
studied problems are linear. The called ―promising tool‖ is whether it succeeds or does not need a high-class
nonlinear tool (e.g., SVC) for the validation, so as to determine how many independent variables can be reduced. For
instance, the case study below can be reduced from 22-D problem to 21-D problem by using BAYSD and SVC.
3 CASE STUDY: THE DETERMINATION AND PREDICTION OF OIL FAMILY TYPE
The objective of this case study is to determine and predict oil family type (OFT) using geochemical data, which has
practical value when experiment data is less limited.
Using data of 37 samples from the oversea oilfields presented in [3], and each sample contains 21 independent
variables (x1 = C19/C23, x2 = C22/C21, x3= C24/C23, x4 = C26/C25, x5 = Tet/C23, x6 = C27T/C27, x7 = C28/H, x8 = C29/H, x9 =
X/H, x10 = OL/H, x11 = C31R/H, x12 = GA/31R, x13 = C35S/C34S, x14 = S/H, x15 = %C27, x16 = %C28, x17 = %C29, x18 =
δ13
Csaturated, x19 =δ13
Caromatic, x20 = C26T/Ts, x21 = C28/C29) and a variable [y*
= oil family type (OFT)]. Peters et al (2007)
employed the decision trees to predict the OFT[3]
. In the case study, the former 30 samples are taken as learning
samples and the latter 7 samples as prediction samples (Table 2) for the OFT determination by QCA at first and then
for the OFT prediction by SVC, NBAY and BAYSD.
TABLE 2. INPUT DATA FOR OIL FAMILY TYPE (OFT) OF OVERSEA OILFIELDS (MODIFIED FROM [3])
No. 21 geochemical factors related to y*
y*
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21
1 4.53 0.43 0.74 1.28 4.16 0.11 0.04 0.53 0.29 0.02 0.29 0.08 0.5 0.35 13.7 26.7 59.6 -
28.15
-
25.89 0.18 0.46 4
2 2.68 0.51 0.73 1.13 6.82 0.03 0.02 0.69 0.11 0.04 0.27 0.12 1.06 0.31 27.2 26.9 45.9 -
30.18
-
28.28 0.16 0.59 4
3 0.23 0.5 0.74 0.98 0.67 0.02 0.03 0.68 0.12 0.02 0.29 0.17 0.65 0.52 27.2 21.8 51 -
30.54
-
29.19 0.61 0.43 4
4 0.12 0.38 0.87 0.61 0.45 0.46 0.07 0.54 0.05 0.01 0.36 0.09 0.96 0.8 36.5 33 30.5 -
32.01
-
31.23 0.6 1.08 3
5 0.27 0.44 0.75 0.79 0.55 0.08 0.02 0.45 0.1 0.01 0.31 0.09 0.81 0.67 36.4 33.1 30.5 -31.5 -30.4 0.43 1.09 3
6 0.31 0.52 0.88 1.01 0.5 0.09 0.03 0.4 0.18 0.03 0.25 0.17 0.7 0.91 33.9 34.7 31.4 -
30.94
-
29.65 0.51 1.11 3
7 0.78 0.49 0.8 1.02 0.88 0.2 0.03 0.46 0.17 0.02 0.24 0.14 0.61 0.55 27.7 32.5 39.8 -
29.12
-
27.36 0.42 0.84 3
8 4.76 0.38 0.59 1.5 5.3 0.21 0.12 0.59 0.32 0.02 0.26 0.18 0.49 1.03 9.2 27.6 63.2 - - 0.29 0.44 4
- 24 -
http://www.sjie.org
26.95 24.92
9 0.09 0.31 0.85 0.97 0.19 0.13 0.12 0.6 0.11 0.03 0.29 0.48 1.06 0.6 14.8 22 63.2 -
33.57
-
33.29 1.84 0.35 2
10 0.04 0.3 0.81 0.96 0.18 0.2 0.18 0.61 0.06 0.03 0.34 0.33 1.22 0.82 13.2 19.5 67.3 -
34.78
-
34.75 2.43 0.29 2
11 0.05 0.31 0.83 0.99 0.13 0.32 0.38 0.66 0.2 0.07 0.31 0.4 1.08 2.13 14.7 22.44 62.9 -
34.63
-
34.02 3.57 0.36 2
12 2.91 0.28 0.76 0.98 1.38 0.15 0.1 0.52 0.07 0.21 0.23 0.13 0.4 0.39 16.9 34.4 48.6 -
28.46
-
27.13 0.3 0.74 4
13 0.05 0.46 0.75 0.77 0.22 0.06 0.04 0.55 0.08 0.01 0.34 0.14 1.09 0.51 30 33.7 36.3 -
30.38 -29.6 1.37 0.94 3
14 0.06 0.48 0.75 0.74 0.25 0.14 0.04 0.51 0.08 0.03 0.39 0.07 1.24 0.57 28.8 38.8 32.4 -
29.99
-
29.21 1.11 1.2 3
15 0.03 0.3 0.92 0.76 0.14 0.04 0.03 0.4 0.15 0.01 0.34 0.1 1.18 0.63 28.7 27.6 43.7 -
31.41
-
30.22 2.13 0.63 2
16 0.04 0.66 0.58 0.78 0.21 0.06 0.07 0.85 0.05 0.01 0.36 0.28 1.18 0.3 30.4 29.3 40.3 -
30.31
-
29.89 1.91 0.73 1
17 0.13 0.41 0.75 0.98 0.46 0.03 0.03 0.38 0.17 0.01 0.31 0.09 0.59 0.57 30.7 28.5 40.8 -
31.23
-
30.24 0.65 0.7 3
18 0.36 0.39 0.89 0.84 0.64 0.18 0.11 0.43 0.1 0.08 0.3 0.09 0.87 0.58 27.2 40 32.9 -
28.93
-
27.94 0.48 1.24 3
19 0.89 0.41 0.85 1.02 0.76 0.05 0.06 0.51 0.12 0.02 0.28 0.06 0.5 0.62 27.2 33.4 39.4 -
28.87
-
27.22 0.56 0.87 3
20 0.05 0.66 0.6 0.83 0.43 0 0.02 1.11 0.03 0.01 0.39 0.23 1.03 0.2 30 23.7 46.3 -29.3 -
28.91 1.19 0.51 1
21 0.02 0.68 0.56 0.79 0.2 0.03 0.13 0.75 0.02 0 0.37 0.37 1.2 0.25 32.3 29.2 38.5 -
29.98
-
29.95 1.72 0.77 1
22 0.07 0.53 0.7 1.11 0.36 0.02 0.03 0.48 0.16 0.01 0.28 0.23 0.54 0.6 33.2 20.4 46.5 -
29.45
-
27.71 0.71 0.44 4
23 0.14 0.81 0.32 0.85 1.53 0.04 0.23 2.51 0.11 0 0.33 0.42 1.25 0.46 34 17.1 49 -
30.09
-
28.04 0.28 0.35 1
24 0.09 0.4 0.78 0.84 0.31 0.02 0.03 0.34 0.12 0.02 0.26 0.11 0.67 0.8 28.8 40.8 30.4 -
28.24
-
26.93 0.55 1.35 3
25 0.14 0.81 0.32 0.85 1.53 0.04 0.23 2.51 0.11 0 0.33 0.42 1.25 0.46 34 17.1 49 -
30.09
-
28.04 0.28 0.35 1
26 0.03 0.87 0.54 0.78 0.22 0.01 0.02 0.87 0.01 0 0.54 0.33 0.98 0.28 36.2 17.3 46.5 -
29.92
-
28.03 2.01 0.37 1
27 0.07 0.53 0.7 1.11 0.36 0.02 0.03 0.48 0.16 0.01 0.28 0.23 0.54 0.6 33.2 20.4 46.5 -
29.45
-
27.71 0.71 0.44 4
28 0.17 0.43 0.84 0.67 0.66 0.28 0.08 0.49 0.1 0 0.35 0.07 0.98 0.6 36.3 32.6 31.1 -
30.36
-
29.51 0.41 1.05 3
29 0.22 0.25 0.64 1.44 0.48 0.02 0.02 0.6 0.19 0.01 0.23 0.35 0.49 0.2 21 40.3 38.7 -
32.18
-
30.43 0.57 1.04 4
30 0.04 0.31 0.75 0.94 0.09 0.02 0.05 0.56 0.1 0.02 0.25 0.84 0.66 0.94 25.9 21.6 52.5 -
30.57
-
30.11 3 0.53 2
31 0.02 1.18 0.44 0.74 0.17 0 0.01 1.13 0.01 0.01 0.54 0.33 0.98 0.28 36.2 17.3 46.5 -
29.92
-
28.77 2.46 0.38 (1)
32 0.04 0.3 0.7 1.06 0.07 0.02 0.07 0.62 0.1 0.01 0.26 0.66 0.62 1.08 20.6 19.3 60.1 -
30.43
-
29.91 3.16 0.32 (2)
33 0.4 0.41 0.88 0.86 0.54 0.16 0.21 0.52 0.78 0.03 0.29 0.21 0.71 2.32 32.8 35.1 32 -
29.29
-
27.84 0.47 1.1 (3)
34 0.25 0.46 0.87 0.8 0.78 0.23 0.2 0.45 0.13 0.01 0.31 0.08 0.79 0.79 33.6 32.8 33.6 -
29.69
-
28.78 0.32 0.98 (3)
35 0.43 0.48 0.79 0.91 0.94 0.1 0.13 0.44 0.21 0.01 0.31 0.1 0.69 0.78 32.2 31.3 36.5 -
28.63
-
27.48 0.24 0.87 (3)
36 0.1 0.33 0.7 0.95 0.35 0.04 0.02 0.67 0.07 0 0.27 0.22 0.63 0.39 30.9 20.5 48.6 -
31.48
-
29.27 1 0.42 (4)
37 0.22 0.25 0.64 1.44 0.48 0.02 0.02 0.6 0.19 0.01 0.23 0.35 0.49 0.2 21 40.3 38.7 -
32.18
-
30.43 0.57 1.04 (4)
a x1 = C19/C23, x2 = C22/C21, x3= C24/C23, x4 = C26/C25, x5 = Tet/C23, x6 = C27T/C27, x7 = C28/H, x8 = C29/H, x9 = X/H, x10 = OL/H, x11 =
C31R/H, x12 = GA/31R, x13 = C35S/C34S, x14 = S/H, x15 = %C27, x16 = %C28, x17 = %C29, x18 = δ13Csaturated, x19 =δ13Caromatic, x20 = C26T/Ts, x21
= C28/C29.
b y* = OFT = oil family type (1~4) determined by QCA at first; then predicted by SVC, NBAY and BAYSD, number in parenthesis is
not input data, but is used for calculating R(%)i.
- 25 -
http://www.sjie.org
3.1 The Calculation and Results of QCA
Using the 37 samples without y*
(Table 2) and by QCA[1, 12]
taking the standard difference standardization in data
normalization, the correlation coefficient in cluster statistic, and the weighted mean in class-distance, a cluster
pedigree chart about the 37 samples has been achieved (Fig. 1).
FIG. 1 Q-MODE CLUSTER PEDIGREE CHART FOR OIL FAMILY TYPE (OFT)
Figure 1 illustrates a cluster pedigree chart of 37 samples that 37 classes (i.e., 37 samples) are merged step by step
into one class, showing the order of dependence between samples. Four classes can be divided in Fig. 1 (classes with
even number), based on which the class-value (1~4) of each sample is filled in the y* column of Table 2.
3.2 The Calculation and Results of SVC, NBAY, BAYSD and MRA
Using the 30 learning samples with y* (Table 2) and by SVC[1, 13]
, NBAY[1, 14, 15]
, BAYSD[1]
and MRA[1, 16, 17]
, four
- 26 -
http://www.sjie.org
functions of OFT (y) with respect to 21 independent variables (x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15,
x16, x17, x18, x19, x20, x21) have been constructed. Substituting the values of the 21 independent variables given by the
30 learning samples and 7 prediction samples (Table 2) in the four functions, respectively, the OFT (y) of each
sample is obtained (Table 3).
TABLE 3. PREDICTION RESULTS FROM OIL FAMILY TYPE (OFT) OF OVERSEA OILFIELDS
Sample
type
Sample
No.
OFT a
y*
Classifier MRA
b
SVC NBAY BAYSD
y R(%) y R(%) y R(%) y R(%)
Learning
samples
1 4 4 0 4 0 4 0 4 0
2 4 4 0 4 0 4 0 4 0
3 4 4 0 4 0 4 0 3 25
4 3 3 0 3 0 3 0 3 0
5 3 3 0 3 0 3 0 3 0
6 3 3 0 3 0 3 0 3 0
7 3 3 0 3 0 3 0 3 0
8 4 4 0 4 0 4 0 4 0
9 2 2 0 2 0 2 0 2 0
10 2 2 0 2 0 2 0 2 0
11 2 2 0 2 0 2 0 2 0
12 4 4 0 4 0 4 0 4 0
13 3 3 0 3 0 3 0 2 33.3
14 3 3 0 3 0 3 0 2 33.3
15 2 2 0 2 0 2 0 2 0
16 1 1 0 1 0 1 0 1 0
17 3 3 0 3 0 3 0 3 0
18 3 3 0 3 0 3 0 3 0
19 3 3 0 3 0 3 0 3 0
20 1 1 0 1 0 1 0 1 0
21 1 1 0 1 0 1 0 1 0
22 4 4 0 4 0 4 0 4 0
23 1 1 0 1 0 1 0 1 0
24 3 3 0 3 0 3 0 3 0
25 1 1 0 1 0 1 0 1 0
26 1 1 0 1 0 1 0 1 0
27 4 4 0 4 0 4 0 4 0
28 3 3 0 3 0 3 0 3 0
29 4 4 0 4 0 4 0 4 0
30 2 2 0 2 0 2 0 2 0
Prediction
samples
31 1 1 0 1 0 1 0 0 100
32 2 2 0 2 0 2 0 2 0
33 3 3 0 4 33.3 4 33.3 2 33.3
34 3 3 0 3 0 3 0 2 33.3
35 3 3 0 3 0 3 0 3 0
36 4 4 0 4 0 2 50 4 0
37 4 4 0 4 0 4 0 4 0 a y* = OFT = oil family type (1~4) determined by cluster analysis. b The results y of MRA are converted from real number to integer number by using round rule.
From Table 4 and based on Table 1, it is shown that a) the nonlinearity degree of this studied problem is weak since
R(%) of MRA is 6.98; b) the solution accuracies of SVC, NBAY and BAYSD are high, and the results availability of
SVC, NBAY and BAYSD are applicable, due to the fact that their R(%) values are 0, 0.9 and 2.25, respectively.
- 27 -
http://www.sjie.org
TABLE 4. COMPARISON AMONG THE APPLICATIONS OF CLASSIFIERS (SVC, NBAY AND BAYSD) TO OIL
FAMILY TYPE (OFT) OF OVERSEA OILFIELDS
Algorithm Fitting
formula
Mean absolute relative
residual
Dependence of the
predicted value (y)
on independent
variables (x1, x2, x3,
x4), in decreasing
order
Time
consuming
on PC (Intel
Core 2)
Problem
nonlinearity
Solution
accuracy
Results
availability R1(%) R2(%) R(%)
SVC Nonlinear,
explicit 0 0 0 N/A 5 s N/A High Applicable
NBAY Nonlinear,
explicit 0 4.76 0.9 N/A <1 s N/A High Applicable
BAYSD Nonlinear,
explicit 0 11.9 2.25
x2, x21, x20, x8, x3,
x14, x16, x9, x15, x6,
x13, x10, x4, x19, x18,
x12, x5, x17, x11, x7, x1
1 s N/A High Applicable
MRA Linear,
explicit 3.06 23.8 6.98
x13, x12, x11, x4, x8,
x20, x2, x16, x15, x10,
x14, x3, x19, x18, x1,
x9, x17, x7, x21, x6, x5
<1 s Weak N/A N/A
3.3 Dimension-reduction Succeeded from 22-D to 21-D Problem by Using BAYSD and SVC
BAYSD gives the dependence of the predicted value (y) on 21 independent variables, in decreasing order: x2, x21, x20,
x8, x3, x14, x16, x9, x15, x6, x13, x10, x4, x19, x18, x12, x5, x17, x11, x7, x1 (Table 4). According to this dependence order, at
first, deleting x2 and running SVC, it is found the results of SVC are the same as before, i.e., R(%)=0, thus the 22-D
problem (x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20, x21, y) can become 21-D problem
(x1, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20, x21, y). In the same way, it is found that this
21-D problem cannot become 20-D problem by deleting x21, due to the fact that the results of SVC are changed with
R(%)=0.9 which is greater than previous R(%)=0 (Table 4). Therefore, the 22-D problem at last can become 21-D
problem.
Comparing with SVC, the major advantages of BAYSD are (Table 4): a) BAYSD runs much faster than SVC, b) it is
easy to code the BAYSD program whereas very complicated to code the SVC program, and c) BAYSD can serve as
a promising dimension-reduction tool. So BAYSD is better than SVC when nonlinearity of studied problem is weak.
4 CONCLUSIONS
Through the aforementioned case study, five major conclusions can be drawn as follows:
1) For classification problem, QCA can be applied to determine the dependent variable (y), while SVC, NBAY and
BAYSD can applied to predict y;
2) The sum of MARR R(%) of MRA can be used to measure the nonlinearity degree of a studied problem, and thus
MRA should be run at first;
3) The proposed three criteria [nonlinearity degree of a studied problem based on R(%) of MRA, solution accuracy
of a given algorithm application based on its R(%), results availability of a given algorithm application based on
its R(%)] are practical;
4) If a classification problem has weak nonlinearity, in general, SVC, NBAY and BAYSD are applicable, and
BAYSD is better than SVC and NBAY;
5) BAYSD and SVC can be applied to dimensionality reduction.
ACKNOWLEDGMENT
This work was supported by the Research Institute of Petroleum Exploration and Development (RIPED) and
PetroChina.
- 28 -
http://www.sjie.org
REFERENCES
[1] Shi G. ―Data Mining and Knowledge Discovery for Geoscientists.‖ Elsevier Inc, USA, 2013
[2] Shi G. ―Prediction of methane inclusion types using support vector machine.‖ Sci J Earth Sci 5(2): 18–27, 2015
[3] Peters, KE, Ramos, LS, Zumberge, JE, Valin, ZC, Scotese, CR, Gautier, DL. ―Circum-Arctic petroleum systems identified using
decision-tree chemometrics.‖ AAPG Bulletin, 91(6): 877–913, 2007
[4] Shi G, Yang X. ―Optimization and data mining for fracture prediction in geosciences.‖ Procedia Comput Sci 1(1): 1353–1360,
2010
[5] Zhu Y, Shi G. ―Identification of lithologic characteristics of volcanic rocks by support vector machine.‖ Acta Petrolei Sinica 34(2):
312–322, 2013
[6] Shi G, Zhu Y, Mi S, Ma J, Wan J. ―A big data mining in petroleum exploration and development.‖ Adv Petrol Expl Devel 7(2): 1–
8, 2014
[7] Zhang Q, Shi G. ―Economic evaluation of waterflood using regression and classification algorithms.‖ Adv Petrol Expl Devel 9(1):
1–8, 2015
[8] Shi G. ―Optimal prediction in petroleum geology by regression and classification methods.‖ Sci J Inf Eng 5(2): 14–32, 2015
[9] Li D, Shi G. ―Data mining in petroleum upstream——the use of regression and classification algorithms.‖ Sci J Earth Sci 5(2):
33–40, 2015
[10] Shi G, Ma J, Ba D. ―Optimal selection of classification algorithms for well log interpretation.‖ Sci J Con Eng 5(3): 37–50, 2015
[11] Mi S, Shi G. ―The use of regression and classification algorithms for layer productivity prediction in naturally fractured reservoirs.‖
J Petrol Sci Res 4(2): 65–78, 2015
[12] Everitt BS, Landau S, Leese M, Stahl D. ―Cluster Analysis, 5th edition.‖ John Wiley & Sons, Chichester, England, UK, 2011
Chang C, Lin C. ―LIBSVM: a library for support vector machines, Version 3.1.‖ Retrieved from
www.csie.ntu.edu.tw/~cjlin/libsvm, 2011
[13] Tan P, Steinbach M, Kumar V. ―Introduction to Data Mining.‖ Pearson Education, Boston, MA, USA, 2005
[14] Han J, Kamber M. ―Data Mining: Concepts and Techniques, 2nd Ed.‖ Morgan Kaufmann, San Francisco, CA, USA, 2006
[15] Sharma MSR, O'Regan M, Baxter CDP, Moran K, Vaziri H, Narayanasamy R. ―Empirical relationship between strength and
geophysical properties for weakly cemented formations.‖ J Petro Sci Eng 72(1-2): 134–142, 2010
[16] Singh J, Shaik B, Singh S, Agrawal VK, Khadikar PV, Deeb O, Supuran CT. ―Comparative QSAR study on para-substituted
aromatic sulphonamides as CAII inhibitors: information versus topological (distance-based and connectivity) indices.‖ Chem Biol
Drug Design 71, 244–259, 2008
AUTHORS 1 Qian Zhang is a senior engineer of
PetroChina, was born in Heilongjiang
Province, China, in September, 1983. She
graduated from the Beijing Institute of
Technology in 2010 with a Ph.D. in
applied mathematics. She joined the
Research Institute of Petroleum
Exploration and Development (RIPED) of PetroChina in 2010
and as a petroleum evaluation engineer on the national and
world energy assessment projects. Her recent accomplishments
include petroleum resources evaluation software development
and data mining algorithms application in petroleum evaluation.
She has published 6 articles in which 3 are SCI-indexed and 3 on
data mining. She obtained the 1st class award of science-
technology achievement from RIPED (2015).
2 Guangren Shi is a professor of PetroChina, was born in
Shanghai, China in February, 1940. His research contains two
major fields of basin modeling (petroleum system) and data
mining for geosciences. He has published 8 books in which 3 are
in English and 3 are on data mining, and 75 articles in which 4
are SCI-indexed and 15 are on data mining.