The Use of Cluster Analysis and Classifiers for Determination and Prediction of Oil Family Type Base

10
- 19 - http://www.sjie.org Scientific Journal of Information Engineering February 2016, Volume 6, Issue 1, PP.19-28 The Use of Cluster Analysis and Classifiers for Determination and Prediction of Oil Family Type Based on Geochemical Data Qian ZHANG , Guangren Shi Research Institute of Petroleum Exploration and Development, PetroChina, P. O. Box 910, Beijing 100083, China Email: [email protected] Abstract The Q-mode cluster analysis (QCA), the multiple regression analysis (MRA) and three classifiers have been applied to an oil family type (OFT) problem using geochemical data, which has practical value when experiment data is less limited. The three classifiers are the support vector classification (SVC), the naïve Bayesian (NBAY), and the Bayesian successive discrimination (BAYSD). QCA is used to determine OFT at first, and then SVC, NBAY and BAYSD are used to predict OFT. In the four regression/classification algorithms, only MRA is a linear algorithm whereas SVC, NBAY and BAYSD are nonlinear algorithms. In general, when all these four algorithms are used to solve a real-world problem, they often produce different solution accuracies. Toward this issue, the solution accuracy is expressed with the sum of mean absolute relative residual (MARR) of all samples, R(%). And three criteria have been proposed: 1) nonlinearity degree of a studied problem based on R(%) of MRA (weak if R(%)<10, and strong if R(%)≥10); 2) solution accuracy of a given algorithm application based on its R(%) (high if R(%)<10, and low if R(%)≥10); and 3) results availability of a given algorithm application based on its R(%) (applicable if R(%)<10, and inapplicable if R(%)≥10). A case study (the OFT problem) has been used to validate the proposed approach. The case study is a classification problem. The calculation results indicate that a) when a case study is a weakly nonlinear problem, SVC, NBAY and BAYSD are all applicable, and BAYSD is better than SVC and NBAY; and b) BAYSD and SVC can be applied to dimensionality reduction. Keywords: Cluster Analysis; Multiple Regression Analysis; Support Vector Machine; Naïve Bayesian; Bayesian Successive Discrimination; Problem Nonlinearity; Solution Accuracy; Results Availability; Dimensionality Reduction 1 INTRODUCTION In the recent years, the regression/classification algorithms have seen the preliminary success in petroleum geochemistry [1, 2] , but the application of these algorithms and the cluster analysis algorithm to oil family type (OFT) prediction have not started yet. Peters et al (2007) employed the decision trees to predict the OFT using the data of samples in which each sample contains 21 geochemical factors [3] . Using all the samples presented in [3], we employ the Q-mode cluster analysis (QCA) algorithm to determine OFT at first, and then employ four regression/classification algorithms below to predict OFT. This work is successful and has not been seen in literatures before. Besides QCA algorithm for determining OFT, one regression algorithm and three classification algorithms have been applied to predict OFT. The one regression algorithm is the multiple regression analysis (MRA), while the other three classification algorithms are the support vector classification (SVC), the naïve Bayesian (NBAY), and the Bayesian successive discrimination (BAYSD). In these four regression/classification algorithms, only MRA is a linear algorithm whereas the other three are nonlinear algorithms. In general, when all these four algorithms are used to solve a real-world problem, they often produce different solution accuracies. Toward this issue, the solution

description

Qian ZHANG, Guangren Shi

Transcript of The Use of Cluster Analysis and Classifiers for Determination and Prediction of Oil Family Type Base

Page 1: The Use of Cluster Analysis and Classifiers for Determination and Prediction of Oil Family Type Base

- 19 -

http://www.sjie.org

Scientific Journal of Information Engineering February 2016, Volume 6, Issue 1, PP.19-28

The Use of Cluster Analysis and Classifiers for

Determination and Prediction of Oil Family

Type Based on Geochemical Data Qian ZHANG

†, Guangren Shi

Research Institute of Petroleum Exploration and Development, PetroChina, P. O. Box 910, Beijing 100083, China

†Email: [email protected]

Abstract

The Q-mode cluster analysis (QCA), the multiple regression analysis (MRA) and three classifiers have been applied to an oil

family type (OFT) problem using geochemical data, which has practical value when experiment data is less limited. The three

classifiers are the support vector classification (SVC), the naïve Bayesian (NBAY), and the Bayesian successive discrimination

(BAYSD). QCA is used to determine OFT at first, and then SVC, NBAY and BAYSD are used to predict OFT. In the four

regression/classification algorithms, only MRA is a linear algorithm whereas SVC, NBAY and BAYSD are nonlinear algorithms.

In general, when all these four algorithms are used to solve a real-world problem, they often produce different solution accuracies.

Toward this issue, the solution accuracy is expressed with the sum of mean absolute relative residual (MARR) of all samples,

R(%). And three criteria have been proposed: 1) nonlinearity degree of a studied problem based on R(%) of MRA (weak if

R(%)<10, and strong if R(%)≥10); 2) solution accuracy of a given algorithm application based on its R(%) (high if R(%)<10, and

low if R(%)≥10); and 3) results availability of a given algorithm application based on its R(%) (applicable if R(%)<10, and

inapplicable if R(%)≥10). A case study (the OFT problem) has been used to validate the proposed approach. The case study is a

classification problem. The calculation results indicate that a) when a case study is a weakly nonlinear problem, SVC, NBAY and

BAYSD are all applicable, and BAYSD is better than SVC and NBAY; and b) BAYSD and SVC can be applied to dimensionality

reduction.

Keywords: Cluster Analysis; Multiple Regression Analysis; Support Vector Machine; Naïve Bayesian; Bayesian Successive

Discrimination; Problem Nonlinearity; Solution Accuracy; Results Availability; Dimensionality Reduction

1 INTRODUCTION

In the recent years, the regression/classification algorithms have seen the preliminary success in petroleum

geochemistry[1, 2]

, but the application of these algorithms and the cluster analysis algorithm to oil family type (OFT)

prediction have not started yet. Peters et al (2007) employed the decision trees to predict the OFT using the data of

samples in which each sample contains 21 geochemical factors[3]

. Using all the samples presented in [3], we employ

the Q-mode cluster analysis (QCA) algorithm to determine OFT at first, and then employ four

regression/classification algorithms below to predict OFT. This work is successful and has not been seen in

literatures before.

Besides QCA algorithm for determining OFT, one regression algorithm and three classification algorithms have been

applied to predict OFT. The one regression algorithm is the multiple regression analysis (MRA), while the other

three classification algorithms are the support vector classification (SVC), the naïve Bayesian (NBAY), and the

Bayesian successive discrimination (BAYSD). In these four regression/classification algorithms, only MRA is a

linear algorithm whereas the other three are nonlinear algorithms. In general, when all these four algorithms are used

to solve a real-world problem, they often produce different solution accuracies. Toward this issue, the solution

Page 2: The Use of Cluster Analysis and Classifiers for Determination and Prediction of Oil Family Type Base

- 20 -

http://www.sjie.org

accuracy is expressed with the sum of mean absolute relative residual (MARR) of all samples, R(%). And three

criteria have been proposed that: 1) nonlinearity degree of a studied problem based on R(%) of MRA (weak if

R(%)<10, and strong if R(%)≥10); 2) solution accuracy of a given algorithm application based on its R(%) (high if

R(%)<10, and low if R(%)≥10); and 3) results availability of a given algorithm application based on its R(%)

(applicable if R(%)<10, and inapplicable if R(%)≥10) (Table 1). A case study (the OFT problem) has been used to

validate the proposed approach.

TABLE 1. PROPOSED THREE CRITERIA

Range of R(%)

Criterion 1: nonlinearity degree of a

studied problem based on R(%) of

MRA

Criterion 2: solution accuracy of a

given algorithm application based on

its R(%)

Criterion 3: Results availability

of a given algorithm application based on its R(%)

R(%)<10 Weak High Applicable

R(%)≥10 Strong Low Inapplicable

2 METHODOLOGY

The methodology consists of the following three major parts: definitions commonly used by regression/classification

algorithms; instruction of five algorithms; dimensionality reduction.

2.1 Definitions Commonly Used by Regression/classification Algorithms

The aforementioned regression/classification algorithms share the same sample data. The essential difference

between the two types of algorithms is that the output of regression algorithms is real-type value and in general

differs from the real number given in the corresponding learning sample, whereas the output of classification

algorithms is integer-type value and must be one of the integers defined in the learning samples. In the view of

dataology, the integer-type value is called as discrete attribute, while the real-type value is called as continuous

attribute.

The four algorithms (MRA, SVC, NBAY, BAYSD) use the same known parameters, and also share the same

unknown that is predicted. The only differences between them are the approach and calculation results.

Assume that there are n learning samples, each is associated with m+1 numbers (x1, x2, …, xm, y*) and a set of

observed values (xi1, xi2, …, xim, *

iy ), with i=1, 2, …, n for these numbers. In principle, n>m, but in actual practice,

n>>m. The n samples associated with m+1 numbers are defined as n vectors:

xi=(xi1, xi2, …, xim, *

iy ) (i=1, 2, …, n) (1)

where n is the number of learning samples; m is the number of independent variables in samples; xi is the ith

learning

sample vector; xij is the value of the jth

independent variable in the ith

learning sample, j=1, 2, …, m; and *

iy is the

observed value of the ith

learning sample. Eq. 1 is the expression of learning samples.

Let x0 be the general form of a vector of (xi1, xi2, …, xim). The principles of MRA, NBAY and BAYSD are the same,

i.e., try to construct an expression, y=y(x0), such that Eq. 2 is minimized. Certainly, these three different algorithms

use different approaches and obtain calculation results in differing accuracies.

2

*

01

i i

n

iy yx (2)

where y=y(x0i) is the calculation result of the dependent variable in the ith

learning sample; and the other symbols

have been defined in Eq. 1.

However, the principle of SVC algorithm is to try to construct an expression, y=y(x0), such that to maximize the

margin based on support vector points so as to obtain the optimal separating line.

This y=y(x0) is called the fitting formula obtained in the learning process. The fitting formulas of different algorithms

are different. In this paper, y is defined as a single variable.

The flowchart is as follows: the 1st step is the learning process, using n learning samples to obtain a fitting formula;

Page 3: The Use of Cluster Analysis and Classifiers for Determination and Prediction of Oil Family Type Base

- 21 -

http://www.sjie.org

the 2nd

step is the learning validation, substituting n learning samples (xi1, xi2, …, xim) into the fitting formula to get

prediction values (y1, y2, …, yn), respectively, so as to verify the fitness of an algorithm; and the 3rd

step is the

prediction process, substituting k prediction samples expressed with Eq. 3 into the fitting formula to get prediction

values (yn+1, yn+2, …, yn+k), respectively.

xi=(xi1, xi2, …, xim) (i=n+1, n+2, …, n+k) (3)

where k is the number of prediction samples; xi is the ith

prediction sample vector; and the other symbols have been

defined in Eq. 1. Eq. 3 is the expression of prediction samples.

In the four algorithms, only MRA is a linear algorithm whereas the other three are nonlinear algorithms, this is due

to the fact that MRA constructs a linear function whereas the other three construct nonlinear functions, respectively.

To express the calculation accuracies of the prediction variable y for learning and prediction samples when the four

algorithms are used, the following four types of residuals are defined.

The absolute relative residual for each sample, R(%)i (i=1, 2, …, n, n+1, n+2, …, n+k), is defined as

* *(%) ( ) 100/= -

i i iiR y y y (4)

where yi is the calculation result of the dependent variable in the ith

sample; and the other symbols have been defined

in Eqs. 1 and 3. R(%)i is the fitting residual to express the fitness for a sample in learning or prediction process.

It is noted that zero must not be taken as a value of *

iy to avoid floating-point overflow. Therefore, for regression

algorithm, delete the sample if its *

iy =0; and for classification algorithm, positive integer is taken as values of *

iy .

The MARR for all learning samples, R1(%), is defined as

11

(%) /=(%) i

n

iR R n (5)

where all symbols have been defined in Eqs. 1 and 4. R1(%) is the fitting residual to express the fitness of learning

process.

The MARR for all prediction samples, R2(%), is defined as

21

(%) /i

n k

i nR R k

=(%) (6)

where all symbols have been defined in Eqs. 3 and 4. R2(%) is the fitting residual to express the fitness of prediction

process.

The sum of MARR of all samples, R(%), is defined as

1(%) (%) / ( )=

i

n k

iR R n k (7)

where all symbols have been defined in Eqs. 1, 3 and 4. If there are no prediction samples, k=0, then R(%)=R1(%).

R(%) is the fitting residual to express the fitness of learning and prediction processes.

When the four algorithms (SVC, NBAY, BAYSD, MRA) are used to solve a real-world problem, they often produce

different solution accuracies. Toward this issue, the three criteria have been proposed (Table 1).

2.2 Instruction of Five Algorithms

The five algorithms (QCA, MRA, SVC, NBAY, BAYSD) are detailedly described by Shi (2013), and the latter four

algorithms are concisely introduced[1, 4–11]

. Hence, we only concisely introduce QCA.

Cluster analysis has been widely applied since the 1970's[1, 12]

. Q-mode and R-mode cluster analyses is one of the

most popular methods in cluster analysis, and still a very useful tool in some fields. Cluster analysis is a

classification analysis approach of data set, which is a multiple linear statistical analysis.

Page 4: The Use of Cluster Analysis and Classifiers for Determination and Prediction of Oil Family Type Base

- 22 -

http://www.sjie.org

As mentioned above, regression/classification algorithms need n learning samples expressed with Eq. 1 and k

prediction samples expressed with Eq. 3. From Eqs. 1 and 3, we know each learning sample must have the data of

both independent variables (xi) and a dependent variable ( *

iy ), while each prediction sample only has the data of

independent variables (xi). But cluster analysis has neither learning process nor prediction process.

Assume that

xi=(xi1, xi2, …, xi n*) (i=1, 2, …, n*) (8)

where n* is the number of samples; and the other symbols have been defined in Eq. 1.

Equation 8 expresses n* sample vectors. The cluster analysis to n

* sample vectors is conducted, thus this kind of

cluster analysis is called Q-mode cluster analysis (QCA).

Transposing the matrix defined by the right-hand side of Eq. 8, we obtained

xj=(xj1, xj2, …, xjm) (j=1, 2, …, m) (9)

Equation 9 expresses m independent variable vectors. The cluster analysis to m independent variable vectors is

conducted, thus this kind of cluster analysis is called R-mode cluster analysis (RCA).

Obviously, the matrix defined by the right-hand side of Eq.8 and that defined by the right-hand side of Eq. 9 are

mutually transposed.

This cluster analysis has two functions: QCA and RCA, conducting to samples and independent variables,

respectively[1]

. By adopting QCA, a cluster pedigree chart about n*

samples (xi) will be achieved, indicating the order

of dependence between samples, therefore, QCA possibly serves as a promising sample-reduction tool. By adopting

RCA, a cluster pedigree chart about m independent variables (xj) will be achieved, indicating the order of

dependence between independent variables, therefore, RCA possibly serves as a promising independent variable

reduction (i.e., dimension-reduction) tool.

The cluster analysis introduced in this paper only refers to QCA. As for RCA, the procedure is definitely same in

essential as that of QCA. The only difference is that the input data for QCA is expressed with Eq. 8, while the input

data for RCA is expressed with Eq. 9.

The commonly used procedure for cluster analysis at present is aggregation procedure. Just as the name implies,

aggregation procedure is to merge more classes into less till into one class. For example, QCA merges n* classes into

one class while RCA merges m classes into one class.

The procedure of QCA is performed in the following four steps[1, 12]

:

1) Determine the number of cluster objects, i.e., the total number of classes is n* at the beginning of cluster analysis;

2) Select one method such as standard difference standardization, maximum difference standardization and

maximum difference normalization, for conducting data normalization on the original data expressed by Eq. 8.

3) Select an approach involving a cluster statistic and a class-distance method to conduct procedures to every two

classes in n*

classes by a class-distance method (e.g., the shortest distance, the longest distance, and the weighted

mean) of a cluster statistic (e.g., distance coefficient, analog coefficient, and correlation coefficient), achieving the

order of dependence between classes, and merge the closest two classes (e.g., to distance coefficient it is minimum

class-distance, to analog coefficient or correlation coefficient it is maximum class-distance) into a new class which is

composed of two sample vectors, thus the total number of classes turns to (n*−1).

4) Use the above selected method to conduct procedures to every two classes in (n*−1) classes, achieving the order

of dependence between classes, and merge the closest two classes into a new class, thus the total number of classes

turns to (n*−2). Repeat this operation in this way till the total number of classes turns to 1, at this time n

* classes are

merged into 1 class, and then stop procedure.

Up to now, nine clustering methods have been introduced: three class-distance methods of the shortest-distance

method, the longest-distance method, and the weighted mean method under the definition of each of the three cluster

Page 5: The Use of Cluster Analysis and Classifiers for Determination and Prediction of Oil Family Type Base

- 23 -

http://www.sjie.org

statistics (distance coefficient, analog coefficient, and correlation coefficient). Even if the same kind of data

normalization (to select one form of standard difference standardization, maximum difference standardization, and

maximum difference normalization) is applied to the original data, the calculation results are found to be different

with these nine methods. Then which method is the best one? A suitable one will be chosen depending on the

practical application and specific data.

It must be pointed out here that since cluster analysis is performed one by one in the light of ―two-in-one‖

aggregation, the number of the new class is one for the last, two for the previous, four for that before the previous, …

and so on. Obviously, except for the last aggregation, the number of the new class is an even number. So the number

of the new class that is obtained in the sample classification by QCA is an even number.

2.3 Dimensionality Reduction

The definition of dimensionality reduction is to reduce the number of dimensions of a data space as small as possible

but the results of studied problem are unchanged. The benefits of dimensionality reduction are to reduce the amount

of data so as to enhance the calculating speed, to reduce the independent variables so as to extend applying ranges,

and to reduce misclassification ratio of prediction samples so as to enhance processing quality.

Each of MRA and BAYSD can serve as a promising dimension-reduction tool, respectively, because the two

algorithms all can give the dependence of the predicted value (y) on independent variables (x1, x2, …, xm), in

decreasing order. However, because MRA belongs to data analysis in linear correlation whereas BAYSD is in

nonlinear correlation, in applications the preferable tool is BAYSD, whereas MRA is applicable only when the

studied problems are linear. The called ―promising tool‖ is whether it succeeds or does not need a high-class

nonlinear tool (e.g., SVC) for the validation, so as to determine how many independent variables can be reduced. For

instance, the case study below can be reduced from 22-D problem to 21-D problem by using BAYSD and SVC.

3 CASE STUDY: THE DETERMINATION AND PREDICTION OF OIL FAMILY TYPE

The objective of this case study is to determine and predict oil family type (OFT) using geochemical data, which has

practical value when experiment data is less limited.

Using data of 37 samples from the oversea oilfields presented in [3], and each sample contains 21 independent

variables (x1 = C19/C23, x2 = C22/C21, x3= C24/C23, x4 = C26/C25, x5 = Tet/C23, x6 = C27T/C27, x7 = C28/H, x8 = C29/H, x9 =

X/H, x10 = OL/H, x11 = C31R/H, x12 = GA/31R, x13 = C35S/C34S, x14 = S/H, x15 = %C27, x16 = %C28, x17 = %C29, x18 =

δ13

Csaturated, x19 =δ13

Caromatic, x20 = C26T/Ts, x21 = C28/C29) and a variable [y*

= oil family type (OFT)]. Peters et al (2007)

employed the decision trees to predict the OFT[3]

. In the case study, the former 30 samples are taken as learning

samples and the latter 7 samples as prediction samples (Table 2) for the OFT determination by QCA at first and then

for the OFT prediction by SVC, NBAY and BAYSD.

TABLE 2. INPUT DATA FOR OIL FAMILY TYPE (OFT) OF OVERSEA OILFIELDS (MODIFIED FROM [3])

No. 21 geochemical factors related to y*

y*

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21

1 4.53 0.43 0.74 1.28 4.16 0.11 0.04 0.53 0.29 0.02 0.29 0.08 0.5 0.35 13.7 26.7 59.6 -

28.15

-

25.89 0.18 0.46 4

2 2.68 0.51 0.73 1.13 6.82 0.03 0.02 0.69 0.11 0.04 0.27 0.12 1.06 0.31 27.2 26.9 45.9 -

30.18

-

28.28 0.16 0.59 4

3 0.23 0.5 0.74 0.98 0.67 0.02 0.03 0.68 0.12 0.02 0.29 0.17 0.65 0.52 27.2 21.8 51 -

30.54

-

29.19 0.61 0.43 4

4 0.12 0.38 0.87 0.61 0.45 0.46 0.07 0.54 0.05 0.01 0.36 0.09 0.96 0.8 36.5 33 30.5 -

32.01

-

31.23 0.6 1.08 3

5 0.27 0.44 0.75 0.79 0.55 0.08 0.02 0.45 0.1 0.01 0.31 0.09 0.81 0.67 36.4 33.1 30.5 -31.5 -30.4 0.43 1.09 3

6 0.31 0.52 0.88 1.01 0.5 0.09 0.03 0.4 0.18 0.03 0.25 0.17 0.7 0.91 33.9 34.7 31.4 -

30.94

-

29.65 0.51 1.11 3

7 0.78 0.49 0.8 1.02 0.88 0.2 0.03 0.46 0.17 0.02 0.24 0.14 0.61 0.55 27.7 32.5 39.8 -

29.12

-

27.36 0.42 0.84 3

8 4.76 0.38 0.59 1.5 5.3 0.21 0.12 0.59 0.32 0.02 0.26 0.18 0.49 1.03 9.2 27.6 63.2 - - 0.29 0.44 4

Page 6: The Use of Cluster Analysis and Classifiers for Determination and Prediction of Oil Family Type Base

- 24 -

http://www.sjie.org

26.95 24.92

9 0.09 0.31 0.85 0.97 0.19 0.13 0.12 0.6 0.11 0.03 0.29 0.48 1.06 0.6 14.8 22 63.2 -

33.57

-

33.29 1.84 0.35 2

10 0.04 0.3 0.81 0.96 0.18 0.2 0.18 0.61 0.06 0.03 0.34 0.33 1.22 0.82 13.2 19.5 67.3 -

34.78

-

34.75 2.43 0.29 2

11 0.05 0.31 0.83 0.99 0.13 0.32 0.38 0.66 0.2 0.07 0.31 0.4 1.08 2.13 14.7 22.44 62.9 -

34.63

-

34.02 3.57 0.36 2

12 2.91 0.28 0.76 0.98 1.38 0.15 0.1 0.52 0.07 0.21 0.23 0.13 0.4 0.39 16.9 34.4 48.6 -

28.46

-

27.13 0.3 0.74 4

13 0.05 0.46 0.75 0.77 0.22 0.06 0.04 0.55 0.08 0.01 0.34 0.14 1.09 0.51 30 33.7 36.3 -

30.38 -29.6 1.37 0.94 3

14 0.06 0.48 0.75 0.74 0.25 0.14 0.04 0.51 0.08 0.03 0.39 0.07 1.24 0.57 28.8 38.8 32.4 -

29.99

-

29.21 1.11 1.2 3

15 0.03 0.3 0.92 0.76 0.14 0.04 0.03 0.4 0.15 0.01 0.34 0.1 1.18 0.63 28.7 27.6 43.7 -

31.41

-

30.22 2.13 0.63 2

16 0.04 0.66 0.58 0.78 0.21 0.06 0.07 0.85 0.05 0.01 0.36 0.28 1.18 0.3 30.4 29.3 40.3 -

30.31

-

29.89 1.91 0.73 1

17 0.13 0.41 0.75 0.98 0.46 0.03 0.03 0.38 0.17 0.01 0.31 0.09 0.59 0.57 30.7 28.5 40.8 -

31.23

-

30.24 0.65 0.7 3

18 0.36 0.39 0.89 0.84 0.64 0.18 0.11 0.43 0.1 0.08 0.3 0.09 0.87 0.58 27.2 40 32.9 -

28.93

-

27.94 0.48 1.24 3

19 0.89 0.41 0.85 1.02 0.76 0.05 0.06 0.51 0.12 0.02 0.28 0.06 0.5 0.62 27.2 33.4 39.4 -

28.87

-

27.22 0.56 0.87 3

20 0.05 0.66 0.6 0.83 0.43 0 0.02 1.11 0.03 0.01 0.39 0.23 1.03 0.2 30 23.7 46.3 -29.3 -

28.91 1.19 0.51 1

21 0.02 0.68 0.56 0.79 0.2 0.03 0.13 0.75 0.02 0 0.37 0.37 1.2 0.25 32.3 29.2 38.5 -

29.98

-

29.95 1.72 0.77 1

22 0.07 0.53 0.7 1.11 0.36 0.02 0.03 0.48 0.16 0.01 0.28 0.23 0.54 0.6 33.2 20.4 46.5 -

29.45

-

27.71 0.71 0.44 4

23 0.14 0.81 0.32 0.85 1.53 0.04 0.23 2.51 0.11 0 0.33 0.42 1.25 0.46 34 17.1 49 -

30.09

-

28.04 0.28 0.35 1

24 0.09 0.4 0.78 0.84 0.31 0.02 0.03 0.34 0.12 0.02 0.26 0.11 0.67 0.8 28.8 40.8 30.4 -

28.24

-

26.93 0.55 1.35 3

25 0.14 0.81 0.32 0.85 1.53 0.04 0.23 2.51 0.11 0 0.33 0.42 1.25 0.46 34 17.1 49 -

30.09

-

28.04 0.28 0.35 1

26 0.03 0.87 0.54 0.78 0.22 0.01 0.02 0.87 0.01 0 0.54 0.33 0.98 0.28 36.2 17.3 46.5 -

29.92

-

28.03 2.01 0.37 1

27 0.07 0.53 0.7 1.11 0.36 0.02 0.03 0.48 0.16 0.01 0.28 0.23 0.54 0.6 33.2 20.4 46.5 -

29.45

-

27.71 0.71 0.44 4

28 0.17 0.43 0.84 0.67 0.66 0.28 0.08 0.49 0.1 0 0.35 0.07 0.98 0.6 36.3 32.6 31.1 -

30.36

-

29.51 0.41 1.05 3

29 0.22 0.25 0.64 1.44 0.48 0.02 0.02 0.6 0.19 0.01 0.23 0.35 0.49 0.2 21 40.3 38.7 -

32.18

-

30.43 0.57 1.04 4

30 0.04 0.31 0.75 0.94 0.09 0.02 0.05 0.56 0.1 0.02 0.25 0.84 0.66 0.94 25.9 21.6 52.5 -

30.57

-

30.11 3 0.53 2

31 0.02 1.18 0.44 0.74 0.17 0 0.01 1.13 0.01 0.01 0.54 0.33 0.98 0.28 36.2 17.3 46.5 -

29.92

-

28.77 2.46 0.38 (1)

32 0.04 0.3 0.7 1.06 0.07 0.02 0.07 0.62 0.1 0.01 0.26 0.66 0.62 1.08 20.6 19.3 60.1 -

30.43

-

29.91 3.16 0.32 (2)

33 0.4 0.41 0.88 0.86 0.54 0.16 0.21 0.52 0.78 0.03 0.29 0.21 0.71 2.32 32.8 35.1 32 -

29.29

-

27.84 0.47 1.1 (3)

34 0.25 0.46 0.87 0.8 0.78 0.23 0.2 0.45 0.13 0.01 0.31 0.08 0.79 0.79 33.6 32.8 33.6 -

29.69

-

28.78 0.32 0.98 (3)

35 0.43 0.48 0.79 0.91 0.94 0.1 0.13 0.44 0.21 0.01 0.31 0.1 0.69 0.78 32.2 31.3 36.5 -

28.63

-

27.48 0.24 0.87 (3)

36 0.1 0.33 0.7 0.95 0.35 0.04 0.02 0.67 0.07 0 0.27 0.22 0.63 0.39 30.9 20.5 48.6 -

31.48

-

29.27 1 0.42 (4)

37 0.22 0.25 0.64 1.44 0.48 0.02 0.02 0.6 0.19 0.01 0.23 0.35 0.49 0.2 21 40.3 38.7 -

32.18

-

30.43 0.57 1.04 (4)

a x1 = C19/C23, x2 = C22/C21, x3= C24/C23, x4 = C26/C25, x5 = Tet/C23, x6 = C27T/C27, x7 = C28/H, x8 = C29/H, x9 = X/H, x10 = OL/H, x11 =

C31R/H, x12 = GA/31R, x13 = C35S/C34S, x14 = S/H, x15 = %C27, x16 = %C28, x17 = %C29, x18 = δ13Csaturated, x19 =δ13Caromatic, x20 = C26T/Ts, x21

= C28/C29.

b y* = OFT = oil family type (1~4) determined by QCA at first; then predicted by SVC, NBAY and BAYSD, number in parenthesis is

not input data, but is used for calculating R(%)i.

Page 7: The Use of Cluster Analysis and Classifiers for Determination and Prediction of Oil Family Type Base

- 25 -

http://www.sjie.org

3.1 The Calculation and Results of QCA

Using the 37 samples without y*

(Table 2) and by QCA[1, 12]

taking the standard difference standardization in data

normalization, the correlation coefficient in cluster statistic, and the weighted mean in class-distance, a cluster

pedigree chart about the 37 samples has been achieved (Fig. 1).

FIG. 1 Q-MODE CLUSTER PEDIGREE CHART FOR OIL FAMILY TYPE (OFT)

Figure 1 illustrates a cluster pedigree chart of 37 samples that 37 classes (i.e., 37 samples) are merged step by step

into one class, showing the order of dependence between samples. Four classes can be divided in Fig. 1 (classes with

even number), based on which the class-value (1~4) of each sample is filled in the y* column of Table 2.

3.2 The Calculation and Results of SVC, NBAY, BAYSD and MRA

Using the 30 learning samples with y* (Table 2) and by SVC[1, 13]

, NBAY[1, 14, 15]

, BAYSD[1]

and MRA[1, 16, 17]

, four

Page 8: The Use of Cluster Analysis and Classifiers for Determination and Prediction of Oil Family Type Base

- 26 -

http://www.sjie.org

functions of OFT (y) with respect to 21 independent variables (x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15,

x16, x17, x18, x19, x20, x21) have been constructed. Substituting the values of the 21 independent variables given by the

30 learning samples and 7 prediction samples (Table 2) in the four functions, respectively, the OFT (y) of each

sample is obtained (Table 3).

TABLE 3. PREDICTION RESULTS FROM OIL FAMILY TYPE (OFT) OF OVERSEA OILFIELDS

Sample

type

Sample

No.

OFT a

y*

Classifier MRA

b

SVC NBAY BAYSD

y R(%) y R(%) y R(%) y R(%)

Learning

samples

1 4 4 0 4 0 4 0 4 0

2 4 4 0 4 0 4 0 4 0

3 4 4 0 4 0 4 0 3 25

4 3 3 0 3 0 3 0 3 0

5 3 3 0 3 0 3 0 3 0

6 3 3 0 3 0 3 0 3 0

7 3 3 0 3 0 3 0 3 0

8 4 4 0 4 0 4 0 4 0

9 2 2 0 2 0 2 0 2 0

10 2 2 0 2 0 2 0 2 0

11 2 2 0 2 0 2 0 2 0

12 4 4 0 4 0 4 0 4 0

13 3 3 0 3 0 3 0 2 33.3

14 3 3 0 3 0 3 0 2 33.3

15 2 2 0 2 0 2 0 2 0

16 1 1 0 1 0 1 0 1 0

17 3 3 0 3 0 3 0 3 0

18 3 3 0 3 0 3 0 3 0

19 3 3 0 3 0 3 0 3 0

20 1 1 0 1 0 1 0 1 0

21 1 1 0 1 0 1 0 1 0

22 4 4 0 4 0 4 0 4 0

23 1 1 0 1 0 1 0 1 0

24 3 3 0 3 0 3 0 3 0

25 1 1 0 1 0 1 0 1 0

26 1 1 0 1 0 1 0 1 0

27 4 4 0 4 0 4 0 4 0

28 3 3 0 3 0 3 0 3 0

29 4 4 0 4 0 4 0 4 0

30 2 2 0 2 0 2 0 2 0

Prediction

samples

31 1 1 0 1 0 1 0 0 100

32 2 2 0 2 0 2 0 2 0

33 3 3 0 4 33.3 4 33.3 2 33.3

34 3 3 0 3 0 3 0 2 33.3

35 3 3 0 3 0 3 0 3 0

36 4 4 0 4 0 2 50 4 0

37 4 4 0 4 0 4 0 4 0 a y* = OFT = oil family type (1~4) determined by cluster analysis. b The results y of MRA are converted from real number to integer number by using round rule.

From Table 4 and based on Table 1, it is shown that a) the nonlinearity degree of this studied problem is weak since

R(%) of MRA is 6.98; b) the solution accuracies of SVC, NBAY and BAYSD are high, and the results availability of

SVC, NBAY and BAYSD are applicable, due to the fact that their R(%) values are 0, 0.9 and 2.25, respectively.

Page 9: The Use of Cluster Analysis and Classifiers for Determination and Prediction of Oil Family Type Base

- 27 -

http://www.sjie.org

TABLE 4. COMPARISON AMONG THE APPLICATIONS OF CLASSIFIERS (SVC, NBAY AND BAYSD) TO OIL

FAMILY TYPE (OFT) OF OVERSEA OILFIELDS

Algorithm Fitting

formula

Mean absolute relative

residual

Dependence of the

predicted value (y)

on independent

variables (x1, x2, x3,

x4), in decreasing

order

Time

consuming

on PC (Intel

Core 2)

Problem

nonlinearity

Solution

accuracy

Results

availability R1(%) R2(%) R(%)

SVC Nonlinear,

explicit 0 0 0 N/A 5 s N/A High Applicable

NBAY Nonlinear,

explicit 0 4.76 0.9 N/A <1 s N/A High Applicable

BAYSD Nonlinear,

explicit 0 11.9 2.25

x2, x21, x20, x8, x3,

x14, x16, x9, x15, x6,

x13, x10, x4, x19, x18,

x12, x5, x17, x11, x7, x1

1 s N/A High Applicable

MRA Linear,

explicit 3.06 23.8 6.98

x13, x12, x11, x4, x8,

x20, x2, x16, x15, x10,

x14, x3, x19, x18, x1,

x9, x17, x7, x21, x6, x5

<1 s Weak N/A N/A

3.3 Dimension-reduction Succeeded from 22-D to 21-D Problem by Using BAYSD and SVC

BAYSD gives the dependence of the predicted value (y) on 21 independent variables, in decreasing order: x2, x21, x20,

x8, x3, x14, x16, x9, x15, x6, x13, x10, x4, x19, x18, x12, x5, x17, x11, x7, x1 (Table 4). According to this dependence order, at

first, deleting x2 and running SVC, it is found the results of SVC are the same as before, i.e., R(%)=0, thus the 22-D

problem (x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20, x21, y) can become 21-D problem

(x1, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20, x21, y). In the same way, it is found that this

21-D problem cannot become 20-D problem by deleting x21, due to the fact that the results of SVC are changed with

R(%)=0.9 which is greater than previous R(%)=0 (Table 4). Therefore, the 22-D problem at last can become 21-D

problem.

Comparing with SVC, the major advantages of BAYSD are (Table 4): a) BAYSD runs much faster than SVC, b) it is

easy to code the BAYSD program whereas very complicated to code the SVC program, and c) BAYSD can serve as

a promising dimension-reduction tool. So BAYSD is better than SVC when nonlinearity of studied problem is weak.

4 CONCLUSIONS

Through the aforementioned case study, five major conclusions can be drawn as follows:

1) For classification problem, QCA can be applied to determine the dependent variable (y), while SVC, NBAY and

BAYSD can applied to predict y;

2) The sum of MARR R(%) of MRA can be used to measure the nonlinearity degree of a studied problem, and thus

MRA should be run at first;

3) The proposed three criteria [nonlinearity degree of a studied problem based on R(%) of MRA, solution accuracy

of a given algorithm application based on its R(%), results availability of a given algorithm application based on

its R(%)] are practical;

4) If a classification problem has weak nonlinearity, in general, SVC, NBAY and BAYSD are applicable, and

BAYSD is better than SVC and NBAY;

5) BAYSD and SVC can be applied to dimensionality reduction.

ACKNOWLEDGMENT

This work was supported by the Research Institute of Petroleum Exploration and Development (RIPED) and

PetroChina.

Page 10: The Use of Cluster Analysis and Classifiers for Determination and Prediction of Oil Family Type Base

- 28 -

http://www.sjie.org

REFERENCES

[1] Shi G. ―Data Mining and Knowledge Discovery for Geoscientists.‖ Elsevier Inc, USA, 2013

[2] Shi G. ―Prediction of methane inclusion types using support vector machine.‖ Sci J Earth Sci 5(2): 18–27, 2015

[3] Peters, KE, Ramos, LS, Zumberge, JE, Valin, ZC, Scotese, CR, Gautier, DL. ―Circum-Arctic petroleum systems identified using

decision-tree chemometrics.‖ AAPG Bulletin, 91(6): 877–913, 2007

[4] Shi G, Yang X. ―Optimization and data mining for fracture prediction in geosciences.‖ Procedia Comput Sci 1(1): 1353–1360,

2010

[5] Zhu Y, Shi G. ―Identification of lithologic characteristics of volcanic rocks by support vector machine.‖ Acta Petrolei Sinica 34(2):

312–322, 2013

[6] Shi G, Zhu Y, Mi S, Ma J, Wan J. ―A big data mining in petroleum exploration and development.‖ Adv Petrol Expl Devel 7(2): 1–

8, 2014

[7] Zhang Q, Shi G. ―Economic evaluation of waterflood using regression and classification algorithms.‖ Adv Petrol Expl Devel 9(1):

1–8, 2015

[8] Shi G. ―Optimal prediction in petroleum geology by regression and classification methods.‖ Sci J Inf Eng 5(2): 14–32, 2015

[9] Li D, Shi G. ―Data mining in petroleum upstream——the use of regression and classification algorithms.‖ Sci J Earth Sci 5(2):

33–40, 2015

[10] Shi G, Ma J, Ba D. ―Optimal selection of classification algorithms for well log interpretation.‖ Sci J Con Eng 5(3): 37–50, 2015

[11] Mi S, Shi G. ―The use of regression and classification algorithms for layer productivity prediction in naturally fractured reservoirs.‖

J Petrol Sci Res 4(2): 65–78, 2015

[12] Everitt BS, Landau S, Leese M, Stahl D. ―Cluster Analysis, 5th edition.‖ John Wiley & Sons, Chichester, England, UK, 2011

Chang C, Lin C. ―LIBSVM: a library for support vector machines, Version 3.1.‖ Retrieved from

www.csie.ntu.edu.tw/~cjlin/libsvm, 2011

[13] Tan P, Steinbach M, Kumar V. ―Introduction to Data Mining.‖ Pearson Education, Boston, MA, USA, 2005

[14] Han J, Kamber M. ―Data Mining: Concepts and Techniques, 2nd Ed.‖ Morgan Kaufmann, San Francisco, CA, USA, 2006

[15] Sharma MSR, O'Regan M, Baxter CDP, Moran K, Vaziri H, Narayanasamy R. ―Empirical relationship between strength and

geophysical properties for weakly cemented formations.‖ J Petro Sci Eng 72(1-2): 134–142, 2010

[16] Singh J, Shaik B, Singh S, Agrawal VK, Khadikar PV, Deeb O, Supuran CT. ―Comparative QSAR study on para-substituted

aromatic sulphonamides as CAII inhibitors: information versus topological (distance-based and connectivity) indices.‖ Chem Biol

Drug Design 71, 244–259, 2008

AUTHORS 1 Qian Zhang is a senior engineer of

PetroChina, was born in Heilongjiang

Province, China, in September, 1983. She

graduated from the Beijing Institute of

Technology in 2010 with a Ph.D. in

applied mathematics. She joined the

Research Institute of Petroleum

Exploration and Development (RIPED) of PetroChina in 2010

and as a petroleum evaluation engineer on the national and

world energy assessment projects. Her recent accomplishments

include petroleum resources evaluation software development

and data mining algorithms application in petroleum evaluation.

She has published 6 articles in which 3 are SCI-indexed and 3 on

data mining. She obtained the 1st class award of science-

technology achievement from RIPED (2015).

2 Guangren Shi is a professor of PetroChina, was born in

Shanghai, China in February, 1940. His research contains two

major fields of basin modeling (petroleum system) and data

mining for geosciences. He has published 8 books in which 3 are

in English and 3 are on data mining, and 75 articles in which 4

are SCI-indexed and 15 are on data mining.