A hybrid evolutionary algorithm for feature and...

76 Int. J. Electronic Security and Digital Forensics, Vol. 7, No. 1, 2015

Copyright © 2015 Inderscience Enterprises Ltd.

A hybrid evolutionary algorithm for feature and ensemble selection in image tampering detection

Jonathan Goh* and Vrizlynn L.L. Thing Cyber Security and Intelligence Department, Institute for Infocomm Research, Singapore Email: [email protected] Email: vriz-i2r.a-star.edu.sg *Corresponding author

Abstract: The detection of the presence of tampered images is of significant importance in digital forensics. The problem with image tampering detection is the vast number of features that are currently available in the literature. It is very challenging to determine the best features to correctly characterise these images. This paper proposes a hybrid evolutionary framework to perform a quantitative study to evaluate all features in image tampering for the best feature set. Upon feature evaluation and selection, the classification mechanism must be optimised for good performance. Therefore, in addition to being able to determine an optimal set of features for a classifier, the hybrid framework is capable of determining the optimal multiple classifier ensembles while achieving the best classification performance in terms of low complexity and high accuracy for image tampering detection. Using a training dataset of only 5% of the dataset, we were able to obtain accuracies of 90.18% on a CASIA 1 dataset with 1,457 test images, 96.21% on a CASIA 2 dataset with 10,200 and 94.64% on a combined CASIA 1 and 2 dataset with 11,657 testing images. The experiment result shows that our image tampering detection can support large-scale digital image evidence authenticity verification with consistent good accuracy.

Keywords: image forgery; evolutionary algorithms; optimal feature selection; multiple classifiers systems.

Reference to this paper should be made as follows: Goh, J. and Thing, V.L.L. (2015) ‘A hybrid evolutionary algorithm for feature and ensemble selection in image tampering detection’, Int. J. Electronic Security and Digital Forensics, Vol. 7, No. 1, pp.76–104.

Biographical notes: Jonathan Goh is currently a Research Scientist in the Department of Cyber Security and Intelligence at the Institute of Infocomm Research, A*Star, Singapore. He received both his PhD and BSc (1st Class Honors) from the University of Surrey, UK, in 2011 and 2006 respectively. His research interests include multimedia forensics, stegnography, steganalysis, biometrics liveness, applied machine learning and evolutionary computation.

Vrizlynn L.L. Thing leads the Cyber Security and Intelligence R&D Department at the Institute for Infocomm Research, A*STAR, Singapore. She has over 13 years of security and forensics R&D experience with in-depth expertise in cyber crime and attack evolvement detection and mitigation, cyber security, digital forensics, and security intelligence and analytics. Her research draws on her multidisciplinary background in computer science (PhD from Imperial College London, UK), and electrical, electronics, computer and communications engineering (Diploma from Singapore Polytechnic, BEng and MEng by Research from Nanyang Technological University, Singapore).

A hybrid evolutionary algorithm for feature and ensemble selection 77

1 Introduction

With the exponential growth of the social media platform, large volumes of images are being uploaded onto the internet each day. As such, the distribution, manipulation and duplication of images can be done very easily (Farid, 2009). While we are mostly not concerned about the authenticity of these images, there may be situations where manipulations of images are performed to intentionally mislead the viewer and can cause very serious consequences. These events could be related but not limited to official government documents, legal matters, healthcare, journalism and forensics investigations, etc. Depending on the skill of the forger, the extent of the manipulation can be executed with a very high degree of sophistication and it is often very difficult to detect such tampered images. These advancements pose an increasing necessity for an infrastructure to accurately determine the authenticity of large volumes of digital images.

The detection of tampered images is a very challenging problem due to the various types of tampering such as copy-move and splicing which produce different statistical features. Some well known methods includes analysing the discrete cosine transform (DCT) coefficients (Fridrich et al., 2003; Popescu and Farid, 2004) or the discrete wavelet transforms (DWT) (Li et al., 2007; Zhang et al., 2008) of the images to identify copy-move tampering. Some of the proposed work also studied the use of double compression effects in the spatial and DCT domains to identify tampered regions (Lin and Wu, 2011; Thing et al., 2012).

While there are many features, it is impossible for any single set of features to truly represent the spectrum of a large tampered image dataset. On the other hand, if all the different features are used, redundancy in the feature set would result in a high dimensional problem causing poor classification performance. For example, the work by Sutthiwan et al. (2010) and Su et al. (2007) proposes a method using Markov process transition probability matrices (TPM) for detecting tampering image with high accuracy. This is achieved by capturing the changes in the distribution of the DCT coefficient caused by tampering. He et al. (2012) have extended the original number of features (Shi et al., 2007; Sutthiwan et al., 2011) by 72 additional features in the form of DWT. The problem with the TPM features is the large dimensionality of the features. Seventy two TPM results in a feature dimensionality between 2,250 and 7,290 depending on the Threshold used to generate the TPM (He et al., 2012). We illustrate in Section 5.7 that such a high dimensionality result in the infamous ‘cause of dimensionality’ which prevents the generalisation of the classifier (causing poor classification performance). Therefore, Shi et al. (2007), Sutthiwan et al. (2011) and He et al. (2012) all applied feature reduction to reduce the feature dimensions. However, reducing the TPM through feature reduction causes a loss in the correlation between the coefficients needed to detect tampering. In contrast, we believe that by preserving the integrity of the TPM, better performance can be obtained. The key to this problem is two folds: firstly, how do we maintain a balance between identifying key TPMs to preserve the low feature dimensionality. Secondly, how do we identify additional key features while at the same time, obtain a high classification performance.

In this paper, we propose a hybrid evolutionary algorithm that combines a population-based global heuristic search strategy to determine the feature sets and at the

78 J. Goh and V.L.L. Thing

same time, uses a local refinement to optimise a multiple classifier system (MCS) to ensure a good classification performance. To the best of our knowledge, this is the first work in image tampering detecting that utilises a hybrid evolutionary algorithm. We propose a framework that consists of two genetic algorithms (GA) to ensure a balance between key features and an optimised MCS. In addition, we feature engineered multiple features that have been developed in the field of image forensics to be used in our framework. This way, we are able to use the hybrid algorithm to perform a fair evaluation on all features used in image tampering to determine the best features used to characterise tampered images. Furthermore, as the framework uses the ground truth as a source of evaluating the solutions, this procedure will ensure that the data best fits the model. In our work, we illustrate through experiments that our hybrid algorithm performs well in identifying key features and classifiers over iterative generations. We also show that using MCS constantly out performs single classifiers with good accuracy for detecting tampered images.

The rest of the paper is organised as follow. In Section 2, we discuss the related work in image forgery detection. In Section 3, we briefly describe the techniques used to extract the features for characterising the images, and in Section 4, we describe the hybrid evolutionary algorithm used to determine the optimal feature set and classifiers. We present and discuss the experimental results in Section 5 and conclude the paper in Section 6.

2 Related work

In this section, we discuss the existing techniques capable of performing various forms of tampered image detection. Also, we discuss the motivations for the use of multiple classifiers in our work.

2.1 Image tampering detection

2.1.1 Copy-move

In Fridrich et al. (2003), the authors suggested segmenting the image into overlapping blocks. DCT coefficients were then extracted from these blocks and utilised as features. The DCT coefficients were then lexicographically sorted and thresholding was applied to determine similar regions.

Popescu and Farid (2004) technique differs slightly by applying the principal component analysis on small fixed-size image blocks to obtain a lower dimension DCT block representation. Each block was then vectorised and inserted into a matrix to obtain the covariance matrix. Subsequently, the eigenvectors of the covariance matrix were lexicographically sorted to detect the duplicated regions.

Li et al. (2007) decomposed the input image into four sub-bands by applying the DWT. Subsequently, singular value decomposition (SVD) was then applied to each individual sub band to obtain a low dimensional representation. The SVD vector was then sorted lexicographically to detect similar regions using a fixed threshold. The main drawback of these methods is that if the copied region has been scaled or rotated, these techniques would not be effective.


So, in order to overcome these obstacles, Zhang et al. (2008) first applied the DWT to reduce the dimensionality of the image. The authors then computed the phase correlation to estimate the spatial offset between the copied region and the pasted region. The pixel location with the corresponding peaks was then detected as the copied-regions.

Huang et al. (2008) used the scale invariant feature transform (SIFT) as a feature to describe every key point on an image. Subsequently, all the key points were compared with each other. Matching key points with the minimum Euclidean distance that satisfies a threshold were then detected as similar regions to determine forgery. Xu et al. (2010) proposed using the speed up robust features (SURF) descriptors instead of SIFT due to its computational efficiency as the former uses integral images. Similar to SIFT, the key points were first obtained using the SURF descriptors. Subsequently, the feature matching would take place to identify similar regions using a predefined threshold.

In Lin and Wu (2011), as part of an integrated technique to detect copy-move and splicing, the authors used the SURF as the key-point feature for carrying out similarity measurements. To detect a copy and move forgery, the Euclidean distance was calculated between the key-points, and a threshold was used to decide on the similarity among the blocks.

2.1.2 Splicing

In order to detect image splicing, Ng and Chang (2004) employed a higher order moment spectrum originally inspired by speech recognition. Originally, bi-coherence features were used for human-speech splicing detection. However, the authors computed 1-D bi-coherence features from the vertical and horizontal image slices of an image to detect the discontinuity among the tampered image. For their work, the authors obtained an accuracy of 71.4%.

In Chen et al. (2007), the authors stated that splicing would often produce sharp edges and corners. Hence, they proposed using statistical moment of characteristic features obtained from the input image, prediction-error image and its wavelet sub band. The prediction-error image was obtained by setting all the wavelet approximation bands to zero before applying the inverse DWT enhancing the high frequency components. Through this step, moments of wavelet characteristics functions could be obtained. All the features were then merged and trained using a SVM. Using their proposed method, the authors obtained an accuracy of 82.3%. On the other hand, Shi et al. (2007) used features from DWT sub bands and Markov transition probability of difference 2D array obtained from different arrays of DCT. For their work, the authors obtained an accuracy of 92%.

Lin and Wu (2011) analysed the double compression effects in the spatial and DCT domains to observe the effectiveness of these characteristics in the detection of image forgery. Through their analysis, it was shown that double compression introduces specific artefacts (i.e., the DQ effect which exhibits periodic peaks and valleys) in the DCT coefficient histogram. So, in order to detect splicing, the probability of the image being forged by splicing was computed based on the analysis of the statistics of the histogram and the detection of these specific artefacts.


2.1.3 JPEG re-compression

Bianchi and Piva (2011) did a study on JPEG compression observations in tampered images. In their research, they analysed the occurrence of JPEG compressions in two scenarios. The first scenario is where the image has been forged and saved again in the JPEG format. In this scenario, the DCT coefficients of the unmodified regions undergo a double compression and exhibit an effect called double quantisation (DQ). In the second scenario, the DCT coefficients of the forged areas in the tampered images resulting in a single compression (i.e., without the DQ effect) are observed. The second scenario of single compression of the forged region is the resultant of decompressing a JPEG image, carrying out image tampering manipulation on the forged region (e.g., image splicing) and subsequently saving the image back into the JPEG format. Therefore, the unmodified region will exhibit double compression artefacts but the forged regions would not have any artefacts.

Thing et al. (2012) introduced an improved double compression detection method for JPEG image forensics. The focus of their research was on a detection method that is more suited for large scale image authenticity verification. In their work, DCT blocks in a JPEG image are first processed to form the coefficient histograms. A robust period detection technique was proposed to detect the local maximum within a period in the DCT histogram. Probabilities of each DCT coefficient in each 8 by 8 block were then calculated to determine its authenticity. A probability bit map was then created by summing up the probabilities for each block in the image. Four features (and, the optimal threshold value used to detect normalities, the variance of the normalities in the authentic and tampered image class, and the squared difference between the mean normalities of the two classes) were extracted and used to train a SVM. In order to demonstrate the robustness and scalability of their method, they used a publicly available database of 9,501 images, where about 5% of the images were used for training and the remainder used for testing. For their work, they obtained an average of 90.81% for true negative (i.e., correctly detecting the authentic images) and 76.95% for true positives (i.e., correctly detecting the forged images).

2.2 Multiple classifiers

To the best of our knowledge, Hsu and Chang (2008), Lin and Wu (2011) and Fontani et al. (2011) are the only existing works that fuse information from multiple sources for detecting tampered images. Combining classifiers have known to improve performance when using large number of classifiers with each classifier giving overlapping performance on different parts of the problem space (Sharkey, 1999). The rationale is that different classifiers are expected to provide complimentary information to other classifiers. Majority of the literature Hsu and Chang (2008), Fontani et al. (2011) and Lin and Wu (2011) are similar in nature; multiple image forensics tools are applied to the image before the information is fused together for a final decision. Fontani et al. (2011) applied multiple image forensics tools followed by using a Dempster-Shafer framework to fuse this information together to infer the best decision. In their work, they used a custom build dataset of 1,600 images for experiments. Hsu and Chang (2008) utilised cues such as DQ artefacts and camera response function inconsistency as input to their statistical fusion framework. For their work, only 90 images were utilised to verify their experiments. Lin and Wu (2011) applied the SURF descriptor for detecting copy-move


tampering and extracted DCT coefficients for DQ effect processing to detect splicing. The outputs of these two processes were then fused together using a simple OR operation for a final decision. In their work, only 20 images were used for verification. Our work differs in the aspect of automatically selecting the key features based on the characteristics of the tampered images.

2.3 Our contribution

The main objective of our work is to identify key features to detect image tampering. Our contributions are as follows:

• The design of the hybrid evolutionary algorithm to obtain optimal features and the multiple classifier components that is capable of characterising authentic and tampered images.

• Manage feature dimensions. We prove that by identifying key TPM, we are able to manage the feature dimensions while obtaining better classification accuracy.

• Evaluation of features. We automatically evaluate the features given an initial large pool of 582 features. A large pool will enable the evolutionary algorithm to search and dynamically combine a wider range of features. Transition probability matrix obtained through DWT (Zhang et al., 2008; Shi et al., 2007) and DCT (Li et al., 2007; Wang et al., 2012) features were considered as they utilised multi block sizes ranging from 2 by 2 to 32 by 32. Therefore, these features have the potential to detect subtle changes caused by tampering on finer regions. Features utilised in Chen et al. (2007) and Shi et al. (2007) were also included because image splicing introduces unnatural characteristics (such as distinct edges, lighting inconsistencies, etc.) into the image, these characteristics could provide useful edge statistics within the spliced image.

• Evaluate on a large scale database. We show through experiments that our proposed framework is able to select key features and at the same time, achieve high accuracies of at least 90%.

3 Features

In this section, we discuss a comprehensive list of features that we are considering that can characterise the different forms of tampering through evolutionary computation.

3.1 Discrete cosine transform

As mentioned earlier in Section 2, JPEG images that have splicing would inherently undergo a double JPEG compression thus exhibiting a DQ effect (Bianchi and Piva, 2011; Thing et al., 2012). So, by capturing the underlying statistics of the DCT coefficients, such tampering can be determined. Hence, we use DCT as a feature in our hybrid evolutionary algorithm.

Similar to Sutthiwan et al. (2010), we adopted the use of multi block DCT (MBDCT) ranging from 2 by 2 to 32 by 32 blocks as part of our features pool. One reason for using


block DCT is that neighbouring DCT coefficients often exhibit mutual correlations. By applying the Markov random process (discussed in sub-section B), it is able to capture these transitions. Also, by using multi block features; these features have the additional capabilities of detecting subtle changes caused by tampering on finer regions. Since a comprehensive description of the technique are reported in Sutthiwan et al. (2010) and Shi et al. (2007), we only provide a brief introduction of the Markov random process (Section 3.2) and refer the reader to the reference for more information.

3.2 Markov random process

In this work, not only do we apply Markov random process to the DCT, we also applied this process to the wavelet transform (discussed in Subsection 3.3) to obtain the transitional characteristics throughout the multi level transition.

3.2.1 Difference 2-D arrays

After an image has gone through the series of n × n block MBDCT-based processing, the difference arrays are generated in various directions. Depending on the direction of the difference array, the difference is carried out by subtracting the pixel value from its neighbour [in all horizontal, vertical and diagonal directions [Figure 1(a)]]. All the difference operation starts from the top-left of the image with the exception of the diagonal backward direction [Figure 1(b)] which starts at the bottom right of the image. As the difference of the arrays is taken in various directions, each relevant block size will have multiple difference arrays.

Figure 1 Example of difference array operation, (a) diagonally forward (b) diagonally backwards

(a) (b)

3.2.2 Thresholding of difference array

In order to limit the range of the difference values for Markov processing, we use a thresholding technique by Shi et al. (2007) to set the difference values to be within the range of [–T, T] (i.e., all values lesser than –T will be set to a minimum of –T and all values greater than T will be set to a maximum of T). The result of the thresholding


operation will result in a (2T + 1) × (2T + 1) matrix which we refer to as a full TPM. We chose to use two different thresholds of 2 and 4. The purpose of this was firstly, to generate more features for the GA. Secondly and more importantly, it is to ensure that we detect the subtle changes at different scales. With a thresholding value of 4, the output of the difference array will be between –4 and 4 with intervals of 1 (resulting in courser detection). Using a thresholding value of 2, the output of the difference will be between –2 and 2. However, the intervals are limited to 0.5, hence providing better capabilities in detecting the subtle changes.

3.2.3 Transition probability matrix

The correlation between the elements in the array is modelled by using the TPM in both the horizontal and vertical directions to capture the correlation characteristics. The elements of the matrices associated with the horizontal and vertical difference 2D arrays are given by:

{ }( )

( ),

,

( , ; ) , ( 1, ; )

( 1, ; ) ( , ; )( , ; )

= + =

+ = = ==

∑∑

h hu v

h vh

u v

δ D u v n j D u v n i

p D u v n i D u v n jδ D u v n j

(1)

{ }( )

( ),

,

( , ; ) , ( , 1; )

( , 1; ) ( , ; )( , ; )

= + =

+ = = ==

∑∑

v vu v

v vv

u v

δ D u v n j D u v n i

p D u v n i D u v n jδ D u v n j

(2)

where i and ˆ{ : ; };− − ≤ ≤εj t tA AμZ and T t T δ(Dh(u, v, n) = j) and δ(Dv(u, v, n) = j) equals 1 if the statement is true, or 0 if false; δ((x, y) equals 1 if both the statements x and y are true, or 0 if false. In the above equations, the normalisation factor for the probability matrices are over the columns, which is j. However, in addition to normalisation over the columns, we normalised the TPM over the rows and over the whole image to introduce additional features to allow for completeness of the feature set.

3.3 Discrete wavelet transform

DWT is a multi-level decomposition technique localised in the frequency and spatial space. As they are multi-level, they have the ability to capture the transient changes between the different resolutions giving them better abilities in representing tampered characteristics. In order to obtain more features, we used a Deubechies (DB) wavelet transform over two layers. At the first level of the wavelet transform, the image is decomposed into four sub images and labelled LL1, LH1, HL1 and HH1 respectively. LL1 corresponds to the approximation image which is used for further decomposition at the second level. LH1, HL1 and HH1 correspond to the vertical, horizontal and diagonal components of the image. In the second layer, we obtain LH2, HL2 and HH2 which correspond to the vertical, horizontal and diagonal components in the second layer of the wavelet respectively. The Markov random process is then applied to capture the transition change in the wavelet and also to further reduce the dimensionality of the wavelet.


In addition to the TPM of the wavelet, we also obtain statistics from each of the different sub bands according to Wang et al. (2012). For each of the sub image, we calculate the mean, norm1 (sum of all the absolute values of the wavelet coefficient), norm2 (Euclidean norm which essentially measures the length of the vector), standard deviation and average residual (i.e., the deviation from the mean) using the following equations:

1

1

=

= ∑n

ii

mean xn

(3)

1

1=

=∑n

ii

norm x (4)

1/22

1

2=

⎛ ⎞= ⎜ ⎟⎜ ⎟⎝ ⎠∑

n

ii

norm x (5)

( )1/2

2

1

1 1 =

⎛ ⎞= −⎜ ⎟⎜ ⎟−⎝ ⎠

∑n

ii

Standard deviation x meann

(6)

( )1

=

= −∑n

ii

Average residual x mean (7)

where x is the wavelet coefficient and n is the number of coefficients in the sub image.

3.4 Edge statistics

Image manipulation processes usually introduce unnatural transitions within the image in terms of edges. Typically, edge and lines in images corresponds to high responses in both the frequency and spatial domain. As a result, these responses could inherit the forged artefacts and could assist in detecting tampered images. So, edge statistics are computed from different images obtained through various image processing techniques. We use the method detailed in Chen et al. (2007) to enhance these edge features by using a reconstructed image. Reconstructed images are generated by first applying DWT to it. The approximation sub band is then substituted with zeros. Through the processing of inversing the wavelet transform, the reconstructed image is obtained.

In addition to the reconstructed image, we also compute a prediction-error image according to Chen et al. (2007). The purpose of the prediction-error image is to remove redundant low frequency information from the image while retaining the high frequency information pertaining to the edge features. Finally, 2D phase congruency (Kovesi, 1999) is applied to both the prediction-error image and the reconstructed image in order to enhance the edge information. In this research, the edge statistics computed are the mean, variance, skewness and kurtosis.


Table 1 Features list

C

ompo

nent

D

iffer

ence

arr

ay

dire

ctio

n N

orm

alis

atio

n m

etho

d D

iffer

ence

arr

ay

thre

shol

d C

olou

r com

pone

nt

Num

ber o

f fea

ture

s

DC

T-TP

M

2 by

2, 4

by

4, 1

6 by

16

, 32

by 3

2 V

ertic

al, h

oriz

onta

l, di

agon

al, d

iago

nally

ba

ckw

ards

Nor

mal

isat

ion

over

ro

w, c

olum

ns a

nd

who

le im

age

4, 2

Y

, Cr,

Cb

360

JPEG

-TPM

8

by 8

V

ertic

al, h

oriz

onta

l, di

agon

al, d

iago

nally

ba

ckw

ards

Nor

mal

isat

ion

over

ro

w

4 -

4

Rec

onst

ruct

ed im

age

- -

- Y

, Cr,

Cb

12

Phas

e co

ngru

ency

im

age

thro

ugh

pred

ictio

n-er

ror i

mag

e

- -

- Y

, Cr,

Cb

12

Stat

istic

s

Phas

e co

ngru

ency

im

age

thro

ugh

reco

nstru

ctio

n im

age

- -

- Y

, Cr,

Cb

12

TPM

from

wav

elet

DB

la

yer 1

LL

1, H

1L, L

H1,

HH

1 V

ertic

al, h

oriz

onta

l, di

agon

al, d

iago

nally

ba

ckw

ards

Nor

mal

isat

ion

over

ro

w, c

olum

ns a

nd

who

le im

age

4, 2

-

Feat

ures

for 1

bl

ock

= 24

. Tot

al

feat

ures

for 4

bl

ocks

= 9

6

Wav

elet

stat

istic

s LL

2, H

L2, L

H2,

HH

2,

HL1

, LH

1, H

H1

- -

- -

7

Phas

e co

rrel

atio

n LL

2, H

L2, L

H2,

HH

2,

HL1

, LH

1, H

H1

- -

- -

7

Tota

l num

ber o

f fea

ture

s 58

2


3.5 Correlation filters

The correlation filters have been used widely for detecting similar objects in the frequency domain (Jingying et al., 2002; Zhang et al., 2008). In this research, they are used primarily as features to represent copy-move forgeries. These filters work by obtaining a segment of the image systematically throughout the whole image. Each of these segments will be used as a template image. The template image is then converted into the frequency domain. Subsequently, they are correlated with each and every other segment in the image. The correlation ratio, R, is calculated between two image regions using the following formula:

( )( 1) ( 2)

( 1) ( 2)F image conj imageR

F image conj F image×

=×

(8)

where F is the Fourier transform, and conj is the complex conjugate. We applied this in the wavelet domain in order to find correlations within the DWT sub bands. The outputs of the correlated ratio are then used as the features.

So, through the utilisation of all these features, we generate a feature pool of 582 features and the breakdown of the features is listed in Table 1.

4 Hybrid evolutionary algorithm: a Memetic algorithm

Memetic algorithms (MA) use a minimum of two evolutionary techniques in a combined approach and they maintain a population of solutions. In our work, we use two evolutionary techniques; one to optimise the feature set and the other to optimise the MCS. In this work, GAs are used to perform both the global and the local search because have been proven to be very good search algorithms (Yao, 1999; Kuncheva and Jain, 2000; Kwong et al., 2001; Kyoung et al., 2006). In addition, since they are stochastic processes, they have the ability to avoid being trapped in local minimas as opposed to traditional search methods that originates from a single point.

The GA process leads to the generation of a new population of individuals that is better suited for the problem than the individuals that they are created from, eventually reaching an optimal solution. For each solution, a subsequent GA will be carried out to further optimise the solution. Ideally, after the termination criteria have been met, the final population will consist only the best individuals which will be decoded as the optimised set of solutions. Algorithm 1 represents the pseudo code for the Memetic Algorithm for our feature and MCS evolution where the variable, pre_defined_accuracy, is the predefined minimum accuracy in order for the solution to be passed onto the next GA.


Algorithm 1 Pseudo code of Memetic algorithm

Initialise Population

while iteration < MaxGeneration do

– Commence GA for Feature Selection –

Selection of a set of Feature solutions using the Roulette Wheel Selection (RWS);

Carry out Crossover of Chromosomes;

Mutation of Features Chromosomes;

Calculate Fitness of all solutions;

– Commence GA for Classifier Selection –

if solution > pre_defined_accuracy then

Generate Random Pool of Solutions

while inner_iteration < inner_MaxGeneration do

Selection of MCS solutions using RWS

Carry out Crossover of MCS Chromosomes

Mutation of MCS Chromosomes;

Re-calculate Fitness for solutions

end while

end if

– End GA for Classifier Selection –

Replacement

– End GA for Feature Selection –

end while

4.1 Generic genetic algorithm

Both the GAs in the Memetic algorithm utilises the same steps. The only difference between the two GAs are the encoding of the chromosomes and the fitness evaluation algorithm. At the beginning of each loop, there is an initial population of chromosomes that is encoded. Their fitness values are computed by a unique fitness function. Depending on the selection function used, certain chromosomes will be selected as parents to generate offsprings. The fitness values of these offsprings are then obtained. Based on the fitness values of these offsprings, the weaker chromosomes will be replaced by the stronger offsprings. This loop will be repeated until the termination criterion is met.


4.1.1 Feature encoding

The chromosome is the core of the GA algorithm as it holds the key factors pertaining to the strength of the individual solution. In our work, the two separate GAs take on different encoding for the chromosomes. The first GA will use the features (as listed in Table 1) as the genes for the chromosome where each feature is encoded as a corresponding numerical value as illustrated in Figure 2. If the feature is a TPM, the full size TPM will be used as a single feature. From here on, this GA will be known as the Feature GA and its chromosomes will be referred to as the feature solutions.

Figure 2 Example of a chromosome with six encoded features

The second GA would use the feature solutions of the first GA as genes (also numerically encoded) in the chromosomes, i.e., solutions from the output of the first GA will be used as genes in the second GA. Similarly, from here on, this GA will be referred to as the MCS GA and its chromosomes will be referred to as MCS solutions.

4.1.2 Selection

Selection is the process used to determine the chromosomes that are to be selected for reproduction. The Roulette Wheel Selection (RWS) is used in both the GAs in our work because of its ability to potentially select weaker solutions. Weak solutions can sometimes contain useful information. It is a misconception that weaker solutions are of no use and are often discarded. However, if they are selected and combined with fitter solutions, they may have the capability to be stronger than fitter solutions (Bäck, 1996) after the current generation (through evolution). RWS selects chromosomes by first assigning a sector of the roulette wheel according to the fitness value of each solution in the GA using the following mathematical formula.

ii N

ij i

fpf

=

=

∑ (9)

where fi is the fitness value that is based on the relevant GA algorithm at that moment in time, and N is the total number of individuals in the population.

Subsequently, a random selection is then made to select the potential solution. Hence, the larger the fitness value, the larger the sector and its chance of being selected.

4.1.3 Crossover

The crossover operation is the canonical force of the GA for carrying out solution optimisations. For this operation, a crossover point will be generated and parts of the


parentsâŁ™ chromosome will be swapped in order to produce the offspring (Holland, 1992). In both the GAs for these experiments, we adopted the one point crossover as shown in Figure 3. If both parents have the same number of features, the creation of the offspring is straightforward. In addition, our crossover operator for both the GAs uses a crossover rate of 0.7 to select a rate between 0.65 and 0.85 as suggested by Jong (1975). This means that 70% of the chromosomes in the population will undergo crossover operations.

Figure 3 Crossover operation

The only concern is when both the parents have the different number of features. In this case, the offspring will take the average number of features. Under no circumstances should duplicated features or solutions be within the chromosome. If duplication features or solutions exist, they would be removed.

4.1.4 Mutation

The mutation operation is to allow for genetic diversity from one generation of the population to the next (Holland, 1992). The purpose of the mutation operator is to introduce diversity to avoid local minima by preventing the chromosomes from being too similar. For the feature GA, we perform mutation by changing the genes at random positions defined by a mutation rate which we defined as 0.1 in our experiments. Meaning to say, 10% of the genes in the chromosome would be randomly selected and have their feature encoding randomly changed. We chose a very small rate of 0.1 so that better solutions (if they exist) that is not so far from the current solutions can be identified.

Similarly, for the MCS GA, we used a mutation rate of 0.1. For a selected classifier gene, we randomly change one of the features encoding in the classifier to denote a mutation.

4.1.5 Feature GA fitness evaluation

For the feature GA, the encodings of the features would be decoded, concatenated and used as a training dataset. In order to get a performance score of the classifier known as the Fitness Evaluation, we based the score on the receiver operator curve (ROC) over the test set. The rationale behind using the ROC is to prevent the classifier from over fitting of the data. Our initial analysis illustrated that if the accuracy was used, the trained model


would not provide consistent results after training; they tend to generate very high accuracy during training. However, the accuracy would be much lower when presented with a new test set.

ROC is a two dimensional depiction of classifier performance that illustrates the classifier performance visually. Essentially, it only maps the classification outputs from a classifier onto a ROC space. So, in order to obtain a single scalar value that can represent the classifier performance, we calculate the area under the ROC (abbreviated as AUC) as proposed in Fawcett (2004). The AUC is computed by recursively adding the areas of trapezoids under the curve. We follow the algorithm proposed in Fawcett (2004) to calculate the AUC and this is illustrated in the pseudo code in Algorithm 2. Algorithm 2 Pseudo code for calculating area under an ROC

T set of test instances Initialise True_Positive False_Positive to 0 T is sorted in decreasing order with reference to probability, Tsorted as output for i < Tsorted do if i is positive then TP = TP + 1; else FP = FP + 1; end if Trapezoid_Area = Calculate_Trapezoid_Area(TP, FP); Area = Area + Trapezoid_Area; end for

4.1.6 MCS GA fitness evaluation

For the MCS GA, the chromosomes are encoded with the feature solutions from the Feature GA. Therefore, each feature solution in the chromosome is essentially a classifier on its own. In order to combine these classifiers (feature solutions) for a final decision for each test sample, the average rule is used.

Similarly, for this segment of the Memetic algorithm, the fitness evaluation uses the AUC of the MCS solution over the test dataset as discussed in the previous section.

Upon passing back the MCS solutions for combination, the feature solutions that are selected from the optimised MCS will be locked down, i.e., they cannot be removed from the population and neither can their encodings (features) be altered. This lock down is to safe guard the optimised MCS from losing its best solution. Subsequently, if the maximum generation has been met, the Memetic algorithm will be terminated. Otherwise, it will perform selection and continue with the evolution.

5 Experiment results

5.1 Image datasets

For our experiment, we use the publicly available datasets generated and maintained by the Institute of Automation at Chinese Academy of Sciences (Dong and Wang, 2011). The breakdown of each dataset is listed in Table 2.


Table 2 Breakdown of dataset

Dataset Spliced Authentic Total

CASIA 1 918 797 1,715

CASIA 2 7,888 4,112 12,000

For all the experiments, unless otherwise indicated, the whole dataset was split into 5% for training, 10% for fitness evaluation and the remaining 85% for testing. The parameters used for the experiments are listed in Table 3. For the number of features, we chose 15 as the maximum number of features allowed in a classifier purely because we wanted to reduce the complexity of the classifier. For the maximum number of classifiers in an ensemble, we chose 40 as the maximum number because we wanted to reduce computational time during classification. With regards to the number of populations and generations, we chose a population of 80 and number of generations of 100 for feature selection because it is a more important factor for classification. Therefore, requiring more initial population and longer run time for better generalisation. As for the population and generations for the number of classifiers, we chose a number of 50 for population size and 30 for number of generations since good features would have already been identified in the previous step.

Table 3 Parameters used

Stage 1: feature selection Stage 2: multiple classifier selection

Populations: 80 50

Limit of features/classifiers: 15 max. features 40 max. classifiers

Generations: 100 30

Crossover rate: 0.7 0.7

Mutation rate: 0.1 0.1

5.2 Support vector machine classifier

The support vector machine (SVM) is used as the classifier of choice and we utilise a 2 degree polynomial kernel. Depending on the size of the dataset, the number of images for the training and testing images varies and are listed in Table 4.

Table 4 Experiment dataset breakdown

Experiment Dataset Number of training images

Number of testing images

Validation images

1 CASIA 1 90 168 1,457

CASIA 2 600 1,200 10,200

2 Mixed CASIA 1 and 2 690 1,368 11,657

3 CASIA 1 and 2 684 1,031 11,657


5.3 Experiment 1: CASIA 1 and CASIA 2

In this experiment, we perform the experiment individually on CASIA 1 and CASIA 2. As the parameters are the same for both datasets, we analyse the results in detail for CASIA 1 only and illustrate the optimal solution for CASIA 2.

Figures 4 to 7 illustrates the results at different generations throughout the evolution for CASIA 1. The best solution at each generation is illustrated based on the increment of the fitness value at the end of each generation of the MA. At this point, we emphasise that the fitness value for the best individual classifier (BIC) within the MCS solution is independent of the AUC that is obtained for the MCS (i.e., the AUC of the MCS is not an average taken over all the selected classifiers). By using independent AUCs, the fitness value from the best individuals can be higher than the fitness value for the MA at different generations of the evolution. Also, the AUC is only an estimate for the performance of the classifier. The actual accuracy obtained for each solution can vary with the AUC.

Figure 4 Best individual and Memetic algorithm validation set accuracy (see online version for colours)

Figure 5 Fitness values for BICs and MCSs (see online version for colours)


Figure 6 Average number of features and classifiers (see online version for colours)

Figure 7 True/false negative/positive rates (see online version for colours)

The performance between the BIC and the MA validation set is illustrated in Figure 4. Within the first 20 evolutions, we see the accuracies fluctuating. Looking at the first five generations, we have the BIC with an accuracy of approximately 79% while for the MA, we obtained an accuracy of 85.79%. This shows that multiple classifiers can obtain better performance as compared to using a single classifier. Between generations five to ten, we see the BIC and the MA having similar accuracies. However, from generations ten onwards, the MA performance clearly obtained better performance as compared to the BIC. Between generations 20 to 65, we see the MA0s accuracy fluctuating while the BIC remains constant. This is the mutation operation causing a change in the MCS chromosome to obtain better performance. In this case, the BIC has not changed. Instead, what is being changed are the other classifiers used in the MCS. Between generations 30 to 40, we observe that the MA exhibits a higher accuracy as opposed to the final accuracy. This occurs because the MA is not driven by the accuracy of the MCS but by the fitness value based on an evaluation set. However, Figure 7 shows that the false positive rates (FPR) and false negative rates (FNR) are higher while the true


positive rates (TNR) and true negative rates (TNR) are lower at those particular generations as compared to the final accuracy. As the MA converges to a solution, we observe that we are able to obtain an accuracy of 90.18% as compared to the BIC at 88.67% with a balanced out TPR, TNR, FPR and FNR as illustrated in Figure 7.

Figure 5 illustrates the fitness value for both the BIC fitness value and the MA fitness value over 100 generations. As the evolution starts, the best MA solution gave a relatively weak fitness value of 0.78, which is slightly lower than the BIC. Throughout the generations, we observe the fitness values increasing steadily. Between generations 10 to 20, we see the BIC obtaining relatively better fitness values than the MA. Similarly, this happens because the BIC within the MCS solution is independent of the AUC that is obtained for the MCS. From generations 25 onwards, we see the MA steadily surpassing the BIC and finally obtaining the fitness value of 0.9689.

In Figure 6, we analyse the number of features and classifiers over the 100 generations. We observe that the number of classifiers hovers between 8 and 10 even though the number of features is limited at 15. In the first 15 generations, we see that the average number of features increases with the number of classifiers raising up to as high as 17. This observation indicates that the MA is trying to maintain a balance between the ideal average number of features and the number of classifiers to obtain better performance. Between generations 35 to 40, we see the number of features lowering while the number of classifiers increases. Again, this observation illustrates the identification of key features and classifiers. From generation 40 to 70, we see the MA starting to converge as the features and classifiers do not vary. As we approach the generation 70, the number of classifiers raises up to 9 while the number of classifiers do not change. Again, this is possible because of the mutation factor that alters the chromosome. This effect illustrates that the solution has the ability to search near its problem space to increase performance. By the 90th generation, we see that the number of classifiers starts to drop and the whole MA terminates at generation 100 with the average number of features at approximately 9 while the number of classifiers is kept low at 5.

Based on the observations of experiment 1 on CASIA 1, we performed the same experiment using the same parameters on the CASIA 2 database. The best solution is listed in Table 5.We understand from the authors of CASIA that these two databases are different in terms of the methods used to tamper these images. So, for this experiment, it is not unexpected that different number of features, type of features and classifiers were obtained. At the end of the evolution, we obtained an accuracy of 96.21% where majority of the images were classified correctly on a relatively large test dataset of 10,200 images. Meaning to say, the Memetic algorithm is capable of evaluating and selecting the best features based on the dataset in order to achieve maximum classification performance. Furthermore, both the false positive rate and false negative rate are kept relatively lower at 0.029 and 0.062 respectively as compared to CASIA 1.

The results of this experiment prove that the feature GA is fully capable of identifying optimal features. For the experiment on CASIA 1, although the MCS accuracy at the end of the evolution for the MA is only slightly higher at 90.18% as compared to the individual classifier, we would like to emphasise that the main objective of this method is to prove the scalability and robustness of this algorithm. This will be more apparent in experiment 3 where the training dataset is totally different from the validation set.


Table 5 CASIA 2 optimal solution

Best individual classifier 0.9512Fitness value Best individual classifier 0.9424 Weakest fitness value MA fitness value 0.9832 Average number of features 9 Total number of classifiers 6 Accuracy 96.21% True positive 3897

(0.93) True negative 5865

(0.97) False positive 177

(0.029) False negative 261

(0.062)

5.4 Experiment 2: Evolution on combining datasets

In this experiment, we combined both the CASIA 1 and CASIA 2 databases. In total, we have 13,715 images. In order to obtain this mixed database, we alternate between images from CASIA 1 and CASIA 2. For this experiment, 5% of the mixed data will be used to train the classifiers, 10% will be used for fitness evaluation for the MA and the final 85% will be used for validating the MCS. By using a small training set with a relatively much larger test set, we are able to test the scalability of the MA. Due to space constrain, we present the results of selected generations in Table 6. Table 6 Evolution for CASIA 1 and 2 databases

0 1 2 3 4 8 10 12 14 18 41 63 72

Fitness values for each classifier

0.67 0.68 0.72 0.72 0.77 0.79 0.79 0.82 0.83 0.85 0.87 0.87 0.87 0.78 0.65 0.77 0.78 0.82 0.78 0.83 0.84 0.85 0.86 0.87 0.89 0.88 0.72 0.75 0.78 0.77 0.79 0.82 0.80 0.78 0.84 0.85 0.85 0.87 0.87 0.75 0.71 0.76 0.73 0.75 0.82 0.80 0.82 0.80 0.84 0.85 0.87 0.86 0.68 0.72 0.75 0.77 0.79 0.80 0.83 0.80 0.86 0.83 0.87 0.87 0.89 0.69 0.71 0.71 0.82 0.67 0.80 0.82 0.85 0.85 0.87 0.86 - - 0.72 0.76 0.77 0.71 0.80 0.83 0.85 0.83 0.84 - 0.89 - -

- 0.77 - 0.74 - - - - - - - - - - 0.68 - 0.69 - - - - - - - - -

Average fitness value

0.72 0.72 0.75 0.75 0.77 0.81 0.81 0.82 0.84 0.85 0.87 0.87 0.87

No classifiers 7 9 7 9 7 7 7 7 7 6 7 5 5 Avg. features 8.14 7.77 8.14 6.11 10.14 8.42 8 8.28 8.14 8.33 8.28 8.2 8.4 Combined MCS fitness values

0.78 0.79 0.80 0.81 0.83 0.85 0.86 0.86 0.87 0.88 0.89 0.89 0.90


At generation 0, we see that the best solution obtained very low fitness values among its classifiers with the lowest fitness value at 0.66 and the highest at 0.78 giving an average of 0.72. This example illustrates that if we were to just take the average AUC over all the classifiers in the MCS solution and use that as fitness value for the MCSGA, the performance of the MCS will be poor. However, by deriving the fitness evaluation for the MCS segment of the MA independently through the AUC for each MCS GA solution, better fitness values and classification performance were obtained. These improvements were illustrated from generation 1 onwards were the fitness value of the MA deviates away from the average fitness values. However, through the generations, we see that this gap closes. This is a natural evolution as the classifiers are evolving to be stronger classifiers thus closing the gap between the MA fitness value and the averaged fitness value.

In the first five generations of the algorithm, we observe that classifiers with low fitness values are being included in the best solution at its generation. Weak classifiers with fitness values less than 0.70 are being part of a ensemble that when combined together, offer better fitness values by at least 0.1. This observation was also present in experiment 1, further emphasising our point that combining classifiers can offer better performance and this point is also true in this experiment with a relatively larger dataset. Naturally, as the MA iterates through its generation, the classifiers also become stronger due to feature evolution resulting in lesser classifiers needed. This is seen through generations 8 to 63, where the number of classifiers converged from 7 to 5.

Optimising of the MCS is also illustrated clearly in Table 6. When it comes to optimising systems with multiple variables (i.e., number of features and number of classifiers), the key is to identify optimal values for these variables in order to obtain the best possible result. In our scenario, we successfully strike a right balance between the total number of features and the number of classifiers while achieving the best fitness value through our MA. This optimisation is achieved as we observe that between generations 4 to 10, the average number of features has fallen from 10.14 to 8 while maintaining similar number of classifiers. The adverse affect of the reduction in features resulted in an increase fitness value from 0.83 to 0.86 demonstrating that with the removal of certain redundant features, the problem space is reduced and better performance is observed. From generations 10 to 12, the average number of features increased to 8.28 with a very slight increase in fitness value. What is interesting is from generation 14 onwards, we observe that the number of classifiers and average number of features fluctuating. The fluctuation is due to the MA consistently exploring the problem space to find a balance between the key features and the number of classifiers in the MCS, it also illustrates the relationship between the complexities of the features, the various classifiers and how this combination can affect the performance of the MCS. Comparing both experiments 1 and 2, experiment 2 converges to a solution much later and is due to the size and complexity of the dataset.

At the end of the evolution, the MA problem obtained an accuracy of 94.64% with the statistics illustrated in Table 7. The features evolved for each classifier within the MCS solution are listed in Table 8. Through analysis of the feature list, we observe that the features evolved contain a dynamic range of features. Firstly, each of the classifier uses features from both the DWT and DCT. Secondly, the types of TPM selected are not typical matrices that one would manually select. Instead, each of the classifiers selected different MBDCT block sizes, normalised using different directions, different colour channels and different threshold values. The randomness of these features illustrates the


power of evolution that can be used for optimisation. Lastly, the dimensions of the feature set are neither too big nor too small, i.e., it is not too small that causes over fitting resulting in inaccurate classification; neither is it too big resulting in the classifier being too complex for accurate prediction and computational efficiency. Through the evaluation of our proposed method, we have demonstrated that we can reduce complexity of the classifiers without sacrificing the classification performance.

Table 7 CASIA 1 and 2 optimal solution

Accuracy 94.64%

True positive 4,786 (0.93)

True negative 6,058 (0.92)

False positive 471 (0.072)

False negative 342 (0.066)

Table 8 Breakdown of MCS classifiers and features

Feature type MB size Normalised direction

Diff. array direction Channel Threshold Dimension

Classifier 1

HH1 wavelet - Whole Diagonal - 2 648

DCT-TPM 8 by 8 Row Vertical Y 4

LL2 wavelet - Row Horizontal - 4

DCT-TPM 4 by 4 Col Horizontal Cr 4


DCT-TPM 2 by 2 Col Horizontal Cb 2

DCT-TPM 8 by 8 Col Diagonal Cb 4


Classifier 2

HH1 Wavelet - Whole Diagonal - 2 649


DCT-TPM 8 by 8 Row Diagonal Back

Y 2

DCT-TPM 2 by 2 Col Vertical Cr 4



Cb 2

DCT-TPM 32 by 32 Row Vertical Cb 4


Variance stat. from phase congruency

image

- - - - -


Table 8 Breakdown of MCS classifiers and features (continued)

Feature type MB size Normalised direction

Diff. array direction Channel Threshold Dimension

Classifier 3




Y 2

DCT-TPM 8 by 8 Col Diagonal Back

â” ˜Y 2


DCT-TPM 16 by 16 Row Diagonal back

Cb 2

DCT-TPM 32 by 32 Row Vertical Cb 4



Classifier 4




Y 2

DCT-TPM 8 by 8 Whole Diagonal Y 4

DCT-TPM 8 by 8 Row Horizontal Cr 2

DCT-TPM 32 by 32 Row Vertical Cr 2



Classifier 5




Y 2

DCT-TPM 16 by 16 Col Horizontal Cb 4


LH1 wavelet - Whole Diagonal Nack

- 2

DCT-TPM 32 by 32 Whole Vertical Cb 4


5.5 Experiment 3: Generalisation

For all our previous experiments, although the testing and validating dataset have not been seen by the training dataset, all these data ultimately still belong to the same dataset.

So, we further test robustness of our method. In this experiment, we train, test and validate our data based on totally different datasets; the training and testing for fitness values (for individual classifiers) are based on the CASIA 1 dataset. However, the validation dataset will be based on the CASIA 2 dataset. The breakdown for the training, testing and validation dataset are broken down and listed in Table 10.


Table 9 illustrates the results over the evolution at various generations of the algorithm. We emphasise that the MA fitness value is based on the 1,031 testing images from CASIA 1. As such, the overall MA fitness values observed in Table 9 are lower than the previous experiments using CASIA 1 simply because the testing set in this experiment is much higher than the former. Another point to note is the accuracy obtained is not tied to the corresponding fitness value because the solution obtained is solely based on the data of CASIA 1. Hence, the accuracy varies according to the solution and we will cover this point later in this section. Table 9 Scalability test

Gen. Avg. no. of features

No. of classifiers

MA fitness value

TP TN FP FN Accuracy

0 8.22 9 0.8131 3,715 (0.76) 5,746 (0.84) 1,157 (0.16) 1,039 (0.21) 84.62% 1 8.3 9 0.8222 2,975 5,967 936 1,779 85.26% 2 8.2 7 0.831 3,715 5,940 963 1,039 85.31% 3 8 7 0.8434 2,723 6,376 527 2,031 85.27% 4 8.5 11 0.8443 3,102 6,338 565 1,652 84.42% 5 8.1 13 0.8462 2,979 6,358 545 1,775 84.06% 6 8.3 15 0.8484 2,637 6,296 607 2,117 84.01% 7 8.6 11 0.8508 2,802 6,240 663 1,952 83.86% 8 8.3 11 0.852 2,285 6,307 596 2,469 83.55% 9 8.7 11 0.8592 2,703 6,419 484 2,051 84% 10 8.7 7 0.8632 3,590 6,312 591 1,164 85.29% 14 8.8 5 0.8763 3,295 6,230 673 1,459 85.17% 19 8.8 7 0.8773 3,378 6,393 510 1,376 85.98% 23 9 9 0.8797 3,556 6,378 525 1,198 86.02% 25 9 11 0.8814 3,077 6,511 392 1,677 87.22% 64 8.8 9 0.8989 3,459 (0.88) 6,436 (0.83) 467 (0.06) 1,295 (0.21) 87.42%

Table 10 Data breakdown

Dataset Training images Testing images Validation images

CASIA 1 684 1031 - CASIA 2 - - 11657

We observe that at the initial generations, we see that the average number of features is relatively low. Since the validation set is an unseen, different and larger dataset, the initial features and classifiers selected are unable to perform well. Hence, the high false positives and false negatives at generation 0.

From generations 4 to 10, we observe an overall increase in the number of average features and classifiers for the best solutions. However, at this point, it is still unable to converge to a good solution and is illustrated in the statistics in Table 9 where the number of true positives is still relatively low and the number of false negatives is still high.

It is only after the 10th generation where the MA finds its grounding and steadily increases both its fitness values and MA accuracy. From this generation onwards, we see the feature and classifier number fluctuating with minimal changes as it stabilises itself.


For this experiment, we obtained an accuracy of 87.42% at the end of the evolution. This accuracy is based on the validation dataset which is the CASIA 2 dataset. Since the MA fitness value is based upon the test set which is the majority of the CASIA 1 dataset, the evolution of the solution has absolutely no relevance with the validation set. This means that the accuracy obtained for the CASIA 2 dataset is totally independent of the solution obtained through evolution. Thus, demonstrating the scalability and robustness of our method.

5.6 Preserving the integrity of the TPM: a comparison

We investigated into the use to full TPM as compared to using features obtained through feature selection of the TPM. As mentioned earlier, we believe that keeping the integrity of the TPM can provide better results. In this sub-section, we performed an experiment to illustrate that blindly concatenating full TPM leads to the curse of dimensionality and the need for researchers to use feature reduction techniques to reduce dimensionality to obtain better results. More importantly, through this comparison, we illustrate that our proposed method of carefully identify key TPMs is able to obtain better accuracy as compared to previous works.

For this experiment, we use various DWT and DCT TPM to perform the experiments. We start off the experiment by concatenating the full TPM features by Sutthiwan et al. (2010). As we observe from Table 11, by concatenating more TPM and increasing the feature size, the accuracy decreases thus illustrating the effect of curse of dimensionality. This is the main reason why many researchers who utilised TPMs performed feature reduction on the initial dataset. Table 11 Concatenation of TPMs

Accuracy Feature dimensions

Combined DCT TPM horizontal diff. 64.96% 811

Combined DCT TPM horizontal and vertical diff. 64.87% 1,620

Combined DCT and wavelet TPM horizontal and vertical diff. 61.32% 2,511

Combined wavelet TPM horizontal and vertical diff. 59.67% 3,403

Combined DCT and wavelet TPM horizontal and vertical diff. 60.76% 4,213

So, we compare our method to other existing state-of-the-art (Sutthiwan et al., 2010; He et al., 2012) that uses feature reduction with both the CASIA 1 and CASIA 2 database. For both the methods, we used the same SVM classifier parameters as stated in Section 5.2 to ensure credibility in the reported results. However, the number of training and testing data differs as it would be unfair to use our training to testing ratio since our proposed method is a hybrid framework aimed at optimisation. Therefore, we follow the author’s initial ratio of 5/6 of authentic and spliced images for training and the remaining 1/6 of the images for testing.

As illustrated in Table 12, we obtained an accuracy of 87.37% and 78.45% for He et al. (2012) and Sutthiwan et al. (2010) respectively. These results are similar to the results obtained by the authors illustrating that the results we obtained are credible.


Table 12 Comparison with existing works for CASIA 2

Authors Accuracy Feature dimensions Sutthiwan et al. (2011) 78.45% 50 He et al. (2012) 87.37% 100 Proposed method 96.21% Avg. of 700

Similarly, we illustrate in Table 13 that we obtained accuracies of 79.74% (50 dimensions), 74.81% (100 dimensions), 76.87% (200 dimensions) for Sutthiwan et al. (2010) and He et al. (2012) respectively. The accuracy for Sutthiwan et al. (2010) is similar to the accuracies report in their paper. As for He et al. (2012), their paper did not report any results on the CASIA 1 database. However, CASIA 1 and CASIA 2 differs greatly in terms of sizes, image format and compression (Dong and Wang, 2011). Therefore, it is not surprising that the features they used for CASIA 2 are not suitable for CASIA 1, hence providing poorer results. This is also consistent with our results where our proposed method obtained better performance on CASIA 2 as opposed to CASIA 1. Table 13 Comparison with existing works for CASIA 1

Authors Accuracy Feature dimensions

Sutthiwan et al. (2011) 79.54% 50 He et al. (2012) 74.81% 100 He et al. (2012) 76.87% 200 Proposed method 90.18% Avg. of 660

Based on Figures 8 and 9, we can observe that our proposed method obtained higher accuracy over existing works suggesting by at least 8% on the CASIA 2 database and 10% on the CASIA 1 database. Although the feature dimensionality per classifier ranges from 589–730, we emphasise our point that if the integrity of the TPM are kept intact and properly identified, better performance can be obtained. In addition, our proposed method utilises only 15% training data, which proves that it is high scalable and robust.

Figure 8 The roc curves obtained on CASIA 2 database (see online version for colours)


Figure 9 The ROC curves obtained on CASIA 1 database (see online version for colours)

6 Conclusions

This paper presents a fair evaluation of the current features used to characterise tampered images. Through the use of a hybrid evolutionary algorithm, we are able to automatically select useful features for each classifier while systematically combining them to obtain an optimised MCS for image tampering detection. Our experiments have illustrated that the evolved classifier system is able to achieve good classification performance in terms of detection accuracy and computational efficiency.

By using a large pool of 582 features, we also illustrated that the features selected for each classifier within the MCS were indeed unique. We showed that the features selected were a comprehensive list of features that were capable of discriminating between tampered and authentic images well. After the optimisation process and the combination of the various classifiers, we evaluated our detection method using the publicly available CASIA dataset. Through our experiments, we showed that the Memetic algorithm used was capable of achieving good accuracy for the CASIA 1, CASIA 2 and mixed CASIA 1 and 2 databases at 90.18 %, 96.21% and 94.64% respectively. Furthermore, with the exception of CASIA 1, these results were based on large scale datasets of at least 10,000 test images. Comparing the performance achieved with current state of the art, not only did our method obtain better accuracy, the results were also achieved based on a large scale dataset.

The experiment result shows that our image tampering detection can be used as an automated large scale authenticity verification system. While this method is developed for image tampering detection, the concept of using the hybrid evolutionary method is generic and is applicable to other domains as well.


References Bäck, T. (1996) Evolutionary Algorithms in Theory and Practice, p.120, Oxford Univ. Press, UK. Bianchi, T. and Piva, A. (2011) ‘Analysis of non-aligned double JPEG artifacts for the localization

of image forgeries’, WIFS, IEEE. Chen, W., Shi, Y.Q. and Su, W. (2007) ‘Image splicing detection using 2-D phase congruency and

statistical moments of characteristic function’, Proceedings of SPIE, p.6. Dong, J. and Wang, W. (2011) CASIA Tampering Detection Dataset [online]

http://forensics.idealtest.org/ (accessed 28 June 2014). Farid, H. (2009) ‘A survey of image forgery detection’, IEEE Signal Processing Magazine,

Vol. 26, No. 2, pp.16–25. Fawcett, T. (2004) ‘ROC graphs: notes and practical considerations for researchers’, Machine

Learning, Vol. 31, pp.1–38. Fontani, M., Bianchi, T., Rosa, A.D. and Barni, M. (2011) ‘A Dempster-Shafer framework for

decision fusion in image forensics’, Workshop on Information Forensics and Security. Fridrich, A.J., Soukal, B.D. and Lukáš, A.J. (2003) ‘Detection of copy-move forgery in digital

images’, Proceedings of Digital Forensic Research Workshop. He, Z., Lu, W., Sun, W. and Huang, J. (2012) ‘Digital image splicing detection based on Markov

features in DCT and DWT domain’, Pattern Recognition, Vol. 45, No. 12, pp.4292–4299. Holland, J. (1992) Adaptation in Natural and Artificial Systems, MIT Press, Cambridge, MA. Hsu, Y.F. and Chang, S.F. (2008) ‘Statistical fusion of multiple cues for image tampering’,

Asilomar Conference on Signals, Systems, and Computers. Huang, H., Guo, W. and Zhang, Y. (2008) ‘Detection of copy-move forgery in digital images using

sift algorithm’, Pacific-Asia Workshop on Computational Intelligence and Industrial Application, PACCIA.

Jingying, J., Xiaodong, H., Kexin, X. and Qilian, Y. (2002) ‘Phase correlation-based matching method with sub-pixel accuracy for translated and rotated images’, Signal Processing, 2002 6th International Conference on, August, Vol. 1, pp.752–755.

Jong, K.A.D. (1975) An Analysis of the Behavior of a Class of Genetic Adaptive Systems, Doctoral dissertation.

Kovesi, P. (1999) ‘Image features from phase congruency’, Journal of Computer Vision Research, Vol. 1, No. 3, pp.1–26.

Kuncheva, L.I. and Jain, L.C. (2000) ‘Designing classifier fusion systems by genetic algorithms’, IEEE Transactions on Evolutionary Computation, Vol. 4, No. 4, pp.327–336.

Kwong, S., Chan, C., Man, K. and Tang, K. (2001) ‘Optimisation of HMM topology and its model parameters by genetic algorithms’, Pattern Recognition, Vol. 34, No. 2, pp.509–522.

Kyoung, K.W., Prugel-Bennet, A. and Krogh, A. (2006) ‘Evolving the structure of hidden Markov models’, IEEE Transaction on Evolutionary Computation, Vol. 10, No. 1, pp.39–49.

Li, G., Wu, Q., Tu, D. and Sun, S. (2007) ‘A sorted neighbourhood approach for detecting duplicated regions in image forgeries based on DWT and SVD’, Proceedings of ICME, pp.1750–1753.

Lin, S. and Wu, T. (2011) ‘An integrated technique for splicing and copy-move forgery image detection’, International Congress on Image and Signal Processing, Vol. 2, pp.1086–1090.

Ng, T-T. and Chang, S-F. (2004) ‘A model for image splicing’, IEEE International Conference on Image Processing (ICIP), October.

Popescu, A.C. and Farid, H. (2004) Exposing Digital Forgeries by Detecting Duplicated Image Regions, Tech. Rep.

Sharkey, A. (1999) Combining Artificial Neural Nets: Ensembles and Modular Multi Net Systems, Springer, USA.


Shi, Y.Q., Chen, C. and Chen, W. (2007) ‘A Markov process based approach to effective attaching JPEG steganography’, LNCS, Vol. 437, pp.249–264, Springer, Heidelberg.

Su, Y., Shan, S., Chen, X.L. and Gao, W. (2007) ‘Hierarchical ensemble of global and local classifiers for face recognition’, IEEE Transactions on Image Processing, pp.1–8.

Sutthiwan, P., Shi, Y.Q., Su, W. and Ng, T.T. (2010) ‘Rake transform and edge statistics for image forgery detection’, ICME, IEEE.

Sutthiwan, P., Shi, Y.Q., Zhao, H., Ng, T.T. and Su, W. (2011) ‘Markovian rake transform for digital image tampering detection’, Transactions on Data Hiding and Multimedia Security VI, Lecture Notes in Computer Science, Vol. 6730.

Thing, V.L.L., Chen, Y. and Cheh, C. (2012) ‘An improved double compression detection method for jpeg image forensics’, IEEE International Symposium in Multimedia.

Wang, Y., Gurule, K., Wise, J. and Zheng, J. (2012) ‘Wavelet based region duplication forgery detection’, 9th International Conference on Information Technology: New Generations (ITNG), pp.30–35.

Xu, B., Wang, J., Liu, G. and Dai, Y. (2010) ‘Image copy-move forgery detection based on surf’, International Conference on Multimedia Information Network and Security, MINES, pp.889–892.

Yao, X. (1999) ‘Evolving artificial neural networks’, Proceedings of the IEEE, Vol. 87, No. 9, pp.1423–1447.

Zhang, J., Feng, Z. and Su, Y. (2008) ‘A new approach for detecting copy-move forgery in digital images’, Singapore International Conference on Communication Systems, IEEE ICCS, pp.362–366.

A hybrid evolutionary algorithm for feature and...

Documents

Transcript of A hybrid evolutionary algorithm for feature and...