New A Data Leakage Prevention Method Based on the Reduction of...

12
Research Article A Data Leakage Prevention Method Based on the Reduction of Confidential and Context Terms for Smart Mobile Devices Xiang Yu, 1 Zhihong Tian , 2 Jing Qiu , 2 and Feng Jiang 3 1 School of Electronics and Information Engineering, Taizhou University, Taizhou 318000, China 2 Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou 510006, China 3 College of Science and Technology, Harbin Institute of Technology, Harbin 150001, China Correspondence should be addressed to Zhihong Tian; [email protected] and Jing Qiu; [email protected] Received 27 April 2018; Revised 30 August 2018; Accepted 24 September 2018; Published 21 October 2018 Guest Editor: Ding Wang Copyright © 2018 Xiang Yu et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Early data leakage protection methods for smart mobile devices usually focus on confidential terms and their context, which truly prevent some kinds of data leakage events. However, with the high dimensionality and redundancy of text data, it is difficult to detect the documents which contain confidential contents accurately. Our approach updates cluster graph structure based on CBDLP (Data Leakage Protection Based on Context) model by computing the importance of confidential terms and the terms within the range of their context. By applying CBDLP with pruning procedure which has been validated, we further remove the redundancy terms and noise terms. Actually, not only can confidential terms be accurately detected but also the sophisticated rephrased confidential contents are detected during the experiments. 1. Introduction With the development of Internet and information technol- ogy, smart mobile devices appear in our daily lives, and the problem of information leakage on smart mobile devices will follow which has become more and more serious [1, 2]. All kinds of private or sensitive information, such as intellectual property and financial data, might be distributed to unauthorized entity intentionally or accidentally. And that it is impossible to prevent from spreading once the confidential information has leaked. According to survey reports [3, 4], most of the threats to information security are caused by internal data leakage. ese internal threats consist of approximate 29% private or sensitive accidental data leakage, approximate 16% theſt of intellectual property, and approximate 15% other theſts including customer information, and financial data. Further, the consensus of approximate 67% organizations shows that the damage caused from internal threats is more serious than those form outside. Although laws and regulations have been passed to punish various behaviors of intentional data leakage, it is still hard to prevent data leakage effectively. Confidential data can be easily disguised by rephrasing confidential contents or embedding confidential contents in nonconfidential contents [5, 6]. In order to avoid the problems arising from data leakage, lots of soſtware and hardware solutions have been developed which are discussed in the following chapter. In this paper, we present CBDLP, a data leakage preven- tion model based on confidential terms and their context terms, which can detect the rephrased confidential contents effectively. In CBDLP, a graph structure with confidential terms and their context involved is adopted to represent documents of the same class, and then the confidentiality score of the document to be detected is calculated to justify whether confidential contents is involved or not. Based on the attribute reduction method from rough set theory, we further propose a pruning method. According to the importance of the confidential terms and their context, the graph structure of each cluster is updated aſter pruning. e motivation of the paper is to develop a solution which can prevent intentional or accidental data leakage from insider effectively. As mixed-confidential documents are very common, it is very important to accurately detect the documents containing confidential contents even when most of the confidential contents have been rephrased. Hindawi Wireless Communications and Mobile Computing Volume 2018, Article ID 5823439, 11 pages https://doi.org/10.1155/2018/5823439

Transcript of New A Data Leakage Prevention Method Based on the Reduction of...

Page 1: New A Data Leakage Prevention Method Based on the Reduction of …downloads.hindawi.com/journals/wcmc/2018/5823439.pdf · 2019. 7. 30. · A Data Leakage Prevention Method Based on

Research ArticleA Data Leakage Prevention Method Based on the Reduction ofConfidential and Context Terms for Smart Mobile Devices

Xiang Yu1 Zhihong Tian 2 Jing Qiu 2 and Feng Jiang 3

1School of Electronics and Information Engineering Taizhou University Taizhou 318000 China2Cyberspace Institute of Advanced Technology Guangzhou University Guangzhou 510006 China3College of Science and Technology Harbin Institute of Technology Harbin 150001 China

Correspondence should be addressed to Zhihong Tian tianzhihonggzhueducn and Jing Qiu qiujinggzhueducn

Received 27 April 2018 Revised 30 August 2018 Accepted 24 September 2018 Published 21 October 2018

Guest Editor Ding Wang

Copyright copy 2018 Xiang Yu et alThis is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Early data leakage protection methods for smart mobile devices usually focus on confidential terms and their context which trulyprevent some kinds of data leakage events However with the high dimensionality and redundancy of text data it is difficult todetect the documents which contain confidential contents accurately Our approach updates cluster graph structure based onCBDLP (Data Leakage Protection Based on Context) model by computing the importance of confidential terms and the termswithin the range of their context By applying CBDLP with pruning procedure which has been validated we further remove theredundancy terms and noise terms Actually not only can confidential terms be accurately detected but also the sophisticatedrephrased confidential contents are detected during the experiments

1 Introduction

With the development of Internet and information technol-ogy smart mobile devices appear in our daily lives and theproblem of information leakage on smart mobile deviceswill follow which has become more and more serious [12] All kinds of private or sensitive information such asintellectual property and financial data might be distributedto unauthorized entity intentionally or accidentally Andthat it is impossible to prevent from spreading once theconfidential information has leaked

According to survey reports [3 4] most of the threatsto information security are caused by internal data leakageThese internal threats consist of approximate 29 privateor sensitive accidental data leakage approximate 16 theftof intellectual property and approximate 15 other theftsincluding customer information and financial data Furtherthe consensus of approximate 67 organizations shows thatthe damage caused from internal threats is more serious thanthose form outside

Although laws and regulations have been passed topunish various behaviors of intentional data leakage it isstill hard to prevent data leakage effectively Confidential data

can be easily disguised by rephrasing confidential contents orembedding confidential contents in nonconfidential contents[5 6] In order to avoid the problems arising from dataleakage lots of software and hardware solutions have beendeveloped which are discussed in the following chapter

In this paper we present CBDLP a data leakage preven-tion model based on confidential terms and their contextterms which can detect the rephrased confidential contentseffectively In CBDLP a graph structure with confidentialterms and their context involved is adopted to representdocuments of the same class and then the confidentialityscore of the document to be detected is calculated to justifywhether confidential contents is involved or not Based on theattribute reduction method from rough set theory we furtherpropose a pruning method According to the importance ofthe confidential terms and their context the graph structureof each cluster is updated after pruning The motivationof the paper is to develop a solution which can preventintentional or accidental data leakage from insider effectivelyAsmixed-confidential documents are very common it is veryimportant to accurately detect the documents containingconfidential contents even when most of the confidentialcontents have been rephrased

HindawiWireless Communications and Mobile ComputingVolume 2018 Article ID 5823439 11 pageshttpsdoiorg10115520185823439

2 Wireless Communications and Mobile Computing

The remainder of this paper is organized as follows InSection 2 we introduce previous related work on data leakageprevention In Section 3 we present CBDLP model togetherwith the corresponding clustering decision and calculationalgorithms The experiments conducted to evaluate CBDLPin all circumstance are discussed in Section 4 FinallySection 5 concludes this paper and discusses the directionsof our future research

2 Related Work

In this section we review clustering of textual documentsattribute reduction method and graph representation oftextual documents respectively

21 Clustering of Textual Documents The problem of clus-tering textual documents is similar to high dimensionalclustering In general each term of a textual document isconsidered as an independent dimension and then eachdocument is considered as a vector consists of thousandsof terms By calculating the angle cosine measure betweendocuments textual documents can be classified in terms ofsimilarity which is reflected by the angle cosine value [7ndash10]

Vector space model VSM is one of the most widelyused text representation models [11] which is first presentedby Salton in the 1960s and successfully applied in SMARTa system for the manipulation and retrieval of text InVSM model a textual document is represented as 119863 =119863((11987911198821) (11987921198822) (119879119899119882119899)) where 119879119894 and 119882119894 denotethe 119894th term and its weight in the document respectively andthen the classification of a textual document is determined bycalculating the similarity between the textual document to beclassified and the textual documents whose classification arealready known

Term frequency and inverse document frequency TF-IDF is a frequently employed and effective statistics methodwhich is used to evaluate the importance of a term for adocuments collection [12] As is well known the importanceof a term is proportional to the frequency of its occurrencein a document and is inversely proportional to the frequencyof its occurrence in the whole corpus Till now TF-IDF hasbeen widely used in various fields such as textmining searchengine and information retrieval

On the basis of VSMmodel and TF-IDFmethod existingtextual documents clustering algorithms can be dividedinto five main categories partitioning methods hierarchicalmethods density-based methods grid-based methods andmodel-based methods Partitioning methods which are effi-cient and insensitive to the sequence of documents divide 119899documents into 119896 clusters in terms of clustering criteria Therepresentative partitioning methods include 119896-means and 119896-medoids [13 14] Hierarchical methods disintegrate docu-ments into different clusters or integrate different documentstogether into one cluster in terms of the similarity with a top-down or bottom-up hierarchical manner The representativehierarchical methods include BIRCH and CURE [15 16]Other than partitioning methods density-based methodsfocus on the density of a certain area When the density

of the documents within a certain area exceeds a prede-fined threshold they are incorporated into the same clusterThe representative density-based methods include DBSCANand OPTICS [17 18] Grid-based methods partition dataspace into limited cells in advance and integrate adjacentcells whose density exceed the density threshold into thesame cluster The representative grid-based methods includeSTING and CLIQUE [19 20] In model-based methodsdifferent models are bound up with each cluster respectivelyand the objective is to find all data subsets that fit each modelbest Statistics solution such as SVM[21] and neural networksolution are adopted extensively in model-based methods[22]The support vector clustering algorithm created byHavaSiegelmann and Vladimir Vapnik applies the statistics ofsupport vectors developed in the support vector machinesalgorithm to categorize unlabeled data and is one of themostwidely used clustering algorithms in industrial applications

In this paper we calculate the angle cosine value whichreflects the similarity between documents and cluster docu-ments withDBSCANDBSCAN proposed byMartin Ester in1996 is a density-based clustering algorithm which is widelycited in scientific literature [23] and it is awarded the testof time award in 2014 [24] When clustering other than 119896-means DBSCAN does not need to specify the number ofclusters and it can find the clusters of arbitrary shape Inaddition DBSCAN is robust to outliers as opposed to 119896-means

22 Graph Representation of Textual Documents Graphshave already been used in many text-related tasks whichemploy graph as the model for text representation insteadof the existing methods [25] As an alternative method tothe vector space model for representing textual informationgraphs can be created fromdocuments and be further used inthe text-related tasks such as information retrieval [26] textmining [27] and topic detection [28]

In general graph-based model is usually employed in thedomain of information retrieval such as PageRank [29] andHITS [30] When determining the similarity graph match-ing which is generally used to detect similar documents isNP in complexity [31] whereas the methods based on vectorspace model perform efficiently by calculating the Euclideandistance or Cosine measure between document vectors [32]The main advantage of graph-based model is that it cannot only capture the contents and structure of a documentbut also represent the terms together with their context Tothe best of our knowledge graph-based model is seldomemployed in text-related tasks Schenker presents a graph-related algorithm with its several variants [33] in which agraph is presented and the terms connected with edges areconsidered as nodes The differences between the variantsare related to term-based techniques Gilad Katz presentsCoBAn a context based model for data leakage preventionwhich enlightens us a lot [34] However CoBAn is partlyinfluenced by the limitation of 119896-means which is employed inCoBAnMoreover there might exist some redundancy nodesin the graph generated in CoBAn Xiaohong Huang et alpropose an Adaptive weighted Graph Walk model (AGW) to

Wireless Communications and Mobile Computing 3

solve the problem of transformed data leakage by mapping itto the dimension of weighted graphs [35]

In this paper we employ a hybrid approaches whichcombines graph and vector representations When clusteringdocuments we employDBSCANwith cosinemeasureWhenrepresenting the confidential textual content and its contextof each cluster the graph of each cluster which includes onlyconfidential and contextual nodes is created

23 Redundancy Information Reduction When dealing withtext-related tasks redundancy information is generally use-less and even worse it might decrease the efficiency oftask execution There exist many representative redundancyinformation reduction methods such as PCA [36] SVD [37]LSI [38] etc The principle of PCA is to transform multipleattributes into a few primary attributes which can reflectthe information of original data effectively However thecomplexity of PCA is generally high and there might bepart of original information loss More than characteristicsSVD has almost the same advantages and disadvantages asPCA does LSI represents textual data with latent topics thatconsists of specific terms but in most cases the influenceof specific terms are ignored In this paper the reductionmethod from rough set theory as shown in Section 3 isemployed and partly recomposed to meet requirements

24 Data Leakage Prevention With the number of leakageincidents and the cost they inflict continues to increase thethreat of data leakage posed to companies and organizationshas become more and more serious [39ndash41] Consideringthe enormity of data leakage prevention various models andapproaches have been developed to address the problem ofdata leakage prevention Tripwire is a more recent prototypesystem proposed by Joe DeBlasio et al in 2017 it registershoney accounts with individual third-party websites andthus access to an email account provides indirect evidenceof credentials theft at the corresponding website [42] How-ever Tripwire is more suitable for forensics rather thanconfidential data leakage prevention In 2018 Wenjia Xuet al propose a new promising image retrieval methodin ciphertext domain by block image encrypting based onPaillier homomorphic cryptosystem which can manage andretrieve ciphertext data effectively [43] Nevertheless themethod focus on data encryption rather than data detectionSince smart devices based on ARM processor become anattractive target of cyberattacks Jinhua Cui et al presenta scheme named SecDisplay for trusted display service in2018 it protects sensitive data displayed from being stolenor tampered surreptitiously by a compromised OS [44]But it pays less attention to the scenarios of intentional oraccidental data leakage from insider According to the workof Ding Wang et al lots of authentication schemes have beenproposed to secure data access in industrial wireless sensornetworks however they do not work well [45] In additionDing Wang et al develop a series of practical experimentsto evaluate the security of four main suggested honeyword-generation methods and further prove that they all fail toprovide the expected security [46]

Stemming and stop-wordsremoval

DBSCAN clustering

Training documents

helliphellip

clusters consists of confidentialand non-confidential documents

Figure 1 DBSCANmethod

3 CBDLP Model

CBDLP consists of training phrase and detection phraseThe training phrase can be further divided into three stepsclustering step graph building step and pruning step Duringthe training phrase the training documents are first classifiedinto different clusters then each cluster is represented bygraph and finally the nodes of each graph are pruned interms of their importance During the detection phrasedocuments are matched to the graphs of clusters respectivelyand the confidential scores are calculated A documentis considered as confidential only if its confidential scoreexceeds a predefined thresholdThe detail of CBDLP trainingphrase is presented in Algorithm 1

31 Clustering Documents with DBSCAN In the first step weapply stemming and stop-words removal to all documentsin training set and transform the processed documents intovectors of weighted terms After applying DBSCAN withcosine measure to the vectors which represent the trainingdocuments each resulting cluster represents an independenttopic of training documents and there might exist bothconfidential and nonconfidential documents As shown inFigure 1

The procedure of DBSCAN is described as follows

Step 1 A data set is given with 119899 documents and 120576 as thethreshold of minimal similarity between documents of thesame cluster and119872119894119899119875119905119904 as the threshold ofminimal numberof documents in a cluster

4 Wireless Communications and Mobile Computing

Input 119862 - Confidential documents set119873 - Non-confidential documents set119879119877119898119894119899 - The minimum similarity threshold

Output 119862119877 - The set of clusters each with the centroid and corresponding graph119862119879 - The set of confidential terms in clusters119862119900119899119905119890119909119905119879 - The set of context terms

(1) 119879 larr997888 119862 cup119873(2) 119862119877 larr997888 119863119861119878119862119860119873(119879) The result of clustering 119879 is saved in 119862119877(3) Initializing 119862119879[119862119877] The scores of confidential terms are saved in 119862119879(4) Initializing 119862119900119899119905119890119909119905119879 The context terms set of each confidential term is saved in 119862119900119899119905119890119909119905119879(5) for (each 119888119903 in 119862119877)(6) Calculate the similarity between 119888119903 and the other clusters(7) Create language model for 119888119903 and calculate the scores for each confidential term(8) 119879119877 larr997888 initial the threshold of cluster similarity(9) while (119879119877 gt 119879119877119898119894119899)(10) 119862119905119890119898119901 larr997888 All clusters whose similarity to 119888119903 gt 119879119877(11) Create language model for the documents of 119862119905119890119898119901(12) 119862119879[119888119903] larr997888 Based on new language model Update the scores of confidential terms(13) for(each confidential term 119888119905 in 119888119903)(14) Detect the occurrence of ct in 119862 cup 119873(15) 119875119888119900119899119891119894119889119890119899119905119894119886119897 119889119900119888(119905119890119903119898 119888119905) larr997888 For each context term of 119888119905 calculate the probability of the appearance both 119888119905 and

the context term in confidential documents(16) 119875119899119900119899 119888119900119899119891119894119889119890119899119905119894119886119897 119889119900119888(119905119890119903119898 119888119905) larr997888 For each context term of 119888119905 calculate the probability of the appearance both 119888119905 and the

context term in non-confidential documents(17) 119862119900119899119905119890119909119905119879 larr997888 Calculate the value of 119875119888119900119899119891119894119889119890119899119905119894119886119897 119889119900119888(119905119890119903119898 119888119905)119875119899119900119899 119888119900119899119891119894119889119890119899119905119894119886119897 119889119900119888(119905119890119903119898 119888119905) for each confidential term 119888119905(18) Detect all clusters whose similarity is greater than 119879119877 and detect the occurrences of all terms in the clusters(19) 119862119900119899119905119890119909119905119879 larr997888Update the probability of the context terms that appear in the scopes of different confidential terms(20) (21) Reduce the value of 119879119877(22) (23)

Algorithm 1 CBDLP

Step 2 Start with an arbitrary document that has not beenvisited and find all the documents in its 120576-neighborhoodIf the number of documents in the neighborhood exceeds119872119894119899119875119905119904 incorporate the documents into the same cluster andlabel them

Step 3 If not all documents have been visited start fromanother arbitrary document which has not been visited

Step 4 Mark the documents which are not labelled as noise

32 Representing Clusters with Graph In this step the con-fidential contents in all clusters which include not onlythe confidential terms but also their context need to berepresented by graphs The procedure of creating graphrepresentation for the clusters which include confidentialcontents is described as follows

(1) Detect the confidential terms provided by domainexperts or inferred from the key terms of trainingdocuments

(2) Analyze the context of each confidential terms

(3) Create the graph representation for confidential termsand their contexts on the cluster level

321 Detect Confidential Terms In general a term whichappears in confidential documents with high probability andappears in nonconfidential documents with low probabilityis considered as confidential term We first build languagemodels for the confidential and non-confidential documentsof the same cluster which are denoted by 119888119881119872 (confidentialvector model) and 119899119888119881119872 (nonconfidential vector model)Then the confidentiality score can be represented by the ratioof its probability in confidential documents to that in non-confidential documents as shown in as follows where119875119888119881119872(119905)and 119875119899119888119881119872(119905) denote the probability of term 119905 in confidentialand nonconfidential language models respectively

forall119905 isin 119888119881119872119904119888119900119903119890+ = 119875119888119881119872 (119905)

119875119899119888119881119872 (119905)(1)

However there may exist the following problem If acluster includes only few nonconfidential documents orpossibly none at all its language model cannot fully representthe nonconfidential documents in it The solution we pro-posed follows an expanding manner we first predefine theminimal similarity threshold 119879119877119898119894119899 and iteratively expandthe 119899119888119881119872 to include more clusters 119879119877 is referred to as thesimilarity threshold of cosine measure of the cluster with few

Wireless Communications and Mobile Computing 5

nonconfidential documents Note that not all clusters need tobe expanded After each iteration we lower the value of 119879119877Unless 119879119877 is greater than 119879119877119898119894119899 the nonconfidential docu-ments of the expanding clusters are included to recalculatethe scores of terms in original cluster

When the adjacent clusters are included and the scores ofconfidential terms are recalculated each term whose score isgreater than 1 is considered as confidential termwhichmeansthe term is more likely to appear in confidential documentsthan in non-confidential documents After this phase the setof confidential terms 119862119879 is obtained322 Analyze the Context of Confidential Terms After con-fidential terms detection we further analyze the context ofconfidential terms Apparently a term is more likely to beconsidered as confidential if it appears in the similar contextsin other confidential documents Inversely if the context ofa confidential term frequently appears in nonconfidentialdocuments the probability of the confidential term beingpart of confidential contents is lower

As a predefined parameter context span 120578 determines thenumber of terms that precede and follow the confidentialterm Context span with high value might increase thecomputational cost inversely and context span with lowvalue could not provide adequate context information ofconfidential terms Experimental results show that 120578 =10 tends to be the optimal value of context span in ourexperiments which means that the context of a confidentialterm consists of the five terms preceding it and the other fivefollowing it Apparently only the context of the confidentialterms in confidential documents needs to be taken intoaccount

The probabilities of a confidential term together with itscontext appearing in confidential documents and nonconfi-dential documents which are denoted by 119875119888(119896119890119910119888119900119899119905119890119909119905119896119890119910)and 119875119899119888(119896119890119910119888119900119899119905119890119909119905119896119890119910) are calculated separately If the for-mer is higher than the latter the corresponding confidentialcontents can be well represented by the confidential termwith its context 119875119888(119896119890119910119888119900119899119905119890119909119905119896119890119910) is defined as the number ofconfidential documents in which the confidential term withits context appears divided by the number of confidentialdocuments in which only the confidential term appears And119875119899119888(119896119890119910119888119900119899119905119890119909119905119896119890119910) is defined as the number of nonconfi-dential documents in which the confidential term with itscontext appears divided by the number of nonconfidentialdocuments in which only the confidential term appears

As mentioned above we predefine the similarity thresh-old of minimum cosine measure 119879119877119898119894119899 and iterativelyexpand to include more clusters 119879119877 is referred to as thesimilarity threshold of cosine measure between the clusterwith few non-confidential documents and its expandingcluster After each iteration we lower the value of 119879119877 at acertain rate120583which is predefined namely119879119877 = 120583lowast119879119877 Unless119879119877 is greater than 119879119877119898119894119899 the non-confidential documents ofthe expanding cluster are included to recalculate the scores ofcontext terms in original cluster By including more adjacentclusters we can accurately estimate which terms are mostlikely to indicate the confidentiality of the document

By subtracting the probability of the appearance of eachcontext term with confidential term in non-confidentialdocuments from the probability of the appearance of them inconfidential ones the score of each context term is calculatedas shown in (2)

The reason for employing subtraction rather thandivisionis to avoid large fluctuations in the values of the contexttermsWhen employing division even a single document candramatically change the probabilities as only the documentsincluding confidential terms are taken into account

119904119888119900119903119890+ = 119875119888 (119896119890119910119888119900119899119905119890119909119905119896119890119910 ) minus 119875119899119888 (119896119890119910119888119900119899119905119890119909119905119896119890119910 ) (2)

We iteratively expand to includemore clusters After eachiteration we lower the value of119879119877 until 119879119877 is less than 119879119877119898119894119899and the score of each context term is calculated as shown in(3) in which 119899119888119897119906 denotes the number of clusters involved

119904119888119900119903119890+ = 1119899119888119897119906 (119875119888 (

119896119890119910119888119900119899119905119890119909119905119896119890119910 ) minus 119875119899119888 (119896119890119910119888119900119899119905119890119909119905119896119890119910 )) (3)

After this phase the set of context terms with their scores119862119900119899119905119890119909119905119879 is obtained For each confidential term its contextterms whose scores are positive are more likely to appear inconfidential documents with the confidential term

323 Create Graph Representation After the operationsdescribed in previous section confidential terms and theircontext can be easily represented as nodes and connectedtogether according to their interrelation As shown in Fig-ure 2 for each cluster a set of confidential terms and a setof its context terms are obtained after the training phase andconfidential terms and its context terms are represented asconfidential nodes and context nodes respectively Confiden-tial nodes are connected together as long as there exists atleast one common context node between them

33 Pruning Nodes of Graph Due to the calculation of con-fidential terms and their context terms are based on statisticsscores there might exist occasional case of a nonconfidentialterm with high score because of term abuse In the pruningphase we employ the method of term reduction in rough settheory to remove the redundancy nodes in graph

With the information of confidential and nonconfidentialdocuments we evaluate the importance of nodes in graphfor each cluster A node in graph can be pruned only ifthe removal of the term represented by the node does notinfluence the results of identifying the confidential docu-ments in this cluster As shown in (4) Impt(119905119894) denotes theimportance measure of term 119905119894 which is represented as node119894 in graph And 119875119900119904119866(119862) denotes the portion in confidentialdocuments set 119862 can be identified correctly by graph 119866Similarly 119875119900119904119866minus119899119894(119862) denotes the portion in confidentialdocuments set 119862 can be identified correctly by graph 119866minus119899119894which means node 119899119894 is removed from graph 119866 The detail ofthe pruning procedure for graph is presented in Algorithm 2

Impt (119905119894) = (1003816100381610038161003816119875119900119904119866 (119862)1003816100381610038161003816 minus 10038161003816100381610038161003816119875119900119904119866minus119899119894 (119862)10038161003816100381610038161003816)|119862| (4)

6 Wireless Communications and Mobile Computing

Input 119862 - Confidential documents set119873 - Non-confidential documents set119862119877 - Clusters set each with the centroid and corresponding graph119862119900119899119905119890119909119905119879 - The set of context terms

Output 1198621198791015840 - The updated set of confidential terms in clusters1198621199001198991199051198901199091199051198791015840 - The updated set of context terms

(1) 119894119903119903 119905119890119903119898119904 larr997888 119862119879 cup 119862119900119899119905119890119909119905119879(2) for (each 119905 in 119894119903119903 119905119890119903119898119904)(3) If (Impt(119905) = (|119875119900119904119879(119862119888119903 and 119873119888119903)| minus |119875119900119904119879minus119905(119862119888119903 and 119873119888119903)|)|119888119903| lt Impt 119905ℎ119903119890119904ℎ119900119897119889)(4) 119894119903119903 119905119890119903119898119904 = 119894119903119903 119905119890119903119898119904 minus 119905(5) for (each 119905 in 119894119903119903 119905119890119903119898119904)(6) 1198621198791015840 larr997888 Update confidential terms in CT(7) 1198621199001198991199051198901199091199051198791015840 larr997888Update context terms in 119862119900119899119905119890119909119905119879

Algorithm 2 Pruning

Context term 1

Confidential term 1

(Score 10)

Context term 2 Context term 3

Confidential term 2

(Score 5)

Score05

Score04 Score-02

Score06 Score-03

Figure 2 An example of confidential nodes connected throughtheir context nodes

34 Detection Phrase Obviously a confidential documentwithout any modification is easy to be detected accordingto confidential terms However the confidential documentswhich are rephrased or partitioned into portions and furtherconcealed in different nonconfidential documents as mostplagiarizers often do can hardly be detected Once theconfidential contents detection fails it is more likely to leadto data leakage or copyright infringement

In the detection phase we employ CBDLP model to dealwith three scenarios that could possibly happen The threedifferent scenarios are described as follows

(i) Each confidential document is detected as a whole(ii) Each confidential document is divided into portions

and embedded in nonconfidential ones

(iii) The confidential terms in confidential documents arerephrased completely

The detection method we employed includes three stepsas shown in Figure 3 which are described as follows

(1) Classify the documents to be tested to the corre-sponding clusters

(2) Identify the confidential terms and their contextterms according to the graphs of the correspondingclusters

(3) Calculate the confidentiality scores for the documentsand draw the conclusion that whether a document isconfidential or not

Then the security model which combines the trainingphrase and the detection phrase is shown in Figure 4

4 Experiments

In this section we evaluate the performance of CBDLPon Reuters-21578 dataset As testing dataset Reuters-21578consists of 21578 pieces of news distributed by Reuters in 1987which are saved in 22 files Reuters-21578 dataset is manuallyclassified as five categories each of which can be subdividedinto different number of subcategories For example the newsof economy includes the inventories subset gold subset andmoney-supply subset

41 Performance Experiments In the experiments wepresent the data leakage prevention method based onCBDLP model and also present a modified model withoutpruning step which is represented as CBDLP-Pr SinceSVM has been proved to be an excellent classifier with highaccuracy and CoBAn performs well in the scenario whereconfidential contents are embedded in nonconfidentialdocuments or rephrased we compare the performance ofCBDLP CBDLP-Pr SVM and CoBAn We evaluate theperformance of the methods in this paper with true positiverate (TPR) and false positive rate (FPR) and our goal is tomaximize TPR and minimize FPR concurrently

Wireless Communications and Mobile Computing 7

Match the test documentsto the most similar cluster

Testdocuments

Match the confidential termsand their context terms

Calculate scores for testdocuments

Determine theconfidentiality for each test

document

e first step

e second step

e third step

cluster

cluster

cluster

helliphellip

Graph

Figure 3 The detection of test documents

Context terms

Confidentialterms

Clusters ofConfidential andnon-confidential

documentsGraph

Match testdocuments to

clusters

Match theconfidential termsand their context

terms

Scores oftest

documentsDetermine

confidentiality

Training phrase

Detection phrase

Figure 4 The security model

We conduct experiments on the three scenarios which aredescribed above As for the first type of scenario we select thenews of ldquoearnrdquo as the carrier for confidential contents andmixthem with the news from other economy subsets as trainingdataset and testing dataset separately As for the second typeof scenario we extract the contents from the documents ofldquoearnrdquo subset and embed them in the documents from other

subsets The embedded portions are detected as confidentialcontents As for the third type of scenario we manuallyrephrase the contents in the documents of ldquoearnrdquo subset andembed them in the documents from other subsets

411 Confidential Documents as a Whole The experimentalresult of the first scenario is presented in Figure 5 As shown in

8 Wireless Communications and Mobile Computing

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPRCBDLP-PrCoBAn

CBDLPSVM

Figure 5 Performance of detecting confidential documents as awhole

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPR

CoBAnSVM

CBDLP-PrCBDLP

Figure 6 Performance of detecting the confidential contentsembedded in nonconfidential documents

Figure 5 when dealing with the scenario where confidentialdocuments are considered as a whole the performance ofthe four detection algorithms has no much difference Inspite of that CBDLP and CBDLP-Pr still perform slightlybetter than CoBAn and SVM which can be explained asthat the performance of CoBAn is partly influenced by thelimitation of 119896-means that it cannot deal with the clustersof various shapes effectively and SVM only focuses on theconfidential terms nevertheless ignores the context terms Inthis scenario since the documents containing confidentialterms are explicitly detected as confidential documents theperformance of the four methods has no much difference

412 Confidential Portions Embedded in NonconfidentialDocuments The result of the second scenario is presentedin Figure 6 As shown in Figure 6 when dealing with thescenario where the confidential portions are embedded innonconfidential documents CBDLP CBDLP-Pr and CoBAn

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPR

CoBAnSVM

CBDLP-PrCBDLP

Figure 7 Performance of detecting the rephrased confidentialcontents embedded in nonconfidential documents

perform better than SVM which can be explained as thatSVM is deceived by the scenario due to its statistics nature Asexpected the performance of CBDLP is slightly better thanCBDLP-Pr andCoBAndue to its pruning stepwhich removesthe redundancy nodes in graph that might deteriorate theresults of detection

In this scenario confidential portions are extracted fromthe documents defined as confidential and then embeddedin the nonconfidential documents whose length are at leastten times larger than the extracted portions Due to thestatistical nature most documents containing confidentialportions are incorrectly detected as nonconfidential by SVMwhich result in dramatic decline in the accuracy of SVMOther than SVM CBDLP CBDLP-Pr and CoBAn take theconfidential terms together with their context into accountand most nonconfidential documents containing embeddedconfidential portions are detected as confidential

413 Rephrased Confidential Contents in NonconfidentialDocuments The result of the third scenario is presentedin Figure 7 As shown in Figure 7 when dealing with thescenario where the confidential contents are rephrased andembedded in nonconfidential documents the performanceof SVM deteriorates considerably due to its statistics natureSince the rephrased contents do not deviate much from itsoriginal meaning CBDLP CBDLP-Pr and CoBAn performwell In addition the performance of CBDLP is better thanCBDLP-Pr andCoBAndue to its pruning stepwhich removesthe redundancy nodes in graph

In this scenario the rephrased confidential terms areembedded in nonconfidential documents which confuseSVMgreatly andmost documents containing rephrased con-fidential contents are incorrectly detected as nonconfidentialOther than SVM with the context of confidential termstaken into account CoBAn detects most documents contain-ing confidential contents however the accuracy of CoBAnis partially influenced by the clusterrsquos terms graph whichdepends on the quality of clusters generated by 119896-means Asa result CBDLP clusters documents with DBSCAN which

Wireless Communications and Mobile Computing 9

SVM

Runn

ing

time (

s)

Number of documents

3000

2500

2000

1500

1000

500

0

0 1000 2000 3000 4000 5000

CoBAnCBDLPCBDLP-Pr

Figure 8The scores of confidential documents andnonconfidentialdocuments

improves the quality of clusters and the clusterrsquos terms graphmeanwhile the pruning method removes the redundancynodes in graph and further improves the performance ofCBDLP

42 Running Time Comparisons In this experiment wemixed non-confidential documents together with the threetype of confidential documents which are the whole con-fidential documents the confidential contents embedded innonconfidential documents and the rephrased confidentialcontents embedded in non-confidential documents Theexperiment is conducted by using 10 fold cross validation ToCompare the running time of CBDLP CBDLP-Pr CoBAnand SVM we conduct the experiment on the datasets ofdifferent size The result is as shown in Figure 8 the runningtime of training phase and testing phase are exhibited asline graph in which the running time of CBDLP CBDLP-PrCoBan and SVM increase as more documents are added tothe dataset Although the additional steps of CBDLPCBDLP-Pr and CoBan result in more running time than SVM needstheir running time is still an order of magnitude more thanthat CBDLP performs much better than SVM does

5 Conclusion and Future Work

In this paper we present a new method for data leakagePrevention based on CBDLPmodel which has the followingadvantages

(1) It clusters the documents with DBSCAN and cosinemeasure which have been verified to be effective

(2) It represents confidential terms and their contextterms in graph

(3) It presents a pruning method based on the attributereduction method of rough set theory

Up to now some designated commercial DLP solutionscan reduce the risk of most accidental leakage howeverthey cannot provide sufficient protection against intentional

leakage And the other DLP solutions such as firewalls IDSantimalware software and management policies which canprovide assistance in detection intrusion or malicious soft-ware and enforce policies to protect data still do not preventintentional leaks perfectly To the best of our knowledgethere might be two main future research topics on DLP dataleakage from mobile devices and accidental data leakage byinsider

Since accidental data leakagemay be part of a larger attackin which their role will be mainly to activate an advancedpersistent threat inside the organization it is expected tocontinue to be one of the most challenging research topicsAnd our future work will focus on accidental data leakagein two directions First try to improve the efficiency andeffectiveness of CBDLP on confidential contents detectionSecond adjust the model dynamically according to thechanges of training dataset

Data Availability

All data generated or analysed during this study are includedin this published article

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

The research is supported by the National Natural Sci-ence Foundation of China under Grant nos 61871140 and61572153

References

[1] A Goyal F Bonchi and V S Lakshmanan ldquoOn minimizingbudget and time in influence propagation over social networksrdquoSocial Network Analysis and Mining pp 1ndash14 2012

[2] C Aggarwal Social Network Data Analytics Springer BerlinGermany 2011

[3] ldquoInformation week global security surveyrdquo Information Week2004

[4] D Alassi and R Alhajj ldquoEffectiveness of template detectionon noise reduction and websites summarizationrdquo InformationSciences vol 219 pp 41ndash72 2013

[5] D Holmes Using language models for information retrieval[PhD Thesis] Center for telematies and information technol-ogy University of Twente 2001

[6] R Bohme ldquoSecurity metrics and security investment modelsrdquoin Proceedings of the 5th international conference on advances ininformation and computer security 2010

[7] K R Rao and P Yip Discrete Cosine Transform AlgorithmsAdvantages Applications Academic Press Boston Mass USA1990

[8] S M Katz ldquoEstimation of probabilities from sparse data forthe language model component of a speech recognizerrdquo IEEETransactions on Signal Processing vol 35 no 3 pp 400-4011987

10 Wireless Communications and Mobile Computing

[9] R Jin L Si A G Hauptmann and J Callan ldquoLanguage modelfor IR using collection informationrdquo in Proceedings of the the25th annual international ACM SIGIR conference pp 419-4202002

[10] W W Cohen ldquoLearning rules that classify e-mailrdquo in Proceed-ings of the AAAI Spring Symposium on Machine Learning inInformation Access pp 18ndash25 1996

[11] G Salton A Wong and C S Yang ldquoA vector space model forautomatic indexingrdquo Communications of the ACM vol 18 no11 pp 613ndash620 1975

[12] S Deerwester S T Dumais G W Furnas T K Landauer andR Harshman ldquoIndexing by latent semantic analysisrdquo Journal ofthe Association for Information Science and Technology vol 41no 6 pp 391ndash407 1990

[13] C Ordonez ldquoClustering binary data streams with K-meansrdquoin Proceedings of the 8th ACM SIGMODWorkshop on ResearchIssues in Data Mining and Knowledge Discovery DMKD rsquo03 pp12ndash19 USA June 2003

[14] B Babcock M Datar R Motwani and L OrsquoCallaghan ldquoMain-taining variance and k-medians over data stream windowsrdquoin Proceedings of the Twenty second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems PODS2003 pp 234ndash243 June 2003

[15] T Zhang R Ramakrishnan and M Livny ldquoBIRCH a newdata clustering algorithm and its applicationsrdquoDataMining andKnowledge Discovery vol 1 no 2 pp 141ndash182 1997

[16] S Guha R Rastogi and K Shim ldquoCure an efficient clusteringalgorithm for large databasesrdquo in Proceedings of 1998 ACMSIGMOD International Conference Management of Data pp73ndash84 1998

[17] J W Han and M Kamber Data Mining Concepts and Tech-niques Morgan Kaufmann 2006

[18] A Hinneburg and D A Keim ldquoOptimal grid-clusteringtowards breaking the curse of dimensionality in high-dimen-sional clusteringrdquo in Proceedings of the 25th VLDB Conferencepp 506ndash517 1999

[19] WWang J Yang andRMuntz ldquoSting a statistical informationgrid approach to spatial data miningrdquo in Proceedings of the 23rdVLDB Conference pp 186ndash195 1997

[20] R Agrawal J Gehrke and D Gunopulos ldquoAutomatic sub-space clustering of high dimensional data for data miningapplicationsrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data pp 94ndash105 1998

[21] A McCallum K Nigam and L H Ungar ldquoEfficient clusteringof high-dimensional data sets with application to referencematchingrdquo in Proceedings of the KDD 2000 pp 169ndash178 ACMNew York NY USA 2000

[22] R Hyde P Angelov and A R MacKenzie ldquoFully onlineclustering of evolving data streams into arbitrarily shapedclustersrdquo Information Sciences vol 382-383 pp 96ndash114 2017

[23] F Jiang Y Fu B B Gupta et al ldquoDeep learning based multi-channel intelligent attack detection for data securityrdquo IEEETransactions on Sustainable Computing 2018

[24] R Agrawal and R Srikant ldquoPrivacy-preserving data miningrdquoin Proceedings of the the 2000 ACM SIGMOD internationalconference pp 439ndash450 May 2000

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information Processing ampManagement vol24 no 5 pp 513ndash523 1988

[26] Salton ldquoAutomatic text processing the transformation analysisand retrieval of information by computerrdquo Tech Rep Addison-Wesley Inc 1989

[27] M N Islam M Seera and C K Loo ldquoA robust incrementalclustering-based facial feature trackingrdquo Applied Soft Comput-ing vol 53 pp 34ndash44 2017

[28] A Shabtai and Y Elovici A Survey of Data Leakage Detectionand Prevention Solutions Springer Berlin Germany 2012

[29] S R Kalidindi S R Niezgoda G Landi S Vachhani andT Fast ldquoA novel framework for building materials knowledgesystemsrdquo Computers Materials and Continua vol 17 no 2 pp103ndash125 2010

[30] A J Wang ldquoInformation security models and metricsrdquo inProceedings of the 43rd annual southeast regional conference onACMSE43 pp 178ndash184 2005

[31] P Mitra C A Murthy and S K Pal ldquoUnsupervised featureselection using feature similarityrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 24 no 3 pp 301ndash3122002

[32] M Dash K Choi P Scheuermann and H Liu ldquoFeatureselection for clustering - A filter solutionrdquo in Proceedings of the2nd IEEE International Conference on Data Mining ICDM rsquo02pp 115ndash122 December 2002

[33] C-S Liu ldquoAn analytical method for computing the one-dimensional backward wave problemrdquo Computers Materialsand Continua vol 13 no 3 pp 219ndash234 2010

[34] G Katz Y Elovici and B Shapira ldquoCoban a context basedmodel for data leakage preventionrdquo Information Sciences vol262 pp 137ndash158 2014

[35] X Huang Y Lu D Li and MMa ldquoA novel mechanism for fastdetection of transformed data leakagerdquo IEEE Access vol 1 pp1ndash11 2018

[36] M Porter ldquoThe porter stemming algorithmrdquo 2006[37] F Pacheco M Cerrada R-V Sanchez D Cabrera C Li and J

Valente de Oliveira ldquoAttribute clustering using rough set theoryfor feature selection in fault severity classification of rotatingmachineryrdquoExpert Systems with Applications vol 71 pp 69ndash862017

[38] ldquoInformation week global security surveyrdquo Information Week2004

[39] C M Praba ldquoA technical review on data leakage detection andprevention approachesrdquo Journal of Network Communicationsand Emerging Technologies (JNCET) 2017

[40] F Ullah M Edwards R Ramdhany R Chitchyan M ABabar and A Rashid ldquoData exfiltration A review of externalattack vectors and countermeasuresrdquo Journal of Network andComputer Applications 2017

[41] K Thomas F Li A Zand et al ldquoData Breaches phishing ormalware understanding the risks of stolen credentialsrdquo in Pro-ceedings of the 24th ACM SIGSAC Conference on Computer andCommunications Security CCS rsquo17 pp 1421ndash1434 November2017

[42] J DeBlasio S Savage G M Voelker and A C SnoerenldquoTripwire Inferring internet site compromiserdquo in Proceedingsof the IMC rsquo17 pp 1ndash14 2017

[43] W Xu S Xiang and V Sachnev ldquoA cryptograph domainimage retrieval method based on paillier homomorphic blockencryptionrdquo Computers Materials and Continua pp 1ndash11 2018

[44] J Cui Y Zhang Z Cai A Liu and Y Li ldquoSecuring displaypath for security-sensitive applications on mobile devicesrdquoComputers Materials and Continua vol 55 no 1 pp 17ndash352018

[45] D Wang W Li and P Wang ldquoMeasuring two-factor authenti-cation schemes for real-time data access in industrial wireless

Wireless Communications and Mobile Computing 11

sensor networksrdquo IEEE Transactions on Industrial Informaticspp 1ndash12 2018

[46] D Wang H Cheng P Wang J Yan and X Huang ldquoA securityanalysis of honeywordsrdquo in Proceedings of the Network andDistributed Systems Security (NDSS) Symposium pp 18ndash212018

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 2: New A Data Leakage Prevention Method Based on the Reduction of …downloads.hindawi.com/journals/wcmc/2018/5823439.pdf · 2019. 7. 30. · A Data Leakage Prevention Method Based on

2 Wireless Communications and Mobile Computing

The remainder of this paper is organized as follows InSection 2 we introduce previous related work on data leakageprevention In Section 3 we present CBDLP model togetherwith the corresponding clustering decision and calculationalgorithms The experiments conducted to evaluate CBDLPin all circumstance are discussed in Section 4 FinallySection 5 concludes this paper and discusses the directionsof our future research

2 Related Work

In this section we review clustering of textual documentsattribute reduction method and graph representation oftextual documents respectively

21 Clustering of Textual Documents The problem of clus-tering textual documents is similar to high dimensionalclustering In general each term of a textual document isconsidered as an independent dimension and then eachdocument is considered as a vector consists of thousandsof terms By calculating the angle cosine measure betweendocuments textual documents can be classified in terms ofsimilarity which is reflected by the angle cosine value [7ndash10]

Vector space model VSM is one of the most widelyused text representation models [11] which is first presentedby Salton in the 1960s and successfully applied in SMARTa system for the manipulation and retrieval of text InVSM model a textual document is represented as 119863 =119863((11987911198821) (11987921198822) (119879119899119882119899)) where 119879119894 and 119882119894 denotethe 119894th term and its weight in the document respectively andthen the classification of a textual document is determined bycalculating the similarity between the textual document to beclassified and the textual documents whose classification arealready known

Term frequency and inverse document frequency TF-IDF is a frequently employed and effective statistics methodwhich is used to evaluate the importance of a term for adocuments collection [12] As is well known the importanceof a term is proportional to the frequency of its occurrencein a document and is inversely proportional to the frequencyof its occurrence in the whole corpus Till now TF-IDF hasbeen widely used in various fields such as textmining searchengine and information retrieval

On the basis of VSMmodel and TF-IDFmethod existingtextual documents clustering algorithms can be dividedinto five main categories partitioning methods hierarchicalmethods density-based methods grid-based methods andmodel-based methods Partitioning methods which are effi-cient and insensitive to the sequence of documents divide 119899documents into 119896 clusters in terms of clustering criteria Therepresentative partitioning methods include 119896-means and 119896-medoids [13 14] Hierarchical methods disintegrate docu-ments into different clusters or integrate different documentstogether into one cluster in terms of the similarity with a top-down or bottom-up hierarchical manner The representativehierarchical methods include BIRCH and CURE [15 16]Other than partitioning methods density-based methodsfocus on the density of a certain area When the density

of the documents within a certain area exceeds a prede-fined threshold they are incorporated into the same clusterThe representative density-based methods include DBSCANand OPTICS [17 18] Grid-based methods partition dataspace into limited cells in advance and integrate adjacentcells whose density exceed the density threshold into thesame cluster The representative grid-based methods includeSTING and CLIQUE [19 20] In model-based methodsdifferent models are bound up with each cluster respectivelyand the objective is to find all data subsets that fit each modelbest Statistics solution such as SVM[21] and neural networksolution are adopted extensively in model-based methods[22]The support vector clustering algorithm created byHavaSiegelmann and Vladimir Vapnik applies the statistics ofsupport vectors developed in the support vector machinesalgorithm to categorize unlabeled data and is one of themostwidely used clustering algorithms in industrial applications

In this paper we calculate the angle cosine value whichreflects the similarity between documents and cluster docu-ments withDBSCANDBSCAN proposed byMartin Ester in1996 is a density-based clustering algorithm which is widelycited in scientific literature [23] and it is awarded the testof time award in 2014 [24] When clustering other than 119896-means DBSCAN does not need to specify the number ofclusters and it can find the clusters of arbitrary shape Inaddition DBSCAN is robust to outliers as opposed to 119896-means

22 Graph Representation of Textual Documents Graphshave already been used in many text-related tasks whichemploy graph as the model for text representation insteadof the existing methods [25] As an alternative method tothe vector space model for representing textual informationgraphs can be created fromdocuments and be further used inthe text-related tasks such as information retrieval [26] textmining [27] and topic detection [28]

In general graph-based model is usually employed in thedomain of information retrieval such as PageRank [29] andHITS [30] When determining the similarity graph match-ing which is generally used to detect similar documents isNP in complexity [31] whereas the methods based on vectorspace model perform efficiently by calculating the Euclideandistance or Cosine measure between document vectors [32]The main advantage of graph-based model is that it cannot only capture the contents and structure of a documentbut also represent the terms together with their context Tothe best of our knowledge graph-based model is seldomemployed in text-related tasks Schenker presents a graph-related algorithm with its several variants [33] in which agraph is presented and the terms connected with edges areconsidered as nodes The differences between the variantsare related to term-based techniques Gilad Katz presentsCoBAn a context based model for data leakage preventionwhich enlightens us a lot [34] However CoBAn is partlyinfluenced by the limitation of 119896-means which is employed inCoBAnMoreover there might exist some redundancy nodesin the graph generated in CoBAn Xiaohong Huang et alpropose an Adaptive weighted Graph Walk model (AGW) to

Wireless Communications and Mobile Computing 3

solve the problem of transformed data leakage by mapping itto the dimension of weighted graphs [35]

In this paper we employ a hybrid approaches whichcombines graph and vector representations When clusteringdocuments we employDBSCANwith cosinemeasureWhenrepresenting the confidential textual content and its contextof each cluster the graph of each cluster which includes onlyconfidential and contextual nodes is created

23 Redundancy Information Reduction When dealing withtext-related tasks redundancy information is generally use-less and even worse it might decrease the efficiency oftask execution There exist many representative redundancyinformation reduction methods such as PCA [36] SVD [37]LSI [38] etc The principle of PCA is to transform multipleattributes into a few primary attributes which can reflectthe information of original data effectively However thecomplexity of PCA is generally high and there might bepart of original information loss More than characteristicsSVD has almost the same advantages and disadvantages asPCA does LSI represents textual data with latent topics thatconsists of specific terms but in most cases the influenceof specific terms are ignored In this paper the reductionmethod from rough set theory as shown in Section 3 isemployed and partly recomposed to meet requirements

24 Data Leakage Prevention With the number of leakageincidents and the cost they inflict continues to increase thethreat of data leakage posed to companies and organizationshas become more and more serious [39ndash41] Consideringthe enormity of data leakage prevention various models andapproaches have been developed to address the problem ofdata leakage prevention Tripwire is a more recent prototypesystem proposed by Joe DeBlasio et al in 2017 it registershoney accounts with individual third-party websites andthus access to an email account provides indirect evidenceof credentials theft at the corresponding website [42] How-ever Tripwire is more suitable for forensics rather thanconfidential data leakage prevention In 2018 Wenjia Xuet al propose a new promising image retrieval methodin ciphertext domain by block image encrypting based onPaillier homomorphic cryptosystem which can manage andretrieve ciphertext data effectively [43] Nevertheless themethod focus on data encryption rather than data detectionSince smart devices based on ARM processor become anattractive target of cyberattacks Jinhua Cui et al presenta scheme named SecDisplay for trusted display service in2018 it protects sensitive data displayed from being stolenor tampered surreptitiously by a compromised OS [44]But it pays less attention to the scenarios of intentional oraccidental data leakage from insider According to the workof Ding Wang et al lots of authentication schemes have beenproposed to secure data access in industrial wireless sensornetworks however they do not work well [45] In additionDing Wang et al develop a series of practical experimentsto evaluate the security of four main suggested honeyword-generation methods and further prove that they all fail toprovide the expected security [46]

Stemming and stop-wordsremoval

DBSCAN clustering

Training documents

helliphellip

clusters consists of confidentialand non-confidential documents

Figure 1 DBSCANmethod

3 CBDLP Model

CBDLP consists of training phrase and detection phraseThe training phrase can be further divided into three stepsclustering step graph building step and pruning step Duringthe training phrase the training documents are first classifiedinto different clusters then each cluster is represented bygraph and finally the nodes of each graph are pruned interms of their importance During the detection phrasedocuments are matched to the graphs of clusters respectivelyand the confidential scores are calculated A documentis considered as confidential only if its confidential scoreexceeds a predefined thresholdThe detail of CBDLP trainingphrase is presented in Algorithm 1

31 Clustering Documents with DBSCAN In the first step weapply stemming and stop-words removal to all documentsin training set and transform the processed documents intovectors of weighted terms After applying DBSCAN withcosine measure to the vectors which represent the trainingdocuments each resulting cluster represents an independenttopic of training documents and there might exist bothconfidential and nonconfidential documents As shown inFigure 1

The procedure of DBSCAN is described as follows

Step 1 A data set is given with 119899 documents and 120576 as thethreshold of minimal similarity between documents of thesame cluster and119872119894119899119875119905119904 as the threshold ofminimal numberof documents in a cluster

4 Wireless Communications and Mobile Computing

Input 119862 - Confidential documents set119873 - Non-confidential documents set119879119877119898119894119899 - The minimum similarity threshold

Output 119862119877 - The set of clusters each with the centroid and corresponding graph119862119879 - The set of confidential terms in clusters119862119900119899119905119890119909119905119879 - The set of context terms

(1) 119879 larr997888 119862 cup119873(2) 119862119877 larr997888 119863119861119878119862119860119873(119879) The result of clustering 119879 is saved in 119862119877(3) Initializing 119862119879[119862119877] The scores of confidential terms are saved in 119862119879(4) Initializing 119862119900119899119905119890119909119905119879 The context terms set of each confidential term is saved in 119862119900119899119905119890119909119905119879(5) for (each 119888119903 in 119862119877)(6) Calculate the similarity between 119888119903 and the other clusters(7) Create language model for 119888119903 and calculate the scores for each confidential term(8) 119879119877 larr997888 initial the threshold of cluster similarity(9) while (119879119877 gt 119879119877119898119894119899)(10) 119862119905119890119898119901 larr997888 All clusters whose similarity to 119888119903 gt 119879119877(11) Create language model for the documents of 119862119905119890119898119901(12) 119862119879[119888119903] larr997888 Based on new language model Update the scores of confidential terms(13) for(each confidential term 119888119905 in 119888119903)(14) Detect the occurrence of ct in 119862 cup 119873(15) 119875119888119900119899119891119894119889119890119899119905119894119886119897 119889119900119888(119905119890119903119898 119888119905) larr997888 For each context term of 119888119905 calculate the probability of the appearance both 119888119905 and

the context term in confidential documents(16) 119875119899119900119899 119888119900119899119891119894119889119890119899119905119894119886119897 119889119900119888(119905119890119903119898 119888119905) larr997888 For each context term of 119888119905 calculate the probability of the appearance both 119888119905 and the

context term in non-confidential documents(17) 119862119900119899119905119890119909119905119879 larr997888 Calculate the value of 119875119888119900119899119891119894119889119890119899119905119894119886119897 119889119900119888(119905119890119903119898 119888119905)119875119899119900119899 119888119900119899119891119894119889119890119899119905119894119886119897 119889119900119888(119905119890119903119898 119888119905) for each confidential term 119888119905(18) Detect all clusters whose similarity is greater than 119879119877 and detect the occurrences of all terms in the clusters(19) 119862119900119899119905119890119909119905119879 larr997888Update the probability of the context terms that appear in the scopes of different confidential terms(20) (21) Reduce the value of 119879119877(22) (23)

Algorithm 1 CBDLP

Step 2 Start with an arbitrary document that has not beenvisited and find all the documents in its 120576-neighborhoodIf the number of documents in the neighborhood exceeds119872119894119899119875119905119904 incorporate the documents into the same cluster andlabel them

Step 3 If not all documents have been visited start fromanother arbitrary document which has not been visited

Step 4 Mark the documents which are not labelled as noise

32 Representing Clusters with Graph In this step the con-fidential contents in all clusters which include not onlythe confidential terms but also their context need to berepresented by graphs The procedure of creating graphrepresentation for the clusters which include confidentialcontents is described as follows

(1) Detect the confidential terms provided by domainexperts or inferred from the key terms of trainingdocuments

(2) Analyze the context of each confidential terms

(3) Create the graph representation for confidential termsand their contexts on the cluster level

321 Detect Confidential Terms In general a term whichappears in confidential documents with high probability andappears in nonconfidential documents with low probabilityis considered as confidential term We first build languagemodels for the confidential and non-confidential documentsof the same cluster which are denoted by 119888119881119872 (confidentialvector model) and 119899119888119881119872 (nonconfidential vector model)Then the confidentiality score can be represented by the ratioof its probability in confidential documents to that in non-confidential documents as shown in as follows where119875119888119881119872(119905)and 119875119899119888119881119872(119905) denote the probability of term 119905 in confidentialand nonconfidential language models respectively

forall119905 isin 119888119881119872119904119888119900119903119890+ = 119875119888119881119872 (119905)

119875119899119888119881119872 (119905)(1)

However there may exist the following problem If acluster includes only few nonconfidential documents orpossibly none at all its language model cannot fully representthe nonconfidential documents in it The solution we pro-posed follows an expanding manner we first predefine theminimal similarity threshold 119879119877119898119894119899 and iteratively expandthe 119899119888119881119872 to include more clusters 119879119877 is referred to as thesimilarity threshold of cosine measure of the cluster with few

Wireless Communications and Mobile Computing 5

nonconfidential documents Note that not all clusters need tobe expanded After each iteration we lower the value of 119879119877Unless 119879119877 is greater than 119879119877119898119894119899 the nonconfidential docu-ments of the expanding clusters are included to recalculatethe scores of terms in original cluster

When the adjacent clusters are included and the scores ofconfidential terms are recalculated each term whose score isgreater than 1 is considered as confidential termwhichmeansthe term is more likely to appear in confidential documentsthan in non-confidential documents After this phase the setof confidential terms 119862119879 is obtained322 Analyze the Context of Confidential Terms After con-fidential terms detection we further analyze the context ofconfidential terms Apparently a term is more likely to beconsidered as confidential if it appears in the similar contextsin other confidential documents Inversely if the context ofa confidential term frequently appears in nonconfidentialdocuments the probability of the confidential term beingpart of confidential contents is lower

As a predefined parameter context span 120578 determines thenumber of terms that precede and follow the confidentialterm Context span with high value might increase thecomputational cost inversely and context span with lowvalue could not provide adequate context information ofconfidential terms Experimental results show that 120578 =10 tends to be the optimal value of context span in ourexperiments which means that the context of a confidentialterm consists of the five terms preceding it and the other fivefollowing it Apparently only the context of the confidentialterms in confidential documents needs to be taken intoaccount

The probabilities of a confidential term together with itscontext appearing in confidential documents and nonconfi-dential documents which are denoted by 119875119888(119896119890119910119888119900119899119905119890119909119905119896119890119910)and 119875119899119888(119896119890119910119888119900119899119905119890119909119905119896119890119910) are calculated separately If the for-mer is higher than the latter the corresponding confidentialcontents can be well represented by the confidential termwith its context 119875119888(119896119890119910119888119900119899119905119890119909119905119896119890119910) is defined as the number ofconfidential documents in which the confidential term withits context appears divided by the number of confidentialdocuments in which only the confidential term appears And119875119899119888(119896119890119910119888119900119899119905119890119909119905119896119890119910) is defined as the number of nonconfi-dential documents in which the confidential term with itscontext appears divided by the number of nonconfidentialdocuments in which only the confidential term appears

As mentioned above we predefine the similarity thresh-old of minimum cosine measure 119879119877119898119894119899 and iterativelyexpand to include more clusters 119879119877 is referred to as thesimilarity threshold of cosine measure between the clusterwith few non-confidential documents and its expandingcluster After each iteration we lower the value of 119879119877 at acertain rate120583which is predefined namely119879119877 = 120583lowast119879119877 Unless119879119877 is greater than 119879119877119898119894119899 the non-confidential documents ofthe expanding cluster are included to recalculate the scores ofcontext terms in original cluster By including more adjacentclusters we can accurately estimate which terms are mostlikely to indicate the confidentiality of the document

By subtracting the probability of the appearance of eachcontext term with confidential term in non-confidentialdocuments from the probability of the appearance of them inconfidential ones the score of each context term is calculatedas shown in (2)

The reason for employing subtraction rather thandivisionis to avoid large fluctuations in the values of the contexttermsWhen employing division even a single document candramatically change the probabilities as only the documentsincluding confidential terms are taken into account

119904119888119900119903119890+ = 119875119888 (119896119890119910119888119900119899119905119890119909119905119896119890119910 ) minus 119875119899119888 (119896119890119910119888119900119899119905119890119909119905119896119890119910 ) (2)

We iteratively expand to includemore clusters After eachiteration we lower the value of119879119877 until 119879119877 is less than 119879119877119898119894119899and the score of each context term is calculated as shown in(3) in which 119899119888119897119906 denotes the number of clusters involved

119904119888119900119903119890+ = 1119899119888119897119906 (119875119888 (

119896119890119910119888119900119899119905119890119909119905119896119890119910 ) minus 119875119899119888 (119896119890119910119888119900119899119905119890119909119905119896119890119910 )) (3)

After this phase the set of context terms with their scores119862119900119899119905119890119909119905119879 is obtained For each confidential term its contextterms whose scores are positive are more likely to appear inconfidential documents with the confidential term

323 Create Graph Representation After the operationsdescribed in previous section confidential terms and theircontext can be easily represented as nodes and connectedtogether according to their interrelation As shown in Fig-ure 2 for each cluster a set of confidential terms and a setof its context terms are obtained after the training phase andconfidential terms and its context terms are represented asconfidential nodes and context nodes respectively Confiden-tial nodes are connected together as long as there exists atleast one common context node between them

33 Pruning Nodes of Graph Due to the calculation of con-fidential terms and their context terms are based on statisticsscores there might exist occasional case of a nonconfidentialterm with high score because of term abuse In the pruningphase we employ the method of term reduction in rough settheory to remove the redundancy nodes in graph

With the information of confidential and nonconfidentialdocuments we evaluate the importance of nodes in graphfor each cluster A node in graph can be pruned only ifthe removal of the term represented by the node does notinfluence the results of identifying the confidential docu-ments in this cluster As shown in (4) Impt(119905119894) denotes theimportance measure of term 119905119894 which is represented as node119894 in graph And 119875119900119904119866(119862) denotes the portion in confidentialdocuments set 119862 can be identified correctly by graph 119866Similarly 119875119900119904119866minus119899119894(119862) denotes the portion in confidentialdocuments set 119862 can be identified correctly by graph 119866minus119899119894which means node 119899119894 is removed from graph 119866 The detail ofthe pruning procedure for graph is presented in Algorithm 2

Impt (119905119894) = (1003816100381610038161003816119875119900119904119866 (119862)1003816100381610038161003816 minus 10038161003816100381610038161003816119875119900119904119866minus119899119894 (119862)10038161003816100381610038161003816)|119862| (4)

6 Wireless Communications and Mobile Computing

Input 119862 - Confidential documents set119873 - Non-confidential documents set119862119877 - Clusters set each with the centroid and corresponding graph119862119900119899119905119890119909119905119879 - The set of context terms

Output 1198621198791015840 - The updated set of confidential terms in clusters1198621199001198991199051198901199091199051198791015840 - The updated set of context terms

(1) 119894119903119903 119905119890119903119898119904 larr997888 119862119879 cup 119862119900119899119905119890119909119905119879(2) for (each 119905 in 119894119903119903 119905119890119903119898119904)(3) If (Impt(119905) = (|119875119900119904119879(119862119888119903 and 119873119888119903)| minus |119875119900119904119879minus119905(119862119888119903 and 119873119888119903)|)|119888119903| lt Impt 119905ℎ119903119890119904ℎ119900119897119889)(4) 119894119903119903 119905119890119903119898119904 = 119894119903119903 119905119890119903119898119904 minus 119905(5) for (each 119905 in 119894119903119903 119905119890119903119898119904)(6) 1198621198791015840 larr997888 Update confidential terms in CT(7) 1198621199001198991199051198901199091199051198791015840 larr997888Update context terms in 119862119900119899119905119890119909119905119879

Algorithm 2 Pruning

Context term 1

Confidential term 1

(Score 10)

Context term 2 Context term 3

Confidential term 2

(Score 5)

Score05

Score04 Score-02

Score06 Score-03

Figure 2 An example of confidential nodes connected throughtheir context nodes

34 Detection Phrase Obviously a confidential documentwithout any modification is easy to be detected accordingto confidential terms However the confidential documentswhich are rephrased or partitioned into portions and furtherconcealed in different nonconfidential documents as mostplagiarizers often do can hardly be detected Once theconfidential contents detection fails it is more likely to leadto data leakage or copyright infringement

In the detection phase we employ CBDLP model to dealwith three scenarios that could possibly happen The threedifferent scenarios are described as follows

(i) Each confidential document is detected as a whole(ii) Each confidential document is divided into portions

and embedded in nonconfidential ones

(iii) The confidential terms in confidential documents arerephrased completely

The detection method we employed includes three stepsas shown in Figure 3 which are described as follows

(1) Classify the documents to be tested to the corre-sponding clusters

(2) Identify the confidential terms and their contextterms according to the graphs of the correspondingclusters

(3) Calculate the confidentiality scores for the documentsand draw the conclusion that whether a document isconfidential or not

Then the security model which combines the trainingphrase and the detection phrase is shown in Figure 4

4 Experiments

In this section we evaluate the performance of CBDLPon Reuters-21578 dataset As testing dataset Reuters-21578consists of 21578 pieces of news distributed by Reuters in 1987which are saved in 22 files Reuters-21578 dataset is manuallyclassified as five categories each of which can be subdividedinto different number of subcategories For example the newsof economy includes the inventories subset gold subset andmoney-supply subset

41 Performance Experiments In the experiments wepresent the data leakage prevention method based onCBDLP model and also present a modified model withoutpruning step which is represented as CBDLP-Pr SinceSVM has been proved to be an excellent classifier with highaccuracy and CoBAn performs well in the scenario whereconfidential contents are embedded in nonconfidentialdocuments or rephrased we compare the performance ofCBDLP CBDLP-Pr SVM and CoBAn We evaluate theperformance of the methods in this paper with true positiverate (TPR) and false positive rate (FPR) and our goal is tomaximize TPR and minimize FPR concurrently

Wireless Communications and Mobile Computing 7

Match the test documentsto the most similar cluster

Testdocuments

Match the confidential termsand their context terms

Calculate scores for testdocuments

Determine theconfidentiality for each test

document

e first step

e second step

e third step

cluster

cluster

cluster

helliphellip

Graph

Figure 3 The detection of test documents

Context terms

Confidentialterms

Clusters ofConfidential andnon-confidential

documentsGraph

Match testdocuments to

clusters

Match theconfidential termsand their context

terms

Scores oftest

documentsDetermine

confidentiality

Training phrase

Detection phrase

Figure 4 The security model

We conduct experiments on the three scenarios which aredescribed above As for the first type of scenario we select thenews of ldquoearnrdquo as the carrier for confidential contents andmixthem with the news from other economy subsets as trainingdataset and testing dataset separately As for the second typeof scenario we extract the contents from the documents ofldquoearnrdquo subset and embed them in the documents from other

subsets The embedded portions are detected as confidentialcontents As for the third type of scenario we manuallyrephrase the contents in the documents of ldquoearnrdquo subset andembed them in the documents from other subsets

411 Confidential Documents as a Whole The experimentalresult of the first scenario is presented in Figure 5 As shown in

8 Wireless Communications and Mobile Computing

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPRCBDLP-PrCoBAn

CBDLPSVM

Figure 5 Performance of detecting confidential documents as awhole

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPR

CoBAnSVM

CBDLP-PrCBDLP

Figure 6 Performance of detecting the confidential contentsembedded in nonconfidential documents

Figure 5 when dealing with the scenario where confidentialdocuments are considered as a whole the performance ofthe four detection algorithms has no much difference Inspite of that CBDLP and CBDLP-Pr still perform slightlybetter than CoBAn and SVM which can be explained asthat the performance of CoBAn is partly influenced by thelimitation of 119896-means that it cannot deal with the clustersof various shapes effectively and SVM only focuses on theconfidential terms nevertheless ignores the context terms Inthis scenario since the documents containing confidentialterms are explicitly detected as confidential documents theperformance of the four methods has no much difference

412 Confidential Portions Embedded in NonconfidentialDocuments The result of the second scenario is presentedin Figure 6 As shown in Figure 6 when dealing with thescenario where the confidential portions are embedded innonconfidential documents CBDLP CBDLP-Pr and CoBAn

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPR

CoBAnSVM

CBDLP-PrCBDLP

Figure 7 Performance of detecting the rephrased confidentialcontents embedded in nonconfidential documents

perform better than SVM which can be explained as thatSVM is deceived by the scenario due to its statistics nature Asexpected the performance of CBDLP is slightly better thanCBDLP-Pr andCoBAndue to its pruning stepwhich removesthe redundancy nodes in graph that might deteriorate theresults of detection

In this scenario confidential portions are extracted fromthe documents defined as confidential and then embeddedin the nonconfidential documents whose length are at leastten times larger than the extracted portions Due to thestatistical nature most documents containing confidentialportions are incorrectly detected as nonconfidential by SVMwhich result in dramatic decline in the accuracy of SVMOther than SVM CBDLP CBDLP-Pr and CoBAn take theconfidential terms together with their context into accountand most nonconfidential documents containing embeddedconfidential portions are detected as confidential

413 Rephrased Confidential Contents in NonconfidentialDocuments The result of the third scenario is presentedin Figure 7 As shown in Figure 7 when dealing with thescenario where the confidential contents are rephrased andembedded in nonconfidential documents the performanceof SVM deteriorates considerably due to its statistics natureSince the rephrased contents do not deviate much from itsoriginal meaning CBDLP CBDLP-Pr and CoBAn performwell In addition the performance of CBDLP is better thanCBDLP-Pr andCoBAndue to its pruning stepwhich removesthe redundancy nodes in graph

In this scenario the rephrased confidential terms areembedded in nonconfidential documents which confuseSVMgreatly andmost documents containing rephrased con-fidential contents are incorrectly detected as nonconfidentialOther than SVM with the context of confidential termstaken into account CoBAn detects most documents contain-ing confidential contents however the accuracy of CoBAnis partially influenced by the clusterrsquos terms graph whichdepends on the quality of clusters generated by 119896-means Asa result CBDLP clusters documents with DBSCAN which

Wireless Communications and Mobile Computing 9

SVM

Runn

ing

time (

s)

Number of documents

3000

2500

2000

1500

1000

500

0

0 1000 2000 3000 4000 5000

CoBAnCBDLPCBDLP-Pr

Figure 8The scores of confidential documents andnonconfidentialdocuments

improves the quality of clusters and the clusterrsquos terms graphmeanwhile the pruning method removes the redundancynodes in graph and further improves the performance ofCBDLP

42 Running Time Comparisons In this experiment wemixed non-confidential documents together with the threetype of confidential documents which are the whole con-fidential documents the confidential contents embedded innonconfidential documents and the rephrased confidentialcontents embedded in non-confidential documents Theexperiment is conducted by using 10 fold cross validation ToCompare the running time of CBDLP CBDLP-Pr CoBAnand SVM we conduct the experiment on the datasets ofdifferent size The result is as shown in Figure 8 the runningtime of training phase and testing phase are exhibited asline graph in which the running time of CBDLP CBDLP-PrCoBan and SVM increase as more documents are added tothe dataset Although the additional steps of CBDLPCBDLP-Pr and CoBan result in more running time than SVM needstheir running time is still an order of magnitude more thanthat CBDLP performs much better than SVM does

5 Conclusion and Future Work

In this paper we present a new method for data leakagePrevention based on CBDLPmodel which has the followingadvantages

(1) It clusters the documents with DBSCAN and cosinemeasure which have been verified to be effective

(2) It represents confidential terms and their contextterms in graph

(3) It presents a pruning method based on the attributereduction method of rough set theory

Up to now some designated commercial DLP solutionscan reduce the risk of most accidental leakage howeverthey cannot provide sufficient protection against intentional

leakage And the other DLP solutions such as firewalls IDSantimalware software and management policies which canprovide assistance in detection intrusion or malicious soft-ware and enforce policies to protect data still do not preventintentional leaks perfectly To the best of our knowledgethere might be two main future research topics on DLP dataleakage from mobile devices and accidental data leakage byinsider

Since accidental data leakagemay be part of a larger attackin which their role will be mainly to activate an advancedpersistent threat inside the organization it is expected tocontinue to be one of the most challenging research topicsAnd our future work will focus on accidental data leakagein two directions First try to improve the efficiency andeffectiveness of CBDLP on confidential contents detectionSecond adjust the model dynamically according to thechanges of training dataset

Data Availability

All data generated or analysed during this study are includedin this published article

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

The research is supported by the National Natural Sci-ence Foundation of China under Grant nos 61871140 and61572153

References

[1] A Goyal F Bonchi and V S Lakshmanan ldquoOn minimizingbudget and time in influence propagation over social networksrdquoSocial Network Analysis and Mining pp 1ndash14 2012

[2] C Aggarwal Social Network Data Analytics Springer BerlinGermany 2011

[3] ldquoInformation week global security surveyrdquo Information Week2004

[4] D Alassi and R Alhajj ldquoEffectiveness of template detectionon noise reduction and websites summarizationrdquo InformationSciences vol 219 pp 41ndash72 2013

[5] D Holmes Using language models for information retrieval[PhD Thesis] Center for telematies and information technol-ogy University of Twente 2001

[6] R Bohme ldquoSecurity metrics and security investment modelsrdquoin Proceedings of the 5th international conference on advances ininformation and computer security 2010

[7] K R Rao and P Yip Discrete Cosine Transform AlgorithmsAdvantages Applications Academic Press Boston Mass USA1990

[8] S M Katz ldquoEstimation of probabilities from sparse data forthe language model component of a speech recognizerrdquo IEEETransactions on Signal Processing vol 35 no 3 pp 400-4011987

10 Wireless Communications and Mobile Computing

[9] R Jin L Si A G Hauptmann and J Callan ldquoLanguage modelfor IR using collection informationrdquo in Proceedings of the the25th annual international ACM SIGIR conference pp 419-4202002

[10] W W Cohen ldquoLearning rules that classify e-mailrdquo in Proceed-ings of the AAAI Spring Symposium on Machine Learning inInformation Access pp 18ndash25 1996

[11] G Salton A Wong and C S Yang ldquoA vector space model forautomatic indexingrdquo Communications of the ACM vol 18 no11 pp 613ndash620 1975

[12] S Deerwester S T Dumais G W Furnas T K Landauer andR Harshman ldquoIndexing by latent semantic analysisrdquo Journal ofthe Association for Information Science and Technology vol 41no 6 pp 391ndash407 1990

[13] C Ordonez ldquoClustering binary data streams with K-meansrdquoin Proceedings of the 8th ACM SIGMODWorkshop on ResearchIssues in Data Mining and Knowledge Discovery DMKD rsquo03 pp12ndash19 USA June 2003

[14] B Babcock M Datar R Motwani and L OrsquoCallaghan ldquoMain-taining variance and k-medians over data stream windowsrdquoin Proceedings of the Twenty second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems PODS2003 pp 234ndash243 June 2003

[15] T Zhang R Ramakrishnan and M Livny ldquoBIRCH a newdata clustering algorithm and its applicationsrdquoDataMining andKnowledge Discovery vol 1 no 2 pp 141ndash182 1997

[16] S Guha R Rastogi and K Shim ldquoCure an efficient clusteringalgorithm for large databasesrdquo in Proceedings of 1998 ACMSIGMOD International Conference Management of Data pp73ndash84 1998

[17] J W Han and M Kamber Data Mining Concepts and Tech-niques Morgan Kaufmann 2006

[18] A Hinneburg and D A Keim ldquoOptimal grid-clusteringtowards breaking the curse of dimensionality in high-dimen-sional clusteringrdquo in Proceedings of the 25th VLDB Conferencepp 506ndash517 1999

[19] WWang J Yang andRMuntz ldquoSting a statistical informationgrid approach to spatial data miningrdquo in Proceedings of the 23rdVLDB Conference pp 186ndash195 1997

[20] R Agrawal J Gehrke and D Gunopulos ldquoAutomatic sub-space clustering of high dimensional data for data miningapplicationsrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data pp 94ndash105 1998

[21] A McCallum K Nigam and L H Ungar ldquoEfficient clusteringof high-dimensional data sets with application to referencematchingrdquo in Proceedings of the KDD 2000 pp 169ndash178 ACMNew York NY USA 2000

[22] R Hyde P Angelov and A R MacKenzie ldquoFully onlineclustering of evolving data streams into arbitrarily shapedclustersrdquo Information Sciences vol 382-383 pp 96ndash114 2017

[23] F Jiang Y Fu B B Gupta et al ldquoDeep learning based multi-channel intelligent attack detection for data securityrdquo IEEETransactions on Sustainable Computing 2018

[24] R Agrawal and R Srikant ldquoPrivacy-preserving data miningrdquoin Proceedings of the the 2000 ACM SIGMOD internationalconference pp 439ndash450 May 2000

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information Processing ampManagement vol24 no 5 pp 513ndash523 1988

[26] Salton ldquoAutomatic text processing the transformation analysisand retrieval of information by computerrdquo Tech Rep Addison-Wesley Inc 1989

[27] M N Islam M Seera and C K Loo ldquoA robust incrementalclustering-based facial feature trackingrdquo Applied Soft Comput-ing vol 53 pp 34ndash44 2017

[28] A Shabtai and Y Elovici A Survey of Data Leakage Detectionand Prevention Solutions Springer Berlin Germany 2012

[29] S R Kalidindi S R Niezgoda G Landi S Vachhani andT Fast ldquoA novel framework for building materials knowledgesystemsrdquo Computers Materials and Continua vol 17 no 2 pp103ndash125 2010

[30] A J Wang ldquoInformation security models and metricsrdquo inProceedings of the 43rd annual southeast regional conference onACMSE43 pp 178ndash184 2005

[31] P Mitra C A Murthy and S K Pal ldquoUnsupervised featureselection using feature similarityrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 24 no 3 pp 301ndash3122002

[32] M Dash K Choi P Scheuermann and H Liu ldquoFeatureselection for clustering - A filter solutionrdquo in Proceedings of the2nd IEEE International Conference on Data Mining ICDM rsquo02pp 115ndash122 December 2002

[33] C-S Liu ldquoAn analytical method for computing the one-dimensional backward wave problemrdquo Computers Materialsand Continua vol 13 no 3 pp 219ndash234 2010

[34] G Katz Y Elovici and B Shapira ldquoCoban a context basedmodel for data leakage preventionrdquo Information Sciences vol262 pp 137ndash158 2014

[35] X Huang Y Lu D Li and MMa ldquoA novel mechanism for fastdetection of transformed data leakagerdquo IEEE Access vol 1 pp1ndash11 2018

[36] M Porter ldquoThe porter stemming algorithmrdquo 2006[37] F Pacheco M Cerrada R-V Sanchez D Cabrera C Li and J

Valente de Oliveira ldquoAttribute clustering using rough set theoryfor feature selection in fault severity classification of rotatingmachineryrdquoExpert Systems with Applications vol 71 pp 69ndash862017

[38] ldquoInformation week global security surveyrdquo Information Week2004

[39] C M Praba ldquoA technical review on data leakage detection andprevention approachesrdquo Journal of Network Communicationsand Emerging Technologies (JNCET) 2017

[40] F Ullah M Edwards R Ramdhany R Chitchyan M ABabar and A Rashid ldquoData exfiltration A review of externalattack vectors and countermeasuresrdquo Journal of Network andComputer Applications 2017

[41] K Thomas F Li A Zand et al ldquoData Breaches phishing ormalware understanding the risks of stolen credentialsrdquo in Pro-ceedings of the 24th ACM SIGSAC Conference on Computer andCommunications Security CCS rsquo17 pp 1421ndash1434 November2017

[42] J DeBlasio S Savage G M Voelker and A C SnoerenldquoTripwire Inferring internet site compromiserdquo in Proceedingsof the IMC rsquo17 pp 1ndash14 2017

[43] W Xu S Xiang and V Sachnev ldquoA cryptograph domainimage retrieval method based on paillier homomorphic blockencryptionrdquo Computers Materials and Continua pp 1ndash11 2018

[44] J Cui Y Zhang Z Cai A Liu and Y Li ldquoSecuring displaypath for security-sensitive applications on mobile devicesrdquoComputers Materials and Continua vol 55 no 1 pp 17ndash352018

[45] D Wang W Li and P Wang ldquoMeasuring two-factor authenti-cation schemes for real-time data access in industrial wireless

Wireless Communications and Mobile Computing 11

sensor networksrdquo IEEE Transactions on Industrial Informaticspp 1ndash12 2018

[46] D Wang H Cheng P Wang J Yan and X Huang ldquoA securityanalysis of honeywordsrdquo in Proceedings of the Network andDistributed Systems Security (NDSS) Symposium pp 18ndash212018

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 3: New A Data Leakage Prevention Method Based on the Reduction of …downloads.hindawi.com/journals/wcmc/2018/5823439.pdf · 2019. 7. 30. · A Data Leakage Prevention Method Based on

Wireless Communications and Mobile Computing 3

solve the problem of transformed data leakage by mapping itto the dimension of weighted graphs [35]

In this paper we employ a hybrid approaches whichcombines graph and vector representations When clusteringdocuments we employDBSCANwith cosinemeasureWhenrepresenting the confidential textual content and its contextof each cluster the graph of each cluster which includes onlyconfidential and contextual nodes is created

23 Redundancy Information Reduction When dealing withtext-related tasks redundancy information is generally use-less and even worse it might decrease the efficiency oftask execution There exist many representative redundancyinformation reduction methods such as PCA [36] SVD [37]LSI [38] etc The principle of PCA is to transform multipleattributes into a few primary attributes which can reflectthe information of original data effectively However thecomplexity of PCA is generally high and there might bepart of original information loss More than characteristicsSVD has almost the same advantages and disadvantages asPCA does LSI represents textual data with latent topics thatconsists of specific terms but in most cases the influenceof specific terms are ignored In this paper the reductionmethod from rough set theory as shown in Section 3 isemployed and partly recomposed to meet requirements

24 Data Leakage Prevention With the number of leakageincidents and the cost they inflict continues to increase thethreat of data leakage posed to companies and organizationshas become more and more serious [39ndash41] Consideringthe enormity of data leakage prevention various models andapproaches have been developed to address the problem ofdata leakage prevention Tripwire is a more recent prototypesystem proposed by Joe DeBlasio et al in 2017 it registershoney accounts with individual third-party websites andthus access to an email account provides indirect evidenceof credentials theft at the corresponding website [42] How-ever Tripwire is more suitable for forensics rather thanconfidential data leakage prevention In 2018 Wenjia Xuet al propose a new promising image retrieval methodin ciphertext domain by block image encrypting based onPaillier homomorphic cryptosystem which can manage andretrieve ciphertext data effectively [43] Nevertheless themethod focus on data encryption rather than data detectionSince smart devices based on ARM processor become anattractive target of cyberattacks Jinhua Cui et al presenta scheme named SecDisplay for trusted display service in2018 it protects sensitive data displayed from being stolenor tampered surreptitiously by a compromised OS [44]But it pays less attention to the scenarios of intentional oraccidental data leakage from insider According to the workof Ding Wang et al lots of authentication schemes have beenproposed to secure data access in industrial wireless sensornetworks however they do not work well [45] In additionDing Wang et al develop a series of practical experimentsto evaluate the security of four main suggested honeyword-generation methods and further prove that they all fail toprovide the expected security [46]

Stemming and stop-wordsremoval

DBSCAN clustering

Training documents

helliphellip

clusters consists of confidentialand non-confidential documents

Figure 1 DBSCANmethod

3 CBDLP Model

CBDLP consists of training phrase and detection phraseThe training phrase can be further divided into three stepsclustering step graph building step and pruning step Duringthe training phrase the training documents are first classifiedinto different clusters then each cluster is represented bygraph and finally the nodes of each graph are pruned interms of their importance During the detection phrasedocuments are matched to the graphs of clusters respectivelyand the confidential scores are calculated A documentis considered as confidential only if its confidential scoreexceeds a predefined thresholdThe detail of CBDLP trainingphrase is presented in Algorithm 1

31 Clustering Documents with DBSCAN In the first step weapply stemming and stop-words removal to all documentsin training set and transform the processed documents intovectors of weighted terms After applying DBSCAN withcosine measure to the vectors which represent the trainingdocuments each resulting cluster represents an independenttopic of training documents and there might exist bothconfidential and nonconfidential documents As shown inFigure 1

The procedure of DBSCAN is described as follows

Step 1 A data set is given with 119899 documents and 120576 as thethreshold of minimal similarity between documents of thesame cluster and119872119894119899119875119905119904 as the threshold ofminimal numberof documents in a cluster

4 Wireless Communications and Mobile Computing

Input 119862 - Confidential documents set119873 - Non-confidential documents set119879119877119898119894119899 - The minimum similarity threshold

Output 119862119877 - The set of clusters each with the centroid and corresponding graph119862119879 - The set of confidential terms in clusters119862119900119899119905119890119909119905119879 - The set of context terms

(1) 119879 larr997888 119862 cup119873(2) 119862119877 larr997888 119863119861119878119862119860119873(119879) The result of clustering 119879 is saved in 119862119877(3) Initializing 119862119879[119862119877] The scores of confidential terms are saved in 119862119879(4) Initializing 119862119900119899119905119890119909119905119879 The context terms set of each confidential term is saved in 119862119900119899119905119890119909119905119879(5) for (each 119888119903 in 119862119877)(6) Calculate the similarity between 119888119903 and the other clusters(7) Create language model for 119888119903 and calculate the scores for each confidential term(8) 119879119877 larr997888 initial the threshold of cluster similarity(9) while (119879119877 gt 119879119877119898119894119899)(10) 119862119905119890119898119901 larr997888 All clusters whose similarity to 119888119903 gt 119879119877(11) Create language model for the documents of 119862119905119890119898119901(12) 119862119879[119888119903] larr997888 Based on new language model Update the scores of confidential terms(13) for(each confidential term 119888119905 in 119888119903)(14) Detect the occurrence of ct in 119862 cup 119873(15) 119875119888119900119899119891119894119889119890119899119905119894119886119897 119889119900119888(119905119890119903119898 119888119905) larr997888 For each context term of 119888119905 calculate the probability of the appearance both 119888119905 and

the context term in confidential documents(16) 119875119899119900119899 119888119900119899119891119894119889119890119899119905119894119886119897 119889119900119888(119905119890119903119898 119888119905) larr997888 For each context term of 119888119905 calculate the probability of the appearance both 119888119905 and the

context term in non-confidential documents(17) 119862119900119899119905119890119909119905119879 larr997888 Calculate the value of 119875119888119900119899119891119894119889119890119899119905119894119886119897 119889119900119888(119905119890119903119898 119888119905)119875119899119900119899 119888119900119899119891119894119889119890119899119905119894119886119897 119889119900119888(119905119890119903119898 119888119905) for each confidential term 119888119905(18) Detect all clusters whose similarity is greater than 119879119877 and detect the occurrences of all terms in the clusters(19) 119862119900119899119905119890119909119905119879 larr997888Update the probability of the context terms that appear in the scopes of different confidential terms(20) (21) Reduce the value of 119879119877(22) (23)

Algorithm 1 CBDLP

Step 2 Start with an arbitrary document that has not beenvisited and find all the documents in its 120576-neighborhoodIf the number of documents in the neighborhood exceeds119872119894119899119875119905119904 incorporate the documents into the same cluster andlabel them

Step 3 If not all documents have been visited start fromanother arbitrary document which has not been visited

Step 4 Mark the documents which are not labelled as noise

32 Representing Clusters with Graph In this step the con-fidential contents in all clusters which include not onlythe confidential terms but also their context need to berepresented by graphs The procedure of creating graphrepresentation for the clusters which include confidentialcontents is described as follows

(1) Detect the confidential terms provided by domainexperts or inferred from the key terms of trainingdocuments

(2) Analyze the context of each confidential terms

(3) Create the graph representation for confidential termsand their contexts on the cluster level

321 Detect Confidential Terms In general a term whichappears in confidential documents with high probability andappears in nonconfidential documents with low probabilityis considered as confidential term We first build languagemodels for the confidential and non-confidential documentsof the same cluster which are denoted by 119888119881119872 (confidentialvector model) and 119899119888119881119872 (nonconfidential vector model)Then the confidentiality score can be represented by the ratioof its probability in confidential documents to that in non-confidential documents as shown in as follows where119875119888119881119872(119905)and 119875119899119888119881119872(119905) denote the probability of term 119905 in confidentialand nonconfidential language models respectively

forall119905 isin 119888119881119872119904119888119900119903119890+ = 119875119888119881119872 (119905)

119875119899119888119881119872 (119905)(1)

However there may exist the following problem If acluster includes only few nonconfidential documents orpossibly none at all its language model cannot fully representthe nonconfidential documents in it The solution we pro-posed follows an expanding manner we first predefine theminimal similarity threshold 119879119877119898119894119899 and iteratively expandthe 119899119888119881119872 to include more clusters 119879119877 is referred to as thesimilarity threshold of cosine measure of the cluster with few

Wireless Communications and Mobile Computing 5

nonconfidential documents Note that not all clusters need tobe expanded After each iteration we lower the value of 119879119877Unless 119879119877 is greater than 119879119877119898119894119899 the nonconfidential docu-ments of the expanding clusters are included to recalculatethe scores of terms in original cluster

When the adjacent clusters are included and the scores ofconfidential terms are recalculated each term whose score isgreater than 1 is considered as confidential termwhichmeansthe term is more likely to appear in confidential documentsthan in non-confidential documents After this phase the setof confidential terms 119862119879 is obtained322 Analyze the Context of Confidential Terms After con-fidential terms detection we further analyze the context ofconfidential terms Apparently a term is more likely to beconsidered as confidential if it appears in the similar contextsin other confidential documents Inversely if the context ofa confidential term frequently appears in nonconfidentialdocuments the probability of the confidential term beingpart of confidential contents is lower

As a predefined parameter context span 120578 determines thenumber of terms that precede and follow the confidentialterm Context span with high value might increase thecomputational cost inversely and context span with lowvalue could not provide adequate context information ofconfidential terms Experimental results show that 120578 =10 tends to be the optimal value of context span in ourexperiments which means that the context of a confidentialterm consists of the five terms preceding it and the other fivefollowing it Apparently only the context of the confidentialterms in confidential documents needs to be taken intoaccount

The probabilities of a confidential term together with itscontext appearing in confidential documents and nonconfi-dential documents which are denoted by 119875119888(119896119890119910119888119900119899119905119890119909119905119896119890119910)and 119875119899119888(119896119890119910119888119900119899119905119890119909119905119896119890119910) are calculated separately If the for-mer is higher than the latter the corresponding confidentialcontents can be well represented by the confidential termwith its context 119875119888(119896119890119910119888119900119899119905119890119909119905119896119890119910) is defined as the number ofconfidential documents in which the confidential term withits context appears divided by the number of confidentialdocuments in which only the confidential term appears And119875119899119888(119896119890119910119888119900119899119905119890119909119905119896119890119910) is defined as the number of nonconfi-dential documents in which the confidential term with itscontext appears divided by the number of nonconfidentialdocuments in which only the confidential term appears

As mentioned above we predefine the similarity thresh-old of minimum cosine measure 119879119877119898119894119899 and iterativelyexpand to include more clusters 119879119877 is referred to as thesimilarity threshold of cosine measure between the clusterwith few non-confidential documents and its expandingcluster After each iteration we lower the value of 119879119877 at acertain rate120583which is predefined namely119879119877 = 120583lowast119879119877 Unless119879119877 is greater than 119879119877119898119894119899 the non-confidential documents ofthe expanding cluster are included to recalculate the scores ofcontext terms in original cluster By including more adjacentclusters we can accurately estimate which terms are mostlikely to indicate the confidentiality of the document

By subtracting the probability of the appearance of eachcontext term with confidential term in non-confidentialdocuments from the probability of the appearance of them inconfidential ones the score of each context term is calculatedas shown in (2)

The reason for employing subtraction rather thandivisionis to avoid large fluctuations in the values of the contexttermsWhen employing division even a single document candramatically change the probabilities as only the documentsincluding confidential terms are taken into account

119904119888119900119903119890+ = 119875119888 (119896119890119910119888119900119899119905119890119909119905119896119890119910 ) minus 119875119899119888 (119896119890119910119888119900119899119905119890119909119905119896119890119910 ) (2)

We iteratively expand to includemore clusters After eachiteration we lower the value of119879119877 until 119879119877 is less than 119879119877119898119894119899and the score of each context term is calculated as shown in(3) in which 119899119888119897119906 denotes the number of clusters involved

119904119888119900119903119890+ = 1119899119888119897119906 (119875119888 (

119896119890119910119888119900119899119905119890119909119905119896119890119910 ) minus 119875119899119888 (119896119890119910119888119900119899119905119890119909119905119896119890119910 )) (3)

After this phase the set of context terms with their scores119862119900119899119905119890119909119905119879 is obtained For each confidential term its contextterms whose scores are positive are more likely to appear inconfidential documents with the confidential term

323 Create Graph Representation After the operationsdescribed in previous section confidential terms and theircontext can be easily represented as nodes and connectedtogether according to their interrelation As shown in Fig-ure 2 for each cluster a set of confidential terms and a setof its context terms are obtained after the training phase andconfidential terms and its context terms are represented asconfidential nodes and context nodes respectively Confiden-tial nodes are connected together as long as there exists atleast one common context node between them

33 Pruning Nodes of Graph Due to the calculation of con-fidential terms and their context terms are based on statisticsscores there might exist occasional case of a nonconfidentialterm with high score because of term abuse In the pruningphase we employ the method of term reduction in rough settheory to remove the redundancy nodes in graph

With the information of confidential and nonconfidentialdocuments we evaluate the importance of nodes in graphfor each cluster A node in graph can be pruned only ifthe removal of the term represented by the node does notinfluence the results of identifying the confidential docu-ments in this cluster As shown in (4) Impt(119905119894) denotes theimportance measure of term 119905119894 which is represented as node119894 in graph And 119875119900119904119866(119862) denotes the portion in confidentialdocuments set 119862 can be identified correctly by graph 119866Similarly 119875119900119904119866minus119899119894(119862) denotes the portion in confidentialdocuments set 119862 can be identified correctly by graph 119866minus119899119894which means node 119899119894 is removed from graph 119866 The detail ofthe pruning procedure for graph is presented in Algorithm 2

Impt (119905119894) = (1003816100381610038161003816119875119900119904119866 (119862)1003816100381610038161003816 minus 10038161003816100381610038161003816119875119900119904119866minus119899119894 (119862)10038161003816100381610038161003816)|119862| (4)

6 Wireless Communications and Mobile Computing

Input 119862 - Confidential documents set119873 - Non-confidential documents set119862119877 - Clusters set each with the centroid and corresponding graph119862119900119899119905119890119909119905119879 - The set of context terms

Output 1198621198791015840 - The updated set of confidential terms in clusters1198621199001198991199051198901199091199051198791015840 - The updated set of context terms

(1) 119894119903119903 119905119890119903119898119904 larr997888 119862119879 cup 119862119900119899119905119890119909119905119879(2) for (each 119905 in 119894119903119903 119905119890119903119898119904)(3) If (Impt(119905) = (|119875119900119904119879(119862119888119903 and 119873119888119903)| minus |119875119900119904119879minus119905(119862119888119903 and 119873119888119903)|)|119888119903| lt Impt 119905ℎ119903119890119904ℎ119900119897119889)(4) 119894119903119903 119905119890119903119898119904 = 119894119903119903 119905119890119903119898119904 minus 119905(5) for (each 119905 in 119894119903119903 119905119890119903119898119904)(6) 1198621198791015840 larr997888 Update confidential terms in CT(7) 1198621199001198991199051198901199091199051198791015840 larr997888Update context terms in 119862119900119899119905119890119909119905119879

Algorithm 2 Pruning

Context term 1

Confidential term 1

(Score 10)

Context term 2 Context term 3

Confidential term 2

(Score 5)

Score05

Score04 Score-02

Score06 Score-03

Figure 2 An example of confidential nodes connected throughtheir context nodes

34 Detection Phrase Obviously a confidential documentwithout any modification is easy to be detected accordingto confidential terms However the confidential documentswhich are rephrased or partitioned into portions and furtherconcealed in different nonconfidential documents as mostplagiarizers often do can hardly be detected Once theconfidential contents detection fails it is more likely to leadto data leakage or copyright infringement

In the detection phase we employ CBDLP model to dealwith three scenarios that could possibly happen The threedifferent scenarios are described as follows

(i) Each confidential document is detected as a whole(ii) Each confidential document is divided into portions

and embedded in nonconfidential ones

(iii) The confidential terms in confidential documents arerephrased completely

The detection method we employed includes three stepsas shown in Figure 3 which are described as follows

(1) Classify the documents to be tested to the corre-sponding clusters

(2) Identify the confidential terms and their contextterms according to the graphs of the correspondingclusters

(3) Calculate the confidentiality scores for the documentsand draw the conclusion that whether a document isconfidential or not

Then the security model which combines the trainingphrase and the detection phrase is shown in Figure 4

4 Experiments

In this section we evaluate the performance of CBDLPon Reuters-21578 dataset As testing dataset Reuters-21578consists of 21578 pieces of news distributed by Reuters in 1987which are saved in 22 files Reuters-21578 dataset is manuallyclassified as five categories each of which can be subdividedinto different number of subcategories For example the newsof economy includes the inventories subset gold subset andmoney-supply subset

41 Performance Experiments In the experiments wepresent the data leakage prevention method based onCBDLP model and also present a modified model withoutpruning step which is represented as CBDLP-Pr SinceSVM has been proved to be an excellent classifier with highaccuracy and CoBAn performs well in the scenario whereconfidential contents are embedded in nonconfidentialdocuments or rephrased we compare the performance ofCBDLP CBDLP-Pr SVM and CoBAn We evaluate theperformance of the methods in this paper with true positiverate (TPR) and false positive rate (FPR) and our goal is tomaximize TPR and minimize FPR concurrently

Wireless Communications and Mobile Computing 7

Match the test documentsto the most similar cluster

Testdocuments

Match the confidential termsand their context terms

Calculate scores for testdocuments

Determine theconfidentiality for each test

document

e first step

e second step

e third step

cluster

cluster

cluster

helliphellip

Graph

Figure 3 The detection of test documents

Context terms

Confidentialterms

Clusters ofConfidential andnon-confidential

documentsGraph

Match testdocuments to

clusters

Match theconfidential termsand their context

terms

Scores oftest

documentsDetermine

confidentiality

Training phrase

Detection phrase

Figure 4 The security model

We conduct experiments on the three scenarios which aredescribed above As for the first type of scenario we select thenews of ldquoearnrdquo as the carrier for confidential contents andmixthem with the news from other economy subsets as trainingdataset and testing dataset separately As for the second typeof scenario we extract the contents from the documents ofldquoearnrdquo subset and embed them in the documents from other

subsets The embedded portions are detected as confidentialcontents As for the third type of scenario we manuallyrephrase the contents in the documents of ldquoearnrdquo subset andembed them in the documents from other subsets

411 Confidential Documents as a Whole The experimentalresult of the first scenario is presented in Figure 5 As shown in

8 Wireless Communications and Mobile Computing

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPRCBDLP-PrCoBAn

CBDLPSVM

Figure 5 Performance of detecting confidential documents as awhole

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPR

CoBAnSVM

CBDLP-PrCBDLP

Figure 6 Performance of detecting the confidential contentsembedded in nonconfidential documents

Figure 5 when dealing with the scenario where confidentialdocuments are considered as a whole the performance ofthe four detection algorithms has no much difference Inspite of that CBDLP and CBDLP-Pr still perform slightlybetter than CoBAn and SVM which can be explained asthat the performance of CoBAn is partly influenced by thelimitation of 119896-means that it cannot deal with the clustersof various shapes effectively and SVM only focuses on theconfidential terms nevertheless ignores the context terms Inthis scenario since the documents containing confidentialterms are explicitly detected as confidential documents theperformance of the four methods has no much difference

412 Confidential Portions Embedded in NonconfidentialDocuments The result of the second scenario is presentedin Figure 6 As shown in Figure 6 when dealing with thescenario where the confidential portions are embedded innonconfidential documents CBDLP CBDLP-Pr and CoBAn

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPR

CoBAnSVM

CBDLP-PrCBDLP

Figure 7 Performance of detecting the rephrased confidentialcontents embedded in nonconfidential documents

perform better than SVM which can be explained as thatSVM is deceived by the scenario due to its statistics nature Asexpected the performance of CBDLP is slightly better thanCBDLP-Pr andCoBAndue to its pruning stepwhich removesthe redundancy nodes in graph that might deteriorate theresults of detection

In this scenario confidential portions are extracted fromthe documents defined as confidential and then embeddedin the nonconfidential documents whose length are at leastten times larger than the extracted portions Due to thestatistical nature most documents containing confidentialportions are incorrectly detected as nonconfidential by SVMwhich result in dramatic decline in the accuracy of SVMOther than SVM CBDLP CBDLP-Pr and CoBAn take theconfidential terms together with their context into accountand most nonconfidential documents containing embeddedconfidential portions are detected as confidential

413 Rephrased Confidential Contents in NonconfidentialDocuments The result of the third scenario is presentedin Figure 7 As shown in Figure 7 when dealing with thescenario where the confidential contents are rephrased andembedded in nonconfidential documents the performanceof SVM deteriorates considerably due to its statistics natureSince the rephrased contents do not deviate much from itsoriginal meaning CBDLP CBDLP-Pr and CoBAn performwell In addition the performance of CBDLP is better thanCBDLP-Pr andCoBAndue to its pruning stepwhich removesthe redundancy nodes in graph

In this scenario the rephrased confidential terms areembedded in nonconfidential documents which confuseSVMgreatly andmost documents containing rephrased con-fidential contents are incorrectly detected as nonconfidentialOther than SVM with the context of confidential termstaken into account CoBAn detects most documents contain-ing confidential contents however the accuracy of CoBAnis partially influenced by the clusterrsquos terms graph whichdepends on the quality of clusters generated by 119896-means Asa result CBDLP clusters documents with DBSCAN which

Wireless Communications and Mobile Computing 9

SVM

Runn

ing

time (

s)

Number of documents

3000

2500

2000

1500

1000

500

0

0 1000 2000 3000 4000 5000

CoBAnCBDLPCBDLP-Pr

Figure 8The scores of confidential documents andnonconfidentialdocuments

improves the quality of clusters and the clusterrsquos terms graphmeanwhile the pruning method removes the redundancynodes in graph and further improves the performance ofCBDLP

42 Running Time Comparisons In this experiment wemixed non-confidential documents together with the threetype of confidential documents which are the whole con-fidential documents the confidential contents embedded innonconfidential documents and the rephrased confidentialcontents embedded in non-confidential documents Theexperiment is conducted by using 10 fold cross validation ToCompare the running time of CBDLP CBDLP-Pr CoBAnand SVM we conduct the experiment on the datasets ofdifferent size The result is as shown in Figure 8 the runningtime of training phase and testing phase are exhibited asline graph in which the running time of CBDLP CBDLP-PrCoBan and SVM increase as more documents are added tothe dataset Although the additional steps of CBDLPCBDLP-Pr and CoBan result in more running time than SVM needstheir running time is still an order of magnitude more thanthat CBDLP performs much better than SVM does

5 Conclusion and Future Work

In this paper we present a new method for data leakagePrevention based on CBDLPmodel which has the followingadvantages

(1) It clusters the documents with DBSCAN and cosinemeasure which have been verified to be effective

(2) It represents confidential terms and their contextterms in graph

(3) It presents a pruning method based on the attributereduction method of rough set theory

Up to now some designated commercial DLP solutionscan reduce the risk of most accidental leakage howeverthey cannot provide sufficient protection against intentional

leakage And the other DLP solutions such as firewalls IDSantimalware software and management policies which canprovide assistance in detection intrusion or malicious soft-ware and enforce policies to protect data still do not preventintentional leaks perfectly To the best of our knowledgethere might be two main future research topics on DLP dataleakage from mobile devices and accidental data leakage byinsider

Since accidental data leakagemay be part of a larger attackin which their role will be mainly to activate an advancedpersistent threat inside the organization it is expected tocontinue to be one of the most challenging research topicsAnd our future work will focus on accidental data leakagein two directions First try to improve the efficiency andeffectiveness of CBDLP on confidential contents detectionSecond adjust the model dynamically according to thechanges of training dataset

Data Availability

All data generated or analysed during this study are includedin this published article

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

The research is supported by the National Natural Sci-ence Foundation of China under Grant nos 61871140 and61572153

References

[1] A Goyal F Bonchi and V S Lakshmanan ldquoOn minimizingbudget and time in influence propagation over social networksrdquoSocial Network Analysis and Mining pp 1ndash14 2012

[2] C Aggarwal Social Network Data Analytics Springer BerlinGermany 2011

[3] ldquoInformation week global security surveyrdquo Information Week2004

[4] D Alassi and R Alhajj ldquoEffectiveness of template detectionon noise reduction and websites summarizationrdquo InformationSciences vol 219 pp 41ndash72 2013

[5] D Holmes Using language models for information retrieval[PhD Thesis] Center for telematies and information technol-ogy University of Twente 2001

[6] R Bohme ldquoSecurity metrics and security investment modelsrdquoin Proceedings of the 5th international conference on advances ininformation and computer security 2010

[7] K R Rao and P Yip Discrete Cosine Transform AlgorithmsAdvantages Applications Academic Press Boston Mass USA1990

[8] S M Katz ldquoEstimation of probabilities from sparse data forthe language model component of a speech recognizerrdquo IEEETransactions on Signal Processing vol 35 no 3 pp 400-4011987

10 Wireless Communications and Mobile Computing

[9] R Jin L Si A G Hauptmann and J Callan ldquoLanguage modelfor IR using collection informationrdquo in Proceedings of the the25th annual international ACM SIGIR conference pp 419-4202002

[10] W W Cohen ldquoLearning rules that classify e-mailrdquo in Proceed-ings of the AAAI Spring Symposium on Machine Learning inInformation Access pp 18ndash25 1996

[11] G Salton A Wong and C S Yang ldquoA vector space model forautomatic indexingrdquo Communications of the ACM vol 18 no11 pp 613ndash620 1975

[12] S Deerwester S T Dumais G W Furnas T K Landauer andR Harshman ldquoIndexing by latent semantic analysisrdquo Journal ofthe Association for Information Science and Technology vol 41no 6 pp 391ndash407 1990

[13] C Ordonez ldquoClustering binary data streams with K-meansrdquoin Proceedings of the 8th ACM SIGMODWorkshop on ResearchIssues in Data Mining and Knowledge Discovery DMKD rsquo03 pp12ndash19 USA June 2003

[14] B Babcock M Datar R Motwani and L OrsquoCallaghan ldquoMain-taining variance and k-medians over data stream windowsrdquoin Proceedings of the Twenty second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems PODS2003 pp 234ndash243 June 2003

[15] T Zhang R Ramakrishnan and M Livny ldquoBIRCH a newdata clustering algorithm and its applicationsrdquoDataMining andKnowledge Discovery vol 1 no 2 pp 141ndash182 1997

[16] S Guha R Rastogi and K Shim ldquoCure an efficient clusteringalgorithm for large databasesrdquo in Proceedings of 1998 ACMSIGMOD International Conference Management of Data pp73ndash84 1998

[17] J W Han and M Kamber Data Mining Concepts and Tech-niques Morgan Kaufmann 2006

[18] A Hinneburg and D A Keim ldquoOptimal grid-clusteringtowards breaking the curse of dimensionality in high-dimen-sional clusteringrdquo in Proceedings of the 25th VLDB Conferencepp 506ndash517 1999

[19] WWang J Yang andRMuntz ldquoSting a statistical informationgrid approach to spatial data miningrdquo in Proceedings of the 23rdVLDB Conference pp 186ndash195 1997

[20] R Agrawal J Gehrke and D Gunopulos ldquoAutomatic sub-space clustering of high dimensional data for data miningapplicationsrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data pp 94ndash105 1998

[21] A McCallum K Nigam and L H Ungar ldquoEfficient clusteringof high-dimensional data sets with application to referencematchingrdquo in Proceedings of the KDD 2000 pp 169ndash178 ACMNew York NY USA 2000

[22] R Hyde P Angelov and A R MacKenzie ldquoFully onlineclustering of evolving data streams into arbitrarily shapedclustersrdquo Information Sciences vol 382-383 pp 96ndash114 2017

[23] F Jiang Y Fu B B Gupta et al ldquoDeep learning based multi-channel intelligent attack detection for data securityrdquo IEEETransactions on Sustainable Computing 2018

[24] R Agrawal and R Srikant ldquoPrivacy-preserving data miningrdquoin Proceedings of the the 2000 ACM SIGMOD internationalconference pp 439ndash450 May 2000

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information Processing ampManagement vol24 no 5 pp 513ndash523 1988

[26] Salton ldquoAutomatic text processing the transformation analysisand retrieval of information by computerrdquo Tech Rep Addison-Wesley Inc 1989

[27] M N Islam M Seera and C K Loo ldquoA robust incrementalclustering-based facial feature trackingrdquo Applied Soft Comput-ing vol 53 pp 34ndash44 2017

[28] A Shabtai and Y Elovici A Survey of Data Leakage Detectionand Prevention Solutions Springer Berlin Germany 2012

[29] S R Kalidindi S R Niezgoda G Landi S Vachhani andT Fast ldquoA novel framework for building materials knowledgesystemsrdquo Computers Materials and Continua vol 17 no 2 pp103ndash125 2010

[30] A J Wang ldquoInformation security models and metricsrdquo inProceedings of the 43rd annual southeast regional conference onACMSE43 pp 178ndash184 2005

[31] P Mitra C A Murthy and S K Pal ldquoUnsupervised featureselection using feature similarityrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 24 no 3 pp 301ndash3122002

[32] M Dash K Choi P Scheuermann and H Liu ldquoFeatureselection for clustering - A filter solutionrdquo in Proceedings of the2nd IEEE International Conference on Data Mining ICDM rsquo02pp 115ndash122 December 2002

[33] C-S Liu ldquoAn analytical method for computing the one-dimensional backward wave problemrdquo Computers Materialsand Continua vol 13 no 3 pp 219ndash234 2010

[34] G Katz Y Elovici and B Shapira ldquoCoban a context basedmodel for data leakage preventionrdquo Information Sciences vol262 pp 137ndash158 2014

[35] X Huang Y Lu D Li and MMa ldquoA novel mechanism for fastdetection of transformed data leakagerdquo IEEE Access vol 1 pp1ndash11 2018

[36] M Porter ldquoThe porter stemming algorithmrdquo 2006[37] F Pacheco M Cerrada R-V Sanchez D Cabrera C Li and J

Valente de Oliveira ldquoAttribute clustering using rough set theoryfor feature selection in fault severity classification of rotatingmachineryrdquoExpert Systems with Applications vol 71 pp 69ndash862017

[38] ldquoInformation week global security surveyrdquo Information Week2004

[39] C M Praba ldquoA technical review on data leakage detection andprevention approachesrdquo Journal of Network Communicationsand Emerging Technologies (JNCET) 2017

[40] F Ullah M Edwards R Ramdhany R Chitchyan M ABabar and A Rashid ldquoData exfiltration A review of externalattack vectors and countermeasuresrdquo Journal of Network andComputer Applications 2017

[41] K Thomas F Li A Zand et al ldquoData Breaches phishing ormalware understanding the risks of stolen credentialsrdquo in Pro-ceedings of the 24th ACM SIGSAC Conference on Computer andCommunications Security CCS rsquo17 pp 1421ndash1434 November2017

[42] J DeBlasio S Savage G M Voelker and A C SnoerenldquoTripwire Inferring internet site compromiserdquo in Proceedingsof the IMC rsquo17 pp 1ndash14 2017

[43] W Xu S Xiang and V Sachnev ldquoA cryptograph domainimage retrieval method based on paillier homomorphic blockencryptionrdquo Computers Materials and Continua pp 1ndash11 2018

[44] J Cui Y Zhang Z Cai A Liu and Y Li ldquoSecuring displaypath for security-sensitive applications on mobile devicesrdquoComputers Materials and Continua vol 55 no 1 pp 17ndash352018

[45] D Wang W Li and P Wang ldquoMeasuring two-factor authenti-cation schemes for real-time data access in industrial wireless

Wireless Communications and Mobile Computing 11

sensor networksrdquo IEEE Transactions on Industrial Informaticspp 1ndash12 2018

[46] D Wang H Cheng P Wang J Yan and X Huang ldquoA securityanalysis of honeywordsrdquo in Proceedings of the Network andDistributed Systems Security (NDSS) Symposium pp 18ndash212018

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 4: New A Data Leakage Prevention Method Based on the Reduction of …downloads.hindawi.com/journals/wcmc/2018/5823439.pdf · 2019. 7. 30. · A Data Leakage Prevention Method Based on

4 Wireless Communications and Mobile Computing

Input 119862 - Confidential documents set119873 - Non-confidential documents set119879119877119898119894119899 - The minimum similarity threshold

Output 119862119877 - The set of clusters each with the centroid and corresponding graph119862119879 - The set of confidential terms in clusters119862119900119899119905119890119909119905119879 - The set of context terms

(1) 119879 larr997888 119862 cup119873(2) 119862119877 larr997888 119863119861119878119862119860119873(119879) The result of clustering 119879 is saved in 119862119877(3) Initializing 119862119879[119862119877] The scores of confidential terms are saved in 119862119879(4) Initializing 119862119900119899119905119890119909119905119879 The context terms set of each confidential term is saved in 119862119900119899119905119890119909119905119879(5) for (each 119888119903 in 119862119877)(6) Calculate the similarity between 119888119903 and the other clusters(7) Create language model for 119888119903 and calculate the scores for each confidential term(8) 119879119877 larr997888 initial the threshold of cluster similarity(9) while (119879119877 gt 119879119877119898119894119899)(10) 119862119905119890119898119901 larr997888 All clusters whose similarity to 119888119903 gt 119879119877(11) Create language model for the documents of 119862119905119890119898119901(12) 119862119879[119888119903] larr997888 Based on new language model Update the scores of confidential terms(13) for(each confidential term 119888119905 in 119888119903)(14) Detect the occurrence of ct in 119862 cup 119873(15) 119875119888119900119899119891119894119889119890119899119905119894119886119897 119889119900119888(119905119890119903119898 119888119905) larr997888 For each context term of 119888119905 calculate the probability of the appearance both 119888119905 and

the context term in confidential documents(16) 119875119899119900119899 119888119900119899119891119894119889119890119899119905119894119886119897 119889119900119888(119905119890119903119898 119888119905) larr997888 For each context term of 119888119905 calculate the probability of the appearance both 119888119905 and the

context term in non-confidential documents(17) 119862119900119899119905119890119909119905119879 larr997888 Calculate the value of 119875119888119900119899119891119894119889119890119899119905119894119886119897 119889119900119888(119905119890119903119898 119888119905)119875119899119900119899 119888119900119899119891119894119889119890119899119905119894119886119897 119889119900119888(119905119890119903119898 119888119905) for each confidential term 119888119905(18) Detect all clusters whose similarity is greater than 119879119877 and detect the occurrences of all terms in the clusters(19) 119862119900119899119905119890119909119905119879 larr997888Update the probability of the context terms that appear in the scopes of different confidential terms(20) (21) Reduce the value of 119879119877(22) (23)

Algorithm 1 CBDLP

Step 2 Start with an arbitrary document that has not beenvisited and find all the documents in its 120576-neighborhoodIf the number of documents in the neighborhood exceeds119872119894119899119875119905119904 incorporate the documents into the same cluster andlabel them

Step 3 If not all documents have been visited start fromanother arbitrary document which has not been visited

Step 4 Mark the documents which are not labelled as noise

32 Representing Clusters with Graph In this step the con-fidential contents in all clusters which include not onlythe confidential terms but also their context need to berepresented by graphs The procedure of creating graphrepresentation for the clusters which include confidentialcontents is described as follows

(1) Detect the confidential terms provided by domainexperts or inferred from the key terms of trainingdocuments

(2) Analyze the context of each confidential terms

(3) Create the graph representation for confidential termsand their contexts on the cluster level

321 Detect Confidential Terms In general a term whichappears in confidential documents with high probability andappears in nonconfidential documents with low probabilityis considered as confidential term We first build languagemodels for the confidential and non-confidential documentsof the same cluster which are denoted by 119888119881119872 (confidentialvector model) and 119899119888119881119872 (nonconfidential vector model)Then the confidentiality score can be represented by the ratioof its probability in confidential documents to that in non-confidential documents as shown in as follows where119875119888119881119872(119905)and 119875119899119888119881119872(119905) denote the probability of term 119905 in confidentialand nonconfidential language models respectively

forall119905 isin 119888119881119872119904119888119900119903119890+ = 119875119888119881119872 (119905)

119875119899119888119881119872 (119905)(1)

However there may exist the following problem If acluster includes only few nonconfidential documents orpossibly none at all its language model cannot fully representthe nonconfidential documents in it The solution we pro-posed follows an expanding manner we first predefine theminimal similarity threshold 119879119877119898119894119899 and iteratively expandthe 119899119888119881119872 to include more clusters 119879119877 is referred to as thesimilarity threshold of cosine measure of the cluster with few

Wireless Communications and Mobile Computing 5

nonconfidential documents Note that not all clusters need tobe expanded After each iteration we lower the value of 119879119877Unless 119879119877 is greater than 119879119877119898119894119899 the nonconfidential docu-ments of the expanding clusters are included to recalculatethe scores of terms in original cluster

When the adjacent clusters are included and the scores ofconfidential terms are recalculated each term whose score isgreater than 1 is considered as confidential termwhichmeansthe term is more likely to appear in confidential documentsthan in non-confidential documents After this phase the setof confidential terms 119862119879 is obtained322 Analyze the Context of Confidential Terms After con-fidential terms detection we further analyze the context ofconfidential terms Apparently a term is more likely to beconsidered as confidential if it appears in the similar contextsin other confidential documents Inversely if the context ofa confidential term frequently appears in nonconfidentialdocuments the probability of the confidential term beingpart of confidential contents is lower

As a predefined parameter context span 120578 determines thenumber of terms that precede and follow the confidentialterm Context span with high value might increase thecomputational cost inversely and context span with lowvalue could not provide adequate context information ofconfidential terms Experimental results show that 120578 =10 tends to be the optimal value of context span in ourexperiments which means that the context of a confidentialterm consists of the five terms preceding it and the other fivefollowing it Apparently only the context of the confidentialterms in confidential documents needs to be taken intoaccount

The probabilities of a confidential term together with itscontext appearing in confidential documents and nonconfi-dential documents which are denoted by 119875119888(119896119890119910119888119900119899119905119890119909119905119896119890119910)and 119875119899119888(119896119890119910119888119900119899119905119890119909119905119896119890119910) are calculated separately If the for-mer is higher than the latter the corresponding confidentialcontents can be well represented by the confidential termwith its context 119875119888(119896119890119910119888119900119899119905119890119909119905119896119890119910) is defined as the number ofconfidential documents in which the confidential term withits context appears divided by the number of confidentialdocuments in which only the confidential term appears And119875119899119888(119896119890119910119888119900119899119905119890119909119905119896119890119910) is defined as the number of nonconfi-dential documents in which the confidential term with itscontext appears divided by the number of nonconfidentialdocuments in which only the confidential term appears

As mentioned above we predefine the similarity thresh-old of minimum cosine measure 119879119877119898119894119899 and iterativelyexpand to include more clusters 119879119877 is referred to as thesimilarity threshold of cosine measure between the clusterwith few non-confidential documents and its expandingcluster After each iteration we lower the value of 119879119877 at acertain rate120583which is predefined namely119879119877 = 120583lowast119879119877 Unless119879119877 is greater than 119879119877119898119894119899 the non-confidential documents ofthe expanding cluster are included to recalculate the scores ofcontext terms in original cluster By including more adjacentclusters we can accurately estimate which terms are mostlikely to indicate the confidentiality of the document

By subtracting the probability of the appearance of eachcontext term with confidential term in non-confidentialdocuments from the probability of the appearance of them inconfidential ones the score of each context term is calculatedas shown in (2)

The reason for employing subtraction rather thandivisionis to avoid large fluctuations in the values of the contexttermsWhen employing division even a single document candramatically change the probabilities as only the documentsincluding confidential terms are taken into account

119904119888119900119903119890+ = 119875119888 (119896119890119910119888119900119899119905119890119909119905119896119890119910 ) minus 119875119899119888 (119896119890119910119888119900119899119905119890119909119905119896119890119910 ) (2)

We iteratively expand to includemore clusters After eachiteration we lower the value of119879119877 until 119879119877 is less than 119879119877119898119894119899and the score of each context term is calculated as shown in(3) in which 119899119888119897119906 denotes the number of clusters involved

119904119888119900119903119890+ = 1119899119888119897119906 (119875119888 (

119896119890119910119888119900119899119905119890119909119905119896119890119910 ) minus 119875119899119888 (119896119890119910119888119900119899119905119890119909119905119896119890119910 )) (3)

After this phase the set of context terms with their scores119862119900119899119905119890119909119905119879 is obtained For each confidential term its contextterms whose scores are positive are more likely to appear inconfidential documents with the confidential term

323 Create Graph Representation After the operationsdescribed in previous section confidential terms and theircontext can be easily represented as nodes and connectedtogether according to their interrelation As shown in Fig-ure 2 for each cluster a set of confidential terms and a setof its context terms are obtained after the training phase andconfidential terms and its context terms are represented asconfidential nodes and context nodes respectively Confiden-tial nodes are connected together as long as there exists atleast one common context node between them

33 Pruning Nodes of Graph Due to the calculation of con-fidential terms and their context terms are based on statisticsscores there might exist occasional case of a nonconfidentialterm with high score because of term abuse In the pruningphase we employ the method of term reduction in rough settheory to remove the redundancy nodes in graph

With the information of confidential and nonconfidentialdocuments we evaluate the importance of nodes in graphfor each cluster A node in graph can be pruned only ifthe removal of the term represented by the node does notinfluence the results of identifying the confidential docu-ments in this cluster As shown in (4) Impt(119905119894) denotes theimportance measure of term 119905119894 which is represented as node119894 in graph And 119875119900119904119866(119862) denotes the portion in confidentialdocuments set 119862 can be identified correctly by graph 119866Similarly 119875119900119904119866minus119899119894(119862) denotes the portion in confidentialdocuments set 119862 can be identified correctly by graph 119866minus119899119894which means node 119899119894 is removed from graph 119866 The detail ofthe pruning procedure for graph is presented in Algorithm 2

Impt (119905119894) = (1003816100381610038161003816119875119900119904119866 (119862)1003816100381610038161003816 minus 10038161003816100381610038161003816119875119900119904119866minus119899119894 (119862)10038161003816100381610038161003816)|119862| (4)

6 Wireless Communications and Mobile Computing

Input 119862 - Confidential documents set119873 - Non-confidential documents set119862119877 - Clusters set each with the centroid and corresponding graph119862119900119899119905119890119909119905119879 - The set of context terms

Output 1198621198791015840 - The updated set of confidential terms in clusters1198621199001198991199051198901199091199051198791015840 - The updated set of context terms

(1) 119894119903119903 119905119890119903119898119904 larr997888 119862119879 cup 119862119900119899119905119890119909119905119879(2) for (each 119905 in 119894119903119903 119905119890119903119898119904)(3) If (Impt(119905) = (|119875119900119904119879(119862119888119903 and 119873119888119903)| minus |119875119900119904119879minus119905(119862119888119903 and 119873119888119903)|)|119888119903| lt Impt 119905ℎ119903119890119904ℎ119900119897119889)(4) 119894119903119903 119905119890119903119898119904 = 119894119903119903 119905119890119903119898119904 minus 119905(5) for (each 119905 in 119894119903119903 119905119890119903119898119904)(6) 1198621198791015840 larr997888 Update confidential terms in CT(7) 1198621199001198991199051198901199091199051198791015840 larr997888Update context terms in 119862119900119899119905119890119909119905119879

Algorithm 2 Pruning

Context term 1

Confidential term 1

(Score 10)

Context term 2 Context term 3

Confidential term 2

(Score 5)

Score05

Score04 Score-02

Score06 Score-03

Figure 2 An example of confidential nodes connected throughtheir context nodes

34 Detection Phrase Obviously a confidential documentwithout any modification is easy to be detected accordingto confidential terms However the confidential documentswhich are rephrased or partitioned into portions and furtherconcealed in different nonconfidential documents as mostplagiarizers often do can hardly be detected Once theconfidential contents detection fails it is more likely to leadto data leakage or copyright infringement

In the detection phase we employ CBDLP model to dealwith three scenarios that could possibly happen The threedifferent scenarios are described as follows

(i) Each confidential document is detected as a whole(ii) Each confidential document is divided into portions

and embedded in nonconfidential ones

(iii) The confidential terms in confidential documents arerephrased completely

The detection method we employed includes three stepsas shown in Figure 3 which are described as follows

(1) Classify the documents to be tested to the corre-sponding clusters

(2) Identify the confidential terms and their contextterms according to the graphs of the correspondingclusters

(3) Calculate the confidentiality scores for the documentsand draw the conclusion that whether a document isconfidential or not

Then the security model which combines the trainingphrase and the detection phrase is shown in Figure 4

4 Experiments

In this section we evaluate the performance of CBDLPon Reuters-21578 dataset As testing dataset Reuters-21578consists of 21578 pieces of news distributed by Reuters in 1987which are saved in 22 files Reuters-21578 dataset is manuallyclassified as five categories each of which can be subdividedinto different number of subcategories For example the newsof economy includes the inventories subset gold subset andmoney-supply subset

41 Performance Experiments In the experiments wepresent the data leakage prevention method based onCBDLP model and also present a modified model withoutpruning step which is represented as CBDLP-Pr SinceSVM has been proved to be an excellent classifier with highaccuracy and CoBAn performs well in the scenario whereconfidential contents are embedded in nonconfidentialdocuments or rephrased we compare the performance ofCBDLP CBDLP-Pr SVM and CoBAn We evaluate theperformance of the methods in this paper with true positiverate (TPR) and false positive rate (FPR) and our goal is tomaximize TPR and minimize FPR concurrently

Wireless Communications and Mobile Computing 7

Match the test documentsto the most similar cluster

Testdocuments

Match the confidential termsand their context terms

Calculate scores for testdocuments

Determine theconfidentiality for each test

document

e first step

e second step

e third step

cluster

cluster

cluster

helliphellip

Graph

Figure 3 The detection of test documents

Context terms

Confidentialterms

Clusters ofConfidential andnon-confidential

documentsGraph

Match testdocuments to

clusters

Match theconfidential termsand their context

terms

Scores oftest

documentsDetermine

confidentiality

Training phrase

Detection phrase

Figure 4 The security model

We conduct experiments on the three scenarios which aredescribed above As for the first type of scenario we select thenews of ldquoearnrdquo as the carrier for confidential contents andmixthem with the news from other economy subsets as trainingdataset and testing dataset separately As for the second typeof scenario we extract the contents from the documents ofldquoearnrdquo subset and embed them in the documents from other

subsets The embedded portions are detected as confidentialcontents As for the third type of scenario we manuallyrephrase the contents in the documents of ldquoearnrdquo subset andembed them in the documents from other subsets

411 Confidential Documents as a Whole The experimentalresult of the first scenario is presented in Figure 5 As shown in

8 Wireless Communications and Mobile Computing

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPRCBDLP-PrCoBAn

CBDLPSVM

Figure 5 Performance of detecting confidential documents as awhole

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPR

CoBAnSVM

CBDLP-PrCBDLP

Figure 6 Performance of detecting the confidential contentsembedded in nonconfidential documents

Figure 5 when dealing with the scenario where confidentialdocuments are considered as a whole the performance ofthe four detection algorithms has no much difference Inspite of that CBDLP and CBDLP-Pr still perform slightlybetter than CoBAn and SVM which can be explained asthat the performance of CoBAn is partly influenced by thelimitation of 119896-means that it cannot deal with the clustersof various shapes effectively and SVM only focuses on theconfidential terms nevertheless ignores the context terms Inthis scenario since the documents containing confidentialterms are explicitly detected as confidential documents theperformance of the four methods has no much difference

412 Confidential Portions Embedded in NonconfidentialDocuments The result of the second scenario is presentedin Figure 6 As shown in Figure 6 when dealing with thescenario where the confidential portions are embedded innonconfidential documents CBDLP CBDLP-Pr and CoBAn

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPR

CoBAnSVM

CBDLP-PrCBDLP

Figure 7 Performance of detecting the rephrased confidentialcontents embedded in nonconfidential documents

perform better than SVM which can be explained as thatSVM is deceived by the scenario due to its statistics nature Asexpected the performance of CBDLP is slightly better thanCBDLP-Pr andCoBAndue to its pruning stepwhich removesthe redundancy nodes in graph that might deteriorate theresults of detection

In this scenario confidential portions are extracted fromthe documents defined as confidential and then embeddedin the nonconfidential documents whose length are at leastten times larger than the extracted portions Due to thestatistical nature most documents containing confidentialportions are incorrectly detected as nonconfidential by SVMwhich result in dramatic decline in the accuracy of SVMOther than SVM CBDLP CBDLP-Pr and CoBAn take theconfidential terms together with their context into accountand most nonconfidential documents containing embeddedconfidential portions are detected as confidential

413 Rephrased Confidential Contents in NonconfidentialDocuments The result of the third scenario is presentedin Figure 7 As shown in Figure 7 when dealing with thescenario where the confidential contents are rephrased andembedded in nonconfidential documents the performanceof SVM deteriorates considerably due to its statistics natureSince the rephrased contents do not deviate much from itsoriginal meaning CBDLP CBDLP-Pr and CoBAn performwell In addition the performance of CBDLP is better thanCBDLP-Pr andCoBAndue to its pruning stepwhich removesthe redundancy nodes in graph

In this scenario the rephrased confidential terms areembedded in nonconfidential documents which confuseSVMgreatly andmost documents containing rephrased con-fidential contents are incorrectly detected as nonconfidentialOther than SVM with the context of confidential termstaken into account CoBAn detects most documents contain-ing confidential contents however the accuracy of CoBAnis partially influenced by the clusterrsquos terms graph whichdepends on the quality of clusters generated by 119896-means Asa result CBDLP clusters documents with DBSCAN which

Wireless Communications and Mobile Computing 9

SVM

Runn

ing

time (

s)

Number of documents

3000

2500

2000

1500

1000

500

0

0 1000 2000 3000 4000 5000

CoBAnCBDLPCBDLP-Pr

Figure 8The scores of confidential documents andnonconfidentialdocuments

improves the quality of clusters and the clusterrsquos terms graphmeanwhile the pruning method removes the redundancynodes in graph and further improves the performance ofCBDLP

42 Running Time Comparisons In this experiment wemixed non-confidential documents together with the threetype of confidential documents which are the whole con-fidential documents the confidential contents embedded innonconfidential documents and the rephrased confidentialcontents embedded in non-confidential documents Theexperiment is conducted by using 10 fold cross validation ToCompare the running time of CBDLP CBDLP-Pr CoBAnand SVM we conduct the experiment on the datasets ofdifferent size The result is as shown in Figure 8 the runningtime of training phase and testing phase are exhibited asline graph in which the running time of CBDLP CBDLP-PrCoBan and SVM increase as more documents are added tothe dataset Although the additional steps of CBDLPCBDLP-Pr and CoBan result in more running time than SVM needstheir running time is still an order of magnitude more thanthat CBDLP performs much better than SVM does

5 Conclusion and Future Work

In this paper we present a new method for data leakagePrevention based on CBDLPmodel which has the followingadvantages

(1) It clusters the documents with DBSCAN and cosinemeasure which have been verified to be effective

(2) It represents confidential terms and their contextterms in graph

(3) It presents a pruning method based on the attributereduction method of rough set theory

Up to now some designated commercial DLP solutionscan reduce the risk of most accidental leakage howeverthey cannot provide sufficient protection against intentional

leakage And the other DLP solutions such as firewalls IDSantimalware software and management policies which canprovide assistance in detection intrusion or malicious soft-ware and enforce policies to protect data still do not preventintentional leaks perfectly To the best of our knowledgethere might be two main future research topics on DLP dataleakage from mobile devices and accidental data leakage byinsider

Since accidental data leakagemay be part of a larger attackin which their role will be mainly to activate an advancedpersistent threat inside the organization it is expected tocontinue to be one of the most challenging research topicsAnd our future work will focus on accidental data leakagein two directions First try to improve the efficiency andeffectiveness of CBDLP on confidential contents detectionSecond adjust the model dynamically according to thechanges of training dataset

Data Availability

All data generated or analysed during this study are includedin this published article

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

The research is supported by the National Natural Sci-ence Foundation of China under Grant nos 61871140 and61572153

References

[1] A Goyal F Bonchi and V S Lakshmanan ldquoOn minimizingbudget and time in influence propagation over social networksrdquoSocial Network Analysis and Mining pp 1ndash14 2012

[2] C Aggarwal Social Network Data Analytics Springer BerlinGermany 2011

[3] ldquoInformation week global security surveyrdquo Information Week2004

[4] D Alassi and R Alhajj ldquoEffectiveness of template detectionon noise reduction and websites summarizationrdquo InformationSciences vol 219 pp 41ndash72 2013

[5] D Holmes Using language models for information retrieval[PhD Thesis] Center for telematies and information technol-ogy University of Twente 2001

[6] R Bohme ldquoSecurity metrics and security investment modelsrdquoin Proceedings of the 5th international conference on advances ininformation and computer security 2010

[7] K R Rao and P Yip Discrete Cosine Transform AlgorithmsAdvantages Applications Academic Press Boston Mass USA1990

[8] S M Katz ldquoEstimation of probabilities from sparse data forthe language model component of a speech recognizerrdquo IEEETransactions on Signal Processing vol 35 no 3 pp 400-4011987

10 Wireless Communications and Mobile Computing

[9] R Jin L Si A G Hauptmann and J Callan ldquoLanguage modelfor IR using collection informationrdquo in Proceedings of the the25th annual international ACM SIGIR conference pp 419-4202002

[10] W W Cohen ldquoLearning rules that classify e-mailrdquo in Proceed-ings of the AAAI Spring Symposium on Machine Learning inInformation Access pp 18ndash25 1996

[11] G Salton A Wong and C S Yang ldquoA vector space model forautomatic indexingrdquo Communications of the ACM vol 18 no11 pp 613ndash620 1975

[12] S Deerwester S T Dumais G W Furnas T K Landauer andR Harshman ldquoIndexing by latent semantic analysisrdquo Journal ofthe Association for Information Science and Technology vol 41no 6 pp 391ndash407 1990

[13] C Ordonez ldquoClustering binary data streams with K-meansrdquoin Proceedings of the 8th ACM SIGMODWorkshop on ResearchIssues in Data Mining and Knowledge Discovery DMKD rsquo03 pp12ndash19 USA June 2003

[14] B Babcock M Datar R Motwani and L OrsquoCallaghan ldquoMain-taining variance and k-medians over data stream windowsrdquoin Proceedings of the Twenty second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems PODS2003 pp 234ndash243 June 2003

[15] T Zhang R Ramakrishnan and M Livny ldquoBIRCH a newdata clustering algorithm and its applicationsrdquoDataMining andKnowledge Discovery vol 1 no 2 pp 141ndash182 1997

[16] S Guha R Rastogi and K Shim ldquoCure an efficient clusteringalgorithm for large databasesrdquo in Proceedings of 1998 ACMSIGMOD International Conference Management of Data pp73ndash84 1998

[17] J W Han and M Kamber Data Mining Concepts and Tech-niques Morgan Kaufmann 2006

[18] A Hinneburg and D A Keim ldquoOptimal grid-clusteringtowards breaking the curse of dimensionality in high-dimen-sional clusteringrdquo in Proceedings of the 25th VLDB Conferencepp 506ndash517 1999

[19] WWang J Yang andRMuntz ldquoSting a statistical informationgrid approach to spatial data miningrdquo in Proceedings of the 23rdVLDB Conference pp 186ndash195 1997

[20] R Agrawal J Gehrke and D Gunopulos ldquoAutomatic sub-space clustering of high dimensional data for data miningapplicationsrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data pp 94ndash105 1998

[21] A McCallum K Nigam and L H Ungar ldquoEfficient clusteringof high-dimensional data sets with application to referencematchingrdquo in Proceedings of the KDD 2000 pp 169ndash178 ACMNew York NY USA 2000

[22] R Hyde P Angelov and A R MacKenzie ldquoFully onlineclustering of evolving data streams into arbitrarily shapedclustersrdquo Information Sciences vol 382-383 pp 96ndash114 2017

[23] F Jiang Y Fu B B Gupta et al ldquoDeep learning based multi-channel intelligent attack detection for data securityrdquo IEEETransactions on Sustainable Computing 2018

[24] R Agrawal and R Srikant ldquoPrivacy-preserving data miningrdquoin Proceedings of the the 2000 ACM SIGMOD internationalconference pp 439ndash450 May 2000

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information Processing ampManagement vol24 no 5 pp 513ndash523 1988

[26] Salton ldquoAutomatic text processing the transformation analysisand retrieval of information by computerrdquo Tech Rep Addison-Wesley Inc 1989

[27] M N Islam M Seera and C K Loo ldquoA robust incrementalclustering-based facial feature trackingrdquo Applied Soft Comput-ing vol 53 pp 34ndash44 2017

[28] A Shabtai and Y Elovici A Survey of Data Leakage Detectionand Prevention Solutions Springer Berlin Germany 2012

[29] S R Kalidindi S R Niezgoda G Landi S Vachhani andT Fast ldquoA novel framework for building materials knowledgesystemsrdquo Computers Materials and Continua vol 17 no 2 pp103ndash125 2010

[30] A J Wang ldquoInformation security models and metricsrdquo inProceedings of the 43rd annual southeast regional conference onACMSE43 pp 178ndash184 2005

[31] P Mitra C A Murthy and S K Pal ldquoUnsupervised featureselection using feature similarityrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 24 no 3 pp 301ndash3122002

[32] M Dash K Choi P Scheuermann and H Liu ldquoFeatureselection for clustering - A filter solutionrdquo in Proceedings of the2nd IEEE International Conference on Data Mining ICDM rsquo02pp 115ndash122 December 2002

[33] C-S Liu ldquoAn analytical method for computing the one-dimensional backward wave problemrdquo Computers Materialsand Continua vol 13 no 3 pp 219ndash234 2010

[34] G Katz Y Elovici and B Shapira ldquoCoban a context basedmodel for data leakage preventionrdquo Information Sciences vol262 pp 137ndash158 2014

[35] X Huang Y Lu D Li and MMa ldquoA novel mechanism for fastdetection of transformed data leakagerdquo IEEE Access vol 1 pp1ndash11 2018

[36] M Porter ldquoThe porter stemming algorithmrdquo 2006[37] F Pacheco M Cerrada R-V Sanchez D Cabrera C Li and J

Valente de Oliveira ldquoAttribute clustering using rough set theoryfor feature selection in fault severity classification of rotatingmachineryrdquoExpert Systems with Applications vol 71 pp 69ndash862017

[38] ldquoInformation week global security surveyrdquo Information Week2004

[39] C M Praba ldquoA technical review on data leakage detection andprevention approachesrdquo Journal of Network Communicationsand Emerging Technologies (JNCET) 2017

[40] F Ullah M Edwards R Ramdhany R Chitchyan M ABabar and A Rashid ldquoData exfiltration A review of externalattack vectors and countermeasuresrdquo Journal of Network andComputer Applications 2017

[41] K Thomas F Li A Zand et al ldquoData Breaches phishing ormalware understanding the risks of stolen credentialsrdquo in Pro-ceedings of the 24th ACM SIGSAC Conference on Computer andCommunications Security CCS rsquo17 pp 1421ndash1434 November2017

[42] J DeBlasio S Savage G M Voelker and A C SnoerenldquoTripwire Inferring internet site compromiserdquo in Proceedingsof the IMC rsquo17 pp 1ndash14 2017

[43] W Xu S Xiang and V Sachnev ldquoA cryptograph domainimage retrieval method based on paillier homomorphic blockencryptionrdquo Computers Materials and Continua pp 1ndash11 2018

[44] J Cui Y Zhang Z Cai A Liu and Y Li ldquoSecuring displaypath for security-sensitive applications on mobile devicesrdquoComputers Materials and Continua vol 55 no 1 pp 17ndash352018

[45] D Wang W Li and P Wang ldquoMeasuring two-factor authenti-cation schemes for real-time data access in industrial wireless

Wireless Communications and Mobile Computing 11

sensor networksrdquo IEEE Transactions on Industrial Informaticspp 1ndash12 2018

[46] D Wang H Cheng P Wang J Yan and X Huang ldquoA securityanalysis of honeywordsrdquo in Proceedings of the Network andDistributed Systems Security (NDSS) Symposium pp 18ndash212018

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 5: New A Data Leakage Prevention Method Based on the Reduction of …downloads.hindawi.com/journals/wcmc/2018/5823439.pdf · 2019. 7. 30. · A Data Leakage Prevention Method Based on

Wireless Communications and Mobile Computing 5

nonconfidential documents Note that not all clusters need tobe expanded After each iteration we lower the value of 119879119877Unless 119879119877 is greater than 119879119877119898119894119899 the nonconfidential docu-ments of the expanding clusters are included to recalculatethe scores of terms in original cluster

When the adjacent clusters are included and the scores ofconfidential terms are recalculated each term whose score isgreater than 1 is considered as confidential termwhichmeansthe term is more likely to appear in confidential documentsthan in non-confidential documents After this phase the setof confidential terms 119862119879 is obtained322 Analyze the Context of Confidential Terms After con-fidential terms detection we further analyze the context ofconfidential terms Apparently a term is more likely to beconsidered as confidential if it appears in the similar contextsin other confidential documents Inversely if the context ofa confidential term frequently appears in nonconfidentialdocuments the probability of the confidential term beingpart of confidential contents is lower

As a predefined parameter context span 120578 determines thenumber of terms that precede and follow the confidentialterm Context span with high value might increase thecomputational cost inversely and context span with lowvalue could not provide adequate context information ofconfidential terms Experimental results show that 120578 =10 tends to be the optimal value of context span in ourexperiments which means that the context of a confidentialterm consists of the five terms preceding it and the other fivefollowing it Apparently only the context of the confidentialterms in confidential documents needs to be taken intoaccount

The probabilities of a confidential term together with itscontext appearing in confidential documents and nonconfi-dential documents which are denoted by 119875119888(119896119890119910119888119900119899119905119890119909119905119896119890119910)and 119875119899119888(119896119890119910119888119900119899119905119890119909119905119896119890119910) are calculated separately If the for-mer is higher than the latter the corresponding confidentialcontents can be well represented by the confidential termwith its context 119875119888(119896119890119910119888119900119899119905119890119909119905119896119890119910) is defined as the number ofconfidential documents in which the confidential term withits context appears divided by the number of confidentialdocuments in which only the confidential term appears And119875119899119888(119896119890119910119888119900119899119905119890119909119905119896119890119910) is defined as the number of nonconfi-dential documents in which the confidential term with itscontext appears divided by the number of nonconfidentialdocuments in which only the confidential term appears

As mentioned above we predefine the similarity thresh-old of minimum cosine measure 119879119877119898119894119899 and iterativelyexpand to include more clusters 119879119877 is referred to as thesimilarity threshold of cosine measure between the clusterwith few non-confidential documents and its expandingcluster After each iteration we lower the value of 119879119877 at acertain rate120583which is predefined namely119879119877 = 120583lowast119879119877 Unless119879119877 is greater than 119879119877119898119894119899 the non-confidential documents ofthe expanding cluster are included to recalculate the scores ofcontext terms in original cluster By including more adjacentclusters we can accurately estimate which terms are mostlikely to indicate the confidentiality of the document

By subtracting the probability of the appearance of eachcontext term with confidential term in non-confidentialdocuments from the probability of the appearance of them inconfidential ones the score of each context term is calculatedas shown in (2)

The reason for employing subtraction rather thandivisionis to avoid large fluctuations in the values of the contexttermsWhen employing division even a single document candramatically change the probabilities as only the documentsincluding confidential terms are taken into account

119904119888119900119903119890+ = 119875119888 (119896119890119910119888119900119899119905119890119909119905119896119890119910 ) minus 119875119899119888 (119896119890119910119888119900119899119905119890119909119905119896119890119910 ) (2)

We iteratively expand to includemore clusters After eachiteration we lower the value of119879119877 until 119879119877 is less than 119879119877119898119894119899and the score of each context term is calculated as shown in(3) in which 119899119888119897119906 denotes the number of clusters involved

119904119888119900119903119890+ = 1119899119888119897119906 (119875119888 (

119896119890119910119888119900119899119905119890119909119905119896119890119910 ) minus 119875119899119888 (119896119890119910119888119900119899119905119890119909119905119896119890119910 )) (3)

After this phase the set of context terms with their scores119862119900119899119905119890119909119905119879 is obtained For each confidential term its contextterms whose scores are positive are more likely to appear inconfidential documents with the confidential term

323 Create Graph Representation After the operationsdescribed in previous section confidential terms and theircontext can be easily represented as nodes and connectedtogether according to their interrelation As shown in Fig-ure 2 for each cluster a set of confidential terms and a setof its context terms are obtained after the training phase andconfidential terms and its context terms are represented asconfidential nodes and context nodes respectively Confiden-tial nodes are connected together as long as there exists atleast one common context node between them

33 Pruning Nodes of Graph Due to the calculation of con-fidential terms and their context terms are based on statisticsscores there might exist occasional case of a nonconfidentialterm with high score because of term abuse In the pruningphase we employ the method of term reduction in rough settheory to remove the redundancy nodes in graph

With the information of confidential and nonconfidentialdocuments we evaluate the importance of nodes in graphfor each cluster A node in graph can be pruned only ifthe removal of the term represented by the node does notinfluence the results of identifying the confidential docu-ments in this cluster As shown in (4) Impt(119905119894) denotes theimportance measure of term 119905119894 which is represented as node119894 in graph And 119875119900119904119866(119862) denotes the portion in confidentialdocuments set 119862 can be identified correctly by graph 119866Similarly 119875119900119904119866minus119899119894(119862) denotes the portion in confidentialdocuments set 119862 can be identified correctly by graph 119866minus119899119894which means node 119899119894 is removed from graph 119866 The detail ofthe pruning procedure for graph is presented in Algorithm 2

Impt (119905119894) = (1003816100381610038161003816119875119900119904119866 (119862)1003816100381610038161003816 minus 10038161003816100381610038161003816119875119900119904119866minus119899119894 (119862)10038161003816100381610038161003816)|119862| (4)

6 Wireless Communications and Mobile Computing

Input 119862 - Confidential documents set119873 - Non-confidential documents set119862119877 - Clusters set each with the centroid and corresponding graph119862119900119899119905119890119909119905119879 - The set of context terms

Output 1198621198791015840 - The updated set of confidential terms in clusters1198621199001198991199051198901199091199051198791015840 - The updated set of context terms

(1) 119894119903119903 119905119890119903119898119904 larr997888 119862119879 cup 119862119900119899119905119890119909119905119879(2) for (each 119905 in 119894119903119903 119905119890119903119898119904)(3) If (Impt(119905) = (|119875119900119904119879(119862119888119903 and 119873119888119903)| minus |119875119900119904119879minus119905(119862119888119903 and 119873119888119903)|)|119888119903| lt Impt 119905ℎ119903119890119904ℎ119900119897119889)(4) 119894119903119903 119905119890119903119898119904 = 119894119903119903 119905119890119903119898119904 minus 119905(5) for (each 119905 in 119894119903119903 119905119890119903119898119904)(6) 1198621198791015840 larr997888 Update confidential terms in CT(7) 1198621199001198991199051198901199091199051198791015840 larr997888Update context terms in 119862119900119899119905119890119909119905119879

Algorithm 2 Pruning

Context term 1

Confidential term 1

(Score 10)

Context term 2 Context term 3

Confidential term 2

(Score 5)

Score05

Score04 Score-02

Score06 Score-03

Figure 2 An example of confidential nodes connected throughtheir context nodes

34 Detection Phrase Obviously a confidential documentwithout any modification is easy to be detected accordingto confidential terms However the confidential documentswhich are rephrased or partitioned into portions and furtherconcealed in different nonconfidential documents as mostplagiarizers often do can hardly be detected Once theconfidential contents detection fails it is more likely to leadto data leakage or copyright infringement

In the detection phase we employ CBDLP model to dealwith three scenarios that could possibly happen The threedifferent scenarios are described as follows

(i) Each confidential document is detected as a whole(ii) Each confidential document is divided into portions

and embedded in nonconfidential ones

(iii) The confidential terms in confidential documents arerephrased completely

The detection method we employed includes three stepsas shown in Figure 3 which are described as follows

(1) Classify the documents to be tested to the corre-sponding clusters

(2) Identify the confidential terms and their contextterms according to the graphs of the correspondingclusters

(3) Calculate the confidentiality scores for the documentsand draw the conclusion that whether a document isconfidential or not

Then the security model which combines the trainingphrase and the detection phrase is shown in Figure 4

4 Experiments

In this section we evaluate the performance of CBDLPon Reuters-21578 dataset As testing dataset Reuters-21578consists of 21578 pieces of news distributed by Reuters in 1987which are saved in 22 files Reuters-21578 dataset is manuallyclassified as five categories each of which can be subdividedinto different number of subcategories For example the newsof economy includes the inventories subset gold subset andmoney-supply subset

41 Performance Experiments In the experiments wepresent the data leakage prevention method based onCBDLP model and also present a modified model withoutpruning step which is represented as CBDLP-Pr SinceSVM has been proved to be an excellent classifier with highaccuracy and CoBAn performs well in the scenario whereconfidential contents are embedded in nonconfidentialdocuments or rephrased we compare the performance ofCBDLP CBDLP-Pr SVM and CoBAn We evaluate theperformance of the methods in this paper with true positiverate (TPR) and false positive rate (FPR) and our goal is tomaximize TPR and minimize FPR concurrently

Wireless Communications and Mobile Computing 7

Match the test documentsto the most similar cluster

Testdocuments

Match the confidential termsand their context terms

Calculate scores for testdocuments

Determine theconfidentiality for each test

document

e first step

e second step

e third step

cluster

cluster

cluster

helliphellip

Graph

Figure 3 The detection of test documents

Context terms

Confidentialterms

Clusters ofConfidential andnon-confidential

documentsGraph

Match testdocuments to

clusters

Match theconfidential termsand their context

terms

Scores oftest

documentsDetermine

confidentiality

Training phrase

Detection phrase

Figure 4 The security model

We conduct experiments on the three scenarios which aredescribed above As for the first type of scenario we select thenews of ldquoearnrdquo as the carrier for confidential contents andmixthem with the news from other economy subsets as trainingdataset and testing dataset separately As for the second typeof scenario we extract the contents from the documents ofldquoearnrdquo subset and embed them in the documents from other

subsets The embedded portions are detected as confidentialcontents As for the third type of scenario we manuallyrephrase the contents in the documents of ldquoearnrdquo subset andembed them in the documents from other subsets

411 Confidential Documents as a Whole The experimentalresult of the first scenario is presented in Figure 5 As shown in

8 Wireless Communications and Mobile Computing

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPRCBDLP-PrCoBAn

CBDLPSVM

Figure 5 Performance of detecting confidential documents as awhole

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPR

CoBAnSVM

CBDLP-PrCBDLP

Figure 6 Performance of detecting the confidential contentsembedded in nonconfidential documents

Figure 5 when dealing with the scenario where confidentialdocuments are considered as a whole the performance ofthe four detection algorithms has no much difference Inspite of that CBDLP and CBDLP-Pr still perform slightlybetter than CoBAn and SVM which can be explained asthat the performance of CoBAn is partly influenced by thelimitation of 119896-means that it cannot deal with the clustersof various shapes effectively and SVM only focuses on theconfidential terms nevertheless ignores the context terms Inthis scenario since the documents containing confidentialterms are explicitly detected as confidential documents theperformance of the four methods has no much difference

412 Confidential Portions Embedded in NonconfidentialDocuments The result of the second scenario is presentedin Figure 6 As shown in Figure 6 when dealing with thescenario where the confidential portions are embedded innonconfidential documents CBDLP CBDLP-Pr and CoBAn

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPR

CoBAnSVM

CBDLP-PrCBDLP

Figure 7 Performance of detecting the rephrased confidentialcontents embedded in nonconfidential documents

perform better than SVM which can be explained as thatSVM is deceived by the scenario due to its statistics nature Asexpected the performance of CBDLP is slightly better thanCBDLP-Pr andCoBAndue to its pruning stepwhich removesthe redundancy nodes in graph that might deteriorate theresults of detection

In this scenario confidential portions are extracted fromthe documents defined as confidential and then embeddedin the nonconfidential documents whose length are at leastten times larger than the extracted portions Due to thestatistical nature most documents containing confidentialportions are incorrectly detected as nonconfidential by SVMwhich result in dramatic decline in the accuracy of SVMOther than SVM CBDLP CBDLP-Pr and CoBAn take theconfidential terms together with their context into accountand most nonconfidential documents containing embeddedconfidential portions are detected as confidential

413 Rephrased Confidential Contents in NonconfidentialDocuments The result of the third scenario is presentedin Figure 7 As shown in Figure 7 when dealing with thescenario where the confidential contents are rephrased andembedded in nonconfidential documents the performanceof SVM deteriorates considerably due to its statistics natureSince the rephrased contents do not deviate much from itsoriginal meaning CBDLP CBDLP-Pr and CoBAn performwell In addition the performance of CBDLP is better thanCBDLP-Pr andCoBAndue to its pruning stepwhich removesthe redundancy nodes in graph

In this scenario the rephrased confidential terms areembedded in nonconfidential documents which confuseSVMgreatly andmost documents containing rephrased con-fidential contents are incorrectly detected as nonconfidentialOther than SVM with the context of confidential termstaken into account CoBAn detects most documents contain-ing confidential contents however the accuracy of CoBAnis partially influenced by the clusterrsquos terms graph whichdepends on the quality of clusters generated by 119896-means Asa result CBDLP clusters documents with DBSCAN which

Wireless Communications and Mobile Computing 9

SVM

Runn

ing

time (

s)

Number of documents

3000

2500

2000

1500

1000

500

0

0 1000 2000 3000 4000 5000

CoBAnCBDLPCBDLP-Pr

Figure 8The scores of confidential documents andnonconfidentialdocuments

improves the quality of clusters and the clusterrsquos terms graphmeanwhile the pruning method removes the redundancynodes in graph and further improves the performance ofCBDLP

42 Running Time Comparisons In this experiment wemixed non-confidential documents together with the threetype of confidential documents which are the whole con-fidential documents the confidential contents embedded innonconfidential documents and the rephrased confidentialcontents embedded in non-confidential documents Theexperiment is conducted by using 10 fold cross validation ToCompare the running time of CBDLP CBDLP-Pr CoBAnand SVM we conduct the experiment on the datasets ofdifferent size The result is as shown in Figure 8 the runningtime of training phase and testing phase are exhibited asline graph in which the running time of CBDLP CBDLP-PrCoBan and SVM increase as more documents are added tothe dataset Although the additional steps of CBDLPCBDLP-Pr and CoBan result in more running time than SVM needstheir running time is still an order of magnitude more thanthat CBDLP performs much better than SVM does

5 Conclusion and Future Work

In this paper we present a new method for data leakagePrevention based on CBDLPmodel which has the followingadvantages

(1) It clusters the documents with DBSCAN and cosinemeasure which have been verified to be effective

(2) It represents confidential terms and their contextterms in graph

(3) It presents a pruning method based on the attributereduction method of rough set theory

Up to now some designated commercial DLP solutionscan reduce the risk of most accidental leakage howeverthey cannot provide sufficient protection against intentional

leakage And the other DLP solutions such as firewalls IDSantimalware software and management policies which canprovide assistance in detection intrusion or malicious soft-ware and enforce policies to protect data still do not preventintentional leaks perfectly To the best of our knowledgethere might be two main future research topics on DLP dataleakage from mobile devices and accidental data leakage byinsider

Since accidental data leakagemay be part of a larger attackin which their role will be mainly to activate an advancedpersistent threat inside the organization it is expected tocontinue to be one of the most challenging research topicsAnd our future work will focus on accidental data leakagein two directions First try to improve the efficiency andeffectiveness of CBDLP on confidential contents detectionSecond adjust the model dynamically according to thechanges of training dataset

Data Availability

All data generated or analysed during this study are includedin this published article

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

The research is supported by the National Natural Sci-ence Foundation of China under Grant nos 61871140 and61572153

References

[1] A Goyal F Bonchi and V S Lakshmanan ldquoOn minimizingbudget and time in influence propagation over social networksrdquoSocial Network Analysis and Mining pp 1ndash14 2012

[2] C Aggarwal Social Network Data Analytics Springer BerlinGermany 2011

[3] ldquoInformation week global security surveyrdquo Information Week2004

[4] D Alassi and R Alhajj ldquoEffectiveness of template detectionon noise reduction and websites summarizationrdquo InformationSciences vol 219 pp 41ndash72 2013

[5] D Holmes Using language models for information retrieval[PhD Thesis] Center for telematies and information technol-ogy University of Twente 2001

[6] R Bohme ldquoSecurity metrics and security investment modelsrdquoin Proceedings of the 5th international conference on advances ininformation and computer security 2010

[7] K R Rao and P Yip Discrete Cosine Transform AlgorithmsAdvantages Applications Academic Press Boston Mass USA1990

[8] S M Katz ldquoEstimation of probabilities from sparse data forthe language model component of a speech recognizerrdquo IEEETransactions on Signal Processing vol 35 no 3 pp 400-4011987

10 Wireless Communications and Mobile Computing

[9] R Jin L Si A G Hauptmann and J Callan ldquoLanguage modelfor IR using collection informationrdquo in Proceedings of the the25th annual international ACM SIGIR conference pp 419-4202002

[10] W W Cohen ldquoLearning rules that classify e-mailrdquo in Proceed-ings of the AAAI Spring Symposium on Machine Learning inInformation Access pp 18ndash25 1996

[11] G Salton A Wong and C S Yang ldquoA vector space model forautomatic indexingrdquo Communications of the ACM vol 18 no11 pp 613ndash620 1975

[12] S Deerwester S T Dumais G W Furnas T K Landauer andR Harshman ldquoIndexing by latent semantic analysisrdquo Journal ofthe Association for Information Science and Technology vol 41no 6 pp 391ndash407 1990

[13] C Ordonez ldquoClustering binary data streams with K-meansrdquoin Proceedings of the 8th ACM SIGMODWorkshop on ResearchIssues in Data Mining and Knowledge Discovery DMKD rsquo03 pp12ndash19 USA June 2003

[14] B Babcock M Datar R Motwani and L OrsquoCallaghan ldquoMain-taining variance and k-medians over data stream windowsrdquoin Proceedings of the Twenty second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems PODS2003 pp 234ndash243 June 2003

[15] T Zhang R Ramakrishnan and M Livny ldquoBIRCH a newdata clustering algorithm and its applicationsrdquoDataMining andKnowledge Discovery vol 1 no 2 pp 141ndash182 1997

[16] S Guha R Rastogi and K Shim ldquoCure an efficient clusteringalgorithm for large databasesrdquo in Proceedings of 1998 ACMSIGMOD International Conference Management of Data pp73ndash84 1998

[17] J W Han and M Kamber Data Mining Concepts and Tech-niques Morgan Kaufmann 2006

[18] A Hinneburg and D A Keim ldquoOptimal grid-clusteringtowards breaking the curse of dimensionality in high-dimen-sional clusteringrdquo in Proceedings of the 25th VLDB Conferencepp 506ndash517 1999

[19] WWang J Yang andRMuntz ldquoSting a statistical informationgrid approach to spatial data miningrdquo in Proceedings of the 23rdVLDB Conference pp 186ndash195 1997

[20] R Agrawal J Gehrke and D Gunopulos ldquoAutomatic sub-space clustering of high dimensional data for data miningapplicationsrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data pp 94ndash105 1998

[21] A McCallum K Nigam and L H Ungar ldquoEfficient clusteringof high-dimensional data sets with application to referencematchingrdquo in Proceedings of the KDD 2000 pp 169ndash178 ACMNew York NY USA 2000

[22] R Hyde P Angelov and A R MacKenzie ldquoFully onlineclustering of evolving data streams into arbitrarily shapedclustersrdquo Information Sciences vol 382-383 pp 96ndash114 2017

[23] F Jiang Y Fu B B Gupta et al ldquoDeep learning based multi-channel intelligent attack detection for data securityrdquo IEEETransactions on Sustainable Computing 2018

[24] R Agrawal and R Srikant ldquoPrivacy-preserving data miningrdquoin Proceedings of the the 2000 ACM SIGMOD internationalconference pp 439ndash450 May 2000

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information Processing ampManagement vol24 no 5 pp 513ndash523 1988

[26] Salton ldquoAutomatic text processing the transformation analysisand retrieval of information by computerrdquo Tech Rep Addison-Wesley Inc 1989

[27] M N Islam M Seera and C K Loo ldquoA robust incrementalclustering-based facial feature trackingrdquo Applied Soft Comput-ing vol 53 pp 34ndash44 2017

[28] A Shabtai and Y Elovici A Survey of Data Leakage Detectionand Prevention Solutions Springer Berlin Germany 2012

[29] S R Kalidindi S R Niezgoda G Landi S Vachhani andT Fast ldquoA novel framework for building materials knowledgesystemsrdquo Computers Materials and Continua vol 17 no 2 pp103ndash125 2010

[30] A J Wang ldquoInformation security models and metricsrdquo inProceedings of the 43rd annual southeast regional conference onACMSE43 pp 178ndash184 2005

[31] P Mitra C A Murthy and S K Pal ldquoUnsupervised featureselection using feature similarityrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 24 no 3 pp 301ndash3122002

[32] M Dash K Choi P Scheuermann and H Liu ldquoFeatureselection for clustering - A filter solutionrdquo in Proceedings of the2nd IEEE International Conference on Data Mining ICDM rsquo02pp 115ndash122 December 2002

[33] C-S Liu ldquoAn analytical method for computing the one-dimensional backward wave problemrdquo Computers Materialsand Continua vol 13 no 3 pp 219ndash234 2010

[34] G Katz Y Elovici and B Shapira ldquoCoban a context basedmodel for data leakage preventionrdquo Information Sciences vol262 pp 137ndash158 2014

[35] X Huang Y Lu D Li and MMa ldquoA novel mechanism for fastdetection of transformed data leakagerdquo IEEE Access vol 1 pp1ndash11 2018

[36] M Porter ldquoThe porter stemming algorithmrdquo 2006[37] F Pacheco M Cerrada R-V Sanchez D Cabrera C Li and J

Valente de Oliveira ldquoAttribute clustering using rough set theoryfor feature selection in fault severity classification of rotatingmachineryrdquoExpert Systems with Applications vol 71 pp 69ndash862017

[38] ldquoInformation week global security surveyrdquo Information Week2004

[39] C M Praba ldquoA technical review on data leakage detection andprevention approachesrdquo Journal of Network Communicationsand Emerging Technologies (JNCET) 2017

[40] F Ullah M Edwards R Ramdhany R Chitchyan M ABabar and A Rashid ldquoData exfiltration A review of externalattack vectors and countermeasuresrdquo Journal of Network andComputer Applications 2017

[41] K Thomas F Li A Zand et al ldquoData Breaches phishing ormalware understanding the risks of stolen credentialsrdquo in Pro-ceedings of the 24th ACM SIGSAC Conference on Computer andCommunications Security CCS rsquo17 pp 1421ndash1434 November2017

[42] J DeBlasio S Savage G M Voelker and A C SnoerenldquoTripwire Inferring internet site compromiserdquo in Proceedingsof the IMC rsquo17 pp 1ndash14 2017

[43] W Xu S Xiang and V Sachnev ldquoA cryptograph domainimage retrieval method based on paillier homomorphic blockencryptionrdquo Computers Materials and Continua pp 1ndash11 2018

[44] J Cui Y Zhang Z Cai A Liu and Y Li ldquoSecuring displaypath for security-sensitive applications on mobile devicesrdquoComputers Materials and Continua vol 55 no 1 pp 17ndash352018

[45] D Wang W Li and P Wang ldquoMeasuring two-factor authenti-cation schemes for real-time data access in industrial wireless

Wireless Communications and Mobile Computing 11

sensor networksrdquo IEEE Transactions on Industrial Informaticspp 1ndash12 2018

[46] D Wang H Cheng P Wang J Yan and X Huang ldquoA securityanalysis of honeywordsrdquo in Proceedings of the Network andDistributed Systems Security (NDSS) Symposium pp 18ndash212018

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 6: New A Data Leakage Prevention Method Based on the Reduction of …downloads.hindawi.com/journals/wcmc/2018/5823439.pdf · 2019. 7. 30. · A Data Leakage Prevention Method Based on

6 Wireless Communications and Mobile Computing

Input 119862 - Confidential documents set119873 - Non-confidential documents set119862119877 - Clusters set each with the centroid and corresponding graph119862119900119899119905119890119909119905119879 - The set of context terms

Output 1198621198791015840 - The updated set of confidential terms in clusters1198621199001198991199051198901199091199051198791015840 - The updated set of context terms

(1) 119894119903119903 119905119890119903119898119904 larr997888 119862119879 cup 119862119900119899119905119890119909119905119879(2) for (each 119905 in 119894119903119903 119905119890119903119898119904)(3) If (Impt(119905) = (|119875119900119904119879(119862119888119903 and 119873119888119903)| minus |119875119900119904119879minus119905(119862119888119903 and 119873119888119903)|)|119888119903| lt Impt 119905ℎ119903119890119904ℎ119900119897119889)(4) 119894119903119903 119905119890119903119898119904 = 119894119903119903 119905119890119903119898119904 minus 119905(5) for (each 119905 in 119894119903119903 119905119890119903119898119904)(6) 1198621198791015840 larr997888 Update confidential terms in CT(7) 1198621199001198991199051198901199091199051198791015840 larr997888Update context terms in 119862119900119899119905119890119909119905119879

Algorithm 2 Pruning

Context term 1

Confidential term 1

(Score 10)

Context term 2 Context term 3

Confidential term 2

(Score 5)

Score05

Score04 Score-02

Score06 Score-03

Figure 2 An example of confidential nodes connected throughtheir context nodes

34 Detection Phrase Obviously a confidential documentwithout any modification is easy to be detected accordingto confidential terms However the confidential documentswhich are rephrased or partitioned into portions and furtherconcealed in different nonconfidential documents as mostplagiarizers often do can hardly be detected Once theconfidential contents detection fails it is more likely to leadto data leakage or copyright infringement

In the detection phase we employ CBDLP model to dealwith three scenarios that could possibly happen The threedifferent scenarios are described as follows

(i) Each confidential document is detected as a whole(ii) Each confidential document is divided into portions

and embedded in nonconfidential ones

(iii) The confidential terms in confidential documents arerephrased completely

The detection method we employed includes three stepsas shown in Figure 3 which are described as follows

(1) Classify the documents to be tested to the corre-sponding clusters

(2) Identify the confidential terms and their contextterms according to the graphs of the correspondingclusters

(3) Calculate the confidentiality scores for the documentsand draw the conclusion that whether a document isconfidential or not

Then the security model which combines the trainingphrase and the detection phrase is shown in Figure 4

4 Experiments

In this section we evaluate the performance of CBDLPon Reuters-21578 dataset As testing dataset Reuters-21578consists of 21578 pieces of news distributed by Reuters in 1987which are saved in 22 files Reuters-21578 dataset is manuallyclassified as five categories each of which can be subdividedinto different number of subcategories For example the newsof economy includes the inventories subset gold subset andmoney-supply subset

41 Performance Experiments In the experiments wepresent the data leakage prevention method based onCBDLP model and also present a modified model withoutpruning step which is represented as CBDLP-Pr SinceSVM has been proved to be an excellent classifier with highaccuracy and CoBAn performs well in the scenario whereconfidential contents are embedded in nonconfidentialdocuments or rephrased we compare the performance ofCBDLP CBDLP-Pr SVM and CoBAn We evaluate theperformance of the methods in this paper with true positiverate (TPR) and false positive rate (FPR) and our goal is tomaximize TPR and minimize FPR concurrently

Wireless Communications and Mobile Computing 7

Match the test documentsto the most similar cluster

Testdocuments

Match the confidential termsand their context terms

Calculate scores for testdocuments

Determine theconfidentiality for each test

document

e first step

e second step

e third step

cluster

cluster

cluster

helliphellip

Graph

Figure 3 The detection of test documents

Context terms

Confidentialterms

Clusters ofConfidential andnon-confidential

documentsGraph

Match testdocuments to

clusters

Match theconfidential termsand their context

terms

Scores oftest

documentsDetermine

confidentiality

Training phrase

Detection phrase

Figure 4 The security model

We conduct experiments on the three scenarios which aredescribed above As for the first type of scenario we select thenews of ldquoearnrdquo as the carrier for confidential contents andmixthem with the news from other economy subsets as trainingdataset and testing dataset separately As for the second typeof scenario we extract the contents from the documents ofldquoearnrdquo subset and embed them in the documents from other

subsets The embedded portions are detected as confidentialcontents As for the third type of scenario we manuallyrephrase the contents in the documents of ldquoearnrdquo subset andembed them in the documents from other subsets

411 Confidential Documents as a Whole The experimentalresult of the first scenario is presented in Figure 5 As shown in

8 Wireless Communications and Mobile Computing

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPRCBDLP-PrCoBAn

CBDLPSVM

Figure 5 Performance of detecting confidential documents as awhole

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPR

CoBAnSVM

CBDLP-PrCBDLP

Figure 6 Performance of detecting the confidential contentsembedded in nonconfidential documents

Figure 5 when dealing with the scenario where confidentialdocuments are considered as a whole the performance ofthe four detection algorithms has no much difference Inspite of that CBDLP and CBDLP-Pr still perform slightlybetter than CoBAn and SVM which can be explained asthat the performance of CoBAn is partly influenced by thelimitation of 119896-means that it cannot deal with the clustersof various shapes effectively and SVM only focuses on theconfidential terms nevertheless ignores the context terms Inthis scenario since the documents containing confidentialterms are explicitly detected as confidential documents theperformance of the four methods has no much difference

412 Confidential Portions Embedded in NonconfidentialDocuments The result of the second scenario is presentedin Figure 6 As shown in Figure 6 when dealing with thescenario where the confidential portions are embedded innonconfidential documents CBDLP CBDLP-Pr and CoBAn

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPR

CoBAnSVM

CBDLP-PrCBDLP

Figure 7 Performance of detecting the rephrased confidentialcontents embedded in nonconfidential documents

perform better than SVM which can be explained as thatSVM is deceived by the scenario due to its statistics nature Asexpected the performance of CBDLP is slightly better thanCBDLP-Pr andCoBAndue to its pruning stepwhich removesthe redundancy nodes in graph that might deteriorate theresults of detection

In this scenario confidential portions are extracted fromthe documents defined as confidential and then embeddedin the nonconfidential documents whose length are at leastten times larger than the extracted portions Due to thestatistical nature most documents containing confidentialportions are incorrectly detected as nonconfidential by SVMwhich result in dramatic decline in the accuracy of SVMOther than SVM CBDLP CBDLP-Pr and CoBAn take theconfidential terms together with their context into accountand most nonconfidential documents containing embeddedconfidential portions are detected as confidential

413 Rephrased Confidential Contents in NonconfidentialDocuments The result of the third scenario is presentedin Figure 7 As shown in Figure 7 when dealing with thescenario where the confidential contents are rephrased andembedded in nonconfidential documents the performanceof SVM deteriorates considerably due to its statistics natureSince the rephrased contents do not deviate much from itsoriginal meaning CBDLP CBDLP-Pr and CoBAn performwell In addition the performance of CBDLP is better thanCBDLP-Pr andCoBAndue to its pruning stepwhich removesthe redundancy nodes in graph

In this scenario the rephrased confidential terms areembedded in nonconfidential documents which confuseSVMgreatly andmost documents containing rephrased con-fidential contents are incorrectly detected as nonconfidentialOther than SVM with the context of confidential termstaken into account CoBAn detects most documents contain-ing confidential contents however the accuracy of CoBAnis partially influenced by the clusterrsquos terms graph whichdepends on the quality of clusters generated by 119896-means Asa result CBDLP clusters documents with DBSCAN which

Wireless Communications and Mobile Computing 9

SVM

Runn

ing

time (

s)

Number of documents

3000

2500

2000

1500

1000

500

0

0 1000 2000 3000 4000 5000

CoBAnCBDLPCBDLP-Pr

Figure 8The scores of confidential documents andnonconfidentialdocuments

improves the quality of clusters and the clusterrsquos terms graphmeanwhile the pruning method removes the redundancynodes in graph and further improves the performance ofCBDLP

42 Running Time Comparisons In this experiment wemixed non-confidential documents together with the threetype of confidential documents which are the whole con-fidential documents the confidential contents embedded innonconfidential documents and the rephrased confidentialcontents embedded in non-confidential documents Theexperiment is conducted by using 10 fold cross validation ToCompare the running time of CBDLP CBDLP-Pr CoBAnand SVM we conduct the experiment on the datasets ofdifferent size The result is as shown in Figure 8 the runningtime of training phase and testing phase are exhibited asline graph in which the running time of CBDLP CBDLP-PrCoBan and SVM increase as more documents are added tothe dataset Although the additional steps of CBDLPCBDLP-Pr and CoBan result in more running time than SVM needstheir running time is still an order of magnitude more thanthat CBDLP performs much better than SVM does

5 Conclusion and Future Work

In this paper we present a new method for data leakagePrevention based on CBDLPmodel which has the followingadvantages

(1) It clusters the documents with DBSCAN and cosinemeasure which have been verified to be effective

(2) It represents confidential terms and their contextterms in graph

(3) It presents a pruning method based on the attributereduction method of rough set theory

Up to now some designated commercial DLP solutionscan reduce the risk of most accidental leakage howeverthey cannot provide sufficient protection against intentional

leakage And the other DLP solutions such as firewalls IDSantimalware software and management policies which canprovide assistance in detection intrusion or malicious soft-ware and enforce policies to protect data still do not preventintentional leaks perfectly To the best of our knowledgethere might be two main future research topics on DLP dataleakage from mobile devices and accidental data leakage byinsider

Since accidental data leakagemay be part of a larger attackin which their role will be mainly to activate an advancedpersistent threat inside the organization it is expected tocontinue to be one of the most challenging research topicsAnd our future work will focus on accidental data leakagein two directions First try to improve the efficiency andeffectiveness of CBDLP on confidential contents detectionSecond adjust the model dynamically according to thechanges of training dataset

Data Availability

All data generated or analysed during this study are includedin this published article

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

The research is supported by the National Natural Sci-ence Foundation of China under Grant nos 61871140 and61572153

References

[1] A Goyal F Bonchi and V S Lakshmanan ldquoOn minimizingbudget and time in influence propagation over social networksrdquoSocial Network Analysis and Mining pp 1ndash14 2012

[2] C Aggarwal Social Network Data Analytics Springer BerlinGermany 2011

[3] ldquoInformation week global security surveyrdquo Information Week2004

[4] D Alassi and R Alhajj ldquoEffectiveness of template detectionon noise reduction and websites summarizationrdquo InformationSciences vol 219 pp 41ndash72 2013

[5] D Holmes Using language models for information retrieval[PhD Thesis] Center for telematies and information technol-ogy University of Twente 2001

[6] R Bohme ldquoSecurity metrics and security investment modelsrdquoin Proceedings of the 5th international conference on advances ininformation and computer security 2010

[7] K R Rao and P Yip Discrete Cosine Transform AlgorithmsAdvantages Applications Academic Press Boston Mass USA1990

[8] S M Katz ldquoEstimation of probabilities from sparse data forthe language model component of a speech recognizerrdquo IEEETransactions on Signal Processing vol 35 no 3 pp 400-4011987

10 Wireless Communications and Mobile Computing

[9] R Jin L Si A G Hauptmann and J Callan ldquoLanguage modelfor IR using collection informationrdquo in Proceedings of the the25th annual international ACM SIGIR conference pp 419-4202002

[10] W W Cohen ldquoLearning rules that classify e-mailrdquo in Proceed-ings of the AAAI Spring Symposium on Machine Learning inInformation Access pp 18ndash25 1996

[11] G Salton A Wong and C S Yang ldquoA vector space model forautomatic indexingrdquo Communications of the ACM vol 18 no11 pp 613ndash620 1975

[12] S Deerwester S T Dumais G W Furnas T K Landauer andR Harshman ldquoIndexing by latent semantic analysisrdquo Journal ofthe Association for Information Science and Technology vol 41no 6 pp 391ndash407 1990

[13] C Ordonez ldquoClustering binary data streams with K-meansrdquoin Proceedings of the 8th ACM SIGMODWorkshop on ResearchIssues in Data Mining and Knowledge Discovery DMKD rsquo03 pp12ndash19 USA June 2003

[14] B Babcock M Datar R Motwani and L OrsquoCallaghan ldquoMain-taining variance and k-medians over data stream windowsrdquoin Proceedings of the Twenty second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems PODS2003 pp 234ndash243 June 2003

[15] T Zhang R Ramakrishnan and M Livny ldquoBIRCH a newdata clustering algorithm and its applicationsrdquoDataMining andKnowledge Discovery vol 1 no 2 pp 141ndash182 1997

[16] S Guha R Rastogi and K Shim ldquoCure an efficient clusteringalgorithm for large databasesrdquo in Proceedings of 1998 ACMSIGMOD International Conference Management of Data pp73ndash84 1998

[17] J W Han and M Kamber Data Mining Concepts and Tech-niques Morgan Kaufmann 2006

[18] A Hinneburg and D A Keim ldquoOptimal grid-clusteringtowards breaking the curse of dimensionality in high-dimen-sional clusteringrdquo in Proceedings of the 25th VLDB Conferencepp 506ndash517 1999

[19] WWang J Yang andRMuntz ldquoSting a statistical informationgrid approach to spatial data miningrdquo in Proceedings of the 23rdVLDB Conference pp 186ndash195 1997

[20] R Agrawal J Gehrke and D Gunopulos ldquoAutomatic sub-space clustering of high dimensional data for data miningapplicationsrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data pp 94ndash105 1998

[21] A McCallum K Nigam and L H Ungar ldquoEfficient clusteringof high-dimensional data sets with application to referencematchingrdquo in Proceedings of the KDD 2000 pp 169ndash178 ACMNew York NY USA 2000

[22] R Hyde P Angelov and A R MacKenzie ldquoFully onlineclustering of evolving data streams into arbitrarily shapedclustersrdquo Information Sciences vol 382-383 pp 96ndash114 2017

[23] F Jiang Y Fu B B Gupta et al ldquoDeep learning based multi-channel intelligent attack detection for data securityrdquo IEEETransactions on Sustainable Computing 2018

[24] R Agrawal and R Srikant ldquoPrivacy-preserving data miningrdquoin Proceedings of the the 2000 ACM SIGMOD internationalconference pp 439ndash450 May 2000

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information Processing ampManagement vol24 no 5 pp 513ndash523 1988

[26] Salton ldquoAutomatic text processing the transformation analysisand retrieval of information by computerrdquo Tech Rep Addison-Wesley Inc 1989

[27] M N Islam M Seera and C K Loo ldquoA robust incrementalclustering-based facial feature trackingrdquo Applied Soft Comput-ing vol 53 pp 34ndash44 2017

[28] A Shabtai and Y Elovici A Survey of Data Leakage Detectionand Prevention Solutions Springer Berlin Germany 2012

[29] S R Kalidindi S R Niezgoda G Landi S Vachhani andT Fast ldquoA novel framework for building materials knowledgesystemsrdquo Computers Materials and Continua vol 17 no 2 pp103ndash125 2010

[30] A J Wang ldquoInformation security models and metricsrdquo inProceedings of the 43rd annual southeast regional conference onACMSE43 pp 178ndash184 2005

[31] P Mitra C A Murthy and S K Pal ldquoUnsupervised featureselection using feature similarityrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 24 no 3 pp 301ndash3122002

[32] M Dash K Choi P Scheuermann and H Liu ldquoFeatureselection for clustering - A filter solutionrdquo in Proceedings of the2nd IEEE International Conference on Data Mining ICDM rsquo02pp 115ndash122 December 2002

[33] C-S Liu ldquoAn analytical method for computing the one-dimensional backward wave problemrdquo Computers Materialsand Continua vol 13 no 3 pp 219ndash234 2010

[34] G Katz Y Elovici and B Shapira ldquoCoban a context basedmodel for data leakage preventionrdquo Information Sciences vol262 pp 137ndash158 2014

[35] X Huang Y Lu D Li and MMa ldquoA novel mechanism for fastdetection of transformed data leakagerdquo IEEE Access vol 1 pp1ndash11 2018

[36] M Porter ldquoThe porter stemming algorithmrdquo 2006[37] F Pacheco M Cerrada R-V Sanchez D Cabrera C Li and J

Valente de Oliveira ldquoAttribute clustering using rough set theoryfor feature selection in fault severity classification of rotatingmachineryrdquoExpert Systems with Applications vol 71 pp 69ndash862017

[38] ldquoInformation week global security surveyrdquo Information Week2004

[39] C M Praba ldquoA technical review on data leakage detection andprevention approachesrdquo Journal of Network Communicationsand Emerging Technologies (JNCET) 2017

[40] F Ullah M Edwards R Ramdhany R Chitchyan M ABabar and A Rashid ldquoData exfiltration A review of externalattack vectors and countermeasuresrdquo Journal of Network andComputer Applications 2017

[41] K Thomas F Li A Zand et al ldquoData Breaches phishing ormalware understanding the risks of stolen credentialsrdquo in Pro-ceedings of the 24th ACM SIGSAC Conference on Computer andCommunications Security CCS rsquo17 pp 1421ndash1434 November2017

[42] J DeBlasio S Savage G M Voelker and A C SnoerenldquoTripwire Inferring internet site compromiserdquo in Proceedingsof the IMC rsquo17 pp 1ndash14 2017

[43] W Xu S Xiang and V Sachnev ldquoA cryptograph domainimage retrieval method based on paillier homomorphic blockencryptionrdquo Computers Materials and Continua pp 1ndash11 2018

[44] J Cui Y Zhang Z Cai A Liu and Y Li ldquoSecuring displaypath for security-sensitive applications on mobile devicesrdquoComputers Materials and Continua vol 55 no 1 pp 17ndash352018

[45] D Wang W Li and P Wang ldquoMeasuring two-factor authenti-cation schemes for real-time data access in industrial wireless

Wireless Communications and Mobile Computing 11

sensor networksrdquo IEEE Transactions on Industrial Informaticspp 1ndash12 2018

[46] D Wang H Cheng P Wang J Yan and X Huang ldquoA securityanalysis of honeywordsrdquo in Proceedings of the Network andDistributed Systems Security (NDSS) Symposium pp 18ndash212018

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 7: New A Data Leakage Prevention Method Based on the Reduction of …downloads.hindawi.com/journals/wcmc/2018/5823439.pdf · 2019. 7. 30. · A Data Leakage Prevention Method Based on

Wireless Communications and Mobile Computing 7

Match the test documentsto the most similar cluster

Testdocuments

Match the confidential termsand their context terms

Calculate scores for testdocuments

Determine theconfidentiality for each test

document

e first step

e second step

e third step

cluster

cluster

cluster

helliphellip

Graph

Figure 3 The detection of test documents

Context terms

Confidentialterms

Clusters ofConfidential andnon-confidential

documentsGraph

Match testdocuments to

clusters

Match theconfidential termsand their context

terms

Scores oftest

documentsDetermine

confidentiality

Training phrase

Detection phrase

Figure 4 The security model

We conduct experiments on the three scenarios which aredescribed above As for the first type of scenario we select thenews of ldquoearnrdquo as the carrier for confidential contents andmixthem with the news from other economy subsets as trainingdataset and testing dataset separately As for the second typeof scenario we extract the contents from the documents ofldquoearnrdquo subset and embed them in the documents from other

subsets The embedded portions are detected as confidentialcontents As for the third type of scenario we manuallyrephrase the contents in the documents of ldquoearnrdquo subset andembed them in the documents from other subsets

411 Confidential Documents as a Whole The experimentalresult of the first scenario is presented in Figure 5 As shown in

8 Wireless Communications and Mobile Computing

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPRCBDLP-PrCoBAn

CBDLPSVM

Figure 5 Performance of detecting confidential documents as awhole

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPR

CoBAnSVM

CBDLP-PrCBDLP

Figure 6 Performance of detecting the confidential contentsembedded in nonconfidential documents

Figure 5 when dealing with the scenario where confidentialdocuments are considered as a whole the performance ofthe four detection algorithms has no much difference Inspite of that CBDLP and CBDLP-Pr still perform slightlybetter than CoBAn and SVM which can be explained asthat the performance of CoBAn is partly influenced by thelimitation of 119896-means that it cannot deal with the clustersof various shapes effectively and SVM only focuses on theconfidential terms nevertheless ignores the context terms Inthis scenario since the documents containing confidentialterms are explicitly detected as confidential documents theperformance of the four methods has no much difference

412 Confidential Portions Embedded in NonconfidentialDocuments The result of the second scenario is presentedin Figure 6 As shown in Figure 6 when dealing with thescenario where the confidential portions are embedded innonconfidential documents CBDLP CBDLP-Pr and CoBAn

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPR

CoBAnSVM

CBDLP-PrCBDLP

Figure 7 Performance of detecting the rephrased confidentialcontents embedded in nonconfidential documents

perform better than SVM which can be explained as thatSVM is deceived by the scenario due to its statistics nature Asexpected the performance of CBDLP is slightly better thanCBDLP-Pr andCoBAndue to its pruning stepwhich removesthe redundancy nodes in graph that might deteriorate theresults of detection

In this scenario confidential portions are extracted fromthe documents defined as confidential and then embeddedin the nonconfidential documents whose length are at leastten times larger than the extracted portions Due to thestatistical nature most documents containing confidentialportions are incorrectly detected as nonconfidential by SVMwhich result in dramatic decline in the accuracy of SVMOther than SVM CBDLP CBDLP-Pr and CoBAn take theconfidential terms together with their context into accountand most nonconfidential documents containing embeddedconfidential portions are detected as confidential

413 Rephrased Confidential Contents in NonconfidentialDocuments The result of the third scenario is presentedin Figure 7 As shown in Figure 7 when dealing with thescenario where the confidential contents are rephrased andembedded in nonconfidential documents the performanceof SVM deteriorates considerably due to its statistics natureSince the rephrased contents do not deviate much from itsoriginal meaning CBDLP CBDLP-Pr and CoBAn performwell In addition the performance of CBDLP is better thanCBDLP-Pr andCoBAndue to its pruning stepwhich removesthe redundancy nodes in graph

In this scenario the rephrased confidential terms areembedded in nonconfidential documents which confuseSVMgreatly andmost documents containing rephrased con-fidential contents are incorrectly detected as nonconfidentialOther than SVM with the context of confidential termstaken into account CoBAn detects most documents contain-ing confidential contents however the accuracy of CoBAnis partially influenced by the clusterrsquos terms graph whichdepends on the quality of clusters generated by 119896-means Asa result CBDLP clusters documents with DBSCAN which

Wireless Communications and Mobile Computing 9

SVM

Runn

ing

time (

s)

Number of documents

3000

2500

2000

1500

1000

500

0

0 1000 2000 3000 4000 5000

CoBAnCBDLPCBDLP-Pr

Figure 8The scores of confidential documents andnonconfidentialdocuments

improves the quality of clusters and the clusterrsquos terms graphmeanwhile the pruning method removes the redundancynodes in graph and further improves the performance ofCBDLP

42 Running Time Comparisons In this experiment wemixed non-confidential documents together with the threetype of confidential documents which are the whole con-fidential documents the confidential contents embedded innonconfidential documents and the rephrased confidentialcontents embedded in non-confidential documents Theexperiment is conducted by using 10 fold cross validation ToCompare the running time of CBDLP CBDLP-Pr CoBAnand SVM we conduct the experiment on the datasets ofdifferent size The result is as shown in Figure 8 the runningtime of training phase and testing phase are exhibited asline graph in which the running time of CBDLP CBDLP-PrCoBan and SVM increase as more documents are added tothe dataset Although the additional steps of CBDLPCBDLP-Pr and CoBan result in more running time than SVM needstheir running time is still an order of magnitude more thanthat CBDLP performs much better than SVM does

5 Conclusion and Future Work

In this paper we present a new method for data leakagePrevention based on CBDLPmodel which has the followingadvantages

(1) It clusters the documents with DBSCAN and cosinemeasure which have been verified to be effective

(2) It represents confidential terms and their contextterms in graph

(3) It presents a pruning method based on the attributereduction method of rough set theory

Up to now some designated commercial DLP solutionscan reduce the risk of most accidental leakage howeverthey cannot provide sufficient protection against intentional

leakage And the other DLP solutions such as firewalls IDSantimalware software and management policies which canprovide assistance in detection intrusion or malicious soft-ware and enforce policies to protect data still do not preventintentional leaks perfectly To the best of our knowledgethere might be two main future research topics on DLP dataleakage from mobile devices and accidental data leakage byinsider

Since accidental data leakagemay be part of a larger attackin which their role will be mainly to activate an advancedpersistent threat inside the organization it is expected tocontinue to be one of the most challenging research topicsAnd our future work will focus on accidental data leakagein two directions First try to improve the efficiency andeffectiveness of CBDLP on confidential contents detectionSecond adjust the model dynamically according to thechanges of training dataset

Data Availability

All data generated or analysed during this study are includedin this published article

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

The research is supported by the National Natural Sci-ence Foundation of China under Grant nos 61871140 and61572153

References

[1] A Goyal F Bonchi and V S Lakshmanan ldquoOn minimizingbudget and time in influence propagation over social networksrdquoSocial Network Analysis and Mining pp 1ndash14 2012

[2] C Aggarwal Social Network Data Analytics Springer BerlinGermany 2011

[3] ldquoInformation week global security surveyrdquo Information Week2004

[4] D Alassi and R Alhajj ldquoEffectiveness of template detectionon noise reduction and websites summarizationrdquo InformationSciences vol 219 pp 41ndash72 2013

[5] D Holmes Using language models for information retrieval[PhD Thesis] Center for telematies and information technol-ogy University of Twente 2001

[6] R Bohme ldquoSecurity metrics and security investment modelsrdquoin Proceedings of the 5th international conference on advances ininformation and computer security 2010

[7] K R Rao and P Yip Discrete Cosine Transform AlgorithmsAdvantages Applications Academic Press Boston Mass USA1990

[8] S M Katz ldquoEstimation of probabilities from sparse data forthe language model component of a speech recognizerrdquo IEEETransactions on Signal Processing vol 35 no 3 pp 400-4011987

10 Wireless Communications and Mobile Computing

[9] R Jin L Si A G Hauptmann and J Callan ldquoLanguage modelfor IR using collection informationrdquo in Proceedings of the the25th annual international ACM SIGIR conference pp 419-4202002

[10] W W Cohen ldquoLearning rules that classify e-mailrdquo in Proceed-ings of the AAAI Spring Symposium on Machine Learning inInformation Access pp 18ndash25 1996

[11] G Salton A Wong and C S Yang ldquoA vector space model forautomatic indexingrdquo Communications of the ACM vol 18 no11 pp 613ndash620 1975

[12] S Deerwester S T Dumais G W Furnas T K Landauer andR Harshman ldquoIndexing by latent semantic analysisrdquo Journal ofthe Association for Information Science and Technology vol 41no 6 pp 391ndash407 1990

[13] C Ordonez ldquoClustering binary data streams with K-meansrdquoin Proceedings of the 8th ACM SIGMODWorkshop on ResearchIssues in Data Mining and Knowledge Discovery DMKD rsquo03 pp12ndash19 USA June 2003

[14] B Babcock M Datar R Motwani and L OrsquoCallaghan ldquoMain-taining variance and k-medians over data stream windowsrdquoin Proceedings of the Twenty second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems PODS2003 pp 234ndash243 June 2003

[15] T Zhang R Ramakrishnan and M Livny ldquoBIRCH a newdata clustering algorithm and its applicationsrdquoDataMining andKnowledge Discovery vol 1 no 2 pp 141ndash182 1997

[16] S Guha R Rastogi and K Shim ldquoCure an efficient clusteringalgorithm for large databasesrdquo in Proceedings of 1998 ACMSIGMOD International Conference Management of Data pp73ndash84 1998

[17] J W Han and M Kamber Data Mining Concepts and Tech-niques Morgan Kaufmann 2006

[18] A Hinneburg and D A Keim ldquoOptimal grid-clusteringtowards breaking the curse of dimensionality in high-dimen-sional clusteringrdquo in Proceedings of the 25th VLDB Conferencepp 506ndash517 1999

[19] WWang J Yang andRMuntz ldquoSting a statistical informationgrid approach to spatial data miningrdquo in Proceedings of the 23rdVLDB Conference pp 186ndash195 1997

[20] R Agrawal J Gehrke and D Gunopulos ldquoAutomatic sub-space clustering of high dimensional data for data miningapplicationsrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data pp 94ndash105 1998

[21] A McCallum K Nigam and L H Ungar ldquoEfficient clusteringof high-dimensional data sets with application to referencematchingrdquo in Proceedings of the KDD 2000 pp 169ndash178 ACMNew York NY USA 2000

[22] R Hyde P Angelov and A R MacKenzie ldquoFully onlineclustering of evolving data streams into arbitrarily shapedclustersrdquo Information Sciences vol 382-383 pp 96ndash114 2017

[23] F Jiang Y Fu B B Gupta et al ldquoDeep learning based multi-channel intelligent attack detection for data securityrdquo IEEETransactions on Sustainable Computing 2018

[24] R Agrawal and R Srikant ldquoPrivacy-preserving data miningrdquoin Proceedings of the the 2000 ACM SIGMOD internationalconference pp 439ndash450 May 2000

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information Processing ampManagement vol24 no 5 pp 513ndash523 1988

[26] Salton ldquoAutomatic text processing the transformation analysisand retrieval of information by computerrdquo Tech Rep Addison-Wesley Inc 1989

[27] M N Islam M Seera and C K Loo ldquoA robust incrementalclustering-based facial feature trackingrdquo Applied Soft Comput-ing vol 53 pp 34ndash44 2017

[28] A Shabtai and Y Elovici A Survey of Data Leakage Detectionand Prevention Solutions Springer Berlin Germany 2012

[29] S R Kalidindi S R Niezgoda G Landi S Vachhani andT Fast ldquoA novel framework for building materials knowledgesystemsrdquo Computers Materials and Continua vol 17 no 2 pp103ndash125 2010

[30] A J Wang ldquoInformation security models and metricsrdquo inProceedings of the 43rd annual southeast regional conference onACMSE43 pp 178ndash184 2005

[31] P Mitra C A Murthy and S K Pal ldquoUnsupervised featureselection using feature similarityrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 24 no 3 pp 301ndash3122002

[32] M Dash K Choi P Scheuermann and H Liu ldquoFeatureselection for clustering - A filter solutionrdquo in Proceedings of the2nd IEEE International Conference on Data Mining ICDM rsquo02pp 115ndash122 December 2002

[33] C-S Liu ldquoAn analytical method for computing the one-dimensional backward wave problemrdquo Computers Materialsand Continua vol 13 no 3 pp 219ndash234 2010

[34] G Katz Y Elovici and B Shapira ldquoCoban a context basedmodel for data leakage preventionrdquo Information Sciences vol262 pp 137ndash158 2014

[35] X Huang Y Lu D Li and MMa ldquoA novel mechanism for fastdetection of transformed data leakagerdquo IEEE Access vol 1 pp1ndash11 2018

[36] M Porter ldquoThe porter stemming algorithmrdquo 2006[37] F Pacheco M Cerrada R-V Sanchez D Cabrera C Li and J

Valente de Oliveira ldquoAttribute clustering using rough set theoryfor feature selection in fault severity classification of rotatingmachineryrdquoExpert Systems with Applications vol 71 pp 69ndash862017

[38] ldquoInformation week global security surveyrdquo Information Week2004

[39] C M Praba ldquoA technical review on data leakage detection andprevention approachesrdquo Journal of Network Communicationsand Emerging Technologies (JNCET) 2017

[40] F Ullah M Edwards R Ramdhany R Chitchyan M ABabar and A Rashid ldquoData exfiltration A review of externalattack vectors and countermeasuresrdquo Journal of Network andComputer Applications 2017

[41] K Thomas F Li A Zand et al ldquoData Breaches phishing ormalware understanding the risks of stolen credentialsrdquo in Pro-ceedings of the 24th ACM SIGSAC Conference on Computer andCommunications Security CCS rsquo17 pp 1421ndash1434 November2017

[42] J DeBlasio S Savage G M Voelker and A C SnoerenldquoTripwire Inferring internet site compromiserdquo in Proceedingsof the IMC rsquo17 pp 1ndash14 2017

[43] W Xu S Xiang and V Sachnev ldquoA cryptograph domainimage retrieval method based on paillier homomorphic blockencryptionrdquo Computers Materials and Continua pp 1ndash11 2018

[44] J Cui Y Zhang Z Cai A Liu and Y Li ldquoSecuring displaypath for security-sensitive applications on mobile devicesrdquoComputers Materials and Continua vol 55 no 1 pp 17ndash352018

[45] D Wang W Li and P Wang ldquoMeasuring two-factor authenti-cation schemes for real-time data access in industrial wireless

Wireless Communications and Mobile Computing 11

sensor networksrdquo IEEE Transactions on Industrial Informaticspp 1ndash12 2018

[46] D Wang H Cheng P Wang J Yan and X Huang ldquoA securityanalysis of honeywordsrdquo in Proceedings of the Network andDistributed Systems Security (NDSS) Symposium pp 18ndash212018

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 8: New A Data Leakage Prevention Method Based on the Reduction of …downloads.hindawi.com/journals/wcmc/2018/5823439.pdf · 2019. 7. 30. · A Data Leakage Prevention Method Based on

8 Wireless Communications and Mobile Computing

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPRCBDLP-PrCoBAn

CBDLPSVM

Figure 5 Performance of detecting confidential documents as awhole

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPR

CoBAnSVM

CBDLP-PrCBDLP

Figure 6 Performance of detecting the confidential contentsembedded in nonconfidential documents

Figure 5 when dealing with the scenario where confidentialdocuments are considered as a whole the performance ofthe four detection algorithms has no much difference Inspite of that CBDLP and CBDLP-Pr still perform slightlybetter than CoBAn and SVM which can be explained asthat the performance of CoBAn is partly influenced by thelimitation of 119896-means that it cannot deal with the clustersof various shapes effectively and SVM only focuses on theconfidential terms nevertheless ignores the context terms Inthis scenario since the documents containing confidentialterms are explicitly detected as confidential documents theperformance of the four methods has no much difference

412 Confidential Portions Embedded in NonconfidentialDocuments The result of the second scenario is presentedin Figure 6 As shown in Figure 6 when dealing with thescenario where the confidential portions are embedded innonconfidential documents CBDLP CBDLP-Pr and CoBAn

00 01 02 03 04 05 06 07 08 09 1000

01

02

03

04

05

06

07

08

09

10

TPR

FPR

CoBAnSVM

CBDLP-PrCBDLP

Figure 7 Performance of detecting the rephrased confidentialcontents embedded in nonconfidential documents

perform better than SVM which can be explained as thatSVM is deceived by the scenario due to its statistics nature Asexpected the performance of CBDLP is slightly better thanCBDLP-Pr andCoBAndue to its pruning stepwhich removesthe redundancy nodes in graph that might deteriorate theresults of detection

In this scenario confidential portions are extracted fromthe documents defined as confidential and then embeddedin the nonconfidential documents whose length are at leastten times larger than the extracted portions Due to thestatistical nature most documents containing confidentialportions are incorrectly detected as nonconfidential by SVMwhich result in dramatic decline in the accuracy of SVMOther than SVM CBDLP CBDLP-Pr and CoBAn take theconfidential terms together with their context into accountand most nonconfidential documents containing embeddedconfidential portions are detected as confidential

413 Rephrased Confidential Contents in NonconfidentialDocuments The result of the third scenario is presentedin Figure 7 As shown in Figure 7 when dealing with thescenario where the confidential contents are rephrased andembedded in nonconfidential documents the performanceof SVM deteriorates considerably due to its statistics natureSince the rephrased contents do not deviate much from itsoriginal meaning CBDLP CBDLP-Pr and CoBAn performwell In addition the performance of CBDLP is better thanCBDLP-Pr andCoBAndue to its pruning stepwhich removesthe redundancy nodes in graph

In this scenario the rephrased confidential terms areembedded in nonconfidential documents which confuseSVMgreatly andmost documents containing rephrased con-fidential contents are incorrectly detected as nonconfidentialOther than SVM with the context of confidential termstaken into account CoBAn detects most documents contain-ing confidential contents however the accuracy of CoBAnis partially influenced by the clusterrsquos terms graph whichdepends on the quality of clusters generated by 119896-means Asa result CBDLP clusters documents with DBSCAN which

Wireless Communications and Mobile Computing 9

SVM

Runn

ing

time (

s)

Number of documents

3000

2500

2000

1500

1000

500

0

0 1000 2000 3000 4000 5000

CoBAnCBDLPCBDLP-Pr

Figure 8The scores of confidential documents andnonconfidentialdocuments

improves the quality of clusters and the clusterrsquos terms graphmeanwhile the pruning method removes the redundancynodes in graph and further improves the performance ofCBDLP

42 Running Time Comparisons In this experiment wemixed non-confidential documents together with the threetype of confidential documents which are the whole con-fidential documents the confidential contents embedded innonconfidential documents and the rephrased confidentialcontents embedded in non-confidential documents Theexperiment is conducted by using 10 fold cross validation ToCompare the running time of CBDLP CBDLP-Pr CoBAnand SVM we conduct the experiment on the datasets ofdifferent size The result is as shown in Figure 8 the runningtime of training phase and testing phase are exhibited asline graph in which the running time of CBDLP CBDLP-PrCoBan and SVM increase as more documents are added tothe dataset Although the additional steps of CBDLPCBDLP-Pr and CoBan result in more running time than SVM needstheir running time is still an order of magnitude more thanthat CBDLP performs much better than SVM does

5 Conclusion and Future Work

In this paper we present a new method for data leakagePrevention based on CBDLPmodel which has the followingadvantages

(1) It clusters the documents with DBSCAN and cosinemeasure which have been verified to be effective

(2) It represents confidential terms and their contextterms in graph

(3) It presents a pruning method based on the attributereduction method of rough set theory

Up to now some designated commercial DLP solutionscan reduce the risk of most accidental leakage howeverthey cannot provide sufficient protection against intentional

leakage And the other DLP solutions such as firewalls IDSantimalware software and management policies which canprovide assistance in detection intrusion or malicious soft-ware and enforce policies to protect data still do not preventintentional leaks perfectly To the best of our knowledgethere might be two main future research topics on DLP dataleakage from mobile devices and accidental data leakage byinsider

Since accidental data leakagemay be part of a larger attackin which their role will be mainly to activate an advancedpersistent threat inside the organization it is expected tocontinue to be one of the most challenging research topicsAnd our future work will focus on accidental data leakagein two directions First try to improve the efficiency andeffectiveness of CBDLP on confidential contents detectionSecond adjust the model dynamically according to thechanges of training dataset

Data Availability

All data generated or analysed during this study are includedin this published article

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

The research is supported by the National Natural Sci-ence Foundation of China under Grant nos 61871140 and61572153

References

[1] A Goyal F Bonchi and V S Lakshmanan ldquoOn minimizingbudget and time in influence propagation over social networksrdquoSocial Network Analysis and Mining pp 1ndash14 2012

[2] C Aggarwal Social Network Data Analytics Springer BerlinGermany 2011

[3] ldquoInformation week global security surveyrdquo Information Week2004

[4] D Alassi and R Alhajj ldquoEffectiveness of template detectionon noise reduction and websites summarizationrdquo InformationSciences vol 219 pp 41ndash72 2013

[5] D Holmes Using language models for information retrieval[PhD Thesis] Center for telematies and information technol-ogy University of Twente 2001

[6] R Bohme ldquoSecurity metrics and security investment modelsrdquoin Proceedings of the 5th international conference on advances ininformation and computer security 2010

[7] K R Rao and P Yip Discrete Cosine Transform AlgorithmsAdvantages Applications Academic Press Boston Mass USA1990

[8] S M Katz ldquoEstimation of probabilities from sparse data forthe language model component of a speech recognizerrdquo IEEETransactions on Signal Processing vol 35 no 3 pp 400-4011987

10 Wireless Communications and Mobile Computing

[9] R Jin L Si A G Hauptmann and J Callan ldquoLanguage modelfor IR using collection informationrdquo in Proceedings of the the25th annual international ACM SIGIR conference pp 419-4202002

[10] W W Cohen ldquoLearning rules that classify e-mailrdquo in Proceed-ings of the AAAI Spring Symposium on Machine Learning inInformation Access pp 18ndash25 1996

[11] G Salton A Wong and C S Yang ldquoA vector space model forautomatic indexingrdquo Communications of the ACM vol 18 no11 pp 613ndash620 1975

[12] S Deerwester S T Dumais G W Furnas T K Landauer andR Harshman ldquoIndexing by latent semantic analysisrdquo Journal ofthe Association for Information Science and Technology vol 41no 6 pp 391ndash407 1990

[13] C Ordonez ldquoClustering binary data streams with K-meansrdquoin Proceedings of the 8th ACM SIGMODWorkshop on ResearchIssues in Data Mining and Knowledge Discovery DMKD rsquo03 pp12ndash19 USA June 2003

[14] B Babcock M Datar R Motwani and L OrsquoCallaghan ldquoMain-taining variance and k-medians over data stream windowsrdquoin Proceedings of the Twenty second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems PODS2003 pp 234ndash243 June 2003

[15] T Zhang R Ramakrishnan and M Livny ldquoBIRCH a newdata clustering algorithm and its applicationsrdquoDataMining andKnowledge Discovery vol 1 no 2 pp 141ndash182 1997

[16] S Guha R Rastogi and K Shim ldquoCure an efficient clusteringalgorithm for large databasesrdquo in Proceedings of 1998 ACMSIGMOD International Conference Management of Data pp73ndash84 1998

[17] J W Han and M Kamber Data Mining Concepts and Tech-niques Morgan Kaufmann 2006

[18] A Hinneburg and D A Keim ldquoOptimal grid-clusteringtowards breaking the curse of dimensionality in high-dimen-sional clusteringrdquo in Proceedings of the 25th VLDB Conferencepp 506ndash517 1999

[19] WWang J Yang andRMuntz ldquoSting a statistical informationgrid approach to spatial data miningrdquo in Proceedings of the 23rdVLDB Conference pp 186ndash195 1997

[20] R Agrawal J Gehrke and D Gunopulos ldquoAutomatic sub-space clustering of high dimensional data for data miningapplicationsrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data pp 94ndash105 1998

[21] A McCallum K Nigam and L H Ungar ldquoEfficient clusteringof high-dimensional data sets with application to referencematchingrdquo in Proceedings of the KDD 2000 pp 169ndash178 ACMNew York NY USA 2000

[22] R Hyde P Angelov and A R MacKenzie ldquoFully onlineclustering of evolving data streams into arbitrarily shapedclustersrdquo Information Sciences vol 382-383 pp 96ndash114 2017

[23] F Jiang Y Fu B B Gupta et al ldquoDeep learning based multi-channel intelligent attack detection for data securityrdquo IEEETransactions on Sustainable Computing 2018

[24] R Agrawal and R Srikant ldquoPrivacy-preserving data miningrdquoin Proceedings of the the 2000 ACM SIGMOD internationalconference pp 439ndash450 May 2000

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information Processing ampManagement vol24 no 5 pp 513ndash523 1988

[26] Salton ldquoAutomatic text processing the transformation analysisand retrieval of information by computerrdquo Tech Rep Addison-Wesley Inc 1989

[27] M N Islam M Seera and C K Loo ldquoA robust incrementalclustering-based facial feature trackingrdquo Applied Soft Comput-ing vol 53 pp 34ndash44 2017

[28] A Shabtai and Y Elovici A Survey of Data Leakage Detectionand Prevention Solutions Springer Berlin Germany 2012

[29] S R Kalidindi S R Niezgoda G Landi S Vachhani andT Fast ldquoA novel framework for building materials knowledgesystemsrdquo Computers Materials and Continua vol 17 no 2 pp103ndash125 2010

[30] A J Wang ldquoInformation security models and metricsrdquo inProceedings of the 43rd annual southeast regional conference onACMSE43 pp 178ndash184 2005

[31] P Mitra C A Murthy and S K Pal ldquoUnsupervised featureselection using feature similarityrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 24 no 3 pp 301ndash3122002

[32] M Dash K Choi P Scheuermann and H Liu ldquoFeatureselection for clustering - A filter solutionrdquo in Proceedings of the2nd IEEE International Conference on Data Mining ICDM rsquo02pp 115ndash122 December 2002

[33] C-S Liu ldquoAn analytical method for computing the one-dimensional backward wave problemrdquo Computers Materialsand Continua vol 13 no 3 pp 219ndash234 2010

[34] G Katz Y Elovici and B Shapira ldquoCoban a context basedmodel for data leakage preventionrdquo Information Sciences vol262 pp 137ndash158 2014

[35] X Huang Y Lu D Li and MMa ldquoA novel mechanism for fastdetection of transformed data leakagerdquo IEEE Access vol 1 pp1ndash11 2018

[36] M Porter ldquoThe porter stemming algorithmrdquo 2006[37] F Pacheco M Cerrada R-V Sanchez D Cabrera C Li and J

Valente de Oliveira ldquoAttribute clustering using rough set theoryfor feature selection in fault severity classification of rotatingmachineryrdquoExpert Systems with Applications vol 71 pp 69ndash862017

[38] ldquoInformation week global security surveyrdquo Information Week2004

[39] C M Praba ldquoA technical review on data leakage detection andprevention approachesrdquo Journal of Network Communicationsand Emerging Technologies (JNCET) 2017

[40] F Ullah M Edwards R Ramdhany R Chitchyan M ABabar and A Rashid ldquoData exfiltration A review of externalattack vectors and countermeasuresrdquo Journal of Network andComputer Applications 2017

[41] K Thomas F Li A Zand et al ldquoData Breaches phishing ormalware understanding the risks of stolen credentialsrdquo in Pro-ceedings of the 24th ACM SIGSAC Conference on Computer andCommunications Security CCS rsquo17 pp 1421ndash1434 November2017

[42] J DeBlasio S Savage G M Voelker and A C SnoerenldquoTripwire Inferring internet site compromiserdquo in Proceedingsof the IMC rsquo17 pp 1ndash14 2017

[43] W Xu S Xiang and V Sachnev ldquoA cryptograph domainimage retrieval method based on paillier homomorphic blockencryptionrdquo Computers Materials and Continua pp 1ndash11 2018

[44] J Cui Y Zhang Z Cai A Liu and Y Li ldquoSecuring displaypath for security-sensitive applications on mobile devicesrdquoComputers Materials and Continua vol 55 no 1 pp 17ndash352018

[45] D Wang W Li and P Wang ldquoMeasuring two-factor authenti-cation schemes for real-time data access in industrial wireless

Wireless Communications and Mobile Computing 11

sensor networksrdquo IEEE Transactions on Industrial Informaticspp 1ndash12 2018

[46] D Wang H Cheng P Wang J Yan and X Huang ldquoA securityanalysis of honeywordsrdquo in Proceedings of the Network andDistributed Systems Security (NDSS) Symposium pp 18ndash212018

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 9: New A Data Leakage Prevention Method Based on the Reduction of …downloads.hindawi.com/journals/wcmc/2018/5823439.pdf · 2019. 7. 30. · A Data Leakage Prevention Method Based on

Wireless Communications and Mobile Computing 9

SVM

Runn

ing

time (

s)

Number of documents

3000

2500

2000

1500

1000

500

0

0 1000 2000 3000 4000 5000

CoBAnCBDLPCBDLP-Pr

Figure 8The scores of confidential documents andnonconfidentialdocuments

improves the quality of clusters and the clusterrsquos terms graphmeanwhile the pruning method removes the redundancynodes in graph and further improves the performance ofCBDLP

42 Running Time Comparisons In this experiment wemixed non-confidential documents together with the threetype of confidential documents which are the whole con-fidential documents the confidential contents embedded innonconfidential documents and the rephrased confidentialcontents embedded in non-confidential documents Theexperiment is conducted by using 10 fold cross validation ToCompare the running time of CBDLP CBDLP-Pr CoBAnand SVM we conduct the experiment on the datasets ofdifferent size The result is as shown in Figure 8 the runningtime of training phase and testing phase are exhibited asline graph in which the running time of CBDLP CBDLP-PrCoBan and SVM increase as more documents are added tothe dataset Although the additional steps of CBDLPCBDLP-Pr and CoBan result in more running time than SVM needstheir running time is still an order of magnitude more thanthat CBDLP performs much better than SVM does

5 Conclusion and Future Work

In this paper we present a new method for data leakagePrevention based on CBDLPmodel which has the followingadvantages

(1) It clusters the documents with DBSCAN and cosinemeasure which have been verified to be effective

(2) It represents confidential terms and their contextterms in graph

(3) It presents a pruning method based on the attributereduction method of rough set theory

Up to now some designated commercial DLP solutionscan reduce the risk of most accidental leakage howeverthey cannot provide sufficient protection against intentional

leakage And the other DLP solutions such as firewalls IDSantimalware software and management policies which canprovide assistance in detection intrusion or malicious soft-ware and enforce policies to protect data still do not preventintentional leaks perfectly To the best of our knowledgethere might be two main future research topics on DLP dataleakage from mobile devices and accidental data leakage byinsider

Since accidental data leakagemay be part of a larger attackin which their role will be mainly to activate an advancedpersistent threat inside the organization it is expected tocontinue to be one of the most challenging research topicsAnd our future work will focus on accidental data leakagein two directions First try to improve the efficiency andeffectiveness of CBDLP on confidential contents detectionSecond adjust the model dynamically according to thechanges of training dataset

Data Availability

All data generated or analysed during this study are includedin this published article

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

The research is supported by the National Natural Sci-ence Foundation of China under Grant nos 61871140 and61572153

References

[1] A Goyal F Bonchi and V S Lakshmanan ldquoOn minimizingbudget and time in influence propagation over social networksrdquoSocial Network Analysis and Mining pp 1ndash14 2012

[2] C Aggarwal Social Network Data Analytics Springer BerlinGermany 2011

[3] ldquoInformation week global security surveyrdquo Information Week2004

[4] D Alassi and R Alhajj ldquoEffectiveness of template detectionon noise reduction and websites summarizationrdquo InformationSciences vol 219 pp 41ndash72 2013

[5] D Holmes Using language models for information retrieval[PhD Thesis] Center for telematies and information technol-ogy University of Twente 2001

[6] R Bohme ldquoSecurity metrics and security investment modelsrdquoin Proceedings of the 5th international conference on advances ininformation and computer security 2010

[7] K R Rao and P Yip Discrete Cosine Transform AlgorithmsAdvantages Applications Academic Press Boston Mass USA1990

[8] S M Katz ldquoEstimation of probabilities from sparse data forthe language model component of a speech recognizerrdquo IEEETransactions on Signal Processing vol 35 no 3 pp 400-4011987

10 Wireless Communications and Mobile Computing

[9] R Jin L Si A G Hauptmann and J Callan ldquoLanguage modelfor IR using collection informationrdquo in Proceedings of the the25th annual international ACM SIGIR conference pp 419-4202002

[10] W W Cohen ldquoLearning rules that classify e-mailrdquo in Proceed-ings of the AAAI Spring Symposium on Machine Learning inInformation Access pp 18ndash25 1996

[11] G Salton A Wong and C S Yang ldquoA vector space model forautomatic indexingrdquo Communications of the ACM vol 18 no11 pp 613ndash620 1975

[12] S Deerwester S T Dumais G W Furnas T K Landauer andR Harshman ldquoIndexing by latent semantic analysisrdquo Journal ofthe Association for Information Science and Technology vol 41no 6 pp 391ndash407 1990

[13] C Ordonez ldquoClustering binary data streams with K-meansrdquoin Proceedings of the 8th ACM SIGMODWorkshop on ResearchIssues in Data Mining and Knowledge Discovery DMKD rsquo03 pp12ndash19 USA June 2003

[14] B Babcock M Datar R Motwani and L OrsquoCallaghan ldquoMain-taining variance and k-medians over data stream windowsrdquoin Proceedings of the Twenty second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems PODS2003 pp 234ndash243 June 2003

[15] T Zhang R Ramakrishnan and M Livny ldquoBIRCH a newdata clustering algorithm and its applicationsrdquoDataMining andKnowledge Discovery vol 1 no 2 pp 141ndash182 1997

[16] S Guha R Rastogi and K Shim ldquoCure an efficient clusteringalgorithm for large databasesrdquo in Proceedings of 1998 ACMSIGMOD International Conference Management of Data pp73ndash84 1998

[17] J W Han and M Kamber Data Mining Concepts and Tech-niques Morgan Kaufmann 2006

[18] A Hinneburg and D A Keim ldquoOptimal grid-clusteringtowards breaking the curse of dimensionality in high-dimen-sional clusteringrdquo in Proceedings of the 25th VLDB Conferencepp 506ndash517 1999

[19] WWang J Yang andRMuntz ldquoSting a statistical informationgrid approach to spatial data miningrdquo in Proceedings of the 23rdVLDB Conference pp 186ndash195 1997

[20] R Agrawal J Gehrke and D Gunopulos ldquoAutomatic sub-space clustering of high dimensional data for data miningapplicationsrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data pp 94ndash105 1998

[21] A McCallum K Nigam and L H Ungar ldquoEfficient clusteringof high-dimensional data sets with application to referencematchingrdquo in Proceedings of the KDD 2000 pp 169ndash178 ACMNew York NY USA 2000

[22] R Hyde P Angelov and A R MacKenzie ldquoFully onlineclustering of evolving data streams into arbitrarily shapedclustersrdquo Information Sciences vol 382-383 pp 96ndash114 2017

[23] F Jiang Y Fu B B Gupta et al ldquoDeep learning based multi-channel intelligent attack detection for data securityrdquo IEEETransactions on Sustainable Computing 2018

[24] R Agrawal and R Srikant ldquoPrivacy-preserving data miningrdquoin Proceedings of the the 2000 ACM SIGMOD internationalconference pp 439ndash450 May 2000

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information Processing ampManagement vol24 no 5 pp 513ndash523 1988

[26] Salton ldquoAutomatic text processing the transformation analysisand retrieval of information by computerrdquo Tech Rep Addison-Wesley Inc 1989

[27] M N Islam M Seera and C K Loo ldquoA robust incrementalclustering-based facial feature trackingrdquo Applied Soft Comput-ing vol 53 pp 34ndash44 2017

[28] A Shabtai and Y Elovici A Survey of Data Leakage Detectionand Prevention Solutions Springer Berlin Germany 2012

[29] S R Kalidindi S R Niezgoda G Landi S Vachhani andT Fast ldquoA novel framework for building materials knowledgesystemsrdquo Computers Materials and Continua vol 17 no 2 pp103ndash125 2010

[30] A J Wang ldquoInformation security models and metricsrdquo inProceedings of the 43rd annual southeast regional conference onACMSE43 pp 178ndash184 2005

[31] P Mitra C A Murthy and S K Pal ldquoUnsupervised featureselection using feature similarityrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 24 no 3 pp 301ndash3122002

[32] M Dash K Choi P Scheuermann and H Liu ldquoFeatureselection for clustering - A filter solutionrdquo in Proceedings of the2nd IEEE International Conference on Data Mining ICDM rsquo02pp 115ndash122 December 2002

[33] C-S Liu ldquoAn analytical method for computing the one-dimensional backward wave problemrdquo Computers Materialsand Continua vol 13 no 3 pp 219ndash234 2010

[34] G Katz Y Elovici and B Shapira ldquoCoban a context basedmodel for data leakage preventionrdquo Information Sciences vol262 pp 137ndash158 2014

[35] X Huang Y Lu D Li and MMa ldquoA novel mechanism for fastdetection of transformed data leakagerdquo IEEE Access vol 1 pp1ndash11 2018

[36] M Porter ldquoThe porter stemming algorithmrdquo 2006[37] F Pacheco M Cerrada R-V Sanchez D Cabrera C Li and J

Valente de Oliveira ldquoAttribute clustering using rough set theoryfor feature selection in fault severity classification of rotatingmachineryrdquoExpert Systems with Applications vol 71 pp 69ndash862017

[38] ldquoInformation week global security surveyrdquo Information Week2004

[39] C M Praba ldquoA technical review on data leakage detection andprevention approachesrdquo Journal of Network Communicationsand Emerging Technologies (JNCET) 2017

[40] F Ullah M Edwards R Ramdhany R Chitchyan M ABabar and A Rashid ldquoData exfiltration A review of externalattack vectors and countermeasuresrdquo Journal of Network andComputer Applications 2017

[41] K Thomas F Li A Zand et al ldquoData Breaches phishing ormalware understanding the risks of stolen credentialsrdquo in Pro-ceedings of the 24th ACM SIGSAC Conference on Computer andCommunications Security CCS rsquo17 pp 1421ndash1434 November2017

[42] J DeBlasio S Savage G M Voelker and A C SnoerenldquoTripwire Inferring internet site compromiserdquo in Proceedingsof the IMC rsquo17 pp 1ndash14 2017

[43] W Xu S Xiang and V Sachnev ldquoA cryptograph domainimage retrieval method based on paillier homomorphic blockencryptionrdquo Computers Materials and Continua pp 1ndash11 2018

[44] J Cui Y Zhang Z Cai A Liu and Y Li ldquoSecuring displaypath for security-sensitive applications on mobile devicesrdquoComputers Materials and Continua vol 55 no 1 pp 17ndash352018

[45] D Wang W Li and P Wang ldquoMeasuring two-factor authenti-cation schemes for real-time data access in industrial wireless

Wireless Communications and Mobile Computing 11

sensor networksrdquo IEEE Transactions on Industrial Informaticspp 1ndash12 2018

[46] D Wang H Cheng P Wang J Yan and X Huang ldquoA securityanalysis of honeywordsrdquo in Proceedings of the Network andDistributed Systems Security (NDSS) Symposium pp 18ndash212018

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 10: New A Data Leakage Prevention Method Based on the Reduction of …downloads.hindawi.com/journals/wcmc/2018/5823439.pdf · 2019. 7. 30. · A Data Leakage Prevention Method Based on

10 Wireless Communications and Mobile Computing

[9] R Jin L Si A G Hauptmann and J Callan ldquoLanguage modelfor IR using collection informationrdquo in Proceedings of the the25th annual international ACM SIGIR conference pp 419-4202002

[10] W W Cohen ldquoLearning rules that classify e-mailrdquo in Proceed-ings of the AAAI Spring Symposium on Machine Learning inInformation Access pp 18ndash25 1996

[11] G Salton A Wong and C S Yang ldquoA vector space model forautomatic indexingrdquo Communications of the ACM vol 18 no11 pp 613ndash620 1975

[12] S Deerwester S T Dumais G W Furnas T K Landauer andR Harshman ldquoIndexing by latent semantic analysisrdquo Journal ofthe Association for Information Science and Technology vol 41no 6 pp 391ndash407 1990

[13] C Ordonez ldquoClustering binary data streams with K-meansrdquoin Proceedings of the 8th ACM SIGMODWorkshop on ResearchIssues in Data Mining and Knowledge Discovery DMKD rsquo03 pp12ndash19 USA June 2003

[14] B Babcock M Datar R Motwani and L OrsquoCallaghan ldquoMain-taining variance and k-medians over data stream windowsrdquoin Proceedings of the Twenty second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems PODS2003 pp 234ndash243 June 2003

[15] T Zhang R Ramakrishnan and M Livny ldquoBIRCH a newdata clustering algorithm and its applicationsrdquoDataMining andKnowledge Discovery vol 1 no 2 pp 141ndash182 1997

[16] S Guha R Rastogi and K Shim ldquoCure an efficient clusteringalgorithm for large databasesrdquo in Proceedings of 1998 ACMSIGMOD International Conference Management of Data pp73ndash84 1998

[17] J W Han and M Kamber Data Mining Concepts and Tech-niques Morgan Kaufmann 2006

[18] A Hinneburg and D A Keim ldquoOptimal grid-clusteringtowards breaking the curse of dimensionality in high-dimen-sional clusteringrdquo in Proceedings of the 25th VLDB Conferencepp 506ndash517 1999

[19] WWang J Yang andRMuntz ldquoSting a statistical informationgrid approach to spatial data miningrdquo in Proceedings of the 23rdVLDB Conference pp 186ndash195 1997

[20] R Agrawal J Gehrke and D Gunopulos ldquoAutomatic sub-space clustering of high dimensional data for data miningapplicationsrdquo in Proceedings of the ACM SIGMOD InternationalConference on Management of Data pp 94ndash105 1998

[21] A McCallum K Nigam and L H Ungar ldquoEfficient clusteringof high-dimensional data sets with application to referencematchingrdquo in Proceedings of the KDD 2000 pp 169ndash178 ACMNew York NY USA 2000

[22] R Hyde P Angelov and A R MacKenzie ldquoFully onlineclustering of evolving data streams into arbitrarily shapedclustersrdquo Information Sciences vol 382-383 pp 96ndash114 2017

[23] F Jiang Y Fu B B Gupta et al ldquoDeep learning based multi-channel intelligent attack detection for data securityrdquo IEEETransactions on Sustainable Computing 2018

[24] R Agrawal and R Srikant ldquoPrivacy-preserving data miningrdquoin Proceedings of the the 2000 ACM SIGMOD internationalconference pp 439ndash450 May 2000

[25] G Salton and C Buckley ldquoTerm-weighting approaches in auto-matic text retrievalrdquo Information Processing ampManagement vol24 no 5 pp 513ndash523 1988

[26] Salton ldquoAutomatic text processing the transformation analysisand retrieval of information by computerrdquo Tech Rep Addison-Wesley Inc 1989

[27] M N Islam M Seera and C K Loo ldquoA robust incrementalclustering-based facial feature trackingrdquo Applied Soft Comput-ing vol 53 pp 34ndash44 2017

[28] A Shabtai and Y Elovici A Survey of Data Leakage Detectionand Prevention Solutions Springer Berlin Germany 2012

[29] S R Kalidindi S R Niezgoda G Landi S Vachhani andT Fast ldquoA novel framework for building materials knowledgesystemsrdquo Computers Materials and Continua vol 17 no 2 pp103ndash125 2010

[30] A J Wang ldquoInformation security models and metricsrdquo inProceedings of the 43rd annual southeast regional conference onACMSE43 pp 178ndash184 2005

[31] P Mitra C A Murthy and S K Pal ldquoUnsupervised featureselection using feature similarityrdquo IEEE Transactions on PatternAnalysis and Machine Intelligence vol 24 no 3 pp 301ndash3122002

[32] M Dash K Choi P Scheuermann and H Liu ldquoFeatureselection for clustering - A filter solutionrdquo in Proceedings of the2nd IEEE International Conference on Data Mining ICDM rsquo02pp 115ndash122 December 2002

[33] C-S Liu ldquoAn analytical method for computing the one-dimensional backward wave problemrdquo Computers Materialsand Continua vol 13 no 3 pp 219ndash234 2010

[34] G Katz Y Elovici and B Shapira ldquoCoban a context basedmodel for data leakage preventionrdquo Information Sciences vol262 pp 137ndash158 2014

[35] X Huang Y Lu D Li and MMa ldquoA novel mechanism for fastdetection of transformed data leakagerdquo IEEE Access vol 1 pp1ndash11 2018

[36] M Porter ldquoThe porter stemming algorithmrdquo 2006[37] F Pacheco M Cerrada R-V Sanchez D Cabrera C Li and J

Valente de Oliveira ldquoAttribute clustering using rough set theoryfor feature selection in fault severity classification of rotatingmachineryrdquoExpert Systems with Applications vol 71 pp 69ndash862017

[38] ldquoInformation week global security surveyrdquo Information Week2004

[39] C M Praba ldquoA technical review on data leakage detection andprevention approachesrdquo Journal of Network Communicationsand Emerging Technologies (JNCET) 2017

[40] F Ullah M Edwards R Ramdhany R Chitchyan M ABabar and A Rashid ldquoData exfiltration A review of externalattack vectors and countermeasuresrdquo Journal of Network andComputer Applications 2017

[41] K Thomas F Li A Zand et al ldquoData Breaches phishing ormalware understanding the risks of stolen credentialsrdquo in Pro-ceedings of the 24th ACM SIGSAC Conference on Computer andCommunications Security CCS rsquo17 pp 1421ndash1434 November2017

[42] J DeBlasio S Savage G M Voelker and A C SnoerenldquoTripwire Inferring internet site compromiserdquo in Proceedingsof the IMC rsquo17 pp 1ndash14 2017

[43] W Xu S Xiang and V Sachnev ldquoA cryptograph domainimage retrieval method based on paillier homomorphic blockencryptionrdquo Computers Materials and Continua pp 1ndash11 2018

[44] J Cui Y Zhang Z Cai A Liu and Y Li ldquoSecuring displaypath for security-sensitive applications on mobile devicesrdquoComputers Materials and Continua vol 55 no 1 pp 17ndash352018

[45] D Wang W Li and P Wang ldquoMeasuring two-factor authenti-cation schemes for real-time data access in industrial wireless

Wireless Communications and Mobile Computing 11

sensor networksrdquo IEEE Transactions on Industrial Informaticspp 1ndash12 2018

[46] D Wang H Cheng P Wang J Yan and X Huang ldquoA securityanalysis of honeywordsrdquo in Proceedings of the Network andDistributed Systems Security (NDSS) Symposium pp 18ndash212018

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 11: New A Data Leakage Prevention Method Based on the Reduction of …downloads.hindawi.com/journals/wcmc/2018/5823439.pdf · 2019. 7. 30. · A Data Leakage Prevention Method Based on

Wireless Communications and Mobile Computing 11

sensor networksrdquo IEEE Transactions on Industrial Informaticspp 1ndash12 2018

[46] D Wang H Cheng P Wang J Yan and X Huang ldquoA securityanalysis of honeywordsrdquo in Proceedings of the Network andDistributed Systems Security (NDSS) Symposium pp 18ndash212018

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 12: New A Data Leakage Prevention Method Based on the Reduction of …downloads.hindawi.com/journals/wcmc/2018/5823439.pdf · 2019. 7. 30. · A Data Leakage Prevention Method Based on

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom