Ramji Shinde et al , International Journal of Computer ... · Output: Guilt Probablity of agent...

3
Cluster Based Data Distribution Techniques for Leakage Detection Ramji Shinde Computer Engineering Pune Institute of Computer Technology Pune Email: [email protected] Amar Buchade Computer Engineering Pune Institute of Computer Technology Pune Email: [email protected] Abstract—Data security is the very important thing in recent years. One organization has to share the data with other for business purpose. The owner of the data can also be called as data distributer and the other business organization can also be called as data agents. Agents should not share the important business data with other agents or third-party. How to detect who among the agents are guilty and done data leakage is challenging issue. KeywordsDistributed computing, Data security,Data privacy I. I NTRODUCTION Consider an example if I have a 120 students placement records with academic marks now the collage whose students information is nothing but data distributer now the collage will not give all the detail to everyone. Collage has to give this information to NACC committee or higher authority. Now these persons or organization is nothing but data agents. These agents are not supposed to give this data to other collage those who are searching students for higher education. The above given example is perfect for data leakage detection. If students are receiving the phone calls from other collage to take admission then data leakage is happen there. Need to find out who among the agents are guilty. In this paper, we are finding the chances of agent is guilty or not by applying algorithm. Again, we can also consider addition of fake objects to the original dataset. Fake objects can improve the probability of agent is guilty or not. II. LITERATURE SURVEY conventionally, leakage detection is handled by watermark- ing, e.g., a unique code is rooted in each dispersed copy. If that copy is later exposed in the hands of an illegal party, the leaker can be identified. Watermarks can be very useful in some cases, but again, involve some change of the original data. Furthermore, watermarks can sometimes be cracked if the data recipient is malicious. E.g. A hospital may give patient records to research who will advise new treatments. Similarly, a company may have partnerships with other companies that require sharing customer data. Another enterprise may out source its data processing, so data must be given to various other companies. We call the owner of the data the distributor and the supposedly trusted third parties the agents. Algorithm in [4] gives a high-quality key, but it has boundaries like area specific. In this paper author discussed that Probability of finding responsibility is related to error understanding system. System letdown is case where the substance is guessed by the illegal agents and not leaked by any official agency. Method discuss in[7] is necessitating the prior information on the way a data view is produced out of data sources. Algorithm accessible in[12] valid for the sample applies for or the open request, but the paper has not discussed more about the mutual data and dissimilar data allotment strategy. In this method fake entity was additional for every agents data. It increases the time complexity of the algorithm. In this paper, it was assumed unchanging set of agents with prior known apply for agents. Algorithm discussed in[14]the process which is mostly useful for the common objects. Also discuss algorithm which is Heuristic in nature to calculate the responsibility of an agent who have leaked the data from the given data set. Running time of the projected algorithm is very high and as it is in heuristic in nature, it will not always produce the optimal result. III. PROBLEM SETUP AND MATHEMATICAL MODEL Input: Data Records and agents Output: Guilt Probablity of agent Success: Guilty agent detected with higest probablity Fail: Guilt of agent is not found System: S S = {I,O,F,S i ,F i |φ} I = Input = (T,U ) U = {u 1 ,u 2 ,u 3 ,...,u n } T = {t 1 ,t 2 ,t 3 ,...,t n } O = Output = {p 1 ,p 2 ,p 3 ,...,p n } F = Function = {f 1 ,f 2 ,f 3 ,...,f n } Ramji Shinde et al , International Journal of Computer Science & Communication Networks,Vol 5(3),192-194 192 ISSN:2249-5789

Transcript of Ramji Shinde et al , International Journal of Computer ... · Output: Guilt Probablity of agent...

Cluster Based Data Distribution Techniques forLeakage Detection

Ramji ShindeComputer Engineering

Pune Institute of Computer TechnologyPune

Email: [email protected]

Amar BuchadeComputer Engineering

Pune Institute of Computer TechnologyPune

Email: [email protected]

Abstract—Data security is the very important thing in recentyears. One organization has to share the data with other forbusiness purpose. The owner of the data can also be called asdata distributer and the other business organization can alsobe called as data agents. Agents should not share the importantbusiness data with other agents or third-party. How to detect whoamong the agents are guilty and done data leakage is challengingissue.

Keywords—Distributed computing, Data security,Data privacy

I. INTRODUCTION

Consider an example if I have a 120 students placementrecords with academic marks now the collage whose studentsinformation is nothing but data distributer now the collagewill not give all the detail to everyone. Collage has to givethis information to NACC committee or higher authority.Now these persons or organization is nothing but data agents.These agents are not supposed to give this data to othercollage those who are searching students for higher education.The above given example is perfect for data leakage detection.If students are receiving the phone calls from other collageto take admission then data leakage is happen there. Need tofind out who among the agents are guilty.

In this paper, we are finding the chances of agent is guiltyor not by applying algorithm. Again, we can also consideraddition of fake objects to the original dataset. Fake objectscan improve the probability of agent is guilty or not.

II. LITERATURE SURVEY

conventionally, leakage detection is handled by watermark-ing, e.g., a unique code is rooted in each dispersed copy. Ifthat copy is later exposed in the hands of an illegal party,the leaker can be identified. Watermarks can be very usefulin some cases, but again, involve some change of the originaldata. Furthermore, watermarks can sometimes be cracked if thedata recipient is malicious. E.g. A hospital may give patientrecords to research who will advise new treatments. Similarly,a company may have partnerships with other companies thatrequire sharing customer data. Another enterprise may outsource its data processing, so data must be given to variousother companies. We call the owner of the data the distributorand the supposedly trusted third parties the agents.

Algorithm in [4] gives a high-quality key, but it hasboundaries like area specific. In this paper author discussedthat Probability of finding responsibility is related to errorunderstanding system. System letdown is case where thesubstance is guessed by the illegal agents and not leaked byany official agency.

Method discuss in[7] is necessitating the prior informationon the way a data view is produced out of data sources.

Algorithm accessible in[12] valid for the sample appliesfor or the open request, but the paper has not discussed moreabout the mutual data and dissimilar data allotment strategy. Inthis method fake entity was additional for every agents data. Itincreases the time complexity of the algorithm. In this paper, itwas assumed unchanging set of agents with prior known applyfor agents.

Algorithm discussed in[14]the process which is mostlyuseful for the common objects. Also discuss algorithm whichis Heuristic in nature to calculate the responsibility of an agentwho have leaked the data from the given data set. Running timeof the projected algorithm is very high and as it is in heuristicin nature, it will not always produce the optimal result.

III. PROBLEM SETUP AND MATHEMATICALMODEL

• Input: Data Records and agents

• Output: Guilt Probablity of agent

• Success: Guilty agent detected with higest probablity

• Fail: Guilt of agent is not found

• System: S

S = {I,O, F, Si, Fi|φ}I = Input

= (T,U)

U = {u1, u2, u3, . . . , un}T = {t1, t2, t3, . . . , tn}O = Output

= {p1, p2, p3, . . . , pn}F = Function

= {f1, f2, f3, . . . , fn}

Ramji Shinde et al , International Journal of Computer Science & Communication Networks,Vol 5(3),192-194

192

ISSN:2249-5789

φ = Constrain(1)

Where,

• T is Valuable Data set.

• Ui ith Agent

• t1 Data associated with Ui

Functions :Sample Allocation function :f1 : ui → tiIt allocate rquested data to agent ui

Explicit Allocation function:f2 : (ui, ti) → TAllocate Data to an agent with given conditionConstrains:

• The agent must receive objects which are requestedby him

• Fake but realistic objects should not be guessed by theagent

IV. METHODS OF FINDING LEAKAGE

A. Random Distribution

We have to distribute data between different agents. Distri-bution is done by creating random number.We have to add thenew field into the distributor data which is nothing but randomnumber.Apply sorting algorithm on modified data.Distributedata to individual agent.

B. Minimum Overlap

We have to distribute data between different agents suchthat it will have minimum overlap of data objects. Algorithmwill take care of this and distribute data such that it will haveminimum number of shared or common objects

C. Fake Object Creation

In this method one fake object will be created and assignto each individual agents. If this fake object found in leak tablewe will be consider that agent as a guilty agent.

V. DATABASE CLUSTERING

A. MongoDB Sharding

For time and space efficiency we have created databaseclustering. It consist of large data in which different shard arethere. Data is distributed among each shard equally.End userwill not come to know abut database sharding. Only Mongdbserver is visible to end user.

B. MongoDB Replica

Each shard internally having 3 replica set one is primary,other one is secondary and last one is arbitrary. Arbitrarytake part in election. Primary have the read and write access.Secondary is having only read access.

C. MongoDB Configuration

It is nothing but configuration server used to know aboutexact data location of the data-set. It store all meta datainformation about the mongoDB cluster

Fig. 1. Database Clustering

VI. RESULT

Sample Data of Wikipedia people is consider for exper-iment.Probability of finding guilt of agent depends on dataallocation.

Fig. 2. Number Of Leak Objects vs Average Leak

No Of Agents DataRecords

Method Avarage Guilt

4 20000 MinimumOverlab

0.97

6 20000 Sample Allo-cation

0.96

10 50000 MinimumOverlap

0.74

10 50000 Random Agent 0.8410 72000 Random Data 0.92

TABLE I. RESULT TABLE

VII. CONCLUSION

The novel method presented above is to find guilt prob-ability of agents. This can be further improved. Like takingdata objects in online fashion.Clustering of data is useful fortime as well as space complexity calculation.

Ramji Shinde et al , International Journal of Computer Science & Communication Networks,Vol 5(3),192-194

193

ISSN:2249-5789

REFERENCES

[1] Shinde Ramji, Amar Buchade. ”Data Leakage Detection using FakeObjects for Suspected Users.” International Journal of Computer Scienceand Communication Networks vol. 4, no. 6 pp. 214-215,2015.

[2] Ruanaidh, JJK ., W. J. Dowling, and F. M. Boland. ”Watermarking digitalimages for copyright protection.” IEE Proceedings-Vision, Image andSignal Processing vol. 143, no. 4, pp. 250-256, 1996.

[3] Hartung, Frank, and Bernd Girod. ”Watermarking of uncompressed andcompressed video.” Signal processing vol. 66 no. 3, pp. 283-301, 1998

[4] Buneman, Peter, Sanjeev Khanna, and Tan Wang-Chiew. ”Why andwhere: A characterization of data provenance.” Database TheoryICDTSpringer Berlin Heidelberg, pp 316-330, 2001.

[5] Agrawal, Rakesh, and Jerry Kiernan. ”Watermarking relationaldatabases.” Proceedings of the 28th international conference on VeryLarge Data Bases. VLDB Endowment, 2002.

[6] Sweeney, Latanya. ”Achieving k-anonymity privacy protection usinggeneralization and suppression.” International Journal of Uncertainty,Fuzziness and Knowledge-Based Systems vol. 10 no. 05 pp. 571-588,2002.

[7] Cui, Yingwei, and Jennifer Widom. ”Lineage tracing for general datawarehouse transformations.” The VLDB JournalThe International Journalon Very Large Data Bases vol. 12 no. 1, pp. 41-58, 2003.

[8] Sion, Radu, Mikhail J. Atallah, and Sunil Prabhakar. ”Rights protectionfor relational data.” Knowledge and Data Engineering, IEEE Transactionson vol. 16, no. 12, pp. 1509-1525, 2004.

[9] Guo, Fei, Jianmin Wang, and Deyi Li. ”Fingerprinting relationaldatabases.” Proceedings of the 2006 ACM symposium on Appliedcomputing. ACM, 2006.

[10] Czerwinski, Steve, Richard Fromm, and Todd Hodes. ”Digital musicdistribution and audio watermarking.” UCB IS 219 (2007).

[11] Buneman, Peter, and Wang-Chiew Tan. ”Provenance in databases.”Proceedings of the 2007 ACM SIGMOD international conference onManagement of data. ACM, 2007.

[12] Papadimitriou, Panagiotis, and Hector Garcia-Molina. ”Data leakagedetection.” IEEE Transactions on Knowledge and Data Engineering, vol.23, no. 1, pp. 51-63, 2011.

[13] Hao, Fang, et al. ”Protecting cloud data using dynamic inline fingerprintchecks.” INFOCOM, 2013 Proceedings IEEE. IEEE, 2013.

[14] Shabtai, Asaf, et al. ”Optimizing Data Misuse Detection.” ACM Trans-actions on Knowledge Discovery from Data (TKDD) vol. 8, pp. 3, 2014.

Ramji Shinde et al , International Journal of Computer Science & Communication Networks,Vol 5(3),192-194

194

ISSN:2249-5789