SPAM DETECTION IN P2P SYSTEMS Team Matrix Abhishek GhagDarshan Kapadia Pratik Singh.

28
SPAM DETECTION IN P2P SYSTEMS Team Matrix Abhishek Ghag Darshan Kapadia Pratik Singh

Transcript of SPAM DETECTION IN P2P SYSTEMS Team Matrix Abhishek GhagDarshan Kapadia Pratik Singh.

SPAM DETECTION IN P2P SYSTEMS

Team MatrixAbhishek Ghag Darshan Kapadia Pratik Singh

OVERVIEW

P2P Basics

Spam

The Spam Detection Problem

Approaches to the Spam Detection Problem

Proposal

References

P2P Basics Used to connect nodes or machines via large adhoc

connections.

No concept of a client or server.

All nodes or peers are equal.

The equal peer nodes function as both client and server.

Classification of P2P:- Centralized P2P network – Napster. Decentralized P2P network – KaZaA. Structured P2P network – CAN. Unstructured P2P network – Gnutella. Hybrid P2P network – JXTA.

Advantages of P2P:-

All peers provide resources like bandwidth, computing power, storage space, CPU cycles.

Replication of data over multiple peers eliminates single point of failure.

Applications of P2P:-

File Sharing

Internet Telephony e.g. Skype.

Streaming media files.

From http://www.acm.org/crossroads/xrds9-4/gfx/GamestateFidelity1.jpg

Spam

Spam is any file that is misrepresented deliberately.

A well known problem in P2P file sharing systems.

Used to manipulate established retrieval and ranking techniques.

Anonymous, decentralized and dynamic in nature.

Spam

Taken From Malware Prevalence in the KaZaA FileSharingTaken From Malware Prevalence in the KaZaA FileSharing

Network Research Paper ACMNetwork Research Paper ACM

Taken From Malware Prevalence in the KaZaA FileSharing

Network Research Paper ACM

VirusesViruses in P2Pin P2P

Why is Spam Harmful?

Degrades user search experience.

Assists the propagation of viruses in the network.

More than 200 viruses use P2P as a propagation vector.

Increases the load on the traffic in the network.

SpamSpam

Hard to detect spam automatically as:-

Insufficient and biased information returned as user query. Anonymous, decentralized and dynamic nature.

Naïve spam detection technique is download and check manually.

Approaches to Spam Detection Problem

Mainly two approaches to the spam detection problem.

Detection after downloading file

User compares the file with the known databases of genuine files. User filters the file so that other user don't get the spammed copy

Detection before downloading file

Rigid Trust Web of trust Reputation System Blocking IP address

Object Reputation:-

Involves the user to vote for a file either positively or negatively. Based on the voting evaluation and the voting protocol, the file is

regarded as genuine or spam.

Disadvantages: -

Consumes time and labor. Wastage of bandwidth and computing resources. Risk of opening malware.

Thus there arises a need to develop an effective automatic spam detection technique.

Goal

Automatic Detection of Spam files.

Query Processing

Client writes a query.

Server compares the result.

System Identifier and descriptor.

The client groups the individual groups by keys.

Ranking.

The client becomes the server.

Spamming

Steps 1, 3 and 5.

Object Reputation on step 1.

Feature based Spam Detection on steps 3 and 5.

Feature Based Spam Detection

Characterizing Spam.

Characterizing Spammers.

Then implement techniques that use this characterization to rank the query results.

Classification of Spam

Type 1:-

• Files whose replicas have semantically different descriptors.

• The Spammer might name a file after a currently popular song or might give multiple names to the same file descriptor.

Eg: different song titles for a same key 26NZUBS655CC66COLKMWHUVJGUXRPVUF:

“12 days after christmas.mp3”

“i want you thalia.mp3”

“come on be my girl.mp3” …

Classification of Spam

Type 2:-

• Files with long descriptors

• In this a Spammer inserts a single long descriptor for the file.

• E.g., a single replica descriptor for key 1200473A4BB17724194C5B9C271F3DC4: “Aerosmith, Van Halen, Quiet Riot, Kiss, Poison, Acdc, Accept, Def Leappard, Boney M, Megadeth, Metallica, Offspring, Beastie Boys, Run Dmc, Buckcherry, Salty Dog Remix.mp3”

Classification of Spam

Type 3:-

Files with descriptors with no query terms.

In this, if a server is wishing to share a file, it may return the file regardless of whether it matches the query results.

Eg. “ Can you afford 0.09 www.BuyLegalMP3.com.mp3”

Classification of Spam

Type 4:-

Files that are highly replicated on a single peer.

Normal users do not create multiple replicas of the same file on a single server. This is aimed at manipulating the group size. It retards processing of query routing techniques used for finding hard to find data.

E.g..177 replicas of the file DY2QXX3MYW75SRCWSSUG6GY3FS7N7YC shared on a single peer.

Proposal

We plan to implement the Feature based Spam Detection technique that characterizes the spam based on various features.

It includes a probing technique that aggregates more descriptive information of result files and statistics of peer and ranking functions.

Our implementation requires little new functionality in the existing P2P file sharing systems, thus it can be combined easily with other existing techniques.

Papers. Author – Dongmei Jia

Title – Cost Effective Spam Detection Techniques in P2P File Sharing Systems.

Conference -- Proceeding of the 2008 ACM workshop on Large scale Distributed Systems for information

retrieval.

Date -- October 2008.

Publisher -- ACM.

URL -- http://portal.acm.org.ezproxy.rit.edu/results.cfm?coll=portal&dl=ACM&CFID=14901064&CFTOKEN=96029385

References

References Author – Dongmei Jia, Wai Gen Yee, Ophir

Frieder

Title – Spam Characterization and Detection in Peer to Peer File Sharing Systems.

Conference -- Proceeding of the 17th ACM conference on Information and knowledge mining

Date -- October 2008.

Publisher -- ACM.

URL -- http://portal.acm.org.ezproxy.rit.edu/citation.cfm?id=1458082.1458128&coll=portal&dl=ACM&CFID=14901064&CFTOKEN=96029385

References Author – Jia Liang, Rakesh Kumar, Yongjian

Xi, Keith W RossTitle – Pollution in P2P File Sharing

Systems. Conference --

INFOCOM 2005. 24th Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings IEEE Date -- March 2005.Publisher -- ACM.URL -- http://ieeexplore.ieee.org.ezproxy.rit.edu/stamp/stamp.jsp?arnumber=1498344&isnumber=32100

SOURCES

http://en.wikipedia.org/wiki/Peer-to-peer

Questions???