Understanding Botnet- driven Blog Spam: Motivations and ... · words designed to change the way a...
Transcript of Understanding Botnet- driven Blog Spam: Motivations and ... · words designed to change the way a...
Understanding Botnet-driven Blog Spam: Motivations and Methods BrandonBevansbrandonbevans@gmail.comCaliforniaPolytechnicStateUniversityUnitedStatesofAmericaBruceDeBruhlbrandonbevans@gmail.comCaliforniaPolytechnicStateUniversityUnitedStatesofAmericaFoaadKhosmoodbrandonbevans@gmail.comCaliforniaPolytechnicStateUniversityUnitedStatesofAmerica
Introduction Spam, or unsolicited commercial communication,
has evolved from telemarketing schemes to a highlysophisticated and profitable black-market business.Although many users are aware that email spam isprominent, theyare lessawareofblogspam(Thom-ason,2007).Blogspam,alsoknownasforumspam,isspamthatispostedtoapublicoroutwardfacingweb-site.Blogspamcanbetoaccomplishmanytasksthatemailspamisusedfor,suchaspostinglinkstoamali-ciousexecutable.
Blog spam can also serve someunique purposes.First,blogspamcaninfluencepurchasingdecisionsbyfeaturing illegitimate advertisements or reviews. Se-cond,blogspamcanincludecontentwithtargetkey-words designed to change the way a search engineidentifies pages (Geerthik, 2013). Lastly, blog spamcancontainlinkspam,whichspamsaURLonavictimpagetoincreasetheinsertedURLssearchenginerank-ing.Overall,blogspamweakenssearchengines’modeloftheInternetpopularitydistribution.Muchacademicand industrial effort has been spent to detect, filter,anddeterspam(Dinh,2013),(SpirinandHan,2012).
Less effort has been placed in understanding theunderlyingdistributionmechanismsofspambotsandbotnets.Onefoundationalstudyincharacterizingblog
spam(Niuetal.,2007)providedaquantitativeanaly-sisofblogspamin2007.Thisstudyshowedthatblogsin2007includedincredibleamountsofspambutdoesnot try to identify linked behavior thatwould implybotnet behavior. A later study on blog spam(Stringhini,2015)exploresusing IPsandusernamestodetectbotnetsbutdoesnotcharacterizethebehav-iorofthesebotnets.In2011,aresearchteam(Stone-Grossetal.,2011)infiltratedabotnet,whichallowedforobservationsof the logisticsaroundbotnet spamcampaigns. Overall, our understanding of blog spamgeneratedbybotnetsisstilllimited.
Related Work Variousprojectshaveattemptedtoidentifytheme-
chanics, characteristics,andbehaviorofbotnets thatcontrol spam. In one important study (Shin et al.,2011), researchers fully evaluated how one of themost popular spam automation programs, XRumer,operates.Anotherstudyexploredthebehaviorofbot-netsacrossmultiplespamcampaigns(ThonnardandDacier,2011).Others(Pitsillidisetal.,2012)examinedtheimpactthatspamdatasetshadoncharacterizationresults.(Lumezanuetal.,2012)exploredthesimilari-ties between email spamand blog spamonTwitter.Theyshowthatover50%ofspamlinks fromemailsalsoappearedonTwitter.
Figure 1: Browser rendering of the ggjx honeypot
Theundergroundecosystembuildaroundthebot-netcommunityhasbeenexplored(Stone-Grossetal.,2011).Inasurprisingresult,over95%ofpharmaceu-ticals advertised in spam were handled by a smallgroupofbanks(Levchenkoetal.,2011).Ourworkissimilarinthatwearetryingtocharacterizethebotnetecosystem,focusingonthedistributionandclassifica-tionofcertainspamproducingbotnets.
Experimental Design
Inordertoclassifylinguisticsimilarityanddiffer-encesinbotnets,weimplement3honeypotstogathersamples of blog spam. We configure our honeypotsidenticallyusingtheDrupalcontentmanagementsys-tems(CMS)asshowninFigure1.Ourhoneypotsareidenticalexceptforthecontentoftheirfirstpostandtheir domain name. Ggjx.org is fashion themed,npcagent.com is sports themed, and gjams.com ispharmaceutical themed. We combine the data col-lected from Drupal with the Apache server logs(Apache, 2016) to allow for content analysis of datacollectedover42days.Toallowbotnets timetodis-cover the honeypots, we activate the honeypots atleast6-weeksbeforedatacollection.
Wegeneratethreetablesofcontentforeachhoney-pot(BevansandKhosmood,2016).Intheusertable,werecordthe informationthespambotenterswhileregisteringanduserloginstatisticsthatwesummarizeinTable1.Thisincludestheuserid,username,pass-word,dateofregistration,registrationIP,andnumberoflogins.Inthecontenttable,werecordthecontentofspampostsandcommentswhichwesummarizeinTa-ble 2. This includes the blog node id, the author’suniqueid,thedateposted,thenumberofhits,typeofpost,titleofthepost,textofthepost,linksinthepost,languageofthepost,andataxonomyofthepostfromIBM’sAlchemyAPI.
Table 1: User table characteristics for three honeypots
Table 2: Characteristics for the content tables
Table 3: Characteristics of entities
Lastly, in the access table, we include data andmeta-datafromtheApachelogs.Thisincludestheuserid,theaccessIP,theURL,theHTTPrequesttype,thenodeID,andanactionkeyworddescribingthetypeofaccess.
Our honeypots received a total of 1.1million re-questsforggjx,481thousandrequestsforgjams,and591thousandrequestsfornpcagent.
Entity Reduction It is widely accepted that spambot networks, or
botnets,areresponsibleformostspam.Therefore,wealgorithmicallyreducespaminstancesintouniqueen-titiesrepresentingbotnets.Foreachentity,wedefine4attributes:entityid,associatedIPs,usernames,andassociated user ids. To construct entities we scanthroughtheusersandassigneachonetoanentityasfollows.
1. Forauser,ifanentityexistswhichcontainsitsusernameorIP,theuserisaddedtotheentity.
2. Ifmore than one entitymatches the abovecriteria,allmatchingentitiesaremerged.
3. Ifnoentitymatchestheabovecriteria,anewentityiscreated.
WesummarizetheentitycharacteristicsinTable3.Themaximumnumberofusersinoneentityisalmost38 thousand for ggjx with over 100 unique IP ad-dresses.Theseresultsconfirmwhatisexpected-thevastmajorityofbots interactingwithourhoneypotsarepartof largebotnets. Thisalsoallowsustoper-formcontentanalysisexploringwhatlinguisticquali-tiesdifferentiatebotnets.
Table 4: NLP feature sets we consider for our content
analysis and their effectiveness at differentiating botnets
Content Analysis Tobetterunderstandbotnets,weusenaturallan-
guageprocessing(CollobertandWeston,2008)foran-alyzingthelinguisticcontentofentities.Forouranal-ysis,we consider various feature sets as proxies forlinguisticcharacteristicsassummarizedinTable4.WeuseaMaximumEntropyclassifier(MegaM,2016)totestwhich featuresdifferentiatebotnets. Inorder totestafeature,wetraintheclassifierwith70%oftheposts, randomly selected, from theN largest entities
andtest itwiththeremaining30%of theposts.Ourfinalresultsaretheaverageofthreeruns.
ThefirstfeaturesetwetestisBagOfWords(BoW)whichmodelsthelexicalcontentofposts.Putsimply,eachwordinadocumentisputintoa‘bag’andthesyn-tacticstructureisdiscarded.Forimplementationde-tails,seeourtechnicalreport(Bevans,2016).InFigure2,weshowouranalysisoftheBoWfeatureset.
Whenconsidering the top5 contributingentities,theclassificationaccuracyislessthan95%whichim-pliesthatthelexicalcontentofbotnetsvariesgreatly.Thesecondfeatureweconsideristhetaxonomypro-videdbyIBMWatson’sAlchemyAPI.Alchemy’soutputisalistoftaxonomylabelsandassociatedconfidences.Forthepurposeofouranalysis,wediscardanylowornon-confidentlabels.InFigure3,weshowouranalysisoftheAlchemyTaxonomyfeaturesetwhichhighlightstheaccuracyofAlchemy’staxonomy.WenotethattheAlchemyTaxonomyfeaturesetisdramaticallysmallerinsizethantheBoWfeaturesetwhilestillprovidinghighperformance.Thisindicatesafulllexicalanalysisis not necessary but a taxonomic approach is suffi-cient. Our third feature is based on the links in theposts.Tocreatethefeature,weparseeachpostforanyHTTPlinksandstripthelinktoitscoredomainname.
Theclassifierwith the link featuresethadvariedresults,asshowninTable5,whereitwasreliableindifferentiating ggjx entities but less reliable for theother twohoneypots.TheseresultscorrelatewithlinkscarcityfromTable2.
Figure 2
Figure 3
Wetestthenormalizedvocabularysizeofapostasa feature.Wederivethis fromthenumberofuniquewords divided by the total number of words in thepost.AsshowninTable5,thevocabularysizedoesnotdifferentiatebotnets.
We also form a feature set based on the part-of-speech(PoS)makeupofapostusingtheStanfordPoSTagger.TheStanfordPoStaggerreturnsapairforeachwordinthetext,theoriginalwordandcorrespondingPoS.WecreateaBoWfromthisresponsethatcreatesanabstractrepresentationof thedocument’ssyntax.As shown in Table 5, the PoS does not differentiatebotnets.
Table 5: Accuracies for various features when identifying 10
and 60 entities using the maximum entropy classifier
Conclusions Inthispaper,weexamineinterestingcharacteris-
tics of spam-generating botnets and release a novelcorpus to the community.We find that hundreds ofthousandsof fakeusersarecreatedbyasmallsetofbotnets and much fewer numbers of them actuallypostspam.Thespamthatispostedishighlycorrelatedbysubjectlanguagetothepointwherebotnetslabeled
bytheirnetworkbehavioraretoalargedegreere-dis-coverableusingcontentclassification(Figure3).
Whilelinkandvocabularyanalysiscanbegooddif-ferentiatorsofthesebotnets,itisthecontentlabeling(providedbyAlchemy)thatisthebestindicator.Ourexperimentonlyspans42days,thusit’spossiblethesubjectspecializationisafeatureofthecampaignra-therthanthebotnetitself.
Bibliography
Apache virtual host. (2016).http://httpd.apache.org/docs/current/vhosts Ac-cessed:2016-08-10.
Bevans,B.,andKhosmood,F.(2016).ForumSpamCorpus.
http://users.csc.calpoly.edu/~foaad/bfbevans Ac-cessed:2017-04-01.
Bevans, B. (2016). “Categorizing Forum Spam.” Master's
ThesesatCalPolyDigitalCommons.http://digitalcom-mons.calpoly.edu/theses/1623Accessed:2017-04-01.
Collobert,R., andWeston, J. (2008). “Aunified architec-
ture fornatural languageprocessing:Deepneuralnet-workswithmultitasklearning.”Proceedingsofthe25thInternational Conference on Machine Learning, ACM:160–67.
Dinh,S.etal.(2015).“Spamcampaigndetection,analysis,
andinvestigation.”DigitalInvestigation,(12)S12–S21.Geerthik, S. (2013). “Survey on internet spam: Classifica-
tion and analysis.” International Journal of ComputerTechnologyandApplications,4(3):384.
Levchenko,K.etal.(2011).“Clicktrajectories:End-to-end
analysisofthespamvaluechain.”SymposiumonSecu-rityandPrivacy,IEEE.431–446.
Lumezanu,C.andFeamster,N. (2012). “Observingcom-
monspamintwitterandemail.”Proceedingsofthe2012ACM conference on Internet measurement, ACM. 461–466.
Mega M. (2016). “Mega model optimization package.”
https://www.umiacs.umd.edu/~hal/megam/, Ac-cessed:2016-08-10.
Niu,Y.etal.(2007).“Aquantitativestudyofforumspam-
mingusingcontext-basedanalysis.”NDSS.Pitsillidis,A.etal.(2012).“Taster’schoice:Acomparative
analysisofspamfeeds.”Proceedingsofthe2012ACMconferenceonInternetmeasure-
ment,ACM.427–440.
Shin,Y.,Gupta,M.,andMyers,S.A.(2011).“Thenutsand
boltsofaforumspamautomator.”LEET.Spirin,N.,andHan,J.(2012).“Surveyonwebspamdetec-
tion:Principlesandalgorithms.”ACMSIGKDDExplorationsNewsletter,13(2):50-64.Stone-Gross,B.,etal.“Theundergroundeconomyofspam:
A botmaster’s perspective of coordinating large-scalespamcampaigns.”LEET,11:4.
Stringhini, G. (2015). “Evilcohort: Detecting communities
ofmaliciousaccountsononlineservices.”24thUSENIXSecuritySymposium(USENIXSecurity15),563–578.
Thomason, A. (2007). “Blog spam: A review.” CEAS,
Citeseer.ThonnardO.andDacier,M.(2011).“Astrategicanalysisof
spambotnetsoperations.”Proceedingsofthe8thAnnualCollaboration, Electronic messaging, Anti-Abuse andSpamConference,ACM,162–171.