Understanding Botnet- driven Blog Spam: Motivations and ... · words designed to change the way a...

Understanding Botnet-driven Blog Spam: Motivations and Methods BrandonBevansbrandonbevans@gmail.comCaliforniaPolytechnicStateUniversityUnitedStatesofAmericaBruceDeBruhlbrandonbevans@gmail.comCaliforniaPolytechnicStateUniversityUnitedStatesofAmericaFoaadKhosmoodbrandonbevans@gmail.comCaliforniaPolytechnicStateUniversityUnitedStatesofAmerica

Introduction Spam, or unsolicited commercial communication,

has evolved from telemarketing schemes to a highlysophisticated and profitable black-market business.Although many users are aware that email spam isprominent, theyare lessawareofblogspam(Thom-ason,2007).Blogspam,alsoknownasforumspam,isspamthatispostedtoapublicoroutwardfacingweb-site.Blogspamcanbetoaccomplishmanytasksthatemailspamisusedfor,suchaspostinglinkstoamali-ciousexecutable.

Blog spam can also serve someunique purposes.First,blogspamcaninfluencepurchasingdecisionsbyfeaturing illegitimate advertisements or reviews. Se-cond,blogspamcanincludecontentwithtargetkey-words designed to change the way a search engineidentifies pages (Geerthik, 2013). Lastly, blog spamcancontainlinkspam,whichspamsaURLonavictimpagetoincreasetheinsertedURLssearchenginerank-ing.Overall,blogspamweakenssearchengines’modeloftheInternetpopularitydistribution.Muchacademicand industrial effort has been spent to detect, filter,anddeterspam(Dinh,2013),(SpirinandHan,2012).

Less effort has been placed in understanding theunderlyingdistributionmechanismsofspambotsandbotnets.Onefoundationalstudyincharacterizingblog

spam(Niuetal.,2007)providedaquantitativeanaly-sisofblogspamin2007.Thisstudyshowedthatblogsin2007includedincredibleamountsofspambutdoesnot try to identify linked behavior thatwould implybotnet behavior. A later study on blog spam(Stringhini,2015)exploresusing IPsandusernamestodetectbotnetsbutdoesnotcharacterizethebehav-iorofthesebotnets.In2011,aresearchteam(Stone-Grossetal.,2011)infiltratedabotnet,whichallowedforobservationsof the logisticsaroundbotnet spamcampaigns. Overall, our understanding of blog spamgeneratedbybotnetsisstilllimited.

Related Work Variousprojectshaveattemptedtoidentifytheme-

chanics, characteristics,andbehaviorofbotnets thatcontrol spam. In one important study (Shin et al.,2011), researchers fully evaluated how one of themost popular spam automation programs, XRumer,operates.Anotherstudyexploredthebehaviorofbot-netsacrossmultiplespamcampaigns(ThonnardandDacier,2011).Others(Pitsillidisetal.,2012)examinedtheimpactthatspamdatasetshadoncharacterizationresults.(Lumezanuetal.,2012)exploredthesimilari-ties between email spamand blog spamonTwitter.Theyshowthatover50%ofspamlinks fromemailsalsoappearedonTwitter.

Figure 1: Browser rendering of the ggjx honeypot

Theundergroundecosystembuildaroundthebot-netcommunityhasbeenexplored(Stone-Grossetal.,2011).Inasurprisingresult,over95%ofpharmaceu-ticals advertised in spam were handled by a smallgroupofbanks(Levchenkoetal.,2011).Ourworkissimilarinthatwearetryingtocharacterizethebotnetecosystem,focusingonthedistributionandclassifica-tionofcertainspamproducingbotnets.

Experimental Design

Inordertoclassifylinguisticsimilarityanddiffer-encesinbotnets,weimplement3honeypotstogathersamples of blog spam. We configure our honeypotsidenticallyusingtheDrupalcontentmanagementsys-tems(CMS)asshowninFigure1.Ourhoneypotsareidenticalexceptforthecontentoftheirfirstpostandtheir domain name. Ggjx.org is fashion themed,npcagent.com is sports themed, and gjams.com ispharmaceutical themed. We combine the data col-lected from Drupal with the Apache server logs(Apache, 2016) to allow for content analysis of datacollectedover42days.Toallowbotnets timetodis-cover the honeypots, we activate the honeypots atleast6-weeksbeforedatacollection.

Wegeneratethreetablesofcontentforeachhoney-pot(BevansandKhosmood,2016).Intheusertable,werecordthe informationthespambotenterswhileregisteringanduserloginstatisticsthatwesummarizeinTable1.Thisincludestheuserid,username,pass-word,dateofregistration,registrationIP,andnumberoflogins.Inthecontenttable,werecordthecontentofspampostsandcommentswhichwesummarizeinTa-ble 2. This includes the blog node id, the author’suniqueid,thedateposted,thenumberofhits,typeofpost,titleofthepost,textofthepost,linksinthepost,languageofthepost,andataxonomyofthepostfromIBM’sAlchemyAPI.

Table 1: User table characteristics for three honeypots

Table 2: Characteristics for the content tables

Table 3: Characteristics of entities

Lastly, in the access table, we include data andmeta-datafromtheApachelogs.Thisincludestheuserid,theaccessIP,theURL,theHTTPrequesttype,thenodeID,andanactionkeyworddescribingthetypeofaccess.

Our honeypots received a total of 1.1million re-questsforggjx,481thousandrequestsforgjams,and591thousandrequestsfornpcagent.

Entity Reduction It is widely accepted that spambot networks, or

botnets,areresponsibleformostspam.Therefore,wealgorithmicallyreducespaminstancesintouniqueen-titiesrepresentingbotnets.Foreachentity,wedefine4attributes:entityid,associatedIPs,usernames,andassociated user ids. To construct entities we scanthroughtheusersandassigneachonetoanentityasfollows.

1. Forauser,ifanentityexistswhichcontainsitsusernameorIP,theuserisaddedtotheentity.

2. Ifmore than one entitymatches the abovecriteria,allmatchingentitiesaremerged.

3. Ifnoentitymatchestheabovecriteria,anewentityiscreated.

WesummarizetheentitycharacteristicsinTable3.Themaximumnumberofusersinoneentityisalmost38 thousand for ggjx with over 100 unique IP ad-dresses.Theseresultsconfirmwhatisexpected-thevastmajorityofbots interactingwithourhoneypotsarepartof largebotnets. Thisalsoallowsustoper-formcontentanalysisexploringwhatlinguisticquali-tiesdifferentiatebotnets.

Table 4: NLP feature sets we consider for our content

analysis and their effectiveness at differentiating botnets

Content Analysis Tobetterunderstandbotnets,weusenaturallan-

guageprocessing(CollobertandWeston,2008)foran-alyzingthelinguisticcontentofentities.Forouranal-ysis,we consider various feature sets as proxies forlinguisticcharacteristicsassummarizedinTable4.WeuseaMaximumEntropyclassifier(MegaM,2016)totestwhich featuresdifferentiatebotnets. Inorder totestafeature,wetraintheclassifierwith70%oftheposts, randomly selected, from theN largest entities

andtest itwiththeremaining30%of theposts.Ourfinalresultsaretheaverageofthreeruns.

ThefirstfeaturesetwetestisBagOfWords(BoW)whichmodelsthelexicalcontentofposts.Putsimply,eachwordinadocumentisputintoa‘bag’andthesyn-tacticstructureisdiscarded.Forimplementationde-tails,seeourtechnicalreport(Bevans,2016).InFigure2,weshowouranalysisoftheBoWfeatureset.

Whenconsidering the top5 contributingentities,theclassificationaccuracyislessthan95%whichim-pliesthatthelexicalcontentofbotnetsvariesgreatly.Thesecondfeatureweconsideristhetaxonomypro-videdbyIBMWatson’sAlchemyAPI.Alchemy’soutputisalistoftaxonomylabelsandassociatedconfidences.Forthepurposeofouranalysis,wediscardanylowornon-confidentlabels.InFigure3,weshowouranalysisoftheAlchemyTaxonomyfeaturesetwhichhighlightstheaccuracyofAlchemy’staxonomy.WenotethattheAlchemyTaxonomyfeaturesetisdramaticallysmallerinsizethantheBoWfeaturesetwhilestillprovidinghighperformance.Thisindicatesafulllexicalanalysisis not necessary but a taxonomic approach is suffi-cient. Our third feature is based on the links in theposts.Tocreatethefeature,weparseeachpostforanyHTTPlinksandstripthelinktoitscoredomainname.

Theclassifierwith the link featuresethadvariedresults,asshowninTable5,whereitwasreliableindifferentiating ggjx entities but less reliable for theother twohoneypots.TheseresultscorrelatewithlinkscarcityfromTable2.

Figure 2

Figure 3

Wetestthenormalizedvocabularysizeofapostasa feature.Wederivethis fromthenumberofuniquewords divided by the total number of words in thepost.AsshowninTable5,thevocabularysizedoesnotdifferentiatebotnets.

We also form a feature set based on the part-of-speech(PoS)makeupofapostusingtheStanfordPoSTagger.TheStanfordPoStaggerreturnsapairforeachwordinthetext,theoriginalwordandcorrespondingPoS.WecreateaBoWfromthisresponsethatcreatesanabstractrepresentationof thedocument’ssyntax.As shown in Table 5, the PoS does not differentiatebotnets.

Table 5: Accuracies for various features when identifying 10

and 60 entities using the maximum entropy classifier

Conclusions Inthispaper,weexamineinterestingcharacteris-

tics of spam-generating botnets and release a novelcorpus to the community.We find that hundreds ofthousandsof fakeusersarecreatedbyasmallsetofbotnets and much fewer numbers of them actuallypostspam.Thespamthatispostedishighlycorrelatedbysubjectlanguagetothepointwherebotnetslabeled

bytheirnetworkbehavioraretoalargedegreere-dis-coverableusingcontentclassification(Figure3).

Whilelinkandvocabularyanalysiscanbegooddif-ferentiatorsofthesebotnets,itisthecontentlabeling(providedbyAlchemy)thatisthebestindicator.Ourexperimentonlyspans42days,thusit’spossiblethesubjectspecializationisafeatureofthecampaignra-therthanthebotnetitself.

Bibliography

Apache virtual host. (2016).http://httpd.apache.org/docs/current/vhosts Ac-cessed:2016-08-10.

Bevans,B.,andKhosmood,F.(2016).ForumSpamCorpus.

http://users.csc.calpoly.edu/~foaad/bfbevans Ac-cessed:2017-04-01.

Bevans, B. (2016). “Categorizing Forum Spam.” Master's

ThesesatCalPolyDigitalCommons.http://digitalcom-mons.calpoly.edu/theses/1623Accessed:2017-04-01.

Collobert,R., andWeston, J. (2008). “Aunified architec-

ture fornatural languageprocessing:Deepneuralnet-workswithmultitasklearning.”Proceedingsofthe25thInternational Conference on Machine Learning, ACM:160–67.

Dinh,S.etal.(2015).“Spamcampaigndetection,analysis,

andinvestigation.”DigitalInvestigation,(12)S12–S21.Geerthik, S. (2013). “Survey on internet spam: Classifica-

tion and analysis.” International Journal of ComputerTechnologyandApplications,4(3):384.

Levchenko,K.etal.(2011).“Clicktrajectories:End-to-end

analysisofthespamvaluechain.”SymposiumonSecu-rityandPrivacy,IEEE.431–446.

Lumezanu,C.andFeamster,N. (2012). “Observingcom-

monspamintwitterandemail.”Proceedingsofthe2012ACM conference on Internet measurement, ACM. 461–466.

Mega M. (2016). “Mega model optimization package.”

https://www.umiacs.umd.edu/~hal/megam/, Ac-cessed:2016-08-10.

Niu,Y.etal.(2007).“Aquantitativestudyofforumspam-

mingusingcontext-basedanalysis.”NDSS.Pitsillidis,A.etal.(2012).“Taster’schoice:Acomparative

analysisofspamfeeds.”Proceedingsofthe2012ACMconferenceonInternetmeasure-

ment,ACM.427–440.

Shin,Y.,Gupta,M.,andMyers,S.A.(2011).“Thenutsand

boltsofaforumspamautomator.”LEET.Spirin,N.,andHan,J.(2012).“Surveyonwebspamdetec-

tion:Principlesandalgorithms.”ACMSIGKDDExplorationsNewsletter,13(2):50-64.Stone-Gross,B.,etal.“Theundergroundeconomyofspam:

A botmaster’s perspective of coordinating large-scalespamcampaigns.”LEET,11:4.

Stringhini, G. (2015). “Evilcohort: Detecting communities

ofmaliciousaccountsononlineservices.”24thUSENIXSecuritySymposium(USENIXSecurity15),563–578.

Thomason, A. (2007). “Blog spam: A review.” CEAS,

Citeseer.ThonnardO.andDacier,M.(2011).“Astrategicanalysisof

spambotnetsoperations.”Proceedingsofthe8thAnnualCollaboration, Electronic messaging, Anti-Abuse andSpamConference,ACM,162–171.

Understanding Botnet- driven Blog Spam: Motivations and ... · words designed to change the way a...

Documents

Transcript of Understanding Botnet- driven Blog Spam: Motivations and ... · words designed to change the way a...