Webbots, Spiders, And Screen Scrapers - Michael Schrenk

600

Transcript of Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Page 1: Webbots, Spiders, And Screen Scrapers - Michael Schrenk
Page 2: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Webbots, Spiders, andScreen Scrapers

Michael SchrenkEditorWilliam PollockCopyright © 2009No Starch Press

DedicationIn loving memoryCharlotte Schrenk

1897–1982

Page 3: Webbots, Spiders, And Screen Scrapers - Michael Schrenk
Page 4: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

ACKNOWLEDGMENTSI needed support and inspiration from family,friends, and colleagues to write this book.Unfortunately, I did not always acknowledge theircontributions when they offered them. Here is adelayed thanks to all of those who helped me.Thanks to Donna, my wife, who convinced me thatI could actually do this, and to my kids, Ava andGordon, who have always supported my crazyschemes, even though they know it means fewercoffees and chess matches together.Andy King encouraged me to find a publisher forthis project, and Daniel Stenberg, founder of thecURL project, helped me organize my thoughtswhen this book was barely an outline.No Starch Press exhibited saint-like patiencewhile I split my time between writing webbots andwriting about webbots. Special thanks to Bill, whotrusted the concept, Tyler, who edited most of themanuscript, and Christina, who kept me on task.Peter MacIntyre was instrumental in checking fortechnical errors, and Megan's copyeditingimproved the book throughout.Anamika Mishra assisted with the book's websiteand consistently covered for me when I was busy

Page 5: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

and consistently covered for me when I was busywriting or too tired to code.Laurie Curtis helped me explore what it might belike to finish a book.Finally, a tip of the hat goes to Mark, Randy,Megan, Karen, Terri, Susan, Dennis, Dan, andMatt, who were thoughtful enough to ask aboutmy book's progress before inquiring about thestatus of their projects.

Page 6: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

IntroductionMy introduction to the World Wide Web was alsothe beginning of my relationship with the browser.The first browser I used was Mosaic, pioneered byEric Bina and Marc Andreessen. Andreessen laterco-founded Netscape.Shortly after I discovered the World Wide Web, Ibegan to associate the wonders of the Internetwith the simplicity of the browser. By just clickinga hyperlink, I could enjoy the art treasures of theLouvre; if I followed another link, I could peruse afan site for The Brady Bunch.[1] The browser wasmore than a software application that facilitateduse of the World Wide Web: It was the World WideWeb. It was the new television. And just astelevision tamed distant video signals with simplechannel and volume knobs, browsers demystifiedthe complexities of the Internet with hyperlinks,bookmarks, and back buttons.

Old-School Client-ServerTechnologyMy big moment of discovery came when I learnedthat I didn't need a browser to view web pages. I

Page 7: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

that I didn't need a browser to view web pages. Irealized that Telnet, a program used since theearly '80s to communicate with networkedcomputers, could also download web pages, asshown in Figure 2.

Figure 1. Viewing a web page with Telnet

Suddenly, the World Wide Web was something Icould understand without a browser. It was afamiliar client-server architecture where simpleclients worked on tasks found on remote servers.The difference here was that the clients werebrowsers and the servers dished up web pages.The only revolutionary thing was that, unlikeprevious client-server client applications,browsers were easy for anyone to use and soongained mass acceptance. The Internet's audienceshifted from physicists and computer

Page 8: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

shifted from physicists and computerprogrammers to the public. Unfortunately, thegeneral public didn't understand client-servertechnology, so the dependency on browsersspread further. They didn't understand that therewere other ways to use the World Wide Web.As a programmer, I realized that if I could useTelnet to download web pages, I could also writeprograms to do the same. I could write my ownbrowser if I desired, or I could write automatedagents (webbots, spiders, and screen scrapers) tosolve problems that browsers couldn't.

[1] I stumbled across a fan site for The BradyBunch during my first World Wide Webexperience.

Page 9: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

The Problem with BrowsersThe basic problem with browsers is that they'remanual tools. Your browser only downloads andrenders websites: You still need to decide if theweb page is relevant, if you've already seen theinformation it contains, or if you need to follow alink to another web page. What's worse, yourbrowser can't think for itself. It can't notify youwhen something important happens online, and itcertainly won't anticipate your actions,automatically complete forms, make purchases,or download files for you. To do these things,you'll need the automation and intelligence onlyavailable with a webbot, or a web robot.

Page 10: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

What to Expect from This BookThis book identifies the limitations of typical webbrowsers and explores how you can use webbotsto capitalize on these limitations. You'll learn howto design and write webbots through samplescripts and example projects. Moreover, you'llfind answers to larger design questions like these:

Where do ideas for webbot projects comefrom?How can I have fun with webbots and stay outof trouble?Is it possible to write stealthy webbots thatrun without detection?What is the trick to writing robust, fault-tolerant webbots that won't break as Internetcontent changes?

Learn from My Mistakes

I've written webbots, spiders, and screen scrapersfor nearly 10 years, and in the process I've mademost of the mistakes someone can make. Becausewebbots are capable of making unconventionaldemands on websites, system administrators can

Page 11: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

confuse webbots' requests with attempts to hackinto their systems. Thankfully, none of mymistakes has ever led to a courtroom, but theyhave resulted in intimidating phone calls, scaryemails, and very awkward moments. Happily, Ican say that I've learned from these situations,and it's been a very long time since I've beenacross the desk from an angry systemadministrator. You can spare yourself a lot ofgrief by reading my stories and learning from mymistakes.

Master Webbot Techniques

You will learn about the technology needed towrite a wide assortment of webbots. Sometechnical skills you'll master include these:

Programmatically downloading websitesDecoding encrypted websitesUnlocking authenticated web pagesManaging cookiesParsing dataWriting spidersManaging the large amounts of data thatwebbots generate

Page 12: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

webbots generate

Leverage Existing Scripts

This book uses several code libraries that make iteasy for you to write webbots, spiders, and screenscrapers. The functions and declarations in theselibraries provide the basis for most of the examplescripts used in this book. You'll save time by usingthese libraries because they do the underlyingwork, leaving the upper-level planning anddevelopment to you. All of these libraries areavailable for download at this book's website.

Page 13: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

About the WebsiteThis book's website(http://www.schrenk.com/nostarch/webbots) is anadditional resource for you to use. To the extentthat it's possible, all the example projects in thisbook use web pages on the companion site astargets, or resources for your webbots todownload and take action on. These targetsprovided a consistent (unchanging) environmentfor you to hone your webbot writing skills. Acontrolled learning environment is importantbecause, regardless of our best efforts, webbotscan fail when their target websites change.Knowing that your targets are unchanging makesthe task of debugging a little easier.The companion website also has links to othersites of interest, white papers, book updates, andan area where you can communicate with otherwebbot developers (see Figure 2). From thewebsite, you will also be able to access all of theexample code libraries used in this book.

Page 14: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 2. The official website of Webbots,Spiders, and Screen Scrapers

Page 15: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

About the CodeMost of the scripts in this book are straight PHP.However, sometimes PHP and HTML areintermixed in the same script—and in many cases,on the same line. In those situations, a boldtypeface differentiates PHP scripts from HTML,as shown in Listing 1.You may use any of the scripts in this book foryour own personal use, as long as you agree notto redistribute them. If you use any script in thisbook, you also consent to bear full responsibilityfor its use and execution and agree not to sell orcreate derivative products, under anycircumstances. However, if you do improve any ofthese scripts or develop entirely new (related)scripts, you are encouraged to share them withthe webbot community via the book's website.

<h1>Coding Conventions for Embedded PHP</h1><table border="0" cellpadding="1" cellspacing="0"><tr><th>Name</th> <th>Address</th> </tr>

<? for ($x=0; $x<sizeof($person_array); $x++) { ?> <tr> <td><? echo person_array[$x]['NAME']?

Page 16: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

></td> <td><? echo person_array[$x]['ADDRESS']?></td> </tr>

<? } ?></table>

Listing 1-1: Bold typeface differentiates PHP fromHTML scriptThe other thing you should know about theexample scripts is that they are teaching aids. Thescripts may not reflect the most efficientprogramming method, because their primary goalis readability.

Note

The code libraries used by this book aregoverned by the W3C Software Notice andLicense(http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231) and are available fordownload from the book's website. Thewebsite is also where the software ismaintained. If you make meaningfulcontributions to this code, please go to thewebsite to see how your improvements may bepart of the next distribution. The softwareexamples depicted in this book are protectedby this book's copyright.

Page 17: Webbots, Spiders, And Screen Scrapers - Michael Schrenk
Page 18: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

RequirementsKnowing HTML and the basics of how the Internetworks will be necessary for using this book. If youare a beginning programmer with even nominalcomputer network experience, you'll be fine. It isimportant to recognize, however, that this bookwill not teach you how to program or how TCP/IP,the protocol of the Internet, works.

Hardware

You don't need elaborate hardware to startwriting webbots. If you have a secondhand 33MHz Pentium computer, you have the minimumrequirement to play with all the examples in thisbook. Any of the following hardware isappropriate for using the examples andinformation in this book:

A personal computer that uses a Windows 95,Windows XP, or Windows Vista operatingsystemAny reasonably modern Linux-, Unix-, orFreeBSD-based computerA Macintosh running OS X (or later)

Page 19: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

It will also prove useful to have ample storage.This is particularly true if your plan is to writespiders, self-directed webbots, which can consumeall available resources (especially hard drives) ifthey are allowed to download too many files.

Software

In an effort to be as relevant as possible, thesoftware examples in this book use PHP,[2]

cURL,[3] and MySQL.[4] All of these softwaretechnologies are available as free downloads fromtheir respective websites. In addition to being free,these software packages are wonderfully portableand function well on a variety of computers andoperating systems.

Note

If you're going to follow the script examples inthis book, you will need a basic knowledge ofPHP. This book assumes you know how toprogram.

Internet Access

A connection to the Internet is very handy, but not

Page 20: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

A connection to the Internet is very handy, but notentirely necessary. If you lack a networkconnection, you can create your own localintranet (one or more webservers on a privatenetwork) by loading Apache[5] onto yourcomputer, and if that's not possible, you candesign programs that use local files as targets.However, neither of these options is as fun aswriting webbots that use a live Internetconnection. In addition, if you lack an Internetconnection, you will not have access to the onlineresources, which add a lot of value to yourlearning experience.

[2] See http://www.php.net.[3] See http://curl.haxx.se.[4] See http://www.mysql.com.[5] See http://www.apache.org.

Page 21: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

A Disclaimer (This IsImportant)As with anything you develop, you must takeresponsibility for your own actions. From atechnology standpoint, there is little to distinguisha beneficial webbot from one that does destructivethings. The main difference is the intent of thedeveloper (and how well you debug your scripts).Therefore, it's up to you to do constructive thingswith the information in this book and not violatecopyright law, disrupt networks, or do anythingelse that would be troublesome or illegal. And ifyou do, don't call me.Please reference Chapter 28 for insight into howto write webbots ethically. Chapter 28 will helpyou do this, but it won't provide legal advice. Ifyou have questions, talk to a lawyer before youexperiment.

Page 22: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Part I. FUNDAMENTALCONCEPTS AND TECHNIQUESWhile most web development books explain howto create websites, this book teaches developershow to combine, adapt, and automate existingwebsites to fit their specific needs. Part Iintroduces the concept of web automation andexplores elementary techniques to harness theresources of the Web.

Chapter 1This chapter explores why it is fun to writewebbots and why webbot development is arewarding career with expanding possibilities.

Chapter 2We've been led to believe that the only way touse a website is with a browser. If, however,you examine what you want to do, as opposedto what a browser allows you to do, you'll lookat your favorite web resources in a whole newway. This chapter discusses existing as well aspotential webbots.

Chapter 3This chapter introduces PHP/CURL, the free

Page 23: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

This chapter introduces PHP/CURL, the freelibrary that makes it easy to download webpages—even when the targeted web pages useadvanced techniques like forwarding,encryption, authentication, and cookies.

Chapter 4Downloaded web pages aren't of any use untilyour webbot can separate the data you needfrom the data you don't need.

Chapter 5To truly automate web agents, your applicationneeds the ability to automatically upload datato online forms.

Chapter 6Spiders in particular can generate hugeamounts of data. That's why it's important foryou to know how to effectively store andreduce the size of web pages, text, and images.

You may already have experience from otherareas of computer science that you can apply tothese activities. However, even if these conceptsare familiar to you, developing webbots may forceyou to view these skills in a different context, sothe following chapters are still worth reading. Ifyou don't already have experience in these areas,the next six chapters will provide the basics for

Page 24: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

the next six chapters will provide the basics fordesigning and developing webbots. You'll use thisgroundwork in the other projects and advancedconsiderations discussed later.

Page 25: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 1. WHAT'S IN IT FORYOU?Whether you're a software developer looking fornew skills or a business leader looking for acompetitive advantage, this chapter is where youwill discover how webbots create opportunities.

Uncovering the Internet's TruePotentialWebbots present a virtually untapped resource forsoftware developers and business leaders. This isbecause the public has yet to realize that most ofthe Internet's potential lies outside the capabilityof the existing browser/website paradigm. Forexample, in today's world, people are satisfiedwith pointing a browser at a website and usingwhatever information or services they find there.With webbots, the focus of the Internet will shiftfrom what's available on individual websitestoward what people actually want to accomplish.To this end, webbots will use as many onlineresources as required to satisfy their individualneeds.To be successful with webbots, you need to stop

Page 26: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

To be successful with webbots, you need to stopthinking like other Internet users. Namely, youneed to stop thinking about the Internet in termsof a browser viewing one website at a time. Thiswill be difficult, because we've all becomedependent on browsers. While you can do a widevariety of things with a browser, you also pay aprice for that versatility—browsers need to besufficiently generic to be useful in a wide varietyof circumstances. As a result, browsers can dogeneral things well, but they lack the ability to dospecific things exceptionally well.[6] Webbots, onthe other hand, can be programmed for specifictasks and can perform those tasks withperfection. Additionally, webbots have the abilityto automate anything you do online or notify youwhen something needs to be done.

[6] For example, they can't act on your behalf,filter content for relevance, or perform tasksautomatically.

Page 27: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

What's in It for Developers?Your ability to write a webbot can distinguish youfrom a pack of lesser developers. Web developers—who've gone from designing the new economy ofthe late 1990s to falling victim to it during the dot-com crash of 2001—know that today's job marketis very competitive. Even today's most talenteddevelopers can have trouble finding meaningfulwork. Knowing how to write webbots will expandyour ability as a developer and make you morevaluable to your employer or potential employers.A webbot writer differentiates his or her skill setfrom that of someone whose knowledge ofInternet technology extends only to creatingwebsites. By designing webbots, you demonstratethat you have a thorough understanding ofnetwork technology and a variety of networkprotocols, as well as the ability to use existingtechnology in new and creative ways.

Webbot Developers Are in Demand

There are many growth opportunities for webbotdevelopers. You can demonstrate this for yourselfby looking at your website's file access logs andrecording all the non-browsers that have visited

Page 28: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

recording all the non-browsers that have visitedyour website. If you compare current server logsto those from a year ago, you should notice ahealthy increase in traffic from nontraditionalweb clients or webbots. Someone has to writethese automated agents, and as the demand forwebbots increases, so does the demand forwebbot developers.Hard statistics on the growth of webbot use arehard to come by, since many webbots defydetection and masquerade as traditional webbrowsers. In fact, the value that webbots bring tobusinesses forces most webbot projectsunderground. I can't talk about most of thewebbots I've developed because they createcompetitive advantages for clients, and they'drather keep those techniques secret. Regardless ofthe actual numbers, it's a fact that webbots andspiders comprise a large amount of today'sInternet traffic and that many developers arerequired to both maintain existing webbots anddevelop new ones.

Webbots Are Fun to Write

In addition to solving serious business problems,webbots are also fun to write. This should bewelcome news to seasoned developers who nolonger experience the thrill of solving a problemor using a technology for the first time. Without a

Page 29: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

or using a technology for the first time. Without alittle fun, it's easy for developers to get bored andconclude that software is simply a sequence ofinstructions that do the same thing every time aprogram runs. While predictability makessoftware dependable, it also makes it tiresome towrite. This is especially true for computerprogrammers who specialize in a specific industryand lack diversity in tasks. At some point in theircareers, nearly all of the programmers I knowhave become very tired of what they do, in spite ofthe fact that they still like to write computerprograms.Webbots, however, are almost like games, in thatthey can pleasantly surprise their developers withtheir unpredictability. This is because webbotsoperate on data that changes frequently, and theyrespond slightly differently every time they run. Asa result, webbots become impulsive and lifelike.Unlike other software, webbots feel organic! Onceyou write a webbot that does somethingwonderfully unexpected, you'll have a hard timedescribing the experience to those writingtraditional software applications.

Webbots Facilitate "ConstructiveHacking"

By its strict definition, hacking is the process of

Page 30: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

By its strict definition, hacking is the process ofcreatively using technology for a purpose otherthan the one originally intended. By using webpages, news groups, email, or other onlinetechnology in unintended ways, you join the ranksof innovators that combine and alter existingtechnology to create totally new and useful tools.You'll also broaden the possibilities for using theInternet.Unfortunately, hacking also has a dark side,popularized by stories of people breaking intosystems, stealing private data, and renderingonline services unusable. While some people dowrite destructive webbots, I don't condone thattype of behavior here. In fact, Chapter 28 isdedicated to this very subject.

Page 31: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

What's in It for BusinessLeaders?Few businesses gain a competitive advantagesimply by using the Internet. Today, businessesneed a unique online strategy to gain acompetitive advantage. Unfortunately, mostbusinesses limit their online strategy to a website—which, barring some visual design differences,essentially functions like all the other websiteswithin the industry.

Customize the Internet for YourBusiness

Most of the webbot projects I've developed are forbusiness leaders who've become frustrated withthe Internet as it is. They want added automationand decision-making capability on the websitesthey use to run their businesses. Essentially, theywant webbots that customize other people'swebsites (and the data those sites contain) for thespecific way they do business. Progressivebusinesses use webbots to improve their onlineexperience, optimizing how they buy things, howthey gather facts, how they're notified when thingschange, and how to enforce business rules when

Page 32: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

change, and how to enforce business rules whenmaking online purchases.Businesses that use webbots aren't limited toenvisioning the Internet as a set of websites thatare accessed by browsers. Instead, they see theInternet as a stockpile of varied resources thatthey can customize (using webbots) to serve theirspecific needs.There has always been a lag between when peoplefigure out how to do something manually andwhen they figure out how to automate theprocess. Just as chainsaws replaced axes and assewing machines superseded needles andthimbles, it is only natural to assume that new(automated) methods for interacting with theInternet will follow the methods we use today. Thecompanies that develop these processes will bethe first to enjoy the competitive advantagecreated by their vision.

Capitalize on the Public's Inexperiencewith Webbots

Most people have very little experience using theInternet with anything other than a browser, andeven if people have used other Internet clients likeemail or news readers, they have never thoughtabout how their online experience could be

Page 33: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

about how their online experience could beimproved through automation. For most, it justhasn't been an issue.For businesspeople, blind allegiance to browsersis a double-edged sword. In one respect, it's goodthat people aren't familiar with the benefits thatwebbots provide—this provides opportunities foryou to develop webbot projects that offercompetitive advantages. On the other hand, ifyour supervisors are used to the Internet as seenthrough a browser alone, you may have a hardtime selling your webbot projects to management.

Accomplish a Lot with a SmallInvestment

Webbots can achieve amazing results withoutelaborate setups. I've used obsolete computerswith slow, dial-up connections to run webbots thatcreate completely new revenue channels forbusinesses. Webbots can even be designed to workwith existing office equipment like phones, faxmachines, and printers.

Page 34: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Final ThoughtsOne of the nice things about webbots is that youcan create a large effect without makingsomething difficult for customers to use. In fact,customers don't even need to know that a webbotis involved. For example, your webbots candeliver services through traditional-lookingwebsites. While you know that you're doingsomething radically innovative, the end usersdon't realize what's going on behind the scenes—and they don't really need to know about thehordes of hidden webbots and spiders combingthe Internet for the data and services they need.All they know is that they are getting an improvedInternet experience. And in the end, that's all thatmatters.

Page 35: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 2. IDEAS FORWEBBOT PROJECTSIt's often more difficult to find applications fornew technology than it is to learn the technologyitself. Therefore, this chapter focuses onencouraging you to generate ideas for things thatyou can do with webbots. We'll explore howwebbots capitalize on browser limitations, andwe'll see a few examples of what people arecurrently doing with webbots. We'll wrap up bythrowing out some wild ideas that might help youexpand your expectations of what can be doneonline.

Inspiration from BrowserLimitationsA useful method for generating ideas for webbotprojects is to study what cannot be done by simplypointing a browser at a typical website. You knowthat browsers, used in traditional ways, cannotautomate your Internet experience. For example,they have these limitations:

Browsers cannot aggregate and filter

Page 36: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Browsers cannot aggregate and filterinformation for relevanceBrowsers cannot interpret what they findonlineBrowsers cannot act on your behalf

However, a browser may leverage the power of awebbot to do many things that it could not doalone. Let's look at some real-life examples of howbrowser limitations were leveraged into actualwebbot projects.

Webbots That Aggregate and FilterInformation for Relevance

TrackRates.com (http://www.trackrates.com,shown in Figure 2-1) is a website that deploys anarmy of webbots to aggregate and filter hotelroom prices from travel websites. By identifyingroom prices for specific hotels for specific dates,it determines the actual market value for roomsup to three months into the future. Thisinformation helps hotel managers intelligentlyprice rooms by specifically knowing what thecompetition is charging for similar rooms.TrackRates.com also reveals market trends byperforming statistical analysis on room prices,and it tries to determine periods of high demandby indicating dates on which hotels have booked

Page 37: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

by indicating dates on which hotels have bookedall of their rooms.

Figure 2-1. TrackRates.com

I wrote TrackRates.com to help hotel managersanalyze local markets and provide facts forsetting room prices. Without the TrackRates.comwebbot, hotel managers either need to guess whattheir rooms are worth, rely on less currentinformation about their local hotel market, or gothrough the arduous task of manually collectingthis data.

Page 38: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Webbots That Interpret What They FindOnline

WebSiteOptimization.com(http://www.websiteoptimization.com) uses awebbot to help web developers create websitesthat use resources effectively. This webbotaccepts a web page's URL (as shown in Figure 2-1) and analyzes how each graphic, CSS, andJavaScript file is used by the web page. In theinterest of full disclosure, I should mention that Iwrote the back end for this web page analyzer.

Page 39: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 2-2. A website-analyzing webbot

The WebSiteOptimization.com webbot analyzesthe data it collects and offers suggestions foroptimizing website performance. Without this tool,developers would have to manually parse throughtheir HTML code to determine which files arerequired by web pages, how much bandwidth theyare using, and how the organization of the webpage affects its performance.

Webbots That Act on Your Behalf

Pokerbots, webbots that play online poker, are aresponse to the recent growth in online gamblingsites, particularly gaming sites with live pokerrooms. While the action in these pokers sites islive, not all the players are. Some online pokerplayers are webbots, like Poker Robot, shown inFigure 2-3.Webbots designed to play online poker not onlyknow the rules of Texas hold 'em but usepredetermined business rules to expertly read howothers play. They use this information to hold,fold, or bet appropriately. Reportedly, theseautomated players can very effectively pick thepockets of new and inexperienced poker players.Some collusion webbots even allow one virtual

Page 40: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Some collusion webbots even allow one virtualplayer to play multiple hands at the same table,while making it look like a separate person isplaying each hand. Imagine playing against agroup of people who not only know each other'scards, but hold, fold, and bet against you as ateam!Obviously, such webbots that play expert poker(and cheat) provide a tremendous advantage.Nobody knows exactly how prevalent pokerbotsare, but they have created a market for anti-pokerbot software like Poker BodyGuard,distributed by StopPokerCheaters.com.

Page 41: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 2-3. An example pokerbot

Page 42: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

A Few Crazy Ideas to Get YouStartedOne of the goals of this book is to encourage youto write new and experimental webbots of yourown design. A way to jumpstart this process is tobrainstorm and generate some ideas for potentialprojects. I've taken this opportunity to list a fewideas to get you started. These ideas are not herenecessarily because they have commercial value.Instead, they should act as inspiration for yourown webbots and what you want to accomplishonline.When designing a webbot, remember that themore specifically you can define the task, themore useful your webbot will be. What can you dowith a webbot? Let's look at a few scenarios.

Help Out a Busy Executive

Suppose you're a busy executive type and you liketo start your day reading your online industrypublication. Time is limited, however, and youonly let yourself read industry news until you'vefinished your first cup of coffee. Therefore, youdon't want to be bothered with stories that you'veread before or that you know are not relevant to

Page 43: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

read before or that you know are not relevant toyour business. You ask your developer to create aspecialized webbot that consolidates articles fromyour favorite industry news sources and onlydisplays links to stories that it has not shown youbefore.The webbot could ignore articles that containcertain key phrases you previously entered in anexclusion list[7] and highlight articles that containreferences to you or your competitors. With suchan application, you could quickly scan what'shappening in your industry and only spend timereading relevant articles. You might even havemore time to enjoy your coffee.

Save Money by Automating Tasks

It's possible to design a webbot that automaticallybuys inventory for a store, given a predeterminedset of buying criteria. For example, assume youown a store that sells used travel gear. Some ofyour sources for inventory are online auctionwebsites.[8] Say you are interested in bidding onunder-priced Tumi suitcases during the closingminute of their auctions. If you don't use a webbotof some sort, you will have to use a web browserto check each auction site periodically.Without a webbot, it can be expensive to use the

Page 44: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Without a webbot, it can be expensive to use theInternet in a business setting, because repetitivetasks (like procuring inventory) are timeconsuming without automation. Additionally, themore mundane the task, the greater theopportunity for human error. Checking onlineauctions for products to resell could easilyconsume one or two hours a day—up to 25percent of a 40-hour work week. At that rate,someone with an annual salary of $80,000 wouldcost a company $20,000 a year to procureinventory (without a webbot). That cost does notinclude the cost of opportunities lost while theemployee manually surfs auction sites. Inscenarios like this, it's easy to see how productacquisition with a webbot saves a lot of money—even for a small business with small requirements.Additionally, a webbot may uncover bargainsmissed by someone manually searching theauction site.

Protect Intellectual Property

You can write a webbot to protect your onlineintellectual property. For example, suppose youspent many hours writing a JavaScript program. Ithas commercial value, and you license the scriptfor others to use for a fee. You've been selling theprogram for a few months and have learned thatsome people are downloading and using your

Page 45: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

some people are downloading and using yourprogram without paying for it. You write a webbotto find websites that are using your JavaScriptprogram without your permission. This webbotsearches the Internet and makes a list of URLsthat reference your JavaScript file. In a separatestep, the webbot does a whois lookup on thedomain to determine the owner from the domainregistrar.[9] If the domain is not one of yourregistered users, the webbot compiles contactinformation from the domain registrar so you cancontact the parties who are using unlicensedcopies of your code.

Monitor Opportunities

You can also write webbots that alert you whenparticular opportunities arise. For example, let'ssay that you have an interest in acquiring a JackRussell Terrier.[10] Instead of devoting part ofeach day to searching for your new dog, youdecide to write a webbot to search for you andnotify you when it finds a dog meeting yourrequirements. Your webbot performs a dailysearch of the websites of local animal shelters anddog rescue organizations. It parses the contents ofthe sites, looking for your dog. When the webbotfinds a Jack Russell Terrier, it sends you an emailnotification describing the dog and its location.The webbot also records this specific dog in its

Page 46: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

The webbot also records this specific dog in itsdatabase, so it doesn't send additionalnotifications for the same dog in the future. This isa fairly common webbot task, which could bemodified to automatically discover job listings,sports scores, or any other timely information.

Verify Access Rights on a Website

Webbots may prevent the potentially nightmarishsituation that exists for any web developer whomistakenly gives one user access to another user'sdata. To avoid this situation, you couldcommission a webbot to verify that all usersreceive the correct access to your site. Thiswebbot logs in to the site with every viableusername and password. While acting on eachuser's behalf, the webbot accesses every availablepage and compares those pages to a list ofappropriate pages for each user. If the webbotfinds a user is inadvertently able to accesssomething he or she shouldn't, that account istemporarily suspended until the problem is fixed.Every morning before you arrive at your office,the webbot emails a report of any irregularities itfound the night before.

Create an Online Clipping Service

Page 47: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Suppose you're very vain, and you'd like a webbotto send an email to your mother every time amajor news service mentions your name.However, since you're not vain enough to checkall the main news websites on a regular basis, youwrite a webbot that accomplishes the task for you.This webbot accesses a collection of websites,including CNN, Forbes, and Fortune. You designyour webbot to look only for articles that mentionyour name, and you employ an exclusion list toignore all articles that contain words or phraseslike shakedown, corruption, or money laundering.When the webbot finds an appropriate article, itautomatically sends your mother an email with alink to the article. Your webbot also blind copiesyou on all emails it sends so you know what she'stalking about when she calls.

Plot Unauthorized Wi-Fi Networks

You could write a webbot that aids in maintainingnetwork security on a large corporate campus.For example, suppose that you recentlydiscovered that you have a problem withemployees attaching unauthorized wireless accesspoints to your network. Since these unauthorizedaccess points occur inside your firewalls andproxies, you recognize that these unauthorized Wi-Fi networks pose a security risk that you need to

Page 48: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Fi networks pose a security risk that you need tocontrol. Therefore, in addition to a new securitypolicy, you decide to create a webbot thatautomatically finds and records the location of allwireless networks on your corporate campus.You notice that your mail room uses a small metalcart to deliver mail. Because this cart reachesevery corner of the corporate campus on a dailybasis, you seek and obtain permission to attach asmall laptop computer with a webbot and GlobalPositioning System (GPS) card to the cart. As yourwebbot hitches a ride through the campus, it looksfor open wireless network connections. When itfinds a wireless network, it uses the open networkto send its GPS location to a special website. Thiswebsite logs the GPS coordinates, IP address, anddate of uplink in a database. If you did yourhomework correctly, in a few days your webbotshould create a map of all open Wi-Fi networks,authorized and unauthorized, in your entirecorporate campus.

Track Web Technologies

You could write webbots that use web pageheaders, the information that servers send tobrowsers so they may correctly render websites,to maintain a list of web technologies used bymajor corporations. Headers typically indicate thetype of webserver (and often the operating

Page 49: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

type of webserver (and often the operatingsystem) that websites use, as shown in Figure 2-4.

Figure 2-4. A web page header showing servertechnology

Your webbot starts by accessing the headers ofeach website from a list that you keep in adatabase. It then parses web technologyinformation from the header. Finally, the webbotstores that information in a database that is usedby a graphing program to plot how servertechnology choices change over time.

Allow Incompatible Systems toCommunicate

In addition to creating human-readable output,you could design a webbot that only talks to other

Page 50: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

you could design a webbot that only talks to othercomputers. For example, let's say that you want tosynchronize two databases, one on a local privatenetwork and one that's behind a public website. Inthis case, synchronization (ensuring that bothdatabases contain the same information) isdifficult because the systems use differenttechnologies with incompatible synchronizationtechniques. Given the circumstances, you couldwrite a webbot that runs on your private networkand, for example, analyzes the public databasethrough a password-protected web service everymorning. The webbot uses the Internet as acommon protocol between these databases,analyzes data on both systems, and exchanges theappropriate data to synchronize the twodatabases.

[7] An exclusion list is a list of keywords orphrases that are ignored by a webbot.[8] Some online auctions actually provide tools tohelp you write webbots that manage auctions. Ifyou're interested in automating online auctions,check out eBay's Developers Program(http://developer.ebay.com).[9] whois is a service that returns informationabout the owner of a website. You can do the

Page 51: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

about the owner of a website. You can do theequivalent of a whois from a shell script or froman online service.[10] I actually met my dog online.

Page 52: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Final ThoughtsStudying browser limitations is one way touncover ideas for new webbot designs. You'veseen some real-world examples of webbots in useand read some descriptions of conceptual webbotdesigns. But, enough with theory—let's head to thelab!The next four chapters describe the basics ofwebbot development: downloading pages, parsingdata, emulating form submission, and managinglarge amounts of data. Once you master theseconcepts, you can move on to actual webbotprojects.

Page 53: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 3. DOWNLOADINGWEB PAGESThe most important thing a webbot does is moveweb pages from the Internet to your computer.Once the web page is on your computer, yourwebbot can parse and manipulate it.This chapter will show you how to write simplePHP scripts that download web pages. Moreimportantly, you'll learn PHP's limitations andhow to overcome them with PHP/CURL, a specialbinding of the cURL library that facilitates manyadvanced network features. cURL is used widelyby many computer languages as a means toaccess network files with a number of protocolsand options.

Note

While web pages are the most common targetsfor webbots and spiders, the Web is not theonly source of information for your webbots.Later chapters will explore methods forextracting data from newsgroups, email, andFTP servers, as well.

Page 54: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Prior to discovering PHP, I wrote webbots in avariety of languages, including Visual Basic, Java,and Tcl/Tk. But due to its simple syntax, in-depthstring parsing capabilities, networking functions,and portability, PHP proved ideal for webbotdevelopment. However, PHP is primarily a serverlanguage, and its chief purpose is to helpwebservers interpret incoming requests and sendthe appropriate web pages in response. Sincewebbots don't serve pages (they request them),this book supplements PHP built-in functions withPHP/CURL and a variety of libraries, developedspecifically to help you learn to write webbots andspiders.

Think About Files, Not WebPagesTo most people, the Web appears as a collectionof web pages. But in reality, the Web is collectionof files that form those web pages. These files mayexist on servers anywhere in the world, and theyonly create web pages when they are viewedtogether. Because browsers simplify the processof downloading and rendering the individual filesthat make up web pages, you need to know thenuts and bolts of how web pages are put togetherbefore you write your first webbot.

Page 55: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

before you write your first webbot.When your browser requests a file, as shown inFigure 3-1, the webserver that fields the requestsends your browser a default or index file, whichmaps the location of all the files that the web pageneeds and tells how to render the text and imagesthat comprise that web page.

Figure 3-1. When a browser requests a webpage, it first receives an index file.

As a rule, this index file also contains referencesto the other files required to render the completeweb page,[11] as shown in Figure 3-2. These mayinclude images, JavaScript, style sheets, orcomplex media files like Flash, QuickTime, orWindows Media files. The browser downloadseach file separately, as it is referenced by theindex file.

Page 56: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 3-2. Downloading files, as they arereferenced by the index file

For example, if you request a web page withreferences to eight items your single web pageactually executes nine separate file downloads(one for the web page and one for each filereferenced by the web page). Usually, each fileresides on the same server, but they could just aseasily exist on separate domains, as shown inFigure 3-2.

[11] Some very simple websites consist of only onefile.

Page 57: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Downloading Files with PHP'sBuilt-in FunctionsBefore you can appreciate PHP/CURL, you'll needto familiarize yourself with PHP's built-infunctions for downloading files from the Internet.

Downloading Files with fopen() andfgets()

PHP includes two simple built-in functions fordownloading files from a network—fopen() andfgets(). The fopen() function does two things.First, it creates a network socket, whichrepresents the link between your webbot and thenetwork resource you want to retrieve. Second, itimplements the HTTP protocol, which defines howdata is transferred. With those tasks completed,fgets() leverages the networking ability of yourcomputer's operating system to pull the file fromthe Internet.

Creating Your First Webbot Script

Let's use PHP's built-in functions to create yourfirst webbot, which downloads a "Hello, world!"web page from this book's companion website.The short script is shown in Listing 3-1.

# Define the file you want to download$target = "http://www.schrenk.com/nostarch/webbots/hello_world.html";

Page 58: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$file_handle = fopen($target, "r");

# Fetch the filewhile (!feof($file_handle)) echo fgets($file_handle, 4096);fclose($file_handle);

Listing 3-1: Downloading a file from the Web withfopen() and fgets()As shown in Listing 3-1, fopen() establishes anetwork connection to the target, or file you wantto download. It references this connection with afile handle, or network link called $file_handle.The script then uses fopen() to fetch and echothe file in 4,096-byte chunks until it hasdownloaded and displayed the entire file. Finally,the script executes an fclose() to tell PHP thatit's finished with the network handle.Before we can execute the example in Listing 3-1,we need to examine the two ways to execute awebbot: You can run a webbot either in a browseror in a command shell.[12]

Executing Webbots in Command Shells

If you have a choice, it is usually better to executewebbots from a shell or command line. Webbotsgenerally don't care about web page formatting,so they will display exactly what is returned froma webserver. Browsers, in contrast, will interpretHTML tags as instructions for rendering the webpage. For example, Figure 3-3 shows what Listing3-1 looks like when executed in a shell.

Page 59: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

3-1 looks like when executed in a shell.

Figure 3-3. Running a webbot script in a shell

Executing Webbots in Browsers

To run a webbot script in a browser, simply loadthe script on a webserver and execute it byloading its URL into the browser's location bar asyou would any other web page. Contrast Figure 3-3 with Figure 3-4, where the same script is runwithin a browser. The HTML tags are gone, aswell as all of the structure of the returned file; theonly things displayed are two lines of text.Running a webbot in a browser only shows apartial picture and often hides importantinformation that a webbot needs.

Note

To display HTML tags within a browser,surround the output with <xmp> and </xmp>tags.

Page 60: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 3-4. Browser "rendering" the output of awebbot

Browser buffering is another complication youmight run into if you try to execute a webbot in abrowser. Buffering is useful when you're viewingweb pages because it allows a browser to waituntil it has collected enough of a web page beforeit starts rendering or displaying the web page.However, browser buffering is troublesome forwebbots because they frequently run for extendedperiods of time—much longer than it would taketo download a typical web page. Duringprolonged webbot execution, status messageswritten by the webbot may not be displayed by thebrowser while it is buffering the display.I have one webbot that runs continuously; in fact,it once ran for seven months before stoppingduring a power outage. This webbot could neverrun effectively in a browser because browsers aredesigned to render web pages with files of finitelength. Browsers assume short download periodsand may buffer an entire web page beforedisplaying anything—therefore, never displayingthe output of your webbot.

Page 61: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

the output of your webbot.

Note

Browsers can still be very useful for creatinginterfaces that set up or control the actions ofa webbot. They can also be useful fordisplaying the results of a webbot's work.

Downloading Files with file()

An alternative to fopen() and fgets() is thefunction file(), which downloads formatted filesand places them into an array. This functiondiffers from fopen() in two important ways: Oneway is that, unlike fopen(), it does not requireyou to create a file handle, because it creates allthe network preparations for you. The otherdifference is that it returns the downloaded file asan array, with each line of the downloaded file ina separate array element. The script in Listing 3-2downloads the same web page used in Listing 3-1,but it uses the file() command.

<?// Download the target file$target = "http://www.schrenk.com/nostarch/webbots/hello_world.html";

$downloaded_page_array = file($target);

// Echo contents of filefor($xx=0; $xx<count($downloaded_page_array); $xx++) echo $downloaded_page_array[$xx];

Page 62: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

echo $downloaded_page_array[$xx];?>

Listing 3-2: Downloading files with file()The file() function is particularly useful fordownloading comma-separated value (CSV) files,in which each line of text represents a row of datawith columnar formatting (as in an Excelspreadsheet). Loading files line-by-line into anarray, however, is not particularly useful whendownloading HTML files because the data in aweb page is not defined by rows or columns; in aCSV file, however, rows and columns havespecific meaning.

[12] See Chapter 23 for more information onexecuting webbots as scheduled events.

Page 63: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Introducing PHP/CURLWhile PHP is capable when it comes to simple filedownloads, most real-life applications requireadditional functionality to handle advanced issuessuch as form submission, authentication,redirection, and so on. These functions aredifficult to facilitate with PHP's built-in functionsalone. Therefore, most of this book's examples usePHP/CURL to download files.The open source cURL project is the product ofSwedish developer Daniel Stenberg and a team ofdevelopers. The cURL library is available for usewith nearly any computer language you can thinkof. When cURL is used with PHP, it's known asPHP/CURL.The name cURL is either a blend of the wordsclient and URL or an acronym for the words clientURL Request Library—you decide. cURL doeseverything that PHP's built-in networkingfunctions do and a lot more. Appendix A expandson cURL's features, but here's a quick overview ofthe things PHP/CURL can do for you, a webbotdeveloper.

Multiple Transfer Protocols

Page 64: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Unlike the built-in PHP network functions, cURLsupports multiple transfer protocols, includingFTP, FTPS, HTTP, HTTPS, Gopher, Telnet, andLDAP. Of these protocols, the most important isprobably HTTPS, which allows webbots todownload from encrypted websites that employthe Secure Sockets Layer (SSL) protocol.

Form Submission

cURL provides easy ways for a webbot to emulatebrowser form submission to a server. cURLsupports all of the standard methods, or formsubmission protocols, as you'll learn in Chapter 5.

Basic Authentication

cURL allows webbots to enter password-protectedwebsites that use basic authentication. You'veencountered authentication if you've seen thisfamiliar gray box, shown in Figure 3-5, asking foryour username and password. PHP/CURL makesit easy to write webbots that enter and usepassword-protected websites.

Page 65: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 3-5. A basic authentication prompt

Cookies

Without cURL, it is difficult for webbots to readand write cookies, those small bits of data thatwebsites use to create session variables that trackyour movement. Websites also use cookies tomanage shopping carts and authenticate users.cURL makes it easy for your webbot to interpretthe cookies that webservers send it; it alsosimplifies the process of showing webservers allthe cookies your webbot has written. Chapter 21and Chapter 22 have much more to say on thesubject of webbots and cookies.

Redirection

Redirection occurs when a web browser looks fora file in one place, but the server tells it that thefile has moved and that it should download it fromanother location. For example, the websitewww.company.com may use redirection to forcebrowsers to go to www.company.com/spring_salewhen a seasonal promotion is in place. Browsers

Page 66: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

when a seasonal promotion is in place. Browsershandle redirections automatically, and cURLallows webbots to have the same functionality.

Agent Name Spoofing

Every time a webserver receives a file request, itstores the requesting agent's name in a log filecalled an access log file. This log file stores thetime of access, the IP address of the requester,and the agent name, which identifies the type ofprogram that requested the file. Generally, agentnames identify the browser that the web surferwas using to view the website.Some agent names that a server log file mayrecord are shown in Listing 3-3. The first fournames are browsers; the last is the Google spider.

Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.6) Gecko/20050225 Firefox/1.0.1Mozilla/4.0 (compatible; MSIE 5.0; Windows 2000) Opera 6.03 [en]Mozilla/5.0 (compatible; Konqueror/3.1-rc3; i686 Linux; 20020515)Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.1)Googlebot/2.1 (+http://www.google.com/bot.html)

Listing 3-3: Agent names as seen in a file accesslogA webbot using cURL can assume any appropriate

Page 67: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

A webbot using cURL can assume any appropriate(or inappropriate) agent name. For example,sometimes it is advantageous to identify yourwebbots, as Google does. Other times, it is betterto make your webbot look like a browser. If youwrite webbots that use the LIB_http library(described later), your webbot's agent name willbe Test Webbot. If you download a file from awebserver with PHP's fopen() or file()functions, your agent name will be the version ofPHP installed on your computer.

Referer Management

cURL allows webbot developers to change thereferer, which is the reference that servers use todetect which link the web surfer clicked.Sometimes webservers use the referer to verifythat file requests are coming from the correctplace. For example, a website might enforce arule that prevents downloading of images unlessthe referring web page is also on the samewebserver. This prohibits people from bandwidthstealing, or writing web pages using images onsomeone else's server. cURL allows a webbot toset the referer to an arbitrary value.

Socket Management

cURL also gives webbots the ability to recognize

Page 68: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

cURL also gives webbots the ability to recognizewhen a webserver isn't going to respond to a filerequest. This ability is vital because, without it,your webbot might hang (forever) waiting for aserver response that will never happen. WithcURL, you can specify how long a webbot willwait for a response from a server before it givesup and moves on.

Page 69: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Installing PHP/CURLSince PHP/CURL is tightly integrated with PHP,installation should be unnecessary, or at worst,easy. You probably already have PHP/CURL onyour computer; you just need to enable it inphp.ini, the PHP configuration file. If you're usingLinux, FreeBSD, OS X, or another Unix-basedoperating system, you may have to recompile yourcopy of Apache/PHP to enjoy the benefits ofPHP/CURL. Installing PHP/CURL is similar toinstalling any other PHP library. If you need help,you should reference the PHP website(http://www.php.net) for the instructions for yourparticular operating system and PHP version.

Page 70: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

LIB_httpSince PHP/CURL is very flexible and has manyconfigurations, it is often handy to use it within awrapper function, which simplifies the complexitiesof a code library into something easier tounderstand. For your convenience, this book uses alibrary called LIB_http, which provides wrapperfunctions to the PHP/CURL features you'll use most.The remainder of this chapter describes the basicfunctions of the LIB_http library.LIB_http is a collection of PHP/CURL routines thatsimplify downloading files. It contains defaults andabstractions that facilitate downloading files,managing cookies, and completing online forms. Thename of the library refers to the HTTP protocol usedby the library. Some of the reasons for using thislibrary will not be evident until we cover its moreadvanced features. Even simple file downloads,however, are made easier and more robust withLIB_http because of PHP/CURL. The most recentversion of LIB_http is available at this book'swebsite.

Familiarizing Yourself with the DefaultValues

To simplify its use, LIB_http sets a series of defaultconditions for you, as described below:

Page 71: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Your webbot's agent name is Test Webbot.Your webbot will time out if a file transferdoesn't complete within 25 seconds.Your webbot will store cookies in the file c:\cookie.txt.Your webbot will automatically follow amaximum of four redirections, as directed byservers in HTTP headers.Your webbot will, if asked, tell the remote serverthat you do not have a local authenticationcertificate. (This is only important if you accessa website employing SSL encryption, which isused to protect confidential information on e-commerce websites.)

These defaults are set at the beginning of the file.Feel free to change any of these settings to meetyour specific needs.

Using LIB_http

The LIB_http library provides a set of wrapperfunctions that simplify complicated PHP/CURLinterfaces. Each of these interfaces calls a commonroutine, http(), which performs the specified task,using the values passed to it by the wrapperinterfaces. All functions in LIB_http share a similarformat: A target and referring URL are passed, andan array is returned, containing the contents of the

Page 72: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

an array is returned, containing the contents of therequested file, transfer status, and error conditions.While LIB_http has many functions, we'll restrictour discussion to simply fetching files from theInternet using HTTP. The remaining features aredescribed as needed throughout the book.

http_get()

The function http_get() downloads files with theGET method; it has many advantages over PHP'sbuilt-in functions for downloading files from theInternet. Not only is the interface simple, but thisfunction offers all the previously describedadvantages of using PHP/CURL. The script in Listing3-4 shows how files are downloaded withhttp_get().

# Usage: http_get()array http_get (string target_url, string referring_url)

Listing 3-4: Using http_get()These are the inputs for the script in Listing 3-4:

target_url is the fully formed URL of the desired filereferring_url is the fully formed URL of the referer

These are the outputs for the script in Listing 3-4:$array['FILE'] contains the contents of the requested file$array['STATUS'] contains status information regardingthe file transfer$array['ERROR'] contains a textual description of any

Page 73: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$array['ERROR'] contains a textual description of anyerrors

http_get_withheader()

When a web agent requests a file from the Web, theserver returns the file contents, as discussed in theprevious section, along with the HTTP header,which describes various properties related to a webpage. Browsers and webbots rely on the HTTPheader to determine what to do with the contents ofthe downloaded file.The data that is included in the HTTP header variesfrom application to application, but it may definecookies, the size of the downloaded file,redirections, encryption details, or authenticationdirectives. Since the information in the HTTP headeris critical to properly using a network file, LIB_httpconfigures cURL to automatically handle the morecommon header directives. Listing 3-5 shows howthis function is used.

# Usage: http_get_withheader()array http_get_withheader (string target_url, string referring_url)

Listing 3-5: Using http_get()These are the inputs for the script in Listing 3-5:

target_url is the fully formed URL of the desired filereferring_url is the fully formed URL of the referer

These are the outputs for the script in Listing 3-5:

Page 74: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

These are the outputs for the script in Listing 3-5:$array['FILE'] contains the contents of the requestedfile, including the HTTP header$array['STATUS'] contains status information about thefile transfer$array['ERROR'] contains a textual description of anyerrors

The example in Listing 3-6 uses thehttp_get_withheader() function to download afile and display the contents of the returned array.

# Include http libraryinclude("LIB_http.php");

# Define the target and referer web pages$target = "http://www.schrenk.com/publications.php";$ref = "http://www.schrenk.com";

# Request the header$return_array = http_get_withheader($target, $ref);

# Display the headerecho "FILE CONTENTS \n";var_dump($return_array['FILE']);

echo "ERRORS \n";var_dump($return_array['ERROR']);

echo "STATUS \n";

var_dump($return_array['STATUS']);

Page 75: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Listing 3-6: Using http_get_withheader()The script in Listing 3-6 downloads the page anddisplays the requested page, any errors, and avariety of status information related to the fetch anddownload.Listing 3-7 shows what is produced when the scriptin Listing 3-6 is executed, with the array thatincludes the page header, error conditions, andstatus. Notice that the contents of the returned fileare limited to only the HTTP header, because werequested only the header and not the entire page.Also, notice that the first line in a HTTP header isthe HTTP code, which indicates the status of therequest. An HTTP code of 200 tells us that therequest was successful. The HTTP code alsoappears in the status array element.[13]

FILE CONTENTSstring(215) "HTTP/1.1 200 OKDate: Sat, 08 Oct 2008 16:38:51 GMTServer: Apache/2.0.53 (FreeBSD) mod_ssl/2.0.53 OpenSSL/0.9.7g PHP/4.4.0X-Powered-By: PHP/4.4.0Content-Type: text/html; charset=ISO-8859-1

"ERRORSstring(0) ""

STATUSarray(20) { ["url"]=>

Page 76: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

["url"]=> string(39) "http://www.schrenk.com/publications.php" ["content_type"]=> string(29) "text/html; charset=ISO-8859-1" ["http_code"]=> int(200)

["header_size"]=> int(215) ["request_size"]=> int(200) ["filetime"]=> int(-1) ["ssl_verify_result"]=> int(0) ["redirect_count"]=> int(0) ["total_time"]=> float(0.683) ["namelookup_time"]=> float(0.005) ["connect_time"]=> float(0.101) ["pretransfer_time"]=> float(0.101) ["size_upload"]=> float(0) ["size_download"]=> float(0) ["speed_download"]=> float(0) ["speed_upload"]=> float(0) ["download_content_length"]=> float(0) ["upload_content_length"]=>

Page 77: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

["upload_content_length"]=> float(0) ["starttransfer_time"]=> float(0.683) ["redirect_time"]=> float(0)}

Listing 3-7: File contents, errors, and the downloadstatus array returned by LIB_httpThe information returned in $array['STATUS'] isextraordinarily useful for learning how the fetchwas conducted. Included in this array are values fordownload speed, access times, and file sizes—allvaluable when writing diagnostic webbots thatmonitor the performance of a website.

Learning More About HTTP Headers

When a Content-Type line appears in an HTTPheader, it defines the MIME, or the media type of filesent from the server. The MIME type tells the webagent what to do with the file. For example, theContent-Type in the previous example was text/html,which indicates that the file is a web page. Knowingif the file they just downloaded was an image or anHTML file helps browsers know if they shoulddisplay the file as text or render an image. Forexample, the HTTP header information for a JPEGimage is shown in Listing 3-8.

HTTP/1.1 200 OKDate: Mon, 23 Mar 2009 00:06:13 GMT

Page 78: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Date: Mon, 23 Mar 2009 00:06:13 GMTServer: Apache/1.3.12 (Unix) mod_throttle/3.1.2 tomcat/1.0 PHP/4.0.3pl1Last-Modified: Wed, 23 Jul 2008 18:03:29 GMTETag: "74db-9063-3d3eebf1"Accept-Ranges: bytesContent-Length: 36963Content-Type: image/jpeg

Listing 3-8: An HTTP header for an image filerequest

Examining LIB_http's Source Code

Most webbots in this book will use the libraryLIB_http to download pages from the Internet. Ifyou plan to explore any of the webbot examples thatappear later in this book, you should obtain a copyof this library; the latest version is available fordownload at this book's website. We'll explore someof the defaults and functions of LIB_http here.

LIB_http Defaults

At the very beginning of the library is a set ofdefaults, as shown in Listing 3-9.

define("WEBBOT_NAME", "Test Webbot"); # How your webbot will appear in serverlogsdefine("CURL_TIMEOUT", 25); # Time (seconds) to wait for networkresponsedefine("COOKIE_FILE", "c:\cookie.txt"); #

Page 79: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

define("COOKIE_FILE", "c:\cookie.txt"); # Location of cookie file

Listing 3-9: LIB_http defaults

LIB_http Functions

The functions shown in Listing 3-10 are availablewithin LIB_http. All of these functions return thearray defined earlier, containing downloaded files,error messages, and the status of the file transfer.

http_get($target, $ref) # Simple get request (w/o header)http_get_withheader($target, $ref) # Simple get request (w/ header)http_get_form($target, $ref, $data_array) # Form (method ="GET", w/oheader)http_get_form_withheader($target, $ref, $data_array) # Form (method ="GET", w/ header)http_post_form($target, $ref, $data_array) # Form (method ="POST", w/oheader)http_post_withheader($target, $ref, $data_array) # Form (method ="POST", w/header)http_header($target, $ref) # Only returns header

Listing 3-10: LIB_http functions

[13] A complete list of HTTP codes can be found inAppendix B.

Page 80: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Appendix B.

Page 81: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Final ThoughtsSome of these functions use an additional inputparameter, $data_array, when form data ispassed from the webbot to the webserver. Thesefunctions are listed below:

http_get_form()

http_get_form_withheader()

http_post_form()

http_post_form_withheader()

If you don't understand what all these functions donow, don't worry. Their use will become familiarto you as you go through the examples thatappear later in this book. Now might be a goodtime to thumb through Appendix A, which detailsthe features of cURL that webbot developers aremost apt to need.

Page 82: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 4. PARSINGTECHNIQUESParsing is the process of segregating what'sdesired or useful from what is not. In the case ofwebbots, parsing involves detecting andseparating image names and addresses, keyphrases, hyper-references, and other informationof interest to your webbot. For example, if you arewriting a spider that follows links on web pages,you will have to separate these links from the restof the HTML. Similarly, if you write a webbot todownload all the images from a web page, youwill have to write parsing routines that identify allthe references to image files.

Parsing Poorly Written HTMLOne of the problems you'll encounter whenparsing web pages is poorly written HTML. Alarge amount of HTML is machine generated andshows little regard for human readability, andhand-written HTML often disregards standards byignoring closing tags or misusing quotes aroundvalues. Browsers may correctly render web pagesthat have substandard HTML, but poorly writtenHTML interferes with your webbot's ability to

Page 83: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

HTML interferes with your webbot's ability toparse web pages.Fortunately, a software library known asHTMLTidy[14] will clean up poorly written webpages. PHP includes HTMLTidy in its standarddistributions, so you should have no problemgetting it running on your computer. InstallingHTMLTidy (also known as just Tidy) should besimilar to installing cURL. Complete installationinstructions are available at the PHP website.[15]

The parse functions (described next) rely on Tidyto put unparsed source code into a known state,with known delimiters and known closing tags ofknown case.

Note

If you do not have HTMLTidy installed on yourcomputer, the parsing described in this bookmay not work correctly.

[14] See http://tidy.sourceforge.net.[15] See http://www.php.net.

Page 84: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Standard Parse RoutinesI have simplified parsing by identifying a fewuseful functions and placing them into a librarycalled LIB_parse. These functions (or acombination of them) provide everything neededfor 99 percent of your parsing tasks. Whether ornot you use the functions in LIB_parse, I highlysuggest that you standardize your parsingroutines. Standardized parse functions make yourscripts easier to read and faster to write—andperhaps just as importantly, when you limit yourparsing options to a few simple solutions, you'reforced to consider simpler approaches to parsingproblems. The latest version of LIB_parse isavailable from this book's website.

Page 85: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Using LIB_parseThe parsing library used in this book, LIB_parse,provides easy-to-read parsing functions thatshould meet most parsing tasks your webbots willencounter. Primarily, LIB_parse containswrapper functions that provide simple interfacesto otherwise complicated routines. To use theexamples in this book, you should download thelatest version of this library from the book'swebsite.One of the things you may notice aboutLIB_parse is the lack of regular expressions.Although regular expressions are the mainstay forparsing text, you won't find many of them here.Regular expressions can be difficult to read andunderstand, especially for beginners. The built-inPHP string manipulation functions are easier tounderstand and usually more efficient thanregular expressions.The following is a description of the functions inLIB_parse and the parsing problems they solve.These functions are also described completelywithin the comments of LIB_parse.

Splitting a String at a Delimiter:split_string()

Page 86: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

split_string()

The simplest parsing function returns a string thatcontains everything before or after a delimiterterm. This simple function can also be used toreturn the text between two terms. The functionprovided for that task is split_string(), shownin Listing 4-1.

/*string split_string (string unparsed, string delimiter, BEFORE/AFTER,INCL/EXCL)Where unparsed is the string to parse delimiter defines boundary between substring you want and substring youdon't want BEFORE indicates that you want what is before the delimiter AFTER indicates that you want what is after the delimiter INCL indicates that you want to include the delimiter in the parsed text EXCL indicates that you don't want to include the delimiter in the parsed text

*/

Listing 4-1: Using split_string()Simply pass split_string() the string you wantto split, the delimiter where you want the split to

Page 87: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

to split, the delimiter where you want the split tooccur, whether you want the portion of the stringthat is before or after the delimiter, and whetheror not you want the delimiter to be included in thereturned string. Examples using split_string()are shown in Listing 4-2.

include("LIB_parse.php");$string = "The quick brown fox";

# Parse what's before the delimiter, including the delimiter$parsed_text = split_string($string, "quick", BEFORE, INCL);// $parsed_text = "The quick"

# Parse what's after the delimiter, but don't include the delimiter$parsed_text = split_string($string, "quick", AFTER, EXCL);// $parsed_text = "brown fox"

Listing 4-2: Examples of split_string() usage

Parsing Text Between Delimiters:return_between()

Sometimes it is useful to parse text between twodelimiters. For example, to parse a web page'stitle, you'd want to parse the text between the<title> and </title> tags. Your webbots canuse the return_between() function in LIB_parse

Page 88: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

use the return_between() function in LIB_parseto do this.The return_between() function uses a startdelimiter and an end delimiter to define aparticular part of a string your webbot needs toparse, as shown in Listing 4-3.

/*string return_between (string unparsed, stringstart, string end,INCL/EXCL)Where unparsed is the string to parse start identifies the starting delimiter endidentifies the ending delimiter INCL indicates that you want to include thedelimiters in the parsed text EXCL indicates that you don't want toinclude delimiters in the parsed text*/

Listing 4-3: Using return_between()The script in Listing 4-4 uses return_between()to parse the HTML title of a web page.

# Include librariesinclude("LIB_parse.php");include("LIB_http.php");

# Download a web page$web_page = http_get($target="http://www.nostarch.com",

Page 89: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

http_get($target="http://www.nostarch.com", $referer="");

# Parse the title of the web page, inclusive of the title tags$title_incl = return_between($web_page['FILE'], "<title>", "</title>", INCL);

# Parse the title of the web page, exclusive of the title tags$title_excl = return_between($web_page['FILE'], "<title>", "</title>", EXCL);

# Display the parsed textecho "title_incl = ".$title_incl;echo "\n";echo "title_excl = ".$title_excl;

Listing 4-4: Using return_between() to find thetitle of a web pageWhen Listing 4-4 is run in a shell, the resultsshould look like Figure 4-1.

Page 90: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 4-1. Examples of usingreturn_between(), with and without returned

delimiters

Parsing a Data Set into an Array:parse_array()

Sometimes the things your webbot needs to parse,like links, appear more than once in a web page.In these cases, a single parsed result isn't asuseful as an array of results. Such a parsed arraycould contain all the links, meta tags, orreferences to images in a web page. Theparse_array() function does essentially thesame thing as the return_between() function,but it returns an array of all items that match theparse description or all occurrences of databetween two delimiting strings. This function, forexample, makes it extremely easy to extract all thelinks and images from a web page.The parse_array() function , shown in Listing 4-5, is most useful when your webbots need to parsethe content of reoccurring tags. For example,returning an array of everything between everyoccurrence of <img and > returns informationabout all the images in a web page. Alternately,

Page 91: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

about all the images in a web page. Alternately,returning an array of everything between<script and </script> will parse all inlineJavaScript. Notice that in each of these cases, theopening tag is not completely defined. This isbecause <img and <script are sufficient todescribe the tag, and additional parameters (thatwe don't need to define in the parse) may bepresent in the downloaded page.This simple parse is also useful for parsing tables,meta tags, formatted text, video, or any otherparts of web pages defined between reoccurringHTML tags.

/*array return_array (string unparsed, stringbeg, string end)Where unparsed is the string to parse begis a reoccurring beginning delimiter end is a reoccurring ending delimiter array contains every occurrence of what's foundbetween beginning and end.

*/

Listing 4-5: Using parse_array()The script in Listing 4-6 uses the parse_array()function to parse and display all the meta tags onthe FBI website. Meta tags are primarily used to

Page 92: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

the FBI website. Meta tags are primarily used todefine a web page's content to a search engine.The following code, which uses parse_array() togather the meta tags from a web page, could beincorporated with the project in Chapter 11 todetermine how adjustments in your meta tagsaffect your ranking in search engines. To parse allthe meta tags, the function must be told to returnall instances that occur between <meta and >.Again, notice that the script only uses enough ofeach delimiter to uniquely identify where a metatag starts and ends. Remember that the definitionsyou apply for start and stop variables must applyfor each data set you want to parse.

include("LIB_parse.php"); # Include parse libraryinclude("LIB_http.php"); # Include cURL library

$web_page = http_get($target="http://www.fbi.gov", $referer="");$meta_tag_array = parse_array($web_page['FILE'], "<meta", ">");

for($xx=0; $xx<count($meta_tag_array); $xx++)

echo $meta_tag_array[$xx]."\n";

Listing 4-6: Using parse_array() to parse all themeta tags from http://www.fbi.gov

Page 93: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

meta tags from http://www.fbi.govWhen the script in Listing 4-6 runs, the resultshould look like Figure 4-2.

Figure 4-2. Using parse_array() to parse themeta tags from the FBI website

Parsing Attribute Values: get_attribute()

Once your webbot has parsed tags from a webpage, it is often important to parse attributevalues from those tags. For example, if you'rewriting a spider that harvests links from webpages, you will need to parse all the link tags, butyou will also need to parse the specific hrefattribute of the link tag. For these reasons,LIB_parse includes the get_attribute()function.

Page 94: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

function.The get_attribute() function provides aninterface that allows webbot developers to parsespecific attribute values from HTML tags. Itsusage is shown in Listing 4-7.

/*string get_attribute( string tag, stringattribute)Where tag is the HTML tag that contains theattribute you want to parse attribute is the name of the specific attributein the HTML tag

*/

Listing 4-7: Using get_attribute()This parse is particularly useful when you need toget a specific attribute from a previously parsedarray of tags. For example, Listing 4-8 shows howto parse all the images fromhttp://www.schrenk.com, using get_attribute()to get the src attribute from an array of <img>tags.

include("LIB_parse.php"); # include parse libraryinclude("LIB_http.php"); # include curl library

Page 95: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

// Download the web page$web_page = http_get($target="http://www.schrenk.com", $referer="");

// Parse the image tags$meta_tag_array = parse_array($web_page['FILE'], "<img", ">");

// Echo the image source attribute from each image tagfor($xx=0; $xx<count($meta_tag_array); $xx++) { $name = get_attribute($meta_tag_array[$xx], $attribute="src"); echo $name ."\n";

}

Listing 4-8: Parsing the src attributes from imagetagsFigure 4-3 shows the output of Listing 4-8.

Page 96: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 4-3. Results of running Listing 4-8,showing parsed image names

Removing Unwanted Text: remove()

Up to this point, parsing meant extracting desiredtext from a larger string. Sometimes, however,parsing means manipulating text. For example,since webbots usually lack JavaScriptinterpreters, it's often best to delete JavaScriptfrom downloaded files. In other cases, yourwebbots may need to remove all images or emailaddresses from a web page. For these reasons,LIB_parse includes the remove() function. Theremove() function is an easy-to-use interface forremoving unwanted text from a web page. Itsusage is shown in Listing 4-9.

/*string remove( string web page, string open_tag, string close_tag)Where web_page is the contents of the web page you want to affect

Page 97: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

affect open_tag defines the beginning of the text that you want to remove close_tag defines the end of the text you want to remove

*/

Listing 4-9: Using remove()By adjusting the input parameters, the remove()function can remove a variety of text from webpages, as shown in Listing 4-10.

$uncommented_page = remove($web_page, "<!--", "-->");$links_removed = remove($web_page, "<a", "</a>");$images_removed = remove($web_page, "<img", " >");$javascript_removed = remove($web_page, "<script", "</script>");

Listing 4-10: Using remove()

Page 98: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Useful PHP FunctionsIn addition to the previously described parsingfunctions in LIB_parse, PHP also contains amultitude of built-in parsing functions. Thefollowing is a brief sample of the most valuablebuilt-in PHP parsing functions, along withexamples of how they are used.

Detecting Whether a String Is WithinAnother String

You can use the stristr() function to tell yourwebbot whether or not a string contains anotherstring. The PHP community commonly uses theterm haystack to refer to the entire unparsed textand the term needle to refer to the substringwithin the larger string. The function stristr()looks for an occurrence of needle in haystack. Iffound, stristr() returns a substring of haystackfrom the occurrence of needle to the end of thelarger string. In normal use, you're not alwaysconcerned about the actual returned text.Generally, the fact that something was returned isused as an indication that you found the existenceof needle in the haystack.The stristr() function is handy if you want to

Page 99: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

The stristr() function is handy if you want todetect whether or not a specific word ismentioned in a web page. For example, if youwant to know if a web page mentions dogs, youcan execute the script shown in Listing 4-11.

if(stristr($web_page, "dogs")) echo "This is a web page that mentions dogs.";else echo "This web page does not mention dogs";

Listing 4-11: Using stristr() to see if a stringcontains another stringIn this example, we're not specifically interested inwhat the stristr() function returns, but whetheris returns anything at all. If something is returned,we know that the web page contained the worddogs.The stristr() function is not case sensitive. Ifyou need a case-sensitive version of stristr(),use strstr().

Replacing a Portion of a String withAnother String

The PHP built-in function str_replace() puts anew string in place of all occurrences of asubstring within a string, as shown in Listing 4-12.

Page 100: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$org_string = "I wish I had a Cat.";$result_string = str_replace("Cat", "Dog", $org_string);# $result_string contains "I wish I had a Dog."

Listing 4-12: Using str_replace() to replace alloccurrences of Cat with DogThe str_repalce() function is also useful when awebbot needs to remove a character or set ofcharacters from a string. You do this byinstructing str_replace() to replace text with anull string, as shown in Listing 4-13.

$result = str_replace("$","","$100.00"); // Remove the dollar sign# $result contains 100.00

Listing 4-13: Using str_replace() to removeleading dollar signs

Parsing Unformatted Text

The script in Listing 4-14 uses a variety of built-infunctions, along with a few functions fromLIB_http and LIB_parse, to create a string thatcontains unformatted text from a website. Theresult is the contents of the web page without anyHTML formatting.

include("LIB_parse.php"); # Include parse library

Page 101: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

libraryinclude("LIB_http.php"); # Include cURL library

// Download the page$web_page = http_get($target="http://www.cnn.com", $referer="");

// Remove all JavaScript$noformat = remove($web_page['FILE'], "<script", "</script>");// Strip out all HTML formatting$unformatted = strip_tags($only_text);

// Remove unwanted white space$noformat = str_replace("\t", "", $noformat); // Remove tabs$noformat = str_replace("&nbsp;", "", $noformat); // Remove non-breaking spaces$noformat = str_replace("\n", "", $noformat); // Remove line feedsecho $noformat;

Listing 4-14: Parsing the content from the HTMLused on http://www.cnn.com

Measuring the Similarity of Strings

Sometimes it is convenient to calculate thesimilarity of two strings without necessarilyparsing them. PHP's similar_text() functionreturns a value that represents the percentage of

Page 102: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

returns a value that represents the percentage ofsimilarity between two strings. The syntax forusing similar_text() is shown in Listing 4-15.

$similarity_percentage = similar_text($string1, $string2);

Listing 4-15: Example of using PHP'ssimilar_text() functionYou may use similar_text() to determine if anew version of a web page is significantlydifferent than a cached version.

Page 103: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Final ThoughtsAs demonstrated, a wide variety of parsing taskscan be performed with the standardized parsingroutines in LIB_parse, along with a few of PHP'sbuilt-in functions. Here are a few moresuggestions that may help you in your parsingprojects.

Note

You'll get plenty of parsing experience as youexplore the projects in this book. The projectsalso introduce a few advanced parsingtechniques. In Chapter 7, we'll cover advancedmethods for parsing data in tables. InChapter 11, you'll learn about the insertionparse, which makes it easier to parse anddebug difficult-to-parse web pages.

Don't Trust a Poorly Coded Web Page

While the scripts in LIB_parse attempt to handlemost situations, there is no guarantee that youwill be able to parse poorly coded or nonsensicalweb pages. Even the use of Tidy will not always

Page 104: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

web pages. Even the use of Tidy will not alwaysprovide proper results. For example, code likethis:

<img src="width='523'" alt >

may drive your parsing routines crazy. If you'rehaving trouble debugging a parsing routine, checkto see if the page has errors. If you don't check forerrors, you may waste many hours trying to parseunparseable web pages.

Parse in Small Steps

When you are writing a script that depends onseveral levels of parsing, avoid the temptation towrite your parsing script in one pass. Sincesucceeding sections of your code will depend onearlier parses, write and debug your scripts oneparse at a time.

Don't Render Parsed Text WhileDebugging

If you are viewing the results of your parse in abrowser, remember that the browser will attemptto render your output as a web page. If the resultsof your parse contain tags, display your parseswithin <xmp> and </xmp> tags. These tags will tellthe browser not to render the results of yourparse as HTML. Failure to analyze the

Page 105: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

parse as HTML. Failure to analyze theunformatted results of your parse may cause youto miss things that are inside tags.[16]

Use Regular Expressions Sparingly

The use of regular expressions is a parsinglanguage in itself, and most modern programminglanguages support aspects of regular expressions.In the right hands, regular expressions are alsouseful for parsing and substituting text; however,they are famous for their sharp learning curveand cryptic syntax. I avoid regular expressionswhenever possible.The regular expression engine used by PHP is notas efficient as engines used in other languages,and it is certainly less efficient than PHP's built-infunctions for parsing HTML. For those reasons,my preference is to limit regular expression use toinstances in which there are few alternatives; inthose cases, I use wrapper functions to takeadvantage of the functionality of regularexpressions while shielding the developer fromtheir complexities.

[16] Chapter 3 describes additional methods forviewing text downloaded from websites.

Page 106: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

viewing text downloaded from websites.

Page 107: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 5. AUTOMATINGFORM SUBMISSIONYou learned how to download files from theInternet in Chapter 3. In this chapter, you'll learnhow to fill out forms and upload information towebsites. When your webbots have the ability toexchange information with target websites, asopposed to just asking for information, theybecome capable of acting on your behalf.Interactive webbots can do these kinds of things:

Transfer funds between your online bankaccounts when an account balance dropsbelow a predetermined limitBuy items in online auctions when an item andits price meet preset criteriaAutonomously upload files to a photo sharingwebsiteAdvise a distributor to refill a vendingmachine when product inventory is low

Webbots send data to webservers by mimickingwhat people do as they fill out standard HTMLforms on websites. This process is called formemulation. Form emulation is not an easy task,

Page 108: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

emulation. Form emulation is not an easy task,since there are many ways to submit forminformation. In addition, it's important to submitforms exactly as the webserver expects them to befilled out, or else the server will generate errors inits log files. People using browsers don't have toworry about the format of the data they submit ina form. Webbot designers, however, must reverseengineer the form interface to learn about thedata format the server is expecting. When theform interface is properly debugged, the formdata from a webbot appears exactly as if it weresubmitted by a person using a browser.If done poorly, form emulation can get webbotdesigners into trouble. This is especially true ifyou are creating an application that delivers acompetitive advantage for a client and you wantto conceal the fact that you are using a webbot. Anumber of things could happen if your webbotgets into trouble, ranging from leaking (to yourcompetitors) that you're gaining an advantagethrough the use of a webbot to having yourwebsite privileges revoked by the owner of thetarget website.The first rule of form emulation is staying legal:Represent yourself truthfully, and don't violate awebsite's user agreement. The second rule is tosend form data to the server exactly as the serverexpects to receive it. If your emulated form data

Page 109: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

expects to receive it. If your emulated form datadeviates from the format that is expected, youmay generate suspicious-looking errors in theserver's log. In either case, the server'sadministrator will easily figure out that you areusing a webbot. Even though your webbot islegitimate, the server log files your webbotcreates may not resemble browser activity. Theymay indicate to the website's administrator thatyou are a hacker and lead to a blocked IP addressor termination of your account. It is best to beboth stealthy and legal. For these reasons, youmay want to read Chapters 24 and 28 before youventure out on your own.

Reverse Engineering FormInterfacesWebbot developers need to look at online formsdifferently than people using the same forms in abrowser. Typically, when people use browsers tofill out online forms, performing some task likepaying a bill or checking an account balance, theysee various fields that need to be selected orotherwise completed.Webbot designers, in contrast, need to view HTMLforms as interfaces or specifications that tell awebbot how a server expects to see form data

Page 110: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

webbot how a server expects to see form dataafter it is submitted. A webbot designer needs tohave the same perspective on forms as the serverthat receives the form. For example, a personfilling out the form in Figure 5-1 would complete avariety of form elements—text boxes, text areas,select lists, radio controls, checkboxes, or hiddenelements—that are identified by text labels.

Figure 5-1. A simple form with various formelements

While a human associates the text labels shown inFigure 5-1 with the form elements, a webbotdesigner knows that the text labels and types ofform elements are immaterial. All the form needsto do is send the correct name/data pairs thatrepresent these data fields to the correct serverpage, with the expected protocol. This isn't nearly

Page 111: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

page, with the expected protocol. This isn't nearlyas complicated as it sounds, but before we can gofurther, it's important that you understand thevarious parts of HTML forms.

Page 112: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Form Handlers, Data Fields,Methods, and Event TriggersWeb-based forms have four main parts, as shownin Figure 5-2:

A form handlerOne or more data fieldsA methodOne or more event triggers

I'll examine each of these parts in detail and thenshow how a webbot emulates a form.

Figure 5-2. Parts of a form

Form Handlers

Page 113: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

The action attribute in the <form> tag defines theweb page that interprets the data entered into theform. We'll refer to this page as the form handler.If there is no defined action, the form handler isthe same as the page that contains the form. Theexamples in Table 5-1 compare the location ofform handlers in a variety of conditions.Table 5-1. Variations in Form-HandlerDescriptions

action Attribute Meaning

<form name="myForm" action="search.php">

The script calledsearch.php will accept andinterpret the form data.This script shares the sameserver and directory as thepage that served the form.

<form name="myForm" action="../cgi/search.php">

A script called search.phphandles this form and is inthe cgi directory, which isparallel to the currentdirectory.

<form name="myForm" action="/search.php">

The script calledsearch.php, in the homedirectory of the server thatserved the page, handlesthis form.

Page 114: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

<form name="myForm" action="www.schrenk.com/search.php">

The contents of this formare sent to the specifiedpage athttp://www.schrenk.com.

<form name="myForm">

There isn't an action (orform handler) specified inthe <form> tag. In thesecases, the same page thatdelivered the form is alsothe page that interpretsthe completed form.

Servers have no use for the form's name, which isthe variable that identifies the form. This variableis only used by JavaScript, which associates theform name with its form elements. Since serversdon't use the form's name, webbots (and theirdesigners) have no use for it either.

Data Fields

Form input tags define data fields and the name,value, and user interface used to input the value.The user interface (or widget) can be a text box,text area, select list, radio control, checkbox, orhidden element. Remember that while there aremany types of interfaces, they are completelymeaningless to the webbot that emulates the formand the server that handles the form. From a

Page 115: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

webbot's perspective, there is no differencebetween data entered via a text box or a selectlist. The input tag's name and its value are theonly things that matter.

Every data field must have a name.[17] Thesenames become form data variables, or containersfor their data values. In Listing 5-1, a variablecalled session_id is set to 0001, and the valuefor search is whatever was in the text box labeledSearch when the user clicked the submit button.Again, from a webbot designer's perspective, itdoesn't matter what type of data elements definethe data fields (hidden, select, radio, text box,etc.). It is important that the data has the correctname and that the value is within a rangeexpected by the form handler.

<form method="GET"> <input type="hidden" name="session_id" value="0001"> <input type="text" name="search" value=""> <input type="submit"></form>

Listing 5-1: Data fields in a HTML form

Methods

The form's method describes the protocol used tosend the form data to the form handler. The most

Page 116: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

send the form data to the form handler. The mostcommon methods for form data transfers are GETand POST.

The GET Method

You are already familiar with the GET method,because it is identical to the protocol you used torequest web pages in previous chapters. With theGET protocol, the URL of a web page is combinedwith data from form elements. The address of thepage and the data are separated by a ? character,and individual data variables are separated by &characters, as shown in Listing 5-2. The portion ofthe URL that follows the ? character is known asa query string.

URL http://www.schrenk.com/search.php?term=hello&sort=up

Listing 5-2: Data values passed in a URL (GETmethod)Since GET form variables may be combined withthe URL, the web page that accepts the form willnot be able to tell the difference between the formsubmitted in Listing 5-3 and the form emulationtechniques shown in Listings 5-4 and 5-5. In eithercase, the variables term and sort will besubmitted to the web pagehttp://www.schrenk.com/search with the GET

Page 117: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

http://www.schrenk.com/search with the GETprotocol.[18]

<form name="frm1" action="http://www.schrenk.com/search.php"> <input type="text" name="term" value="hello"> <input type="text" name="sort" value="up"> <input type="submit"></form>

Listing 5-3: A GET method performed by a formsubmissionAlternatively, you could use LIB_http to emulatethe form, as in Listing 5-4.

include("LIB_http.php");

$action = "http://www.schrenk.com/search.php"; // Address of form handler$method="GET"; // GET method$ref = ""; // Referer variable$data_array['term'] = "hello"; // Define term$data_array['sort'] = "up"; // Define sort$response = http($target=$action, $ref, $method, $data_array, EXCL_HEAD);

Listing 5-4: Using LIB_http to emulate the formin Listing 5-3 with data passed in an arrayConversely, since the GET method places form

Page 118: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Conversely, since the GET method places forminformation in the URL's query string, you couldalso emulate the form with a script like Listing 5-5.

include("LIB_http.php");

$action = "http://www.schrenk.com/search.php?term=hello&sort=up";$method=""GET";$ref = "" ;$response = http($target=$action, $ref, $method, $data_array="", EXCL_HEAD);

Listing 5-5: Emulating the form in Listing 5-3 bycombining the URL with the form dataThe reason we might choose Listing 5-4 overListing 5-5 is that the code is cleaner when formdata is treated as array elements, especially whenmany form values are passed to the form handler.Passing form variables to the form's handler withan array is also more symmetrical, meaning thatthe procedure is nearly identical to the onerequired to pass values to a form handlerexpecting the POST method.

The POST Method

While the GET method tacks on form data at theend of the URL, the POST method sends data in a

Page 119: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

end of the URL, the POST method sends data in aseparate file. The POST method has theseadvantages over the GET method:

POST methods can send more data to serversthan GET methods can. The maximum lengthof a GET method is typically around 250characters. POST methods, in contrast, caneasily upload several megabytes ofinformation during a single form upload.Since URL fetch requests are sent in HTTPheaders, and since headers are neverencrypted, sensitive data should always betransferred with POST methods. POST methodsdon't transfer form data in headers, and thus,they may be encrypted. Obviously, this is onlyimportant for web pages using encryption.GET method requests are always visible on thelocation bar of the browser. POST requestsonly show the actual URL in the location bar.

Regardless of the advantages of POST over GET,you must match your method to the method ofform you are emulating. Keep in mind thatmethods may also be combined in the same form.For example, forms with POST methods may alsouse form handlers that contains query strings.To submit a form using the POST method withLIB_http, simply specify the POST protocol, as

Page 120: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

LIB_http, simply specify the POST protocol, asshown in Listing 5-6.

include("LIB_http.php");

$action = "http://www.schrenk.com/search.php"; // Address of form handler$method="POST "; // POST method$ref = ""; // Referer variable$data_array['term'] = "hello"; // Define term$data_array['sort'] = "up"; // Define sort$response = http($target=$action, $ref, $method, $data_array, EXCL_HEAD);

Listing 5-6: Using LIB_http to emulate a formwith the POST methodRegardless of the number of data elements, theprocess is the same. Some form handlers,however, access the form elements as an array, soit's always a good idea to match the order of thedata elements that is defined in the HTML form.

Event Triggers

A submit button typically acts as the event trigger,which causes the form data to be sent to the formhandler using the defined form method. While thesubmit button is the most common event trigger, it

Page 121: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

submit button is the most common event trigger, itis not the only way to submit a form. It is verycommon for web developers to employ JavaScriptto verify the contents of the form before it issubmitted to the server. In fact, any JavaScriptevent like onClick or onMouseOut can submit aform, as can any other type of human-generatedJavaScript event. Sometimes, JavaScript may alsochange the value of a form variable before theform is submitted. The use of JavaScript as anevent trigger causes many difficulties for webbotdesigners, but these issues are remedied by theuse of special tools, as you'll soon see.

[17] The HTML value of any form element is onlyits stating or default value. The user may changethe final element with JavaScript or by editing theform before it is sent to the form handler.[18] In forms where no form method is defined, likethe form shown in Listing 5-3, the default formmethod is GET.

Page 122: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Unpredictable FormsYou may not be able to tell exactly what the formrequires by looking at the source HTML. Thereare three primary reasons for this: the use ofJavaScript, the readability of machine generatedHTML, and the presence of cookies.

JavaScript Can Change a Form JustBefore Submission

Forms often use JavaScript to manipulate databefore sending it to the form handler. Thesemanipulations are usually the result of checkingthe validity of data entered into the form datafield. Since these manipulations happendynamically, it is nearly impossible to predictwhat will happen unless you actually run theJavaScript and see what it does—or unless youhave a JavaScript parser in your head.

Form HTML Is Often Unreadable byHumans

You cannot expect to look at the source HTML fora web page and determine, with any precision,what the form does. Regardless of the fact that all

Page 123: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

what the form does. Regardless of the fact that allbrowsers have a View Source option, it isimportant to remember that HTML is rendered bymachines and does not have to be readable bypeople—and it frequently isn't. It is also importantto remember that much of the HTML served onweb pages is dynamically generated by scripts.For these reasons, you should never expect HTMLpages to be easy to read, and you should nevercount on being able to accurately analyze a formby looking at a script.

Cookies Aren't Included in the Form,but Can Affect Operation

While cookies are not evident in a form, they canoften play an important role, since they maycontain session variables or other important datathat isn't readily visible but is required to processa form. You'll learn more about webbots that usecookies in Chapter 21 and Chapter 22.

Page 124: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Analyzing a FormSince it is so hard to accurately analyze an HTMLform by hand, and since the importance ofsubmitting a form correctly is critical, you mayprefer to use a tool to analyze the format offorms. This book's website has a form handlerthat provides this service. The form analyzerworks by substituting the form's original formhandler with the URL of the form analyzer. Whenthe analyzer receives form data, it creates a webpage that describes the method, data variables,and cookies sent by the form exactly as they areseen by the original form handler, even if the webpage uses JavaScript.To use the emulator, you must first create a copyof the web page that contains the form you wantto analyze, and place that copy on your harddrive. Then you must replace the form handler onthe web page with a form handler that willanalyze the form structure. For example, if theform you want to analyze has a <form> tag likethe one in Listing 5-7, you must substitute theoriginal form handler with the address of my formanalyzer, as shown in Listing 5-8.

<form method="POST" action="https://panel.schrenk.com/keywords/search/"

>

Listing 5-7: Original form handler

<form method="POST" action="http://www.schrenk.com/nostarch/webbots/form_analyzer.php"

Page 125: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

>

Listing 5-8: Substituting the original form handlerwith a handler that analyzes the formTo analyze the form, save your changes to yourhard drive and load the modified web page into abrowser. Once you fill out the form (by hand) andsubmit it, the form analyzer will provide ananalysis similar to the one in Figure 5-3.This simple diagnosis isn't perfect—use it at yourown risk. However, it does allow a webbotdeveloper to verify the form method, agent name,and GET and POST variables as they are presentedto the actual form handler. For example, in thisparticular exercise, it is evident that the formhandler expects a POST method with the variablessessionid, email, message, status, gender,and vol.Forms with a session ID point out the importanceof downloading and analyzing the form beforeemulating it. In this typical case, the session ID isassigned by the server and cannot be predicted.The webbot can accurately use session IDs only byfirst downloading and parsing the web pagecontaining the form.

Page 126: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 5-3. Using a form analyzer

If you were to write a script that emulates theform submitted and analyzed in Figure 5-3, itwould look something like Listing 5-9.

include("LIB_http.php");

# Initiate addresses$action="http://www.schrenk.com/nostarch/webbots/form_analyzer.php";

$ref = "" ;

# Set submission method$method="POST";

# Set form data and values$data_array['sessionid'] = "sdfg73453845";$data_array['email'] = "[email protected]";$data_array['message'] = "This is a test message";$data_array['status'] = "in school";$data_array['gender'] = "M";$data_array['vol'] = "on";

$response = http($target=$action, $ref, $method, $data_array, EXCL_HEAD);

Listing 5-9: Using LIB_http to emulate the formanalysis in Figure 5-3After you write a form-emulation script, it's agood idea to use the analyzer to verify that theform method and variables match the originalform you are attempting to emulate. If you'refeeling ambitious, you could improve on thissimple form analyzer by designing one thataccepts both the submitted and emulated formsand compares them for you.The script in Listing 5-10 is similar to the onerunning at http://www

Page 127: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

running at http://www.schrenk.com/nostarch/webbots/form_analyzer.php.This script is for reference only. You candownload the latest copy from this book's website.Note that the PHP sections of this script appear inbold.

<?setcookie("SET BY THIS PAGE", "This is a diagnostic cookie.");?><head> <title>HTTP Request Diagnostic Page</title> <style type="text/css"> p { color: black; font-weight: bold; font-size: 110%; font-family: arial} .title { color: black; font-weight: bold; font-size: 110%; font-family:arial} .text {font-weight: normal; font-size: 90%;} TD { color: black; font-size: 100%; font-family: courier; vertical-align:top;} .column_title { color: black; font-size: 100%; background-color: eeeeee; font-weight: bold; font-family: arial} </style></head><p class="title">Webbot Diagnostic Page</p><p class="text">This web page is a tool to diagnose webbot functionality byexamining what the webbot sends to webservers.<table border="1" cellspacing="0" cellpadding="3" width="800"> <tr class="column_title"> <th width="25%">Variable</th> <th width="75%">Value sent to server</th> </tr> <tr> <td>HTTP Request Method</td><td><?echo $_SERVER["REQUEST_METHOD"];?></td>

Page 128: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$_SERVER["REQUEST_METHOD"];?></td> </tr> <tr> <td>Your IP Address</td><td><?echo $_SERVER["REMOTE_ADDR"];?></td> </tr> <tr> <td>Server Port</td><td><?echo $_SERVER["SERVER_PORT"];?></td> </tr> <tr> <td>Referer</td> <td><? if(isset($_SERVER['HTTP_REFERER'])) echo $_SERVER['HTTP_REFERER']; else echo "Null<br>"; ?> </td> </tr> <tr> <td>Agent Name</td> <td><? if(isset($_SERVER['HTTP_USER_AGENT'])) echo $_SERVER['HTTP_USER_AGENT']; else echo "Null<br>"; ?> </td> </tr>

<tr> <td>Get Variables</td> <td><? if(count($_GET)>0) var_dump($_GET); else echo "Null"; ?> </td> </tr> <tr> <td>Post Variables</td>

Page 129: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

<td>Post Variables</td> <td><? if(count($_POST)>0) var_dump($_POST); else echo "Null"; ?> </td> </tr> <tr> <td>Cookies</td> <td><? if(count($_COOKIE)>0) var_dump($_COOKIE); else echo "Null"; ?> </td> </tr></table><p class="text">This web page also sets a diagnostic cookie, which should bevisible the second time you access this page.

Listing 5-10: A simple form analyzer

Page 130: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Final ThoughtsYears of experience have taught me a few tricksfor emulating forms. While it's not hard to write awebbot that submits a form, it is often difficult todo it right the first time. Moreover, as you readearlier, there are many reasons to submit a formcorrectly the first time. I highly suggest readingChapters 24, 25, and 28 before creating webbotsthat emulate forms. These chapters provideadditional insight into potential problems andperils that you're likely to encounter when writingwebbots that submit data to webservers.

Don't Blow Your Cover

If you're using a webbot to create a competitiveadvantage for a client, you don't want that fact tobe widely known—especially to the people thatrun the targeted site.There are two ways a webbot can blow its coverwhile submitting a form:

It emulates the form but not the browser.It generates an error either because it poorlyanalyzed the form or poorly executed theemulation. Either error may create a

Page 131: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

emulation. Either error may create acondition that isn't possible when the form issubmitted by a browser, creating aquestionable entry in a server activity log.

Note

This topic is covered in more detail inChapter 24.

Correctly Emulate Browsers

Emulating a browser is easy, but you should verifythat you're doing it correctly. Your webbot canlook like any browser you desire if you properlydeclare the name of your web agent. If you'reusing the LIB_http library, the constantWEBBOT_NAME defines how your webbot identifiesitself, and furthermore, how servers log your webagent's name in their log files. In some cases,webservers verify that you are using a particularweb browser (most commonly Internet Explorer)before allowing you to submit a form.If you plan to emulate a browser as well as theform, you should verify that the name of yourwebbot is set to something that looks like abrowser (as shown in Listing 5-11). Obviously, if

Page 132: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

you don't change the default value for yourwebbot's name in the LIB_http library, you'll telleveryone who looks at the server logs that you'reusing a test webbot.

# Define how your webbot will appear in server logsdefine("WEBBOT_NAME", "Internet Explorer");

Listing 5-11: Setting the name of your webbot toInternet Explorer in LIB_httpStrange user agent names will often be noticed bywebmasters, since they routinely analyze logs tosee which browsers people use to access theirsites to ensure that they don't run into browsercompatibility problems.

Avoid Form Errors

Even more serious than using the wrong agentname is submitting a form that couldn't possiblybe sent from the form the webserver provides onits website. These mistakes are logged in theserver's error log and are subject to carefulscrutiny. Situations that could cause server errorsinclude the following:

Using the wrong form protocolSubmitting the form to the wrong action (formhandler)

Page 133: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

handler)Submitting form variables in the wrong orderIgnoring an expected variable that the formhandler needsAdding an extra variable that the formhandler doesn't expectEmulating a form that is no longer availableon the website

Using the wrong method can have severalundesirable outcomes. If your webbot sends toomuch data with a GET method when the formspecifies a POST method, you risk the danger oflosing some of your data. (Most webserversrestrict the length of a GET method.[19]) Anotherdanger of using the wrong form method is thatmany form handlers expect variables to bemembers of either a $_GET or $_POST array,which is a keyed name/value array similar to the$data_array used in LIB_http. If you're sendingthe form a POST variable called 'name', and theserver is expecting $_GET['name'], your webbotwill generate an entry in the server's error logbecause it didn't send the variable the server waslooking for.Also, remember that protocols aren't limited tothe form method. If the form handler expects an

Page 134: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

the form method. If the form handler expects anSSL-encrypted https protocol, and you deliver theemulated form to an unencrypted http address,the form handler won't understand you becauseyou'll be sending data to the wrong server port. Inaddition, you're potentially sending sensitive dataover an unencrypted connection.The final thing to verify is that you are sendingyour emulated form to a web page that exists onthe target server. Sometimes mistakes like thisare the result of sloppy programming, but this canalso occur when a webmaster updates the site(and form handler). For this reason, a proactivewebbot designer verifies that the form handlerhasn't changed since the webbot was written.

[19] Servers routinely restrict the length of a GETrequest to help protect the server from extremelylong requests, which are commonly used byhackers attempting to compromise servers withbuffer overflow exploits.

Page 135: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 6. MANAGING LARGEAMOUNTS OF DATAYou will soon find that your webbots are capableof collecting massive amounts of data. Theamount of data a simple automated webbot orspider can collect, even if it runs only once a dayfor several months, is colossal. Since none of ushave unlimited storage, managing the quality andvolume of the data our programs collect and storebecomes very important. In this chapter, I willdescribe methods to organize the data that yourwebbots collect and then investigate ways toreduce the size of what you save.

Organizing DataOrganizing the resources that your webbotsdownload requires planning. Whether you employa well-defined file structure or a relationaldatabase, the result should meet the needs of theparticular problem your application attempts tosolve. For example, if the data is primarily text, isaccessed by many people, or is in need of sort orsearch capability, then you may prefer to storeinformation in a relational database, whichaddresses these needs. If, on the other hand, youare storing many images, PDFs, or Worddocuments, you may favor storing files in astructured filesystem. You may even create ahybrid system where a database references media

Page 136: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

hybrid system where a database references mediafiles stored in structured directories.

Naming Conventions

While there is no "correct" way to organize data,there are many bad ways to store the datawebbots generate. Most mistakes arise fromassigning non-descriptive or confusing names tothe data your webbots collect. For this reason,your designs must incorporate namingconventions that uniquely identify files,directories, and database properties. Definenames for things early, during your planningstages, as opposed to naming things as you goalong. Always name in a way that allows yourdata structure to grow. For example, a real estatewebbot that refers to properties as houses may bedifficult to maintain if your application laterexpands to include raw land, offices, orbusinesses. Updating names for your data canbecome tedious, since your code anddocumentation will reference those names manytimes.Your naming convention can enforce any rulesyou like, but you should consider the followingguidelines:

You need to enforce any naming standardswith an iron fist, or they will cease to bestandards.It's often better to assign names based on the

Page 137: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

It's often better to assign names based on thetype of thing an object is, rather than what isactually is. For example, in the previous realestate example, it may have been better toname the database table that describes housesproperties, so when the scope of the projectexpands,[20] it can handle a variety of realestate. With this method, if your projectgrows, you could add another column to thetable to describe the type of property. It isalways easier to expand data tables than torename columns.Consider who (or what) will be using yourdata organization. For example, a directorycalled Saturday_January_23 might be easy fora person to read, but a directory called 0123might be a better choice if a computeraccesses its contents. Sequential numbers areeasier for computer programs to interpret.Define the format of your names. People willoften use compound words and separate theword with underscores for readability, as inname_first. Other times, people separatecompound words with case, as in nameFirst;this is commonly referred to as CamelCase.These format definitions should include thingslike case, language, and parts of speech. Forexample, if you decide to separate terms withunderscores, you shouldn't use CamelCase toname other terms later. It's very common fordevelopers to use different standards to helpidentify differences between functions, datavariables, and objects.

Page 138: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

variables, and objects.If you give members of a certain group labelsthat are all the same part of speech, don'toccasionally throw in a label with anothergrammatical form. For example, if you have agroup of directories named with nouns, don'tname another directory in the same groupwith a verb—and if you do, chances are itprobably doesn't belong in that group ofthings in the first place.If you are naming files in a directory, you maywant to give the files names that will laterfacilitate easy grouping or sorting. Forexample, if you are using a filename thatdefines a date, filenames with the formatyear_month_day will make more sense whensorted than filenames with the formatmonth_day_year. This is because year, month,and day is a sequential progression fromlargest to smallest and will accurately reflectorder when sorted.

Storing Data in Structured Files

To successfully store files in a structured series ofdirectories, you need to find out what the fileshave in common. In most cases, the problemyou're trying to solve and the means for retrievingthe data will dictate the common factors amongyour files. Figuratively, you need to look for thelowest common denominator for all your files.Figure 6-1 shows a file structure for storing data

Page 139: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 6-1 shows a file structure for storing dataretrieved by a webbot that runs once a day. Itscommon theme is time.

Figure 6-1. Example of a structured filesystemprimarily based on dates

With the structure defined in Figure 6-1, you couldeasily locate thumbnail images created by thewebbot on February 3, 2006 because the folderscomply with the following specification:

drive:\project\year\month\day\category\subcategory\files

Therefore, the path would look like this:

c:\Spider_files\2006\02\03\Graphics\Thumbnails\

Page 140: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

People may easily decipher this structure, and sowill programs, which need to determine thecorrect file path programmatically. Figure 6-2shows another file structure, primarily based ongeography.

Figure 6-2. A geographically themed example ofa structured filesystem

Ensure that all files have a unique path and thateither a person or a computer can easily makesense of these paths.File structures, like the ones shown in theprevious figures, are commonly created bywebbots. You'll see how to write webbots thatcreate file structures in Chapter 8.

Storing Text in a Database

While many applications call for file structuressimilar to the ones shown in Figure 6-1 orFigure 6-2, the majority of projects you're likely toencounter will require that data is stored in a

Page 141: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

encounter will require that data is stored in adatabase. A database has many advantages over afile structure. The primary advantage is the abilityto query or make requests from the database witha query language called Structured QueryLanguage or SQL (pronounced SEE-quill). SQLallows programs to sort, extract, update,combine, insert, and manipulate data in nearlyany imaginable way.It is not within the scope of this book to teachSQL, but this book does include the LIB_mysqllibrary, which simplifies using SQL with the opensource database called MySQL[21] (pronouncedmy-esk-kew-el).

LIB_mysql

LIB_mysql consists of a few server configurationsand three functions, which should handle most ofyour database needs. These functions act asabstractions or simplifications of the actualinterface to the program. Abstractions areimportant because they allow access to a widevariety of database functions with a commoninterface and error-reporting method. They alsoallow you to use a database other than MySQL bycreating a similar library for a new database. Forexample, if you choose to use another databasesomeday, you could write abstractions with thesame function names used in LIB_mysql. In thisway, you can make the code in this book workwith Oracle, SQL Server, or any other databasewithout modifying any scripts.

Page 142: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

without modifying any scripts.The source code for LIB_mysql is available fromthis book's website. There are other fine databaseabstractions available from projects like PEARand PECL; however, the examples in this book useLIB_mysql.Listing 6-1 shows the configuration area ofLIB_mysql. You should configure this area foryour specific MySQL installation before you use it.

MySQL Constants (scope = global)

define("MYSQL_ADDRESS", "localhost"); // Define IP address of your MySQL Serverdefine("MYSQL_USERNAME", ""); // Define your MySQL usernamedefine("MYSQL_PASSWORD", ""); // Define your MySQL passworddefine("DATABASE", ""); // Define your default database

Listing 6-1: LIB_mysql server configurationsAs shown in Listing 6-1, the configuration sectionprovides an opportunity to define where yourMySQL server resides and the credentials neededto access it. The configuration section also definesa constant, "DATABASE", which you may use todefine the default database for your project.There are three functions in LIB_mysql thatfacilitate the following:

Inserting data into the databaseUpdating data already in the database

Page 143: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Updating data already in the databaseExecuting a raw SQL query

Each function uses a similar interface, and eachprovides error reporting if you request anerroneous query.

The insert() Function

The insert() function in LIB_mysql simplifiesthe process of inserting a new entry into adatabase by passing the new data in a keyedarray. For example, if you have a table like theone in Figure 6-3, you can insert another row ofdata with the script in Listing 6-2, making it looklike the table in Figure 6-4.

Figure 6-3. Example table people before theinsert()

$data_array['NAME'] = "Jill Monroe";$data_array['CITY'] = "Irvine";$data_array['STATE'] = "CA";$data_array['ZIP'] = "55410";insert(DATABASE, $table="people", $data_array);

Listing 6-2: Example of using insert()

Page 144: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 6-4. Example table people afterexecuting the insert() in Listing 6-2

The update() Function

Alternately, you can use update() to update therecord you just inserted with the script in Listing6-3, which changes the ZIP code for the record.

$data_array['NAME'] = "Jill Monroe";$data_array['CITY'] = "Irvine";$data_array['STATE'] = "CA";$data_array['ZIP'] = "92604";update(DATABASE, $table="people", $data_array, $key_column="ID", $id="3");

Listing 6-3: Example script for updating data in atableRunning the script in Listing 6-3 changes values inthe table, as shown in Figure 6-5.

Figure 6-5. Example table people after updatingZIP codes with the script in Listing 6-3

The exe_sql() Function

Page 145: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

For database functions other than inserting orupdating records, LIB_mysql provides theexe_sql() function, which executes a SQL queryagainst the database. This function is particularlyuseful for extracting data with complex queries orfor deleting records, altering tables, or anythingelse you can do with SQL. Table 6-1 showsvarious uses for this function.Table 6-1. Example Usage Scenarios for theLIB_mysql_exe_sql() Function

Instruction Result

$array = exe_sql(DATABASE, "select * $array[1]['ID']="1";

from people"); $array[1]['NAME']="KellyGarrett";

$array[1]['CITY']="CulverCity";

$array[1]['STATE']="CA";

$array[1]['ZIP']="90232";

$array[2]['ID']="2";

$array[2]['NAME']="SabrinaDuncan";

$array[2]['CITY']="Anaheim";

Page 146: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$array[2]['STATE']="CA";

$array[2]['ZIP']="92812";

$array[3]['ID']="3";

$array[3]['NAME']="JillMonroe";

$array[3]['CITY']="Irvine";

$array[3]['STATE']="CA";

$array[3]['ZIP']="92604";

$array = exe_sql(DATABASE, "select * frompeople where ID='2'"); $array['ID']="2";

$array['NAME']="SabrinaDuncan";

$array['CITY']="Anaheim";

$array['STATE']="CA";

$array['ZIP']="92604";

List($name)= exe_sql(DATABASE, "selectNAME from people where ID='2'"); $name = "Sabrina Duncan";

exe_sql(DATABASE, "delete from peoplewhere ID='2'");

Deletes row 3 fromtable

Page 147: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Please note that if exe_sql() is fetching datafrom the database, it will always return an arrayof data. If the query returns multiple rows of data,you'll get a multidimensional array. Otherwise, asingle-dimensional array is returned.

Storing Images in a Database

It is usually better to store images in a filestructure and then refer to the paths of the imagesin the database, but occasionally you may find theneed to store images as blobs, or largeunstructured pieces of data, directly in adatabase. These occasions may arise when youdon't have the requisite system permissions tocreate a file. For example, many webadministrators do not allow their webservers tocreate files, as a security measure. To store animage in a database, set the typecasting orvariable type for the image to blob or large bloband insert the data, as shown in Listing 6-4.

$data_array['IMAGE_ID'] = 6;$data_array['IMAGE'] = base64_encode(file_get_contents($file_path));insert(DATABASE, $table, $data_array);

Listing 6-4: Storing an image directly in adatabase recordWhen you store a binary file, like an image, in adatabase, you should base64-encode the data first.Since the database assumes text or numeric data,this precaution ensures that no bit combinations

Page 148: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

this precaution ensures that no bit combinationswill cause internal errors in the database. If youdon't do this, you take the risk that some odd bitcombination in the image will be interpreted as anunintended database command or specialcharacter.Since images are—or should be—base64 encoded,you need to decode the images before you canreuse them. The script in Listing 6-5 shows how todisplay an image stored in a database record.

<!— Display an image stored in a database where the image ID is 6 —><img src="show_image.php?img_id=6">

Listing 6-5: HTML that displays an image storedin a databaseListing 6-6 shows the code to extract, decode, andpresent the image.

<?# Get needed database libraryinclude("LIB_mysql.php");

# Convert the variable on the URL to a new variable$image_id=$_GET['img_id'];

# Get the base64-encoded image from the database$sql = "select IMAGE from table where IMAGE_ID='".$image_id."'";list($img) = exe_sql (DATABASE, $sql);

# Decode the image and send it as a file to the requesterheader("Content-type: image/jpeg");

Page 149: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

header("Content-type: image/jpeg");echo base64_decode($img);exit;?>

Listing 6-6: Script to query, decode, and create animage from an image record in a databaseWhen an image tag is used in this fashion, theimage src attribute is actually a function thatpulls the image from the database before is sendsit to the waiting web agent. This function knowswhich image to send because it is referenced inthe query of the src attribute. In this case, thatrecord is img_id, which corresponds with thetable column IMAGE_ID. The programshow_image.php actually creates a new image fileeach time it is executed.

Database or File?

Your decision to store information in a databaseor as files in a directory structure is largelydependent on your application, but because of theadvantages that SQL brings to data storage, Ioften use databases. The one common exceptionto this rule is images files, which (as previouslymentioned) are usually more efficiently stored asfiles in a directory. Nevertheless, when files arestored in local directories, it is often convenient toidentify the physical address of the file you savedin a database.

Page 150: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

[20] Projects always expand in scope.[21] More information about MySQL is available athttp://www.mysql.com and http://www.php.net.

Page 151: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Making Data SmallerNow that you know how to store data, you'll wantto efficiently store the data in ways that reducethe amount of disk spaced required, whilefacilitating easy retrieval and manipulation of thatdata. The following section explores methods forreducing the size of the data your webbots collectin these ways:

Storing references to dataCompressing dataRemoving unneeded formattingThumbnailing or creating smallerrepresentations of larger graphic files

Storing References to Image Files

Since your webbot and the image files it discoversshare the same network, it is possible to store anetwork reference to the image instead of makinga physical copy of it. For example, instead ofdownloading and storing the imagenorth_beach.jpg from www.schrenk.com, youmight store a reference to its URL,http://www.schrenk.com/north_beach.jpg, in a

Page 152: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

database. Now, instead of retrieving the file fromyour data structure, you could retrieve the actualfile from its original location. While you can applythis technique to images, this technique is notlimited to image files but also applies to HTML,JavaScript, Style Sheets, or any other networkedfile.There are three main advantages to recordingreferences to images instead of storing copies ofthe images. The most obvious advantage is thatthe reference to an image will usually consumemuch less space than a copy of the image file.Another advantage is that if the original image onthe website changes, you will still have access tothe most current version of that image—providedthat the network address of the image hasn't alsochanged. A less obvious advantage to storing thenetwork address of an image is that you mayshield yourself from potential copyright issueswhen you make a copy of someone else'sintellectual property.The disadvantage of storing a reference to animage instead of the actual images is that there isno guarantee that it still references an imagethat's available online. When the remote imagechanges, your reference will be obsolete. Giventhe short-lived nature of online media, images canchange or disappear without warning.

Page 153: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

change or disappear without warning.

Compressing Data

From a webbot's perspective, compression canhappen either when a webserver delivers pages orwhen your webbot compresses pages before itstores them for later use. Compression on inboundfiles will save bandwidth; the second method cansave space on your hard drives. If you'reambitious, you can use both forms ofcompression.

Compressing Inbound Files

Many webservers automatically compress filesbefore they serve pages to browsers. Managingyour incoming data is just as important asmanaging the data on your hard drive.Servers configured to serve compressed webpages will look for signals from the web clientindicating that it can accept compressed pages.Like browsers, your webbots can also tell serversthat they can accept compressed data by includingthe lines shown in Listing 6-7 in your PHP andcURL routines.

$header[] = "Accept-Encoding: compress, gzip";curl_setopt($curl_session, CURLOPT_HTTPHEADER, $header);

Page 154: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$header);

Listing 6-7: Requesting compressed files from awebserverServers equipped to send compressed web pageswon't send compressed files if they decide that theweb agent cannot decompress the pages. Serversdefault to uncompressed pages if there's anydoubt of the agent's ability to decompresscompressed files. Over the years, I have foundthat some servers look for specific agent names—in addition to header directions—before decidingto compress outgoing data. For this reason, youwon't always gain the advantage of inboundcompression if your webbot's agent name issomething nonstandard like Test Webbot. For thatreason, when inbound file compression isimportant, it's best if your webbot emulates acommon browser.[22]

Since the webserver is the final arbiter of anagent's ability to handle compressed data—andsince it always defaults on the side of safety (nocompressions)—you're never guaranteed toreceive a compressed file, even if one is requested.If you are requesting compression from a server,it is incumbent on your webbot to detect whetheror not a web page was compressed. To detectcompression, look at the returned header to see ifthe web page is compressed and, if so, what form

Page 155: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

the web page is compressed and, if so, what formof compression was used (as shown in Listing 6-8).

if (stristr($http_header, "zip")) // Assumes that header is in $http_header $compressed = TRUE;

Listing 6-8: Analyzing the HTTP header to detectinbound file compressionIf the data was compressed by the server, you candecompress the files with the functiongzuncompress() in PHP, as shown in Listing 6-9.

$uncompressed_file = gzuncompress($compressed_file);

Listing 6-9: Decompressing a compressed file

Compressing Files on Your Hard Drive

PHP provides a variety of built-in functions forcompressing data. Listing 6-10 demonstratesthese functions. This script downloads the defaultHTML file from http://www.schrenk.com,compresses the file, and shows the differencebetween the compressed and uncompressed files.The PHP sections of this script appear in bold.

# Demonstration of PHP file compression

# Include cURL library

Page 156: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

# Include cURL libraryinclude("LIB_http.php");

# Get web page$target = "http://www.schrenk.com";$ref = "";$method = "GET";$data_array = "";$web_page = http_get($target, $ref, $method, $data_array, EXCL_HEAD);

# Get sizes of compressed and uncompressed versions of web page$uncompressed_size = strlen($web_page['FILE']);$compressed_size = strlen(gzcompress($web_page['FILE'], $compression_value = 9));$noformat_size = strlen(strip_tags($web_page['FILE']));

# Report the sizes of compressed and uncompressed versions of web page?><table border="1"> <tr> <th colspan="3">Compression report for <? echo $target?></th> </tr> <tr> <th>Uncompressed</th> <th>Compressed</th> </tr>

Page 157: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

</tr> <tr> <td align="right"><?echo $uncompressed_size?> bytes</td> <td align="right"><?echo $compressed_size?> bytes</td> </tr></table>

Listing 6-10: Compressing a downloaded fileRunning the script from Listing 6-10 in a browserprovides the results shown in Figure 6-6.Before you start compressing everything yourwebbot finds, you should be aware of thedisadvantages of file compression. In thisexample, compression resulted in files roughly 20percent of the original size. While this isimpressive, the biggest drawback to compressionis that you can't do much with a compressed file.You can't perform searches, sorts, or comparisonson the contents of a compressed file. Nor can youmodify the contents of a file while it's compressed.Furthermore, while text files (like HTML files)compress effectively, many media files like JPG,GIF, WMF, or MOV are already compressed andwill not compress much further. If your webbotapplication needs to analyze or manipulatedownloaded files, file compression may not be foryou.

Page 158: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 6-6. The script from Listing 6-10, showingthe value of compressing files

Removing Formatting

Removing unneeded HTML formattinginstructions is much more useful for reducing thesize of a downloaded file than compressing it,since it still facilitates access to the usefulinformation in the file. Formatting instructionslike <div class="font_a"> are of little use to awebbot because they only control format and notcontent, and because they can be removed withoutharming your data. Removing formatting reducesthe size of downloaded HTML files while stillleaving the option of manipulating the data later.Fortunately, PHP provides strip_tags_(), abuilt-in function that automatically strips HTMLtags from a document. For example, if I add thelines shown in Listing 6-11 to the previous script,we can see the affect of stripping the HTMLformatting.

Page 159: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$noformat = strip_tags($web_page['FILE']); // Remove HTML tags$noformat_size = strlen($noformat); // Get size of new string

Listing 6-11: Removing HTML formatting usingthe strip_tags() functionIf you run the program in Listing 6-10 again andmodify the output to also show the size of theunformatted file, you will see that the unformattedweb page is nearly as compact as the compressedversion. The results are shown in Figure 6-7.

Figure 6-7. Comparison of uncompressed,compressed, and unformatted file sizes

Unlike the compressed data, the unformatted datacan still be sorted, modified, and searched. Youcan make the file even smaller by removingexcessive spaces, line feeds, and other white spacewith a simple PHP function called trim(), withoutreducing your ability to manipulate the data later.As an added benefit, unformatted pages may be

Page 160: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

As an added benefit, unformatted pages may beeasier to manipulate, since parsing routines won'tconfuse HTML for the content you're acting on.Remember that removing the HTML tags removesall links, JavaScript, image references, and CSSinformation. If any of that is important, youshould extract it before removing a page'sformatting.

[22] For more information on agent namespoofing, please review Chapter 3.

Page 161: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Thumbnailing ImagesThe most effective method of decreasing the sizeof an image is to create smaller versions, orthumbnails, of the original. You may easily createthumbnails with the LIB_thumbnail library,which you can download from this book's website.To use this library, you will have to verify thatyour configuration uses the gd (revision 2.0 orhigher) graphics module.[23] The script in Listing6-12 demonstrates how to use LIB_thumbnail tocreate a miniature version of a larger image. ThePHP sections of this script appear in bold.

# Demonstration of LIB_thumbnail.php

# Include librariesinclude("LIB_http.php");include("LIB_thumbnail.php");

# Get image from the Internet$target = "http://www.schrenk.com/north_beach.jpg";$ref = "";$method = "GET";$data_array = "";$image = http_get($target, $ref, $method, $data_array, EXCL_HEAD);

# Store captured image file to local hard drive

Page 162: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

# Store captured image file to local hard drive$handle = fopen("test.jpg", "w");fputs($handle, $image['FILE']);fclose($handle);

# Create thumbnail image from image that was just stored locally$org_file = "test.jpg";$new_file_name = "thumbnail.jpg";

# Set the maximum width and height of the thumbnailed image$max_width = 90;$max_height= 90;

# Create the thumbnailed imagecreate_thumbnail($org_file, $new_file_name, $max_width, $max_height);?>Full-size image<br><img src="test.jpg"><p>Thumbnail image<br><img src="thumbnail.jpg">

Listing 6-12: Demonstration of howLIB_thumbnail may create a thumbnailed imageThe script in Listing 6-12 fetches an image fromthe Internet, writes a copy of the original to alocal file, defines the maximum dimensions of thethumbnail, creates the thumbnail, and finallydisplays both the original image and the

Page 163: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

displays both the original image and thethumbnail image.The product of running the script in Listing 6-12 isshown in Figure 6-8.The thumbnailed image shown in Figure 6-8consumes roughly 30 percent as much space asthe original file. If the original file was larger orthe specification for the thumbnailed image wassmaller, the savings would be even greater.

Figure 6-8. Output of Listing 6-12, makingthumbnails with LIB_thumbnail

Page 164: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

thumbnails with LIB_thumbnail

[23] If the gd module is not installed in yourconfiguration, please referencehttp://www.php.net/manual/en/ref.image.php forfurther instructions.

Page 165: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Final ThoughtsWhen storing information, you need to considerwhat is being stored and how that information willbe used later. Furthermore, if the data isn't goingto be used later, you need to ask yourself why youneed to save it.Sometimes it is easier to parse text before theHTML tags are removed. This is especially truewith regard to data in tables, where rows andcolumns are parsed.While unformatted pages are stripped ofpresentation, colors, and images, the remainingtext is enough to represent the original file.Without the HTML, it is actually easier tocharacterize, manipulate, or search for thepresence of keywords.Before you continue, this is a good time todownload LIB_mysql, LIB_http, andLIB_thumbnail from this book's website. You willneed all of these libraries to program laterexamples in this book.

Page 166: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Part II. PROJECTSThis section expands on the concepts you learnedin the previous section with simple yetdemonstrative projects. Any of these projects,with further development, could be transformedfrom a simple webbot concept into a potentiallymarketable product.

Chapter 7The first project describes webbots that collectand analyze online prices from a mock storethat exists on this book's website. The priceschange periodically, creating an opportunityfor your webbots to analyze and makepurchase decisions based on the price of items.Since this example store is solely for yourexperimentation, you'll gain confidence intesting your webbot on web pages that serveno commercial purpose and haven't changedsince this book's publication. This environmentalso affords the freedom to make mistakeswithout obsessing over the crumbs yourwebbots leave behind in an actual onlinestore's server log file.

Chapter 8

Page 167: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

The image-capturing webbot leverages yourknowledge of downloading and parsing webpages to create an application that copies allthe images (and their directory structure) toyour local hard drive. In addition to creating auseful tool, you'll also learn how to convertrelative addresses into fully resolved URLs, atechnique that is vital for later spideringprojects.

Chapter 9Here you will have the opportunity to write awebbot that automatically verifies that all thelinks on a web page point to valid web pages.I'll conclude the chapter with ideas forexpanding this concept into a variety of usefultools and products.

Chapter 10In this chapter, I'll introduce the concept ofusing a webbot as a proxy, or intermediaryagent that intercepts and modifies informationflowing between a user and the Internet. Theresult of this project is a simple proxy webbotthat allows users to surf the Internetanonymously.

Chapter 11This project describes a simple webbot that

Page 168: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

This project describes a simple webbot thatdetermines how highly a search engine ranks awebsite, given a set of search criteria. You'llalso find a host of ideas about how you canmodify this concept to provide a variety ofother services.

Chapter 12Aggregation is a technique that gathers thecontents of multiple web pages in a singlelocation. This project introduces techniquesthat make it easy to exploit the availability ofRSS news services.

Chapter 13Webbots that use FTP are able to move theinformation they collect to an FTP server forstorage or use by other applications. In thischapter, we'll explore methods for navigatingon, uploading to, and downloading from FTPservers.

Chapter 14While often overlooked in favor of newer, web-based sources, NNTP is still a viable protocolwith an active user base. In this chapter, I'lldescribe methods by which you can interfaceyour webbots to news servers, which useNNTP.

Page 169: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 15Here you will learn how to write webbots thatread and delete messages from any POP3 mailserver. The ability to read email allows awebbot to interpret instructions sent by emailor apply a variety of email filters.

Chapter 16In this chapter, you'll learn various methodsthat allow your webbots to send emailmessages and notifications. You will also learnhow to leverage what you learned in theprevious chapter to create "smart emailaddresses" that can determine how to forwardmessages based on their content—withoutmodifying anything on the mail server.

Chapter 17This project describes how you can use formemulation and parsing techniques to transformany preexisting online application into afunction you can call from any PHP program.

Page 170: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 7. PRICE-MONITORING WEBBOTSIn this chapter, we'll look at a strategicapplication of webbots—monitoring online prices.There are many reasons one would do this. Forexample, a webbot might monitor prices for thesepurposes:

Notifying someone (via email or textmessage[24]) when a price drops below apreset thresholdPredicting price trends by performingstatistical analysis on price historiesEstablishing your company's prices bystudying what the competition charges forsimilar items

Regardless of your reasons to monitor prices, theone thing that all of these strategies have incommon is that they all download web pagescontaining prices and then identify and parse thedata.In this chapter, I will describe methods formonitoring online prices on e-commerce websites.Additionally, I will explain how to parse data from

Page 171: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Additionally, I will explain how to parse data fromtables and prepare you for the webbot strategiesrevealed in Chapter 19.

The TargetThe practice store, available at this book'swebsite,[25] will be the target for our price-monitoring webbot. A screenshot of the store isshown in Figure 7-1.

Page 172: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 7-1. The e-commerce website that ismonitored by the price-monitoring webbot

This practice store provides a controlledenvironment that is ideal for this exercise. Forexample, by targeting the example store you cando the following:

Experiment with price-monitoring webbotswithout the possibility of interfering with anactual businessControl the content of this target, so you don'trun the risk of someone modifying the webpage, which could break the examplescripts[26]

The prices change on a daily basis, so you canalso use it to practice writing webbots that trackand graph prices over time.

[24] Chapter 16 describes how webbots send emailand text messages.[25] The URL for this store is found at

Page 173: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

[25] The URL for this store is found athttp://www.schrenk.com/nostarch/webbots.[26] The example scripts are resistant to mostchanges in the target store.

Page 174: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Designing the Parsing ScriptOur webbot's objective is to download the targetweb page, parse the price variables, and place thedata into an array for processing. The price-monitoring webbot is largely an exercise inparsing data that appears in tables, since usefulonline data usually appears as such. When tablesaren't used, <div> tags are generally applied andcan be parsed in a similar manner.While we know that the test target for thisexample won't change, we don't know that abouttargets in the wild. Therefore, we don't want to betoo specific when telling our parsing routineswhere to look for pricing information. In thisexample, the parsing script won't look for data inspecific locations; instead, it will look for thedesired data relative to easy-to-find text that tellsus where the desired information is located. If theposition of the pricing information on the targetweb page changes, our parsing script will still findit.Let's look at a script that downloads the targetweb page, parses the prices, and displays the datait parsed. This script is available in its entiretyfrom this book's website. The script is broken intosections here; however, iterative loops are

Page 175: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

sections here; however, iterative loops aresimplified for clarity.

Page 176: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Initialization and Downloadingthe TargetThe example script initializes by including theLIB_http and LIB_parse libraries you readabout earlier. It also creates an array where theparsed data is stored, and it sets the productcounter to zero, as shown in Listing 7-1.

# Initializationinclude("LIB_http.php");include("LIB_parse.php");$product_array=array();$product_count=0;

# Download the target (practice store) web page$target = "http://www.schrenk.com/webbots/example_store";$web_page = http_get($target, "");

Listing 7-1: Initializing the price-monitoringwebbotAfter initialization, the script proceeds todownload the target web page with theget_http() function described in Chapter 3.After downloading the web page, the script parsesall the page's tables into an array, as shown inListing 7-2.

# Parse all the tables on the web page into an array$table_array = parse_array($web_page['FILE'], "<table", "</table>");

Listing 7-2: Parsing the tables into an arrayThe script does this because the product pricing

Page 177: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

The script does this because the product pricingdata is in a table. Once we neatly separate all thetables, we can look for the table with the productdata. Notice that the script uses <table, not<table>, as the leading indicator for a table. Itdoes this because <table will always beappropriate, no matter how many tableformatting attributes are used.Next, the script looks for the first landmark, ortext that identifies the table where the productdata exists. Since the landmark represents textthat identifies the desired data, that text must beexclusive to our task. For example, by examiningthe page's source code we can see that we cannotuse the word origin as a landmark because itappears in both the description of this week'sauction and the list of products for sale. Theexample script uses the words Products for Sale,because that phrase only exists in the heading ofthe product table and is not likely to existelsewhere if the web page is updated. The scriptlooks at each table until it finds the one thatcontains the landmark text, Products for Sale, asshown in Listing 7-3.

# Look for the table that contains the product informationfor($xx=0; $xx<count($table_array); $xx++) { $table_landmark = "Products For Sale"; if(stristr($table_array[$xx], $table_landmark)) // Process this table { echo "FOUND: Product table\n";

Listing 7-3: Examining each table for the existenceof the landmark textOnce the table containing the product pricing data

Page 178: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Once the table containing the product pricing datais found, that table is parsed into an array of tablerows, as shown in Listing 7-4.

# Parse table into an array of table rows$product_row_array = parse_array($table_array[$xx], "<tr", "</tr>");

Listing 7-4: Parsing the table into an array oftable rowsThen, once an array of table rows from theproduct data table is available, the script looks forthe product table heading row. The heading row isuseful for two reasons: It tells the webbot wherethe data begins within the table, and it providesthe column positions for the desired data. This isimportant because in the future, the order of thedata columns could change (as part of a web pageupdate, for example). If the webbot uses columnnames to identify data, the webbot will still parsedata correctly if the order changes, as long as thecolumn names remain the same.Here again, the script relies on a landmark to findthe table heading row. This time, the landmark isthe word Condition, as shown in Listing 7-5. Oncethe landmark identifies the table heading, thepositions of the desired table columns arerecorded for later use.

for($table_row=0; $table_row<count($product_row_array); $table_row++) { # Detect the beginning of the desired data (heading row) $heading_landmark = "Condition"; if((stristr($product_row_array[$table_row], $heading_landmark)))

Page 179: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$heading_landmark))) { echo "FOUND: Table heading row\n";

# Get the position of the desired headings $table_cell_array = parse_array($product_row_array[$table_row], "<td", "</td>"); for($heading_cell=0; $heading_cell<count($table_cell_array); $heading_cell++) { if(stristr(strip_tags(trim($table_cell_array[$heading_cell])),"ID#")) $id_column=$heading_cell; if(stristr(strip_tags(trim($table_cell_array[$heading_cell])),

"Product name")) $name_column=$heading_cell; if(stristr(strip_tags(trim($table_cell_array[$heading_cell])),"Price")) $price_column=$heading_cell; } echo "FOUND: id_column=$id_column\n"; echo "FOUND: price_column=$price_column\n"; echo "FOUND: name_column=$name_column\n";

# Save the heading row for later use

$heading_row = $table_row; }

Listing 7-5: Detecting the table heading andrecording the positions of desired columnsAs the script loops through the table containingthe desired data, it must also identify where thepricing data ends. A landmark is used again to

Page 180: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

pricing data ends. A landmark is used again toidentify the end of the desired data. The scriptlooks for the landmark Calculate, from the form'ssubmit button, to identify when it has reached theend of the data. Once found, it breaks the loop, asshown in Listing 7-6.

# Detect the end of the desired data table$ending_landmark = "Calculate";if((stristr($product_row_array[$table_row], $ending_landmark))) { echo "PARSING COMPLETE!\n"; break; }

Listing 7-6: Detecting the end of the tableIf the script finds the headers but doesn't find theend of the table, it assumes that the rest of thetable rows contain data. It parses these tablerows, using the column position data gleanedearlier, as shown in Listing 7-7.

# Parse product and price dataif(isset($heading_row) && $heading_row<$table_row) { $table_cell_array = parse_array($product_row_array[$table_row], "<td", "</td>"); $product_array[$product_count]['ID'] = strip_tags(trim($table_cell_array[$id_column])); $product_array[$product_count]['NAME'] = strip_tags(trim($table_cell_array[$name_column]));

$product_array[$product_count]['PRICE'] = strip_tags(trim($table_cell_array[$price_column]));

Page 181: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$product_count++; echo"PROCESSED: Item #$product_count\n"; }

Listing 7-7: Assigning parsed data to an arrayOnce the prices are parsed into an array, thewebbot script can do anything it wants with thedata. In this case, it simply displays what itcollected, as shown in Listing 7-8.

# Display the collected datafor($xx=0; $xx<count($product_array); $xx++) { echo "$xx. "; echo "ID: ".$product_array[$xx]['ID'].", "; echo "NAME: ".$product_array[$xx]['NAME'].", "; echo "PRICE: ".$product_array[$xx]['PRICE']."\n"; }

Listing 7-8: Displaying the parsed product pricingdataAs shown in Figure 7-2, the webbot indicateswhen it finds landmarks and prices. This not onlytells the operator how the webbot is running, butalso provides important diagnostic information,making both debugging and maintenance easier.Since prices are almost always in HTML tables,you will usually parse price information in amanner that is similar to that shown here.Occasionally, pricing information may becontained in other tags, (like <div> tags, forexample), but this is less likely. When youencounter <div> tags, you can easily parse thedata they contain into arrays using similarmethods.

Page 182: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

methods.

Figure 7-2. The price-monitoring webbot, as runin a shell

Page 183: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Further ExplorationNow you know how to parse pricing informationfrom a web page—what you do with thisinformation is up to you. If you are so inclined,you can expand your experience with some of thefollowing suggestions.

Since the prices in the example store changeon a daily basis, monitor the daily pricechanges for a month and save your parsedresults in a database.Develop scripts that graph price fluctuations.Read about sending email with webbots inChapter 16, and develop scripts that notifyyou when prices hit preset high or lowthresholds.

While this chapter covers monitoring pricesonline, you can use similar parsing techniques tomonitor and parse other types of data found inHTML tables. Consider using the techniques youlearned here to monitor things like baseballscores, stock prices, weather forecasts, censusdata, banner ad rotation statistics,[27] and more.

Page 184: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

[27] You can use webbots to perform a variety ofdiagnostic functions. For example, a webbot mayrepeatedly download a web page to gathermetrics on how banner ads are rotated.

Page 185: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 8. IMAGE-CAPTURINGWEBBOTSIn this chapter, I'll describe a webbot thatidentifies and downloads all of the images on aweb page. This webbot also stores images in adirectory structure similar to the directorystructure on the target website. This project willshow how a seemingly simple webbot can bemade more complex by addressing these commonproblems:

Finding the page base, or the address thatdefines the address from which all relativeaddresses are referencedDealing with changes to the page base, causedby page redirectionConverting relative addresses into fullyresolved URLsReplicating complex directory structuresProperly downloading image files with binaryformats

In Chapter 18, you'll expand on these concepts todevelop a spider that downloads images from anentire website, not just one page.

Page 186: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

entire website, not just one page.

Example Image-CapturingWebbotOur image-capturing webbot downloads a targetweb page (in this case, the Viking Mission webpage on the NASA website) and parses allreferences to images on the page. The webbotdownloads each image, echoes the image's nameand size to the console, and stores the file on thelocal hard drive. Figure 8-1 shows what thewebbot's output looks like when executed from ashell.

Page 187: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 8-1. The image-capturing bot, whenexecuted from a shell

On this website, like many others, several uniqueimages share the same filename but have differentfile paths. For example, the image/templates/logo.gif may represent a differentgraphic than /templates/affiliate/logo.gif. To solvethis problem, the webbot re-creates a local copyof the directory structure that exists on the targetweb page. Figure 8-2 shows the directorystructure the webbot created when it saved theseimages it downloaded from the NASA example.

Page 188: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Creating the Image-CapturingWebbotThis example webbot relies on a library calledLIB_download_images, which is available fromthis book's website. This library contains thefollowing functions:

download_binary_file(), which safelydownloads image filesmkpath(), which makes directory structureson your hard drivedownload_images_for_page(), whichdownloads all the images on a page

Figure 8-2. Re-creating a file structure for stored

Page 189: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 8-2. Re-creating a file structure for storedimages

For clarity, I will break down this library intohighlights and accompanying explanations.The first script (Listing 8-1) shows the mainwebbot used in Figure 8-1 and Figure 8-2.

include("LIB_download_images.php");$target="http://www.nasa.gov/mission_pages/viking/index.html";

download_images_for_page($target);

Listing 8-1: Executing the image-capturing webbotThis short webbot script loads theLIB_download_images library, defines a targetweb page, and calls thedownload_images_for_page() function, whichgets the images and stores them in acomplementary directory structure on the localdrive.

Note

Please be aware that the scripts in thischapter, which are available at http://www.schrenk.com/nostarch/webbots, are createdfor demonstration purposes only. Althoughthey should work in most cases, they aren'tproduction ready. You may find long orcomplicated directory structures, oddfilenames, or unusual file formats that willcause these scripts to crash.

Binary-Safe Download Routine

Page 190: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Binary-Safe Download Routine

Our image-grabbing webbot uses the functiondownload_binary_file(), which is designed todownload binary files, like images. Other binaryfiles you may encounter could include executablefiles, compressed files, or system files. Up to thispoint, the only file downloads discussed have beenASCII (text) files, like web pages. The distinctionbetween downloading binary and ASCII files isimportant because they have different formatsand can cause confusion when downloaded. Forexample, random byte combinations in binary filesmay be misinterpreted as end-of-file markers inASCII file downloads. If you download a binaryfile with a script designed for ASCII files, youstand a good chance of downloading a partial orcorrupt file.Even though PHP has its own, built-in binary-safedownload functions, this webbot uses a customdownload script that leverages PHP/cURLfunctionality to download images from SSL sites(when the protocol is HTTPS), follow HTTP fileredirections, and send referer information to theserver.Sending proper referer information is crucialbecause many websites will stop other websitesfrom "borrowing" images. Borrowing images fromother websites (without hosting the images onyour server) is bad etiquette and is commonlycalled hijacking. If your webbot doesn't include aproper referer value, its activity could be confusedwith a website that is hijacking images. Listing 8-2shows the file download script used by thiswebbot.

function download_binary_file($file, $referer)

Page 191: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

function download_binary_file($file, $referer) { # Open a PHP/CURL session $s = curl_init();

# Configure the cURL command curl_setopt($s, CURLOPT_URL, $file); // Define target site curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE); // Return file contents ina string curl_setopt($s, CURLOPT_BINARYTRANSFER, TRUE); // Indicate binary transfer curl_setopt($s, CURLOPT_REFERER, $referer); // Referer value curl_setopt($s, CURLOPT_SSL_VERIFYPEER, FALSE); // No certificate curl_setopt($s, CURLOPT_FOLLOWLOCATION, TRUE); // Follow redirects curl_setopt($s, CURLOPT_MAXREDIRS, 4); // Limit redirections to four

# Execute the cURL command (send contents of target web page to string) $downloaded_page = curl_exec($s);

# Close PHP/CURL session and return the file curl_close($s); return $downloaded_page; }

Listing 8-2: A binary-safe file download routine,optimized for webbot use

Directory Structure

The script that creates directories (shown inFigure 8-2) is derived from a user-contributedroutine found on the PHP website(http://www.php.net). Users commonly submit

Page 192: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

(http://www.php.net). Users commonly submitscripts like this one when they find something theywant to share with the PHP community. In thiscase, it's a function that expands on mkdir() bycreating complete directory structures withmultiple directories at once. I modified thefunction slightly for our purposes. This function,shown in Listing 8-3, creates any file path thatdoesn't already exist on the hard drive and, ifneeded, it will create multiple directories for asingle file path. For example, if the image's filepath is images/templates/November, this functionwill create all three directories—images,templates, and November—to satisfy the entire filepath.

function mkpath($path) { # Make sure that the slashes are all single and lean the correct way $path=preg_replace('/(\/){2,}|(\\\){1,}/','/',$path);

# Make an array of all the directories in path $dirs=explode("/",$path);

# Verify that each directory in path exists and create if necessary $path=""; foreach ($dirs as $element) { $path.=$element."/"; if(!is_dir($path)) // Directory verified here mkdir($path); // Created if it doesn't exist } }

Page 193: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Listing 8-3: Re-creating file paths for downloadedimagesThis script in Listing 8-3 places all the pathdirectories into an array and attempts to re-createthat array, one directory at a time, on the localfilesystem. Only nonexistent directories arecreated.

The Main Script

The main function for this webbot,download_images_for_page(), is broken downinto highlights and explained below. As mentionedearlier, this function—and the entireLIB_download_images library—is available atthis book's website.

Initialization and Target Validation

Since $target is used later for resolving the webaddress of the images, the $target value must bevalidated after the web page is downloaded. Thisis important because the server may redirect thewebbot to an updated web page. That updatedURL is the actual URL for the target page and theone that all relative files are referenced from inthe next step. The script in Listing 8-4 verifies thatthe $target is the actual URL that wasdownloaded and not the product of a redirection.

function download_images_for_page($target) { echo "target = $target\n"; # Download the web page $web_page = http_get($target, $referer=""); # Update the target in case there was a

Page 194: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

# Update the target in case there was a redirection

$target = $web_page['STATUS']['url'];

Listing 8-4: Downloading the target web page andresponding to page redirection

Defining the Page Base

Much like the <base> HTML tag, the webbot uses$page_base to define the directory address of thetarget web page. This address becomes thereference for all images with relative addresses.For example, if $target ishttp://www.schrenk.com/april/index.php, then$page_base becomeshttp://www.schrenk.com/april.This task, which is shown in Listing 8-5, isperformed by the functionget_base_page_address() and is actually inLIB_resolve_address and included byLIB_download_images.

# Strip filename off target for use as page base $page_base=get_base_page_address($target);

Listing 8-5: Creating the page base for the targetweb pageAs an example, if the webbot finds an image withthe relative address 14/logo.gif, and the page baseis http://www.schrenk.com/april, the webbot willuse the page base to derive the fully resolvedaddress for the image. In this case, the fullyresolved address ishttp://www.schrenk.com/april/14/logo.gif. Incontrast, if the image's file path is

Page 195: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

contrast, if the image's file path is/march/14/logo.gif, the address will resolve tohttp://www.schrenk.com/march/14/logo.gif.

Creating a Root Directory for ImportedFile Structure

Since this webbot may download images from anumber of domains, it creates a directorystructure for each (see Listing 8-6). The rootdirectory of each imported file structure is basedon the page base.

# Identify the directory where images are to be saved $save_image_directory = "saved_images_".str_replace("http://", "", $page_base);

Listing 8-6: Creating a root directory for theimported file structure

Parsing Image Tags from theDownloaded Web Page

The webbot uses techniques described inChapter 4 to parse the image tags from the targetweb page and put them into an array for easyprocessing. This is shown in Listing 8-7. Thewebbot stops if the target web page contains noimages.

# Parse the image tags $img_tag_array = parse_array($web_page['FILE'], "<img", ">"); if(count($img_tag_array)==0) { echo "No images found at $target\n";

Page 196: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

echo "No images found at $target\n"; exit; }

Listing 8-7: Parsing the image tags

The Image-Processing Loop

The webbot employs a loop, where each image tagis individually processed. For each image tag, thewebbot parses the image file source and creates afully resolved URL (see Listing 8-8). Creating afully resolved URL is important because thewebbot cannot download an image without itscomplete URL: the HTTP protocol identifier, thedomain, the image's file path, and the image'sfilename.

$image_path = get_attribute($img_tag_array[$xx], $attribute="src"); echo " image: ".$image_path; $image_url = resolve_address($image_path, $page_base);

Listing 8-8: Parsing the image source and creatinga fully resolved URL

Creating the Local Directory Structure

The webbot verifies that the file path exists in thelocal file structure. If the directory doesn't exist,the webbot creates the directory path, as shown inListing 8-9.

if(get_base_domain_address($page_base) == get_base_domain_address($image_url)) { # Make image storage directory for

Page 197: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

# Make image storage directory for image, if one doesn't exist $directory = substr($image_path, 0, strrpos($image_path, "/"));

clearstatcache(); // Clear cache to get accurate directory status if(!is_dir($save_image_directory."/".$directory))// See if dir exists

mkpath($save_image_directory."/".$directory); // Create if itdoesn't

Listing 8-9: Creating the local directory structurefor each image file

Downloading and Saving the File

Once the path is verified or created, the image isdownloaded (using its fully resolved URL) andstored in the local file structure (see Listing 8-10).

# Download the image and report image size $this_image_file = download_binary_file($image_url, $referer=$target); echo " size: ".strlen($this_image_file);

# Save the image $fp = fopen($save_image_directory."/".$image_path, "w"); fputs($fp, $this_image_file); fclose($fp); echo "\n";

Page 198: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

echo "\n";

Listing 8-10: Downloading and storing imagesThe webbot repeats this process for each imageparsed from the target web page.

Page 199: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Further ExplorationYou can point this webbot at any web page, and itwill generate a copy of each image that page uses,arranged in a directory structure that resemblesthe original. You can also develop other usefulwebbots based on this design. If you want to testyour skills, consider the following challenges.

Write a similar webbot that detects hijackedimages.Improve the efficiency of the script byreworking it so that it doesn't download animage it has downloaded previously.Modify this webbot to create local backupcopies of web pages.Adjust the webbot to cache movies or audiofiles instead of images.Modify the bot to monitor when imageschange on a web page.

Page 200: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Final ThoughtsIf you attempt to run this webbot on a remoteserver, remember that your webbot must havewrite privileges on that server, or it won't be ableto create file structures or download images.

Page 201: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 9. LINK-VERIFICATION WEBBOTSThis webbot project solves a problem shared byall web developers—detecting broken links on webpages. Verifying links on a web page isn't adifficult thing to do, and the associated script isshort.Figure 9-1 shows the simplicity of this webbot.

Creating the Link-VerificationWebbotFor clarity, I'll break down the creation of thelink-verification webbot into manageable sections,which I'll explain along the way. The code andlibraries used in this chapter are available fordownload at this book's website.

Initializing the Webbot andDownloading the Target

Before validating links on a web page, yourwebbot needs to load the required libraries andinitialize a few key variables. In addition toLIB_http and LIB_parse, this webbot introducestwo new libraries: LIB_resolve_addresses andLIB_http_codes. I'll explain these additions as Iuse them.

Page 202: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 9-1. Link-verification bot flow chart

The webbot downloads the target web page withthe http_get() function, which was described inChapter 3.

# Include librariesinclude("LIB_http.php");include("LIB_parse.php");include("LIB_resolve_addresses.php");include("LIB_http_codes.php");

# Identify the target web page and the page base$target = "http://www.schrenk.com/nostarch/webbots/page_with_broken_links.php";

$page_base = "http://www.schrenk.com/nostarch/webbots/";

# Download the web page$downloaded_page = http_get($target, $ref="");

Listing 9-1: Initializing the bot and downloadingthe target web page

Setting the Page Base

Page 203: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

In addition to defining the $target, which pointsto a diagnostic page on the book's website, Listing9-1 also defines a variable called $page_base. Apage base defines the domain and server directoryof the target page, which tells the webbot whereto find web pages referenced with relative links.Relative links are references to other files—relative to where the reference is made. Forexample, consider the relative links in Table 9-1.Table 9-1. Examples of Relative Links

Link References a File Located In . . .

<a href="linked_page.html"> Same directory as web page

<a href="../linked_page.html"> The page's parent directory (upone level)

<ahref="../../linked_page.html">

The page's parent's parentdirectory (up 2 levels)

<a href="/linked_page.html"> The server's root directory

Your webbot would fail if it tried to download anyof these links as is, since your webbot's referencepoint is the computer it runs on, and not thecomputer where the links where found. The pagebase, however, gives your webbot the samereference as the target page. You might think of itthis way: The page base is to a webbot as the<base> tag is to a browser. The page base setsthe reference for everything referred to on thetarget web page.

Parsing the Links

You can easily parse all the links and place theminto an array with the script in Listing 9-2.

Page 204: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

# Parse the links$link_array = parse_array($downloaded_page['FILE'], $beg_tag="<a", $close_tag=">");

Listing 9-2: Parsing the links from the downloadedpageThe code in Listing 9-2 uses parse_array() toput everything between every occurrence of <aand > into an array.[28] The functionparse_array() is not case sensitive, so it doesn'tmatter if the target web page uses <a>, <A> or acombination of both tags to define links.

Running a Verification Loop

You gain a great deal of convenience when theparsed links are available in an array. The arrayallows your script to verify the links iterativelythrough one set of verification instructions, asshown in Listing 9-3. The PHP sections of thisscript appear in bold.Listing 9-3 also includes some HTML formattingto create a nice-looking report, which you'll seelater. Notice that the contents of the verificationloop have been removed for clarity. I'll explainwhat happens in this loop next.

<b>Status of links on <?echo $target?></b><br><table border="1" cellpadding="1" cellspacing="0"> <tr bgcolor="#e0e0e0"> <th>URL</th> <th>HTTP CODE</th> <th>MESSAGE</th> <th>DOWNLOAD TIME (seconds)</th> </tr><?for($xx=0; $xx<count($link_array); $xx++) { // Verification and display go here

Page 205: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

// Verification and display go here }

Listing 9-3: The verification loop

Generating Fully Resolved URLs

Since the contents of the $link_array elementsare actually complete anchor tags, we need toparse the value of the href attribute out of thetags before we can download and test the pagesthey reference.The value of the href attribute is extracted fromthe anchor tag with the functionget_attribute(), as shown in Listing 9-4.

// Parse the HTTP attribute from link$link = get_attribute($tag=$link_array[$xx], $attribute="href");

Listing 9-4: Parsing the referenced address fromthe anchor tagOnce you have the href address, you need tocombine the previously defined $page_base withthe relative address to create a fully resolvedURL, which your webbot can use to downloadpages. A fully resolved URL is any URL thatdescribes not only the file to download, but alsothe server and directory where that file is locatedand the protocol required to access it. Table 9-2shows the fully resolved addresses for the links inTable 9-1, assuming the links are on a page in thedirectory,http://www.schrenk.com/nostarch/webbots.Table 9-2. Examples of Fully Resolved URLs (forlinks on http://www.schrenk.com/nostarch/book)

Link Fully Resolved URL

<a href="linked_page.html"> http://www.schrenk.com/nostarch/webbots/linked_page.html

Page 206: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

<a href="../linked_page.html"> http://www.schrenk.com/nostarch/linked_page.html

<ahref="../../linked_page.html"> http://www.schrenk.com/linked_page.html

<a href="/linked_page.html"> http://www.schrenk.com/linked_page.html

Fully resolved URLs are made with theresolve_address() function (see Listing 9-5),which is in the LIB_resolve_addresses library.This library is a set of routines that converts allpossible methods of referencing web pages inHTML into fully resolved URLs.

// Create a fully resolved URL$fully_resolved_link_address = resolve_address($link, $page_base);

Listing 9-5: Creating fully resolved addresses withresolve_address()

Downloading the Linked Page

The webbot verifies the status of each pagereferenced by the links on the target page bydownloading each page and examining its status.It downloads the pages with http_get(), just asyou downloaded the target web page earlier (seeListing 9-6).

// Download the page referenced by the link and evaluate$downloaded_link = http_get($fully_resolved_link_address, $target);

Listing 9-6: Downloading a page referenced by alinkNotice that the second parameter in http_get()is set to the address of the target web page. Thissets the page's referer variable to the target page.When executed, the effect is the same as telling

Page 207: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

When executed, the effect is the same as tellingthe server that someone requested the page byclicking a link from the target web page.

Displaying the Page Status

Once the linked page is downloaded, the webbotrelies on the STATUS element of the downloadedarray to analyze the HTTP code, which isprovided by PHP/CURL. (For your future projects,keep in mind that PHP/CURL also provides totaldownload time and other diagnostics that we'renot using in this project.)HTTP status codes are standardized, three-digitnumbers that indicate the status of a pagefetch.[29] This webbot uses these codes todetermine if a link is broken or valid. These codesare divided into ranges that define the type oferrors or status, as shown in Table 9-3.Table 9-3. HTTP Code Ranges and RelatedCategories

HTTPCodeRange

Category Meaning

100-199 Informational Not generally used

200-299 Successful Your page request was successful

300-399 Redirection The page you're looking for has movedor has been removed

400-499 Client error Your web client made a incorrect orillogical page request

500-599 Server error A server error occurred, generallyassociated with a bad form submission

The $status_code_array was created when the

Page 208: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

The $status_code_array was created when theLIB_http_codes library was imported. When youuse the HTTP code as an index into$status_code_array, it returns a human-readable status message, as shown in Listing 9-7.(PHP script is in bold.)

<tr> <td align="left"><?echo $downloaded_link['STATUS']['url']?></td> <td align="right"><?echo $downloaded_link['STATUS']['http_code']?></td> <td align="left"><?echo $status_code_array[$downloaded_link['STATUS']['http_code']]?></td> <td align="right"><?echo $downloaded_link['STATUS']['total_time']?></td></tr>

Listing 9-7: Displaying the status of linked webpagesAs an added feature, the webbot displays theamount of time (in seconds) required to downloadpages referenced by the links on the target webpage. This period is automatically measured andrecorded by PHP/CURL when the page isdownloaded. The period required to download thepage is available in the array element:$downloaded_link['STATUS']['total_time'].

[28] Parsing functions are explained in Chapter 4.[29] The official reference for HTTP codes isavailable on the World Wide Web Consortium'swebsite(http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html).

Page 209: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Running the WebbotSince the output of this webbot contains formattedHTML, it is appropriate to run this webbot withina browser, as shown in Figure 9-2.

Figure 9-2. Running the link-verification webbot

This webbot counts and identifies all the links onthe target website. It also indicates the HTTP codeand diagnostic message describing the status ofthe fetch used to download the page and displaysthe actual amount of time it took the page to load.Let's take this time to look at some of the librariesused by this webbot.

LIB_http_codes

The following script creates an indexed array of

Page 210: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

The following script creates an indexed array ofHTTP error codes and their definitions. To use thearray, simply include the library, insert yourHTTP code value into the array, and echo asshown in Listing 9-8.

include(LIB_http_codes.php);echo $status_code_array[$YOUR_HTTP_CODE]['MSG']

Listing 9-8: Decoding an HTTP code withLIB_http_codes

LIB_http_codes is essentially a group of arraydeclarations, with the first element being theHTTP code and the second element, ['MSG'],being the status message text. Like the others, thislibrary is also available for download from thisbook's website.

LIB_resolve_addresses

The library that creates fully resolved addresses,LIB_resolve_addresses, is also available fordownload at the book's website.

Note

Before you download and examine thislibrary, be warned that creating fully resolvedURLs is a lot like making sausage—while you

Page 211: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

might enjoy how sausage tastes, you probablywouldn't like watching those lips and ears gointo the grinder. Simply put, the act ofconverting relative links into fully resolvedURLs involves awkward, asymmetrical codewith numerous exceptions to rules and manyspecial cases. This library is extraordinarilyuseful, but it isn't made up of pretty code.

If you don't need to see how this conversion isdone, there's no reason to look. If, on the otherhand, you're intrigued by this description, feel freeto download the library from the book's websiteand see for yourself. More importantly, if you finda cleaner solution, please upload it to the book'swebsite to share it with the community.

Page 212: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Further ExplorationYou can expand this basic webbot to do a varietyof very useful things. Here is a short list of ideasto get you started on advanced designs.

Create a web page with a form that allowspeople to enter and test the links of any webpage.Schedule a link-verification bot to runperiodically to ensure that links on web pagesremain current. (For information onscheduling webbots, read Chapter 23.)Modify the webbot to send email notificationswhen it finds dead links. (More information onwebbots that send email is available inChapter 16.)Encase the webbot in a spider to check thelinks on an entire website.Convert this webbot into a function that iscalled directly from PHP. (This idea isexplored in Chapter 17.)

Page 213: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 10. ANONYMOUSBROWSING WEBBOTSThe Internet is a public place, and as in any othercommunity, web surfers leave telltale clues ofwhere they've been and what they've done. Whilemany people feel anonymous online, the fact isthat server logs, cookies, and browser cachesleave little doubt to what happens on the Internet.While total online anonymity is nearly impossible,you can cloak your activity through a specializedwebbot called a proxybot, or simply a proxy. Thischapter investigates applications for proxies andlater explores a webbot proxy project thatprovides anonymous web browsing.

Anonymity with ProxiesA proxy is a special type of webbot that serves asan intermediary between webservers and clients.Proxies have many uses including banning peoplefrom browsing prohibited websites, blockingbanner advertisements, and inhibiting suspectscripts from running on browsers.One of the more popular proxies is Squid, a webproxy that, among other things, saves bandwidth

Page 214: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

proxy that, among other things, saves bandwidthon large networks by caching frequentlydownloaded images.[30] Squid, along with mostother proxies, also converts private network IPaddresses into a single public address through aprocess called Network Address Translation(NAT).A side effect of proxy use is that proxies create apotentially anonymous browsing environmentbecause individual network addresses are pooledinto a single network address. Since only theproxied network address is visible to web servers,the identities of the individual surfers remainunknown. Anonymity is the focus of this chapter,but before we start that discussion, a quick reviewof the liabilities of browsing in a non-proxiedenvironment is in order.

Non-proxied Environments

In non-proxied network environments, web clientsare totally exposed to the servers they access.This is important in terms of privacy becauseservers maintain records of requesting IPaddresses, the files accessed, and the times theywere accessed, as depicted in Figure 10-1.

Page 215: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 10-1. Browsing in a non-proxied networkenvironment

Additionally, webservers may store small recordsof browsing activity on clients' hard drives in theform of cookies.[31] By reading cookies on a user'ssuccessive visits to the same Internet domain,webservers determine a variety of information,including previously defined browsingpreferences, authentication criteria, and browsinghistory for that user within that domain.

Your Online Exposure

You may think that you only expose your identity

Page 216: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

You may think that you only expose your identityonline when you formally register a username andpassword with a website, or that your identity isonly known at sites where you've registered.However, a variety of tricks are available tomonitor Internet activity, even when you don'thave administration rights to a website. Forexample, you can learn a lot about the users ofcommunity forums, news servers, or evenMySpace by uploading a single-pixel image,usually a transparent GIF file, to one of thoseservices. While the single-pixel image is essentiallyinvisible, everybody who accesses a web pagecontaining one also downloads this seeminglyinnocuous little image. If things are set upcorrectly, each web surfer who downloads a pagecontaining one of these single-pixel images leavesa record in a server log file, unknowinglyrecording his or her IP address and file accesstime. Here are some of the things you can learnfrom these log files:

IP addresses of the web surfers accessing thepageFrequency that someone with a specific IPaddress (or domain of origin) visits the pageTime of day that web surfer visited the webpageTotal traffic the web page receives

Page 217: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Total traffic the web page receivesIndications of when traffic to the web page isheavy or light

Once you have a visitor's IP address, you couldalso identify his or her ISP by performing areverse DNS lookup, which converts an IPaddress into its domain of origin. Many times, areverse DNS lookup only reveals someone's ISP,like EarthLink or AOL. And since so many people(from all over the world) use these ISPs, thatinformation isn't very useful. Other times,however, the domain name will give you the nameof a specific corporation or organization thatdownloaded the web page.[32]

You can also configure the server that hosts thesingle-pixel image to write a cookie on the harddrive of the web surfer. With this cookie, you candetermine when an individual user gains access toweb pages. If you place your single-pixel image onmany web pages that are visited by a specificInternet user, you can track much of that user'sbrowsing activity.If you think these threats to one's privacy are tootheoretical, consider what happens on a largerscale with online advertising companies likeMySpace, Google, DoubleClick, and SpecificClick.Given the large number of web pages on which

Page 218: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Given the large number of web pages on whichthese companies' advertisements appear, they arecapable of tracking a very large percentage ofyour online activity. Just consider how many of thewebsites you visit have advertisements. Then lookat your browser's cookie records (usuallyavailable in the privacy settings of your browser,as shown in Figure 10-2) to see how many of thesemedia companies have left cookies on yourcomputer.

Figure 10-2. Viewing advertisers' cookies

Armed with what you know now, are youwondering why advertising companies writecookies to your hard drive? Are you questioningwhy the cookie in Figure 10-2 doesn't expire fornearly three years? I hope that this informationfreaks you out just a little and whets your appetiteto learn more about writing anonymizing webbotproxies.

Proxied Environments

Page 219: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Proxied Environments

Typically, in corporate settings, proxies sitbetween a private network and the Internet, andall traffic that moves between the two is forcedthrough the proxy. In the process, the proxyreplaces each individual's identity with its own,and thereby "hides" the web surfer from thewebserver's log files, as shown in Figure 10-3.

Figure 10-3. Hiding behind a proxy

Since the web surfer in Figure 10-3 is the onlyproxy user, no anonymity is achieved—the proxyis synonymous with the person using it. Ambiguity,and eventually anonymity, is achieved as morepeople use the same proxy, as in Figure 10-4.

Page 220: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

people use the same proxy, as in Figure 10-4.

Figure 10-4. Achieving anonymity throughnumbers

The log files recorded by the webservers becomeambiguous as more people use the proxy becausethe proxy's identity no longer represents a singleweb surfer. As the number of people using theproxy increases, the identity of individual usersdecreases. While anonymity is not generally anobjective for proxies of this type, it is a side effectof operation, and the focus of this chapter'sproject.

Page 221: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

[30] Information about Squid, a popular opensource web proxy cache, is available athttp://www.squid-cache.org. In addition to cachingfrequently downloaded images, Squid also cachesDNS lookups, failed requests, and many otherInternet objects.[31] Chapter 21 and Chapter 22 describe cookiesand their application to webbots in detail.[32] In the late 1990s, Amazon.com used a similartechnique, combined with purchase data, todetermine the reading lists of large corporations.For a short while, Amazon.com actually publishedthese lists on its website. For obvious reasons, thisfeature was short-lived.

Page 222: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

The Anonymizer ProjectIn many respects, this anonymizer is like thepreviously described network proxies. However,this anonymizer is web-based, in contrast to most(corporate) proxies, which provide the only pathfrom a local network to the Internet. Since alltraffic between the private network and theInternet passes through these network proxies, itis simpler for them to modify traffic. Our web-based proxy, in contrast, runs on a web script andmust contain the traffic within a browser. Whatthis means is that every link passing through aweb-based proxy must be modified to keep theweb surfer on the anonymizer's web page, whichis shown in Figure 10-5.

Page 223: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 10-5. The anonymous browsing proxy

The user interface of the anonymous browsingproxy provides a place for web surfers to enterthe URL of the website they wish to surfanonymously. After clicking Go, the page appearsin the browser window, and the webserver, wherethe content originates, records the identity of theanonymizer. Because of the proxy, the webserverhas no knowledge of the identity of the websurfer.In order for the proxy to work, all web surfingactivity must happen within the anonymizer script.If someone clicks a link, he or she must return tothe anonymizer and not end up at the websitereferenced by the link. Therefore, before sendingthe web page to the browser, the anonymizerchanges each link address to reference itself,while passing a Base64-encoded address of thelink in a variable, as shown in the status bar at thebottom of Figure 10-5.

Note

This is a simple anonymizer, designed for

Page 224: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

This is a simple anonymizer, designed forstudy; it is not suitable for use in productionenvironments. It will not work correctly onweb pages that rely on forms, cookies,JavaScript, frames, or advanced webdevelopment techniques.

Writing the Anonymizer

The following scripts describe the anonymizer'sdesign. The complete script for the anonymizerproject is available on this book's website.[33] Forclarity, only script highlights are described indetail here.

Downloading and Preparing the TargetWeb Page

After initializing libraries and variables, which isdone in Listing 10-1, the anonymizer downloadsand prepares the target web page for laterprocessing. Note that the anonymizer makes useof the parsing and HTTP libraries described inPart I.

# Download the target web page$page_array = http_get($target_webpage), $ref="", GET, $data_array="", EXCL_HEAD);

Page 225: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

# Clean up the HTML formatting with Tidy$web_page = tidy_html($page_array['FILE']);

# Get the base page address so we can create fully resolved addresses later$page_base = get_base_page_address($page_array['STATUS']['url']);

# Remove JavaScript and HTML comments from web page$web_page = remove($web_page, "<script", "</script>");$web_page = remove($web_page, "<!--", "-->");

Listing 10-1: Downloading and prepping the targetweb page

Modifying the <base> Tag

After prepping the target web page, the <base>tag is either inserted or modified so all relativepage addresses will properly resolve from theanonymizer's URL. This is shown in Listing 10-2.

$new_base_value = "<base href=\"".$page_base."\">";if(!stristr($web_page, "<base")) { # If there is a <head>, insert <base> at beginning of <head></head> if(stristr($web_page, "<head"))

Page 226: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

if(stristr($web_page, "<head")) { $web_page = eregi_replace("<head>", "<head>\n".$new_base_value, $web_page); } # Else insert a <head><base></head> at beginning of web page else { $web_page = "</head>\n".$new_base_value."\n</head>" . $web_page; } }

Listing 10-2: Adjusting the target page's <base>tag

Parsing the Links

The next step is to create an array of all the linkson the page, which is done with the script inListing 10-3.

$a_tag_array = parse_array($web_page, "<a", ">");

Listing 10-3: Creating an array of all the links(anchor tags)

Substituting the Links

Page 227: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

After parsing links into an array, the code loopsthrough each link. This loop, shown in Listing 10-4, performs the following steps:

1. Parse the hyper-reference attribute for eachlink.

2. Convert the hyper-reference into a fullyresolved URL.

3. Convert the hyper-reference into the followingformat:

anonymizer_address?v= hyper referencebase64_encoded

4. Substitute the original hyper-reference withthe URL (representing theanonymizer_address and the original linkpassed as a variable) created in the previousstep.

for($xx=0; $xx<count($a_tag_array); $xx++) { // Get the original href value $original_href = get_attribute($a_tag_array[$xx], "href"); // Convert href to a fully resolved address $fully_resolved_href = get_fully_resolved_address($original_href, $page_base);

Page 228: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

// Substitute the original href with "this_page?v=fully resolved address" $substitution_tag = str_replace($original_href, trim($this_page."?v=".base64_encode($fully_resolved_href)), $a_tag_array[$xx]);

// Substitute the original tag with the new one $web_page = str_replace($a_tag_array[$xx], $substitution_tag, $web_page); }

Listing 10-4: Substituting links with coded linksthat re-reference the anonymizer

Displaying the Proxied Web Page

Once all the links are processed, the anonymizersends the newly processed web page to therequesting web surfer's browser, as shown inListing 10-5.

# Display the processed target web pageecho $web_page;

Listing 10-5: Displaying the proxied web pageThat's all there is to it. The important thing is todesign the anonymizer so all links displayed in the

Page 229: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

design the anonymizer so all links displayed in theanonymizer's window re-reference the anonymizerwith a $_GET variable that identifies the actualpage to download. This is really not that hard todo, but as mentioned earlier, this anonymizer doesnot handle forms, cookies, JavaScript, frames, ormore advanced web design techniques. That beingsaid, it's a good place to start, and you should usethis script to further explore the concept ofanonymizing. With a few modifications, you couldwrite web proxies that modify web content in avariety of ways.

[33] This book's website is available athttp://www.schrenk.com/nostarch/webbots.

Page 230: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Final ThoughtsIt is important to note that anonymizers do notalways provide complete anonymity. Anonymousbrowsing techniques rely on many users to maskthe actions of individuals, and they are notfoolproof. However, even simple anonymizers hidea web surfer's ISP and country of origin.Moreover, barring a disclosure of theanonymizer's server logs, users should remainanonymous; even if those logs were examined,they would still have to be referenced with thelogs of ISPs to identify web surfers. Advancedanonymizers complicate issues further by makingpage requests from a variety of domains, whichadds more confusion to server logs and users'identities. An anonymizer's access log files gainfurther protection if you host anonymizers onencrypted servers in countries that don't honoryour home country's subpoenas for server logrecords.[34] (You didn't hear me make thatrecommendation, however.)People argue about whether or not anonymousbrowsing is a good thing. On one hand, it canhamper the tracking of cyber criminals. However,anonymizers also provide freedom to people livingin countries that severely limit what they can view

Page 231: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

in countries that severely limit what they can viewonline. I have also found anonymizers to behelpful in cases where I needed to view a websitefrom a remote domain in order to debug securitycertificates. I don't have a lot of personalexperience with other people's anonymizers, so Iwon't make any recommendations, but if thesetypes of programs interest you, a quick Googlesearch will reveal that many are available.

[34] Perhaps the most famous of these countries isSealand, a sovereign country built on anabandoned World War II anti-aircraft platformseven miles off the coast of England. Moreinformation about Sealand is available at itsofficial website, http://www.sealandgov.org.

Page 232: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 11. SEARCH-RANKING WEBBOTSEvery day, millions of people find what they needonline through search websites. If you own anonline business, your search ranking may havefar-reaching effects on that business. A higher-ranking search result should yield higheradvertising revenue and more customers. Withoutknowing your search rankings, you have no wayto measure how easy it is for people to find yourweb page, nor will you have a way to gauge thesuccess of your attempts to optimize your webpages for search engines.Manually finding your search ranking is not aseasy as it sounds, especially if you are interestedin the ranking of many pages with an assortmentof search terms. If your web page appears on thefirst page of results, it's easy to find, but if yourpage is listed somewhere on the sixth or seventhpage, you'll spend a lot of time figuring out howyour website is ranked. Even searches forrelatively obscure terms can return a largenumber of pages. (A recent Google search on theterm tapered drills, for example, yielded over44,000 results.) Since search engine spiderscontinually update their records, your search

Page 233: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

continually update their records, your searchranking may also change on a daily basis.Complicating the matter more, a web page willhave a different search ranking for every searchterm. Manually checking web page searchrankings with a browser does not make sense—webbots, however, make this task nearly trivial.With all the search variations for each of yourweb pages, there is a need for an automatedservice to determine your web page's searchranking. A quick Internet search will revealseveral such services, like the one shown inFigure 11-1.

Figure 11-1. A search-ranking service,GoogleRankings.com

Page 234: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

GoogleRankings.com

This chapter demonstrates how to design awebbot that finds a search ranking for a domainand a search term. While this project's target is onthe book's website, you can modify this webbot towork on a variety of available search services.[35]

This example project also shows how to performan insertion parse, which injects parsing tagswithin a downloaded web page to make parsingeasier.

Description of a Search ResultPageMost search engines return two sets of results forany given search term, as shown in Figure 11-2.The most prominent search results are paidplacements, which are purchased advertisementsmade to look something like search results. Theother set of search results is made up of organicplacements (or just organics), which are non-sponsored search results.This chapter's project focuses on organicsbecause they're the links that people are mostlikely to follow. Organics are also the searchresults whose visibility is improved throughSearch Engine Optimization.

Page 235: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Search Engine Optimization.

Figure 11-2. Parts of a search results page

The other part of the search result page we'llfocus on is the Next link. This is importantbecause it tells our webbot where to find the next

Page 236: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

because it tells our webbot where to find the nextpage of search results.For our purposes, the search ranking isdetermined by counting the number of pages inthe search results until the subject web page isfirst found. The page number is then combinedwith the position of the subject web page withinthe organic placements on that page. Forexample, if a web page is the sixth organic on thefirst result page, it has a search ranking of 1.6. Ifa web page is the third organic on the secondpage, its search ranking is 2.3.

[35] If you modify this webbot to work on othersearch services, make sure you are not violatingtheir respective Terms of Service agreements.

Page 237: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

What the Search-RankingWebbot DoesThis webbot (actually a specialized spider)submits a search term to a search web page andlooks for the subject web page in the searchresults. If the webbot finds the subject web pagewithin the organic search results, it reports theweb page's ranking. If, however, the webbotdoesn't find the subject in the organics on thatpage, it downloads the next page of search resultsand tries again. The webbot continues searchingdeeper into the pages of search results until itfinds a link to the subject web page. If the webbotcan't find the subject web page within a specifiednumber of pages, it will stop looking and reportthat it could not find the web page within thenumber of result pages searched.

Page 238: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Running the Search-RankingWebbotFigure 11-3 shows the output of our search-ranking webbot. In each case, there must be botha test web page (the page we're looking for in thesearch results) and a search term. In our testcase, the webbot is looking for the ranking ofhttp://www.loremianam.com, with a search termof webbots.[36] Once the webbot is run, it onlytakes a few seconds to determine the searchranking for this combination of web page andsearch term.

Figure 11-3. Running the search-ranking webbot

Page 239: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

[36] Unlike a real search service, thedemonstration search pages on the book's websitereturn the same page set regardless of the searchterm used.

Page 240: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

How the Search-RankingWebbot WorksOur search-ranking webbot uses the processdetailed in Figure 11-4 to determine the ranking ofa website using a specific search term. These arethe steps:

1. Initialize variables for use, including thesearch criteria and the subject web page.

2. Fetch the subject web page from the searchengine using the search term.

3. Parse the organic search results from theadvertisement and navigation text.

4. Determine whether or not the desired websiteappears in this page's search results.a. If the desired website is not found, keep

looking deeper into the search resultsuntil the desired web page is found or themaximum number of attempts has beenused.

b. If the desired website is found, record theranking.

5. Report the results.

Page 241: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

5. Report the results.

Figure 11-4. Search-ranking webbot at work

Page 242: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

The Search-Ranking WebbotScriptThe following section describes key aspects of thewebbot's script. The latest version of this script isavailable for download at this book's website.

Note

If you want to experiment with the code, youshould download the webbot's script. I havesimplified the scripts shown here fordemonstration purposes.

Initializing Variables

Initialization consists of including libraries andidentifying the subject website and search criteria,as shown in Listing 11-1.

# Initialization// Include librariesinclude("LIB_http.php");include("LIB_parse.php");

// Identify the search term and URL combination$desired_site = "www.loremianam.com";$search_term = "webbots";// Initialize other miscellaneous variables$page_index = 0;$url_found = false;$previous_target = "";

Page 243: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$previous_target = "";// Define the target website and the query string for the search term$target = "http://www.schrenk.com/nostarch/webbots/search$target = $target."?q=".urlencode(trim($search_term));# End: Initialization

Listing 11-1: Initializing the search-rankingwebbotThe target is the page we're going to download,which in this case is a demonstration search pageon this book's website. That URL also includes thesearch term in the query string. The webbot URLencodes the search term to guarantee that none ofthe characters in the search term conflict withreserved URL character combinations. Forexample, the PHP built-in function urlencode()changes Karen Susan Terri toKaren+Susan+Terri. If the search term containscharacters that are illegal in a URL—for example,the comma or ampersand in Karen, Susan &Terri—it would be safely encoded toKaren%2C+Susan+%26+Terri.

Starting the Loop

The webbot loops through the main section of thecode, which requests pages of search results andsearches within those pages for the desired site,as shown in Listing 11-2.

# Initialize loopwhile($url_found==false) {

Page 244: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

{ $page_index++; echo "Searching for ranking on page #$page_index\n";

Listing 11-2: Starting the main loopWithin the loop, the script removes any HTMLspecial characters from the target to ensure thatthe values passed to the target web page onlyinclude legal characters, as shown in Listing 11-3.In particular, this step replaces &amp with thepreferred & character.

// Verify that there are no illegal characters in the URLs $target = html_entity_decode($target); $previous_target = html_entity_decode($previous_target);

Listing 11-3: Formatting characters to createproperly formatted URLsThis particular step should not be confused withURL encoding, because while &amp is a legalcharacter to have in a URL, it will be interpretedas $_GET['amp'] and return invalid results.

Fetching the Search Results

The webbot tries to simulate the action of aperson who is manually looking for a website in aset of search results. The webbot uses twotechniques to accomplish this trick. The first is theuse of a random delay of three to six secondsbetween fetches, as shown in Listing 11-4.

Page 245: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

sleep(rand(3, 6));

Listing 11-4: Implementing a random delayTaking this precaution will make it less obviousthat a webbot is parsing the search results. This agood practice for all webbots you design.The second technique simulates a personmanually clicking the Next button at the bottom ofthe search result page to see the next page ofsearch results. Our webbot "clicks" on the link byspecifying a referer variable, which in our case isalways the target used in the previous iteration ofthe loop, as shown in Listing 11-5. On the initialfetch, this value is an empty string.

$result = http_get($target, $ref=$previous_target, GET, "", EXCL_HEAD); $page = $result['FILE'];

Listing 11-5: Downloading the next page of searchresults from the target and specifying a referervariableThe actual contents of the fetch are returned inthe FILE element of the returned $result array.

Parsing the Search Results

This webbot uses a parsing technique referred toas an insertion parse because it inserts specialparsing tags into the fetched web page tofacilitate an easy parse (and easy debug).Consider using the insertion parse technique whenyou need to parse multiple blocks of data thatshare common separators. The insertion parse is

Page 246: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

share common separators. The insertion parse isparticularly useful when web pages changefrequently or when the information you need isburied deep within a complicated HTML tablestructure. The insertion technique also makesyour code much easier to debug, because byviewing where you insert your parsing tags, youcan figure out where your parsing script thinksthe desired data is.Think of the text you want to parse as blocks oftext surrounded by other blocks of text you don'tneed. Imagine that the web page you want toparse looks like Figure 11-5, where the desiredtext is depicted as the dark blocks. Find thebeginning of the first block you want to parse.Strip away everything prior to this point andinsert a <data> tag at the beginning of this block(Figure 11-6). Replace the text that separates theblocks of text you want to parse with </data> and<data> tags. Now every block of text you want toparse is sandwiched between <data> and</data> tags (see Figure 11-7). This way, the textcan be easily parsed with the parse_array()function. The final <data> tag is an artifact and isignored.

Page 247: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 11-5. Desired data depicted in dark gray

Figure 11-6. Initiating an insertion parse

Figure 11-7. Delimiting desired text with <data>tags

The script that performs the insertion parse isstraightforward, but it depends on accuratelyidentifying the text that surrounds the blocks wewant to parse. The first step is to locate the textthat identifies that start of the first block. The onlyway to do this is to look at the HTML source codeof the search results. A quick examination revealsthat the first organic is immediately preceded by<!--@gap;-->.[37] The next step is to find somecommon text that separates each organic searchresult. In this case, the search terms are also

Page 248: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

separated by <!--@gap;-->.To place the <data> tag at the beginning of thefirst block, the webbot uses the strops() functionto determine where the first block of text begins.That position is then used in conjunction withsubstr() to strip away everything before the firstblock. Then a simple string concatenation places a<data> tag in front of the first block, as shown inListing 11-6.

// We need to place the first <data> tag before the first piece // of desired data, which we know starts with the first occurrence // of $separator $separator = "<!--@gap;-->";

// Find first occurrence of $separator $beg_position = strpos($page, $separator);

// Get rid of everything before the first piece of desired data // and insert a <data> tag before the data $page = substr($page, $beg_position, strlen($page)); $page = "<data>".$page;

Listing 11-6: Inserting the initial insertion parsetag (as in Figure 11-6)The insertion parse is completed with the insertionof the </data> and <data> tags. The webbot doesthis by simply replacing the block separator thatwe identified earlier with our insertion tags, asshown in Listing 11-7.

$page = str_replace($separator, "</data>

Page 249: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$page = str_replace($separator, "</data> <data>", $page);

// Put all the desired content into an array $desired_content_array = parse_array($page, "<data>", "</data>", EXCL);

Listing 11-7: Inserting the insertion delimiter tags(as in Figure 11-7)Once the insertion is complete, each block of textis sandwiched between tags that allow the webbotto use the parse_array() function to create anarray in which each array element is one of theblocks. Could you perform this parse without theinsertion parse technique? Of course. However,the insertion parse is more flexible and easier todebug, because you have more control over wherethe delimiters are placed, and you can see wherethe file will be parsed before the parse occurs.Once the search results are parsed and placedinto an array, it's a simple process to comparethem with the web page we're ranking, as inListing 11-8.

for($page_rank=0; $page_rank<count($desired_content_array); $page_rank++) { // Look for the $subject_site to appear in one of the listings if(stristr($desired_content_array[$page_rank], $subject_site)) { $url_found_rank_on_page = $page_rank; $url_found=true;

Page 250: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$url_found=true; } }

Listing 11-8: Determining if an organic matchesthe subject web pageIf the web page we're looking for is found, thewebbot records its ranking and sets a flag to tellthe webbot to stop looking for additionaloccurrences of the web page in the search results.If the webbot doesn't find the website in this page,it finds the URL for the next page of searchresults. This URL is the link that contains thestring Next. The webbot finds this URL by placingall the links into an array, as shown in Listing 11-9.

// Create an array of links on this page $search_links = parse_array($result['FILE'], "<a", "</a>", EXCL);

Listing 11-9: Parsing the page's links into an arrayThe webbot then looks at each link until it findsthe hyperlink containing the word Next. Oncefound, it sets the referer variable with the currenttarget and uses the new link as the next target. Italso inserts a random three-to-six second delay tosimulate human interaction, as shown in Listing11-10.

for($xx=0; $xx<count($search_links); $xx++) { if(strstr($search_links[$xx], "Next")) { $previous_target = $target; $target =

Page 251: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$target = get_attribute($search_links[$xx], "href");

// Remember that this path is relative to the target page, so add // protocol and domain $target = "http://www.schrenk.com/nostarch/webbots/search/".$target;

} }

Listing 11-10: Looking for the URL for the nextpage of search results

[37] Comments are common parsing landmarks,especially when web pages are created with anHTML generator like Adobe Dreamweaver.

Page 252: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Final ThoughtsNow that you know how to write a webbot thatdetermines search rankings and how to performan insertion parse, here are a few other things tothink about.

Be Kind to Your Sources

Remember that search engines do not makemoney by displaying search results. The search-ranking webbot is a concept study and not asuggestion for a product that you should developand place in a production environment, where thepublic uses it. Also—and this is important—youshould not violate any search website's Terms ofService agreement when deploying a webbot likethis one.

Search Sites May Treat WebbotsDifferently Than Browsers

Experience has taught me that some search sitesserve pages differently if they think they're dealingwith an automated web agent. If you leave thedefault setting for the agent name (in LIB_http)set to Test Webbot, your programs will definitelylook like webbots instead of browsers.

Spidering Search Engines Is a Bad Idea

It is not a good idea to spider Google or any othersearch engine. I once heard (at a hackingconference) that Google limits individual IPaddresses to 250 page requests a day, but I havenot verified this. Others have told me that if youmake the page requests too quickly, Google willstop replying after sending three result pages.Again, this is unverified, but it won't be an issue if

Page 253: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Again, this is unverified, but it won't be an issue ifyou obey Google's Terms of Service agreement.What I can verify is that I have, in othercircumstances, written spiders for clients wherewebsites did limit the number of daily page fetchesfrom a particular IP address to 250. After the251st fetch within a 24-hour period, the serviceignored all subsequent requests coming from thatIP address. For one such project, I put a spider onmy laptop and ran it in every Wi-Fi-enabled coffeehouse I could find in South Minneapolis. Thistactic involved drinking a lot of coffee, but it alsoproduced a good number of unique IP addressesfor my spider, and I was able to complete the jobmore quickly than if I had run the spider (in alimited capacity) over a period of many days in myoffice.Despite Google's best attempts to thwartautomated use of its search results, there arerumors indicating that MSN has been spideringGoogle to collect records for its own searchengine.[38]

If you're interested in these issues, you shouldread Chapter 28, which describes how torespectfully treat your target websites.

Familiarize Yourself with the Google API

If you are interested in pursuing projects that useGoogle's data, you should investigate the Googledeveloper API, a service (or Application ProgramInterface), which makes it easier for developers touse Google in noncommercial applications. At thetime of this writing, Google provided informationabout its developer API athttp://www.google.com/apis/index.html.

Page 254: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

[38] Jason Dowdell, "Microsoft Crawling GoogleResults For New Search Engine?" November 11,2004, WebProNews(http://www.webpronews.com/insiderreports/searchinsider/wpn-49-20041111MicrosoftCrawlingGoogleResultsForNewSearchEngine.html).

Page 255: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Further ExplorationHere are some other ways to leverage thetechniques you learned in this chapter.

Design another search-ranking webbot toexamine the paid advertising listings insteadof the organic listings.Write a similar webbot to run daily over aperiod of many days to measure howchanging a web page's meta tags or contentaffects the page's search engine ranking.Design a webbot that examines web pagerankings using a variety of search terms.Use the techniques explained in this chapter toexamine how search rankings differ fromsearch engine to search engine.

Page 256: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 12. AGGREGATIONWEBBOTSIf you've ever researched topics online, you've nodoubt found the need to open multiple webbrowsers, each loaded with a different resource.The practice of viewing more than one web pageat once has become so common that all majorbrowsers now support tabs that allow surfers toeasily view multiple websites at once. Anotherapproach to simultaneously viewing more thanone website is to consolidate information with anaggregation webbot.People are doing some pretty cool things withaggregation scripts these days. To whet yourappetite for what's possible with an aggregationwebbot, look at the web page found athttp://www.housingmaps.com. This bot combinesreal estate listings from http://www.craigslist.orgwith Google Maps. The results are maps that plotthe locations and descriptions of homes for sale,as shown in Figure 12-1.

Page 257: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 12-1. craigslist real estate ads aggregatedwith Google Maps

Choosing Data Sources forWebbotsAggregation webbots can use data from a varietyof places; however, some data sources are betterthan others. For example, your webbots can parseinformation directly from web pages, as you did inChapter 7, but this should never be your firstchoice. Since web page content is intermixed withpage formatting and web pages are frequentlyupdated, this method is prone to error. Whenavailable, a developer should always use a non-

Page 258: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

available, a developer should always use a non-HTML version of the data, as the creators ofHousingMaps did. The data shown in Figure 12-1came from Google Maps' Application ProgramInterface (API)[39] and craigslist's Real SimpleSyndication (RSS) feed.Application Program Interfaces provide access tospecific applications, like Google Maps, eBay, orAmazon.com. Since APIs are developed forspecific applications, the features from one APIwill not work in another. Working with APIs tendsto be complex and often has a steep learningcurve. Their complexity, however, is mitigated bythe vast array of services they provide. The detailsof using Google's API (or any other API for thatmatter) are outside of the scope of this book.In contrast to APIs, RSS provides a standardizedway to access data from a variety of sources, likecraigslist. RSS feeds are simple to parse and arean ideal protocol for webbot developers because,unlike unparsed web pages or site-specific APIs,RSS feeds conform to a consistent protocol. Thischapter's example project explores RSS in detail.

[39] See http://www.google.com/apis/maps.

Page 259: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Example Aggregation WebbotThe webbot described in this chapter combinesnews from multiple sources. While the scripts inthis chapter only display the data, I'll concludewith suggestions for extending this project into awebbot that makes decisions and takes actionbased on the information it finds.

Familiarizing Yourself with RSS Feeds

While your webbot could aggregate informationfrom any online source, this example will combinenews feeds in the RSS format. RSS is a standardfor making online content available for a varietyof uses. Originally developed by Netscape in 1997,RSS quickly became a popular means to distributenews and other online content, including blogs.After AOL and Sun Microsystems divided upNetscape, the RSS Advisory Board tookownership of the RSS specification.[40]

Today, nearly every news service providesinformation in the form of RSS. RSS feeds areactually web pages that package online content ineXtensible Markup Language (XML) format.Unlike HTML, XML typically lacks formattinginformation and surrounds data with tags thatmake parsing very easy. Generally, RSS feedsprovide links to web pages and just enoughinformation to let you know whether a link isworth clicking, though feeds can also includecomplete articles.The first part of an RSS feed contains a headerthat describes the RSS data to follow, as shown in

Page 260: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

that describes the RSS data to follow, as shown inListing 12-1.

<title> RSS feed title</title><link> www.Link_to_web_page.com</link><description> Description of RSS feed</description><copyright> Copyright notice</copyright><lastBuildDate> Date of RSS publication</lastBuildDate>

Listing 12-1: The RSS feed header describes thecontent to followNot all RSS feeds start with the same set of tags,but Listing 12-1 is representative of the tagsyou're likely to find on most feeds. In addition tothe tags shown, you may also find tags thatspecify the language used or define the locationsof associated images.Following the header is a collection of items thatcontains the content of the RSS feed, as shown inListing 12-2.

<item> <title> Title of item </title> <link> URL of associated web page for item </link> <description> Description of item

Page 261: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Description of item </description> <pubDate> Publication date of item </pubDate></item><item> Other items may follow, defined as above</item>

Listing 12-2: Example of RSS item descriptionsDepending on the source, RSS feeds may also useindustry-specific XML tags to describe itemcontents. The tags shown in Listing 12-2, however,are representative of what you should find in mostRSS data.Our project webbot takes three RSS feeds andconsolidates them on a single web page, as shownin Figure 12-2.

Figure 12-2. The aggregation webbot

The webbot shown in Figure 12-2 summarizesnews from three sources. It always shows currentinformation because the webbot requests thecurrent news from each source every time theweb page is downloaded.

Page 262: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

web page is downloaded.

Writing the Aggregation Webbot

This webbot uses two scripts. The main script,shown in Listing 12-3, defines which RSS feeds tofetch and how to display them. Both scripts areavailable at this book's website. The PHP sectionsof this script appear in bold.

<?# Include librariesinclude("LIB_http.php");include("LIB_parse.php");include("LIB_rss.php");?><head> <style> BODY {font-family:arial; color: black;} </style></head><table> <tr> <td valign="top" width="33%"> <? $target = "http://www.nytimes.com/services/xml/rss/nyt/RealEstate.xml";

$rss_array = download_parse_rss($target); display_rss_array($rss_array); ?> </td> <td valign="top" width="33%"> <? $target = "http://www.startribune.com/rss/1557.xml"; $rss_array = download_parse_rss($target); display_rss_array($rss_array); ?> </td> <td valign="top" width="33%">

Page 263: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

<? $target = "http://www.mercurynews.com/mld/mercurynews/news/breaking_news/

rss.xml"; $rss_array = download_parse_rss($target); display_rss_array($rss_array); ?> </td> </tr></table>

Listing 12-3: Main aggregation webbot script,describing RSS sources and display formatAs you can tell from the script in Listing 12-3,most of the work is done in the LIB_rss library,which we will explore next.

Downloading and Parsing the Target

As the name implies, the functiondownload_parse_rss() downloads the targetRSS feed and parses the results into an array forlater processing, as shown in Listing 12-4.

function download_parse_rss($target) { # Download the RSS page $news = http_get($target, "");

# Parse title and copyright notice $rss_array['TITLE'] = return_between($news['FILE'], "<title>", "</title>", EXCL); $rss_array['COPYRIGHT'] = return_between($news['FILE'], "<copyright>", "</copyright>", EXCL);

Page 264: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

</copyright>", EXCL);

# Parse the items $item_array = parse_array($news['FILE'], "<item>", "</item>"); for($xx=0; $xx<count($item_array); $xx++) { $rss_array['ITITLE'][$xx] = return_between($item_array[$xx], "<title>", "</title>", EXCL); $rss_array['ILINK'][$xx] = return_between($item_array[$xx], "<link>", "</link>", EXCL); $rss_array['IDESCRIPTION'][$xx] = return_between($item_array[$xx], "<description>", "</description>", EXCL); $rss_array['IPUBDATE'][$xx] = return_between($item_array[$xx], "<pubDate>", "</pubDate>", EXCL); }

return $rss_array; }

Listing 12-4: Downloading the RSS feed andparsing data into an arrayIn addition to using the http_get() function inthe LIB_http library, this script also employs thereturn_between() and parse_array()functions to ease the task of parsing the RSS datafrom the XML tags.After downloading and parsing the RSS feed, thedata is formatted and displayed with the functionin Listing 12-5. (PHP script appears in bold.)

function display_rss_array($rss_array)

Page 265: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

function display_rss_array($rss_array) {?> <table border="0"> <!-- Display the article title and copyright notice --> <tr> <td> <font size="+1"> <b><?echo strip_cdata_tags($rss_array['TITLE'])?></b> </font> </td> </tr>

<tr><td><?echo strip_cdata_tags($rss_array['COPYRIGHT'])?></td></tr>

<!-- Display the article descriptions and links --> <?for($xx=0; $xx<count($rss_array['ITITLE']); $xx++) {?> <tr> <td> <a href="<?echo strip_cdata_tags($rss_array['ILINK'][$xx])?>"> <b><?echo strip_cdata_tags($rss_array['ITITLE'][$xx])?></b> </a> </td> </tr> <tr> <td><?echo strip_cdata_tags($rss_array['IDESCRIPTION'][$xx])?></td> </tr> <tr> <td> <font size="-1"> <?echo

Page 266: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

<?echo strip_cdata_tags($rss_array['IPUBDATE'][$xx])?> </font> </td> </tr> <?}?> </table> <?}?>

Listing 12-5: Displaying the contents of$rss_array

Dealing with CDATA

It's worth noting that the functionstrip_cdata_tags() is used to remove CDATAtags from the RSS data feed. XML uses CDATAtags to identify text that may contain charactersor combinations of characters that could confuseparsers. CDATA tells parsers that the dataencased in CDATA tags should not be interpretedas XML tags. Listing 12-6 shows the format forusing CDATA.

<![[ ...text goes here... ]]>

Listing 12-6: formatSince parsers ignore all , the script needs to stripoff the tags to make the data displayable in abrowser.

[40] See http://www.rssboard.org.

Page 267: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Adding Filtering to YourAggregation WebbotYour webbots can also modify or filter datareceived from RSS (or any other source). In thischapter's news aggregator, you could filter (i.e.,not use) any stories that don't contain specifickeywords or key phrases. For example, if you onlywant news stories that contain the wordswebbots, web spiders, and spiders, you couldcreate a filter array like the one shown in Listing12-7.

$filter_array[]="webbots";$filter_array[]="web spiders";$filter_array[]="spiders";

Listing 12-7: Creating a filter arrayWe can use $filter_array to select articles forviewing by modifying the download_parse_rss()function used in Listing 12-4. This modification isshown in Listing 12-8.

function download_parse_rss($target, $filter_array) { # Download the RSS page $news = http_get($target, "");

Page 268: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

# Parse title and copyright notice $rss_array['TITLE'] = return_between($news['FILE'], "<title>", "</title>", EXCL); $rss_array['COPYRIGHT'] = return_between($news['FILE'], "<copyright>", "</copyright>", EXCL);

# Parse the items $item_array = parse_array($news['FILE'], "<item>", "</item>"); for($xx=0; $xx<count($item_array); $xx++) { # Filter stories for relevance for($keyword=0; $keyword<count($filter_array); $keyword ++) { if(stristr($item_array[$xx], $filter_array[$keyword])) { $rss_array['ITITLE'][$xx] = return_between($item_array[$xx], "<title>", "</title>", EXCL); $rss_array['ILINK'][$xx] = return_between($item_array[$xx], "<link>", "</link>", EXCL); $rss_array['IDESCRIPTION'][$xx] = return_between($item_array[$xx],

Page 269: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

= return_between($item_array[$xx], "<description>", "</description>", EXCL); $rss_array['IPUBDATE'][$xx] = return_between($item_array[$xx], "<pubDate>", "</pubDate>", EXCL); } } } return $rss_array;

}

Listing 12-8: Adding filtering to thedownload_parse_rss() functionListing 12-8 is identical to Listing 12-4, with thefollowing exceptions:

The filter array is passed todownload_parse_rss()

Each news story is compared to everykeywordOnly stories that contain a keyword areparsed and placed into $rss_array

The end result of the script in Listing 12-8 is anaggregator that only lists stories that containmaterial with the keywords in $filter_array. Asconfigured, the comparison of stories andkeywords is not case sensitive. If case sensitivity is

Page 270: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

keywords is not case sensitive. If case sensitivity isrequired, simply replace stristr() withstrstr(). Remember, however, that the amountof data returned is directly tied to the number ofkeywords and the frequency with which theyappear in stories.

Page 271: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Further ExplorationThe true power of webbots is that they can makedecisions and take action with the informationthey find online. Here are a few suggestions forextending what you've learned to do with RSS orother data you choose to aggregate with yourwebbots.

Modify the script in Listing 12-8 to acceptstories that don't contain a keyword.Write an aggregation webbot that doesn'tdisplay information unless it finds it on two ormore sources.Design a webbot that looks for specifickeywords in news stories and sends an emailnotification when those keywords appear.Search blogs for spelling errors.Find an RSS feed that posts scores from yourfavorite sports team. Parse and store thescores in a database for later statisticalanalysis.Write a webbot that uses news stories to helpyou decided whether to buy or sellcommodities futures.

Page 272: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Devise an online clipping service that archivesinformation about your company.Create an RSS feed for the example store usedin Chapter 7.

Page 273: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 13. FTP WEBBOTSFile transfer protocol (FTP) is among the oldestInternet protocols.[41] It dates from the Internet'spredecessor ARPANET, which was originallyfunded by the Eisenhower administration.[42]

Research centers started using FTP to exchangelarge files in the early 1970s, and FTP became thede facto transport protocol for email, a status itmaintained until the early 1990s. Today, systemadministrators most commonly use FTP to allowweb developers to upload and maintain files onremote webservers. Though it's an older protocol,FTP still allows computers with dissimilartechnologies to share files, independent of filestructure and operating system.

Example FTP WebbotTo gain an insight for uses of an FTP-capablewebbot, consider this scenario. A national retailerneeds to move large sales reports from each of itsstores to a centralized corporate webserver. Thisparticular retail chain was built throughacquisition, so it uses multiple protocols andproprietary computer systems. The one thing allof these systems have in common is access to an

Page 274: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

of these systems have in common is access to anFTP server. The goal for this project is to use FTPprotocols to download store sales reports andmove them to the corporate server.The script for this example project is available forstudy at this book's website. Just remember thatthe script satisfies a ficticious scenario and willnot run unless you change the configuration. Inthis chapter, I have split it up and annotated thesections for clarity. Listing 13-1 shows theinitialization for the FTP servers.

<?

// Define the source FTP server, file location, and authentication valuesdefine("REMOTE_FTP_SERVER", "remote_FTP_address"); // Domain name or IP addressdefine("REMOTE_USERNAME", "yourUserName");define("REMOTE_PASSWORD", "yourPassword");define("REMOTE_DIRCTORY", "daily_sales");define("REMOTE_FILE", "sales.txt");

// Define the corporate FTP server, file location, and authentication valuesdefine("CORP_FTP_SERVER", "corp_FTP_address");define("CORP_USERNAME", "yourUserName");define("CORP_PASSWORD", "yourPassword");define("CORP_DIRCTORY", "sales_reports");define("CORP_FILE", "store_03_".date("Y-M-d"));

Page 275: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Listing 13-1: Initializing the FTP botThis program also configures a routine to send ashort email notification when commands fail.Automated email error notification allows thescript to run autonomously without requiring thatsomeone verify the operation manually.[43] Listing13-2 shows the email configuration script.

include("LIB_MAIL.php");$mail_addr['to'] = "[email protected]";$mail_addr['from'] = "[email protected]";function report_error_and_quit($error_message, $server_connection) { global $mail_addr;

// Send error message echo "$error_message, $server_connection"; formatted_mail($error_message, $error_message, $mail_addr, "text/plain");

// Attempt to log off the server gracefully if possible ftp_close($server_connection);

// It is not traditional to end a function this way, but since there is // nothing to return or do, it is best to exit exit(); }

Page 276: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Listing 13-2: Email configurationThe next step is to make a connection to theremote FTP server. After making the connection,the script authenticates itself with its usernameand password, as shown in Listing 13-3.

// Negotiate a socket connection to the remote FTP server$remote_connection_id = ftp_connect(REMOTE_FTP_SERVER);

// Log in (authenticate) the source serverif(!ftp_login($remote_connection_id, REMOTE_USERNAME, REMOTE_PASSWORD)) report_error_and_quit("Remote ftp_login failed", $remote_connection_id);

Listing 13-3: Connecting and authenticating withthe remote serverOnce authenticated by the server, the script movesto the target file's directory and downloads the fileto the local filesystem. After downloading the file,the script closes the connection to the remoteserver, as shown in Listing 13-4.

// Move the directory of the source fileif(!ftp_chdir($remote_connection_id, REMOTE_DIRCTORY)) report_error_and_quit("Remote ftp_chdir failed", $remote_connection_id);

Page 277: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

// Download the fileif(!ftp_get($remote_connection_id, "temp_file", REMOTE_FILE, FTP_ASCII)) report_error_and_quit("Remote ftp_get failed", $remote_connection_id);

// Close connections to the remote FTP serverftp_close($remote_connection_id);

Listing 13-4: Downloading the file and closing theconnectionThe final task, shown in Listing 13-5, uploads thefile to the corporate server using techniquessimilar to the ones used to download the file.

// Negotiate a socket connection to the corporate FTP server$corp_connection_id = ftp_connect(CORP_FTP_SERVER);

// Log in to the corporate serverif(!ftp_login($corp_connection_id, CORP_USERNAME, CORP_PASSWORD)) report_error_and_quit("Corporate ftp_login failed", $corp_connection_id);

// Move the destination directoryif(!ftp_chdir($corp_connection_id, CORP_DIRECTORY)) report_error_and_quit("Corporate ftp_chdir failed", $corp_connection_id);

Page 278: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

failed", $corp_connection_id);

// Upload the fileif(!ftp_put($corp_connection_id, CORP_FILE, "temp_file", FTP_ASCII)) report_error_and_quit("Corporate ftp_put failed", $corp_connection_id);

// Close connections to the corporate FTP serverftp_close($corp_connection_id);

// Send notification that the webbot ran successfullyformatted_mail("ftpbot ran successfully at ".time("M d,Y h:s"), "", $mail_addr,$content_type);?>

Listing 13-5: Logging in and uploading thepreviously downloaded file to the corporate server

[41] The original document defining FTP can beviewed at http://www.w3.org/Protocols/rfc959.[42] Katie Hafner and Matthew Lyon, WhereWizards Stay Up Late: The Origins of the Internet(New York: Simon & Schuster, 1996), 14.[43] See Chapter 23 for information on how tomake webbots run periodically.

Page 279: Webbots, Spiders, And Screen Scrapers - Michael Schrenk
Page 280: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

PHP and FTPPHP provides built-in functions that closelyresemble standard FTP commands. In addition totransferring files, PHP allows your scripts toperform many administrative functions. Table 13-1 lists the most useful FTP commands supportedby PHP.Table 13-1. Common FTP Commands Supportedby PHP

FTP Function (Where$ftp Is the FTP FileStream)

Usage

ftp_cdup($ftp); Makes the parent directory the currentdirectory

ftp_chdir ($ftp,"directory/path") Changes the current directory

ftp_delete ($ftp,"file_name") Deletes a file

ftp_get ($ftp,"local file","remote file", MODE)

Copies the remote file to the local file whereMODE indicates if the remote file is FTP_ASCIIor FTP_BINARY

ftp_mkdir($ftp,

Page 281: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

ftp_mkdir($ftp,"directory name") Creates a new directory

ftp_rename($ftp,"file name")

Renames a file or a directory on the FTPserver

ftp_put ($ftp,"remote file","local file", MODE)

Copies the local file to the remote file whereMODE indicates if the local file is FTP_ASCIIor FTP_BINARY

ftp_rmdir($ftp,"directory/path") Removes a directory

ftp_rawlist($ftp,"directory/path")

Returns an array with each array elementcontaining directory information about a file

As shown in Table 13-1, the PHP FTP commandsallow you to write webbots that create, delete,and rename directories and files. You may alsouse PHP/CURL to perform advanced FTP tasksrequiring advanced authentication or encryption.Since FTP seldom uses these features, they areout of the scope of this book, but they're availablefor you to explore on the official PHP websiteavailable at http://www.php.net.

Page 282: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Further ExplorationSince FTP is often the only application-levelprotocol that computer systems share, it is aconvenient communication bridge between newand old computer systems. Moreover, in additionto using FTP as a common path between disparate—or obsolete—systems, FTP is still the mostcommon method for uploading files to websites.With the information in this chapter, you shouldbe able to write webbots that update websiteswith information found at a variety of sources.Here are some ideas to get you started.

Write a webbot that updates your corporateserver with information gathered from salesreports.Develop a security webbot that uses awebcam to take pictures of your warehouseor parking lot, timestamps the images, anduploads the pictures to an archival server.Design a webbot that creates archives of yourcompany's internal forums on an FTP server.Create a webbot that photographically logsthe progress of a construction site anduploads these pictures to an FTP server. Onceconstruction is complete, compile the

Page 283: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

construction is complete, compile theindividual photos into an animation showingthe construction process.

If you don't have access to an FTP server on theInternet, you can still experiment with FTP bots.An FTP server is probably already on yourcomputer if your operating system is Unix, Linux,or Mac OS X. If you have a Windows computer,you can find free FTP servers on many sharewaresites. Once you locate FTP server software, youcan set up your own local server by following theinstructions accompanying your FTP installation.

Page 284: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 14. NNTP NEWSWEBBOTSAnother non-web protocol your webbots can useis the Network News Transfer Protocol (NTTP).Before modern applications like MySpace,Facebook, and topic-specific web forums, NNTPwas used to build online communities wherepeople with common interests exchangedinformation in newsgroups. Members ofnewsgroups contribute articles—announcements,questions, or answers relating to one of thousandsof subject-specific topics. Collectively, thesearticles are referred to as news. While NNTP is anolder Internet protocol, it is still in wide use today,and it provides a valuable source of informationfor certain webbot projects. I've recently foundNNTP useful when working on projects forprivate investigators, the hospitality industry, andfinancial institutions.

NNTP Use and HistoryNNTP originated in 1986[44] and was designed fora network much different from the one we usetoday. When NNTP was conceived, broadbandand always-on access to networks were virtually

Page 285: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

and always-on access to networks were virtuallyunheard of. To utilize the network as it existed,NNTP employed a non-centralized serverconfiguration, similar to what email uses. Userslogged in to one of the many news servers on thenetwork where they read articles, posted newarticles, and replied to old ones. Behind thescenes, NNTP servers periodically synchronizedto distribute updated news to all servers hostingspecific newsgroups. Today, NNTP serversexchange news so frequently that newly submittedarticles appear on news servers across the worldalmost immediately. In 1986, however, newsservers often waited until the early morning hoursto synchronize, when phone (modem) calls to thenetwork were cheapest. If the newsgroup processseems odd by today's standards, remember thatNNTP was optimized for use when networks wereslower and more expensive.While HTTP has superseded many older protocols(like Gopher[45]), newsgroups have survived andare still widely used today. Most moderncommunication applications like MicrosoftOutlook and Mozilla Thunderbird include newsclients in their basic configurations (seeFigure 14-1).

Page 286: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 14-1. A newsgroup as viewed in MozillaThunderbird, a typical news reader

While the number of active newsgroups isdeclining, there are still tens of thousands ofnewsgroups in use today. The news server I use(hosted by RoadRunner) subscribes to 26,365newsgroups. Since the variety of topics coveredby newsgroups is so diverse (ranging fromalt.alien.visitors toalt.www.software.spiders.programming),you're apt to find one that interests you.Newsgroups are a fun source of homegrowninformation; however, like many sources on theInternet, you need to take what you read with agrain of salt. Newsgroups allow anyone to makeanonymous contributions, and themes likeconspiracy, spam, and self-promotion all thriveunder those conditions.

Page 287: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

under those conditions.

[44] RFC 977 defines the original NNTPspecification (http://www.ietf.org/rfc/rfc977.txt).[45] Gopher was a predecessor to the World WideWeb, developed at the University of Minnesota(http://www.ietf.org/rfc/rfc1436.txt).

Page 288: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Webbots and NewsgroupsNewsgroups are a rich source of content forwebbot developers. While less convenient thanwebsites, news servers are not hard to access,especially when you have a set of functions thatdo most for the work for you. All of this chapter'sexample scripts use the LIB_nntp library.Functions in this library provide easy access toarticles on news servers and create manyopportunities for webbots. LIB_nntp containsfunctions that list newsgroups hosted by specificnews servers, list available articles withinnewsgroups, and download particular articles. Aswith all libraries used in this book, the latestversion of LIB_nntp is available for download atthe book's website.

Identifying News Servers

Before you use NNTP, you'll need to find anaccessible news server. A Google search for freenews servers will provide links to some, but keepin mind that not all news servers are equal. Sincefew news servers host all newsgroups, not everynews server will have the group you're lookingfor. Many free news servers also limit the numberof requests you can make in a day or suffer from

Page 289: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

of requests you can make in a day or suffer frompoor performance. For these reasons, manypeople prefer to pay for access to reliable newsservers. You might already have access to apremium news server through your ISP. Bewarned, however, that some ISPs' news servers(like those hosted by RoadRunner and EarthLink)will not allow access if you are not directlyconnected to a subnet in their network.

Identifying Newsgroups

Your news bots should always verify that thegroup you want to access is hosted by your newsserver. The script in Listing 14-1 usesget_nntp_groups() to create an arraycontaining all the newsgroups on a particularnews server. (Remember to put the name of yournews server in place of your.news.serverbelow.) Putting the newsgroups in an array ishandy, since it allows a webbot to examine groupsiteratively.

include("LIB_nntp.php");$server = "your.news.server";$group_array= get_nntp_groups($server);var_dump($group_array);

Listing 14-1: Requesting (and viewing) thenewsgroups available on a news server

Page 290: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

The result of executing Listing 14-1 is shown inFigure 14-2.

Figure 14-2. Newsgroups hosted on a newsserver

Notice that Figure 14-2 only shows thenewsgroups that hadn't already scrolled off thescreen. In this example, my news server returned46,626 groups. (It also required 40 seconds todownload them all, so expect a short delay whenrequesting large amounts of data.) For eachgroup, the server responds with the name of thegroup, the identifier of the first article, theidentifier of the last article, and a y if you can postarticles to this group or an n if posting articles tothis group (on this server) is prohibited.News servers terminate messages by sending aline that contains just a period (.), which you can

Page 291: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

line that contains just a period (.), which you cansee in the last array element in Figure 14-2. Thatlone period is the only sign your webbot willreceive to tell it to stop looking for data. If yourwebbot reads buffers incorrectly, it will eitherhang indefinitely or return with incomplete data.The small function shown in Listing 14-2 (found inLIB_nntp) correctly reads data from an openNNTP network socket and recognizes the end-of-message indicator.

function read_nntp_buffer($socket) { $this_line =""; $buffer =""; while($this_line!=".\r\n") // Read until lone . found on line { $this_line = fgets($socket); // Read line from socket $buffer = $buffer . $this_line;

} return $buffer; }

Listing 14-2: Reading NNTP data and identifyingthe end of messagesThe script in Listing 14-1 uses the functionget_nntp_groups() to get an array of availablegroups hosted by your news server. The script for

Page 292: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

that function is shown below in Listing 14-3.

function get_nntp_groups($server) { # Open socket connection to the mail server $fp = fsockopen($server, $port="119", $errno, $errstr, 30); if (!$fp) { # If socket error, issue error $return_array['ERROR'] = "ERROR: $errstr ($errno)"; } else { # Else tell server to return a list of hosted newsgroups $out = "LIST\r\n"; fputs($fp, $out); $groups = read_nntp_buffer($fp); $groups_array = explode("\r\n", $groups); // Convert to an array } fputs($fp, "QUIT \r\n"); // Log out fclose($fp); // Close socket return $groups_array; }

Listing 14-3: A function that finds availablenewsgroups on a news serverAs you'll learn, all NNTP commands follow a

Page 293: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

As you'll learn, all NNTP commands follow astructure similar to the one used in Listing 14-3.Most NNTP commands require that you do thefollowing:

1. Connect to the server (on port 119)2. Issue a command, like LIST (followed by a

carriage return/line feed)3. Read the results (until encountering a line

with a lone perioid)4. End the session with a QUIT command5. Close the network socket

Other NNTP commands that identify groupshosted by news servers are listed in RFC 997. Youcan use the basic structure ofget_nntp_groups() as a guide to creating otherfunctions that execute NNTP commands found inRFC 997.

Finding Articles in Newsgroups

As you read earlier, newsgroup articles aredistributed among each of the news servershosting a particular newsgroup and are physicallylocated at each server hosting the newsgroup.Each article has a sequential numeric identifierthat identifies the article on a particular news

Page 294: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

that identifies the article on a particular newsserver. You may request the range of numericidentifiers for articles (for a given a newsgroup)with a script similar to the one in Listing 14-4.

include("LIB_nntp.php");# Request article IDs$server = "your.news.server";$newsgroup = "alt.vacation.las-vegas";$ids_array = get_nntp_article_ids($server, $newsgroup);

# Report Resultsecho "\nInfo about articles in $newsgroup on $server\n";echo "Code: ". $ids_array['RESPONSE_CODE']."\n";echo "Estimated # of articles: ". $ids_array['EST_QTY_ARTICLES']."\n";echo "First article ID: ". $ids_array['FIRST_ARTICLE']."\n";echo "Last article ID: ". $ids_array['LAST_ARTICLE']."\n";

Listing 14-4: Requesting article IDs from a newsserverThe result of running the script in Listing 14-4 isshown in Figure 14-3.

Page 295: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 14-3. Executingget_nntp_article_ids() and displaying the

results

This function returns data in an array, withelements containing a status code,[46] theestimated quantity of articles for that group onthe server, the identifier of the first article in thenewsgroup, and the identifier of the last article inthe newsgroup. An estimate of the number ofarticles is provided because some articles aredeleted after submission, so not every articlewithin the given range is actually available. It'salso worth noting that each server will have itsown rules for when articles become obsolete, soeach server will have a different number ofarticles for any one newsgroup. The code thatactually reads the article identifiers from theserver is shown in Listing 14-5.

function get_nntp_article_ids($server, $newsgroup)

Page 296: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$newsgroup) { # Open socket connection to the mail server $socket = fsockopen($server, $port="119", $errno, $errstr, 30); if (!$socket) { # If socket error, issue error $return_array['ERROR'] = "ERROR: $errstr ($errno)"; } else { # Else tell server which group to connect to fputs($socket, "GROUP ".$newsgroup." \r\n"); $return_array['GROUP_MESSAGE'] = trim(fread($socket, 2000));

# Get the range of available articles for this group fputs($socket, "NEXT \r\n"); $res = fread($socket, 2000); $array = explode(" ", $res);

$return_array['RESPONSE_CODE'] = $array[0]; $return_array['EST_QTY_ARTICLES'] = $array[1]; $return_array['FIRST_ARTICLE'] = $array[2]; $return_array['LAST_ARTICLE'] =

Page 297: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$return_array['LAST_ARTICLE'] = $array[3]; } fputs($socket, "QUIT \r\n"); fclose($socket); return $return_array; }

Listing 14-5: The functionget_nntp_article_ids()

Reading an Article from a Newsgroup

Once you know the range of valid articleidentifiers for your newsgroup (on your newssever), you can request an individual article. Forexample, the script in Listing 14-6 reads articlenumber 562340 from the groupalt.vacation.las-vegas.

include("LIB_nntp.php");$server = "your.news.server";$newsgroup = "alt.vacation.las-vegas";$article = read_nntp_article($server, $newsgroup, $article=562340);echo $article['HEAD'];echo $article['ARTICLE'];

Listing 14-6: Reading and displaying an articlefrom a news serverWhen you execute the code in Listing 14-6, you'llsee a screen similar to the one in Figure 14-4. On

Page 298: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

see a screen similar to the one in Figure 14-4. Onmy news server, article 562340 is the same articledisplayed in the screenshot of the Thunderbirdnews reader, shown earlier in Figure 14-1.[47]

Figure 14-4. Reading a newsgroup article

The first part of Figure 14-4 shows the NTTPheader, which, like a mail or HTTP header,returns status information about the article.Following the header is the article. Notice that inthe header and at the beginning of the article, it is

Page 299: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

the header and at the beginning of the article, it isalso referred to as<[email protected]>. Unlikethe server-dependent identifier used in theprevious function call, this longer identifier isuniversal and references this article on any newsserver that hosts this newsgroup.The function called to read the news article isshown in Listing 14-7.

function read_nntp_article($server, $newsgroup, $article) { # Open socket connection to the mail server $socket = fsockopen($server, $port="119", $errno, $errstr, 30);

if (!$socket) { # If socket error, issue error $return_array['ERROR'] = "ERROR: $errstr ($errno)"; }

else { # Else tell server which group to connect to fputs($socket, "GROUP ".$newsgroup." \r\n");

# Request this article's HEAD

Page 300: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

# Request this article's HEAD fputs($socket, "HEAD $article \r\n"); $return_array['HEAD'] = read_nntp_buffer($socket);

# Request the article fputs($socket, "BODY $article \r\n"); $return_array['ARTICLE'] = read_nntp_buffer($socket); } fputs($socket, "QUIT \r\n"); // Sign out (newsgroup server) fclose($socket); // Close socket return $return_array; // Return data array }

Listing 14-7: A function that reads a newsgrouparticleAs mentioned earlier, NNTP was designed for useon older (slower) networks. For this reason, thearticle headers are available separately from theactual articles. This allowed news readers todownload article headers first, to show userswhich articles were available on their newsservers. If an article interested the viewer, thatarticle alone was downloaded, consumingminimum bandwidth.

Page 301: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

[46] There is a full list of NNTP status codes inAppendix B.[47] Remember that article IDs are unique tonewsgroups on each specific news server. Yourarticle IDs are apt to be different.

Page 302: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Further ExplorationNow that you know how to use webbots tointerface with newsgroups, here is a list of ideasyou can use to develop news bots for your ownpurposes.

Develop a newsgroup clipping service. Thisservice could monitor numerous newsgroupsfor mention of specific keywords and eitheraggregate that information in a database orsend email alerts when a keyword appears ina newsgroup.Build a web-based newsgroup portal, similarto http://groups.google.com.Create a webbot that gathers weatherforecasts for Las Vegas from the NationalWeather Service website, and post thisweather information for vacationers onalt.vacation.las-vegas.[48]

Monitor newsgroups for unauthorized use ofintellectual property.Create a database that archives a newsgroup.Write a web-based newsgroup client thatallows users to read newsgroups

Page 303: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

allows users to read newsgroupsanonymously.

[48] Due to the ridiculous amounts of spam onnewsgroups, scripts for posting articles onnewsgroups were deliberately omitted from thischapter. However, between the scripts used asexamples in this chapter and the original NNTPRFC, you should be able to figure out how to postarticles to newsgroups on your own.

Page 304: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 15. WEBBOTS THATREAD EMAILWhen a webbot can read email, it's easier for it tocommunicate with the outside world.[49] Webbotscapable of reading email can take instruction viaemail commands, share data with handhelddevices like BlackBerries and Palm PDAs, andfilter messages for content.For example, if package-tracking information issent to an email account that a webbot canaccess, the webbot can parse incoming email fromthe carrier to track delivery status. Such a webbotcould also send email warnings when shipmentsare late, communicate shipping charges to yourcorporate accounting software, or create reportsthat analyze a company's use of overnightshipping.

The POP3 ProtocolOf the many protocols for reading email from mailservers, I selected Post Office Protocol 3 (POP3)for this task because of its simplicity and near-universal support among mail servers. POP3instructions are also easy to perform in any Telnetor standard TCP/IP terminal program.[50] Theability to use Telnet to execute POP3 commandswill provide an understanding of POP3commands, which we will later convert into PHP

Page 305: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

commands, which we will later convert into PHProutines that any webbot may execute.

Logging into a POP3 Mail Server

Listing 15-1 shows how to connect to a POP3 mailserver though a Telnet client. Simply entertelnet, followed by the mail server name and theport number (which is always 110 for POP3). Themail server should reply with a message similar tothe one in Listing 15-1.

telnet mail.server.net 110+OK <[email protected]>

Listing 15-1: Making a Telnet connection to aPOP3 mail serverThe reply shown in Listing 15-1 says that you'vemade a connection to the POP3 mail server andthat it is waiting for its next command, whichshould be your attempt to log in. Listing 15-2shows the process for logging in to a POP3 mailserver.

user [email protected]+OKpass xxxxxxxx+OK

Listing 15-2: Successful authentication to a POP3mail serverWhen you try this, be sure to substitute your emailaccount in place of [email protected] and thepassword associated with your account forxxxxxxxx.

Page 306: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

xxxxxxxx.If authentication fails, the mail server shouldreturn an authentication failure message, asshown in Listing 15-3.

-ERR authorization failed

Listing 15-3: POP3 authentication failure

Reading Mail from a POP3 Mail Server

Before you can download email messages from aPOP3 mail server, you'll need to execute a LISTcommand. The mail server will then respond withthe number of messages on the server.

The POP3 LIST Command

The LIST command will also reveal the size of theemail messages and, more importantly, how toreference individual email messages on the server.The response to the LIST command contains aline for every available message for the specifiedaccount. Each line consists of a sequential mail IDnumber, followed by the size of the message inbytes. Listing 15-4 shows the results of a LISTcommand on an account with two pieces of email.

LIST+OK1 23982 2023.

Page 307: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

.

Listing 15-4: Results of a POP3 LIST commandThe server's reply to the LIST command tells usthat there are two messages on the server for thespecified account. We can also tell that message 1is the larger message, at 2,398 bytes, and thatmessage 2 is 2,023 bytes in length. Beyond that,we don't know anything specific about any ofthese messages.The last line in the response is the end of messageindicator. Servers always terminate POP3responses with a line containing only a period.

The POP3 RETR Command

To read a specific message, enter RETR followedby a space and the mail ID received from the LISTcommand. The command in Listing 15-5 requestsmessage 1.

RETR 1

Listing 15-5: Requesting a message from theserverThe mail server should respond to the RETRcommand with a string of characters resemblingthe contents of Listing 15-6.

+OK 2398 octetsReturn-Path: <[email protected]>Delivered-To: [email protected]: (qmail 73301 invoked from network); 19 Feb 2006 20:55:31 −0000

Page 308: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Feb 2006 20:55:31 −0000Received: from mail2.server.net by mail1.server.net (qmail-ldap-1.03) with compressed QMQP; 19 Feb2006 20:55:31 −0000Delivered-To: CLUSTERHOST mail2.server.net [email protected]: (qmail 50923 invoked from network); 19 Feb 2006 20:55:31 −0000Received: by simscan 1.1.0 ppid: 50907, pid: 50912, t: 2.8647s scanners: attach: 1.1.0 clamav: 0.86.1/m:34/d:1107 spam: 3.0.4Received: from web30515.mail.mud.server.com (envelope-sender <[email protected]>) by mail2.server.net (qmail-ldap-1.03) with SMTP for <[email protected]>; 19 Feb 2006 20:55:28 −0000Received: (qmail 7734 invoked by uid 60001); 19 Feb 2006 20:55:26 −0000

Message-ID: <[email protected]>

Date: Sun, 19 Feb 2006 12:55:26 −0800 (PST)From: mike schrenk <[email protected]>Subject: Hey, Can you read this email?To: mike schrenk <[email protected]>MIME-Version: 1.0Content-Type: multipart/alternative; boundary="0-349883719-1140382526=:7581"Content-Transfer-Encoding: 8bitX-Spam-Checker-Version: SpamAssassin 3.0.4 (2005-06-05) on mail2.server.comX-Spam-Level:X-Spam-Status: No, score=0.9 required=17.0 tests=HTML_00_10,HTML_MESSAGE,

Page 309: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

tests=HTML_00_10,HTML_MESSAGE, HTML_SHORT_LENGTH autolearn=no version=3.0.4

--0-349883719-1140382526=:7581Content-Type: text/plain; charset=iso-8859-1Content-Transfer-Encoding: 8bit

This is an email sent from my Yahoo! email account.--0-349883719-1140382526=:7581Content-Type: text/html; charset=iso-8859-1Content-Transfer-Encoding: 8bit

This is an email sent from my Yahoo! email account.<br><BR><BR--0-349883719-1140382526=:7581--.

Listing 15-6: A raw email message read from theserver using the RETR POP3 commandAs you can see, even a short email message has alot of overhead. Most of the returned informationhas little to do with the actual text of a message.For example, the email message retrieved inListing 15-6 doesn't appear until over halfwaydown the listing. The rest of the text returned bythe mail server consists of headers, which tell themail client the path the message took, whichservices touched it (like SpamAssassin), how todisplay or handle the message, to whom to sendreplies, and so forth.These headers include some familiar informationsuch as the subject header, the to and fromvalues, and the MIME version. You can easily

Page 310: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

values, and the MIME version. You can easilyparse this information with thereturn_between() function found in theLIB_parse library (see Chapter 4), as shown inListing 15-7.

$ret_path = return_between($raw_message, "Return-Path: ", "\n", EXCL );$deliver_to = return_between($raw_message, "Delivered-To: ", "\n", EXCL );$date = return_between($raw_message, "Date: ", "\n", EXCL );$from = return_between($raw_message, "From: ", "\n", EXCL );$subject = return_between($raw_message, "Subject: ", "\n", EXCL );

Listing 15-7: Parsing header valuesThe header values in Listing 15-7 are separated bytheir names and a \n (carriage return) character.Note that the header name must be followed by acolon (:) and a space, as these words may appearelsewhere in the raw message returned from themail server.Parsing the actual message is more involved, asshown in Listing 15-8.

$content_type = return_between($raw_message, "Content-Type: ", "\n", EXCL);$boundary = get_attribute($content_type, "boundary");$raw_msg = return_between($message, "--".$boundary, "--".$boundary, EXCL );$msg_separator = $raw_msg, chr(13).chr(10).chr(13).chr(10);$clean_msg = return_between($raw_msg,

Page 311: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$clean_msg = return_between($raw_msg, $msg_separator, $msg_separator, EXCL );

Listing 15-8: Parsing the actual message from araw POP3 responseWhen parsing the message, you must first identifythe Content-Type, which holds the boundariesdescribing where the message is found. TheContent-Type is further parsed with theget_attribute() function, to obtain the actualboundary value.[51] Finally, the text defined withinthe boundaries may contain additionalinformation that tells the client how to display thecontent of the message. This information, if itexists, is removed by parsing only what's withinthe message separator, a combination of carriagereturns and line feeds.

Other Useful POP3 Commands

The DELE and QUIT (followed by the mail id)commands mark a message for deletion. Listing15-9 shows demonstrations of both the DELE andQUIT commands.

DELE 8+OKQUIT+OK

Listing 15-9: Using the POP3 DELE and QUITcommandsWhen you use DELE, the deleted message is onlymarked for deletion and not actually deleted. The

Page 312: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

marked for deletion and not actually deleted. Thedeletion doesn't occur until you execute a QUITcommand and your server session ends.

Note

If you've accidentally marked a message withthe DELE function and wish to retain it whenyou quit, enter RSET followed by the messagenumber. The message will not be marked fordeletion when you issue the QUIT command(retention is the default condition).

[49] See Chapter 16 to learn how to send emailwith webbots and spiders.[50] Telnet clients are standard on all Windows,Mac OS X, Linux, and Unix distributions.[51] The actual boundary, which defines themessage, is prefixed with -- characters todistinguish the actual boundary from where it isdefined.

Page 313: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Executing POP3 Commandswith a WebbotPOP3 commands can be performed with PHP'sopensocket(), fputs(), and fgets() functions.The LIB_pop3 library is available for you todownload from this book's website. This librarycontains functions for connecting to the mailserver, authenticating your account on the server,finding out what mail is available for the account,requesting messages from the server, and deletingmessages.The scripts in Listings 15-10 through 15-13 showhow to use the LIB_pop3 library. The larger scriptis split up and annotated here for clarity, but it isavailable in its entirety on this book's website.

Note

Before you use the script in Listing 15-10,replace the values for SERVER, USER, and PASSwith your email account information.

include("LIB_pop3.php"); // Include POP3 command library

Page 314: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

define("SERVER", "your.mailserver.net"); // Your POP3 mailserverdefine("USER", "[email protected] "); // Your POP3 email addressdefine("PASS", "your_password"); // Your POP3 password

Listing 15-10: Including the LIB_pop3 library andinitializing credentialsIn Listing 15-11, the script makes the connectionto the server and, after a successful login attempt,obtains a connection array containing the"handle" that is required for all subsequentcommunication with the server.

# Connect to POP3 server$connection_array = POP3_connect(SERVER, USER, PASS);$POP3_connection = $connection_array['handle'];if($POP3_connection) { // Create an array, which is the result of a POP3 LIST command $list_array = POP3_list($POP3_connection);

Listing 15-11: Connecting to the server andmaking an array of available messagesThe script in Listing 15-12 uses the $list_arrayobtained in the previous step to create requestsfor each email message. It displays each messagealong with its ID and size and then deletes the

Page 315: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

along with its ID and size and then deletes themessage, as shown here.

# Request and display all messages in $list_array for($xx=0; $xx<count($list_array); $xx++) { // Parse the mail ID from the message size list($mail_id, $size) = explode(" ", $list_array[$xx]);

// Request the message for the specific mail ID $message = POP3_retr($POP3_connection, $mail_id);

// Display message and place mail ID, size, and message in an array echo "$mail_id, $size\n"; $mail_array[$xx]['ID'] = $mail_id; $mail_array[$xx]['SIZE'] = $size; $mail_array[$xx]['MESSAGE'] = $message;

// Display message in <xmp></xmp> tags to disable HTML // (in case script is run in a browser) echo "<xmp>$message</xmp>";

// Delete the message from the server POP3_delete($POP3_connection, $mail_id); }

Page 316: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

}

Listing 15-12: Reading, displaying, and deletingeach message found on the serverFinally, after each message is read and deletedfrom the server, the session is closed, as shown inListing 15-13.

// End the server session echo POP3_quit($POP3_connection); }else { echo "Login error"; }

Listing 15-13: Closing the connection to theserver, or noting the login error if necessarySubsequently, if the connection wasn't originallymade to the server, the script returns an errormessage.

Page 317: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Further ExplorationWith a little thought, you can devise many creativeuses for webbots that can access email accounts.There are two general areas that may serve asinspiration.

Use email as a means to control webbots. Forexample, you could use an email message totell a spider which domain to use as a target,or you could send an email to a procurementbot (featured in Chapter 19) to indicate whichitems to purchase.Use an email-enabled webbot to interfaceincompatible systems. For example, you couldupload a small file to an FTP sever from aBlackBerry if the file (the contents of theemail) were sent to a special webbot that,after reading the email, sent the file to thespecified server. This could effectively connecta legacy system to remote users.

Email-Controlled Webbots

Here are a few ideas to get you started withemail-controlled webbots.

Design a webbot that forwards messages from

Page 318: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Design a webbot that forwards messages froma mailing list to your personal email addressbased upon references to a preset list ofterms. (For example, the webbot couldforward all messages that reference thewords robot, web crawler, webbot, andspider.)Develop a procurement bot that automaticallyreconfigures your eBay bidding strategy whenit receives an email from eBay indicating thatsomeone has outbid you.Create a strategy that forwards an emailmessage to a webbot that, in turn, displays themessage on a 48-foot scrolling marquee thatis outside your office building (assuming youhave access to such a display!).

Email Interfaces

Here are a few ways you can capitalize on email-enabled webbots to interface different systems.

Develop a webbot that automatically updatesyour financial records based on email youreceive from PayPal.Create a webbot that automatically forwardsall email with the word support in the subjectline to the person working the help desk at

Page 319: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

line to the person working the help desk atthat time.Write a webbot that notifies you when one ofyour mail servers has reached its email (size)quota.Write a service that interfaces shippingnotification email messages from FedEx toyour company's fulfillment system.Develop an email-to-fax service that faxes anemail message to the phone number in theemail's subject line. (This isn't hard to do ifyou have an old fax/modem from the lastcentury lying around.)Write a webbot that maintains statistics aboutyour email accounts, indicating who issending the most email, when servers arebusiest, the number of messages that aredeleted without being read, when servers fail,and email addresses that are returned asundeliverable.

Page 320: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 16. WEBBOTS THATSEND EMAILIn Chapter 15 you learned how to create webbotsthat read email. In this chapter I'll show you howto write webbots that can create massive amountsof email. On that note, let's talk briefly aboutemail ethics.

Email, Webbots, and SpamSpam has negatively influenced all of our emailexperiences.[52] It was probably only a few yearsago that every email in one's inbox had somevalue and deserved to be read. Today, however,my spam filter (a proxy service that examinesemail headers and content to determine if theemail is legitimate or a potential scam) rejectsroughly 80 percent of the email I receive, flaggingit as unwanted solicitation at best and, at worst, aphishing attack—email that masquerades itself aslegitimate and requests credit card or otherpersonal information.Nobody likes unsolicited email, and your webbot'seffectiveness will be reduced if its messages areinterpreted as spam by end readers or automated

Page 321: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

interpreted as spam by end readers or automatedfilters. When using your webbots to send volumesof mail, follow these guidelines:

Allow recipients to unsubscribe. If peoplecan't remove themselves from a mailing list,they're subscribed involuntarily. Email that ispart of a periodic mailing should include alink that allows the recipient to opt out offuture mailings.[53]

Avoid multiple emails. Avoid sending multipleemails with similar content or intent to thesame address.Use a relevant subject line. Don't deceiveemail recipients (or try to avoid a spam filter)with misleading subject lines. If you'reactually selling "herbal Via8r4," don't use asubject line like RE: Thanks!Identify yourself. Don't spoof your emailheaders or the originator's actual emailaddress in order to trick spam filters intodelivering your email.Obey the law. Depending where you live, lawsmay prohibit sending specific types of email.For example, under the Children's OnlinePrivacy Protection Act (COPPA), it is illegal inthe United States to solicit personalinformation from children. (More information

Page 322: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

information from children. (More informationis available at the COPPA website,http://www.coppa.org.) Laws regarding emailethics change constantly. If you havequestions, talk to a lawyer that specializes inonline law.

Note

Do not use any of the following techniques totest the resolve of people's spam filters. Irecommend reading Chapter 28 and having apersonal consultation with an attorney beforedoing anything remotely questionable.

[52] I would like to extend my sincerest apologiesto the Hormel Foods Corporation for perpetuatingthe use of the word spam to describe unwantedemail. I'd rather refer to the phenomenon of junkemail with a clever term like eJunk or NetClutter.But unfortunately, no other synonym has theworldwide acceptance of spam. Hormel Foodsdeserves better treatment of its brand—and forthis reason I want to stress the difference betweenSPAM and spam. For additional information onHormel's take on the use of the word spam, please

Page 323: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Hormel's take on the use of the word spam, pleaserefer to http://www.spam.com/ci/ci_in.htm.[53] Unfortunately, many spammers rely on peopleopting out of mailing lists to verify that an emailaddress is actively used. For many, opting out of amail list ensures they will continue to receiveunsolicited email.

Page 324: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Sending Mail with SMTP andPHPOutgoing email is sent using the Simple MailTransfer Protocol (SMTP). Fortunately, PHP'sbuilt-in mail() function handles all SMTP socket-level protocols and handshaking for you. Themail() function acts as your mail client, sendingemail messages just as Outlook or Thunderbirdmight.

Configuring PHP to Send Mail

Before you can use PHP as a mail client, you mustedit PHP's configuration file, php.ini, to point PHPto the mail server's location. For example, thescript in Listing 16-1 shows the section of php.inithat configures PHP to work with sendmail, theUnix mail server on many networks.

[mail function]; For Win32 only.SMTP = localhost

; For Win32 only.;sendmail_from = [email protected]

; For Unix only. You may supply arguments as

Page 325: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

; For Unix only. You may supply arguments as well (default: "sendmail -t -i").sendmail_path = /usr/sbin/sendmail -t -i

Listing 16-1: Configuring PHP's mail() function

Note

Notice that the configuration differs slightlyfor Windows and Unix installations. Forexample, windows servers use PHP.INI todescribe the network location of the mailserver you want to use. In contrast, Unixinstallations need the file path to your localmail server. In either case, you must haveaccess to a mail server (preferably in the samenetwork domain) that allows you to sendemail.

Only a few years ago, you could send emailthrough almost any mail server on the Internetusing relay host, which enables mail servers torelay messages from mail clients in one domain toa different domain. When using relay host, onecan send nearly anonymous email, because thesemail servers accept commands from any mailclient without needing any form of authentication.The relay host process has been largelyabandoned by system administrators because

Page 326: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

abandoned by system administrators becausespammers can use it to send millions ofanonymous commercial emails. Today, almostevery mail server will ignore commands that comefrom a different domain or from users that are notregistered as valid clients.An "open" mail server—one that allows relaying—is obviously a dangerous thing. I once worked fora company with two corporate mail servers, oneof which mistakenly allowed mail relaying.Eventually, a spammer discovered it andcommandeered it as a platform for dispatchingthousands of anonymous commercial emails.[54]

In addition to wasting our bandwidth, our domainwas reported as one that belonged to a spammerand subsequently got placed on a watch list usedby spam-detection companies. Once they identifiedour domain as a source of spam, many importantcorporate emails weren't received because spamfilters had rejected them. It took quite an effort toget our domain off of that list. For this reason,you will need a valid email account to send emailfrom a webbot.

Sending an Email with mail()

PHP provides a built-in function for sending email,as shown in Listing 16-2.

$email_address = "[email protected]";

Page 327: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$email_address = "[email protected]";$email_subject = "Webbot Notification Email";$email_message = "Your webbot found something that needs you attention";mail($email_address, $email_subject, $email_message);

Listing 16-2: Sending an email with PHP's built-inmail() functionIn the simplest configuration, as shown in Listing16-2, you only need to specify the destinationemail address, the subject, and the message. Forthe reasons mentioned in the relay hostdiscussion, however, you will need a valid accounton the same server as the one specified in yourphp.ini file.There are, of course, more options than thoseshown in Listing 16-2. However, these optionsusually require that you build email headers,which tell a mail client how to format the emailand how the email should be distributed. Since thesyntax for email headers is very specific, it is easyto implement them incorrectly. Therefore, I'vewritten a small email library called LIB_mail witha function formatted_mail(), which makes iteasy to send emails that are more complex thanwhat can easily be sent with the mail() functionalone. The script for LIB_mail is shown in Listing16-3.

Page 328: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

function formatted_mail($subject, $message, $address, $content_type) { # Set defaults if(!isset($address['cc'])) $address['cc'] = ""; if(!isset($address['bcc'])) $address['bcc'] = "";

# Ensure that there's a Reply-to address if(!isset($address['replyto'])) $address['replyto'] = $address['from'];

# Create mail headers $headers = ""; $headers = $headers . "From: ".$address['from']."\r\n"; $headers = $headers . "Return-Path: ".$address['from']."\r\n"; $headers = $headers . "Reply-To: ".$address['replyto']."\r\n";

# Add Cc to header if needed if (strlen($address['cc'])< 0 ) $headers = $headers . "Cc: ".$address['cc']."\r\n";

# Add Bcc to header if needed if (strlen($address['bcc'])< 0 ) $headers = $headers . "Bcc: ".$address['bcc']."\r\n";

# Add content type

Page 329: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

# Add content type $headers = $headers . "Content-Type: ".$content_type."\r\n";

# Send the email $result = mail($address['to'], $subject, $message, $headers);

return $result; }

Listing 16-3: Sending formatted email withLIB_mail

The main thing to take away from the script aboveis that the mail header is a very syntax-sensitivestring that works better if it is a built-in functionthan if it is created repeatedly in your scripts.Also, up to six addresses are involved in sendingemail, and they are all passed to this routine in anarray called $address. These addresses aredefined in Table 16-1.Table 16-1. Email Addresses Used by LIB_mail

Address FunctionRequiredorOptional

To: Defines the address of the main recipient ofthe email Required

Page 330: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Reply-to:

Defines the address where replies to theemail are sent Optional

Return-path:

Indicates where notifications are sent if theemail could not be delivered Optional

From: Defines the email address of the partysending the email Required

Cc:Refers to an address of another party, whoreceives a carbon copy of the email, but is notthe primary recipient of the message

Optional

Bcc:Is similar to Cc: and stands for blind carboncopy; this address is hidden from the otherparties receiving the same email

Optional

Configuring the Reply-to address is also importantbecause this address is used as the address whereundeliverable email messages are sent. If this isnot defined, undeliverable email messages willbounce back to your system admin, and you won'tknow that an email wasn't delivered. For thisreason, the function automatically uses the Fromaddress if a Return-path address isn't specified.

[54] Spammers write webbots to discover mailservers that allow mail relaying.

Page 331: Webbots, Spiders, And Screen Scrapers - Michael Schrenk
Page 332: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Writing a Webbot That SendsEmail NotificationsHere's a simple webbot that, when run, sends anemail notification if a web page has changed sincethe last time it was checked.[55] Such a webbotcould have many practical uses. For example, itcould monitor online auctions or pages on yourfantasy football league's website. A modifiedversion of this webbot could even notify you whenthe balance of your checking account changes.The webbot simply downloads a web page andstores a page signature, a number that uniquelydescribes the content of the page, in a database.This is also known as a hash, or a series ofcharacters, that represents a test message or afile. In this case, a small hash is used to create asignature that references a file without the need toreference the entire contents of the file. If thesignature of the page differs from the one in thedatabase, the webbot saves the new value andsends you an email indicating that the page haschanged. Listing 16-4 shows the script for thiswebbot.[56]

# Get librariesinclude("LIB_http.php"); # include cURL library

Page 333: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

libraryinclude("LIB_mysql.php"); # include MySQL libraryinclude("LIB_mail.php"); # include mail library

# Define parameters$webbot_email_address = "[email protected]";$notification_email_address = "[email protected] ";$target_web_site = "www.trackrates.com";

# Download the website$download_array = http_get($target_web_site, $ref="");$web_page = $download_array['FILE'];

# Calculate a 40-character sha1 hash for use as a simple signature$new_signature = sha1($web_page);

# Compare this signature to the previously stored value in a database$sql = "select SIGNATURE from signatures where WEB_PAGE='".$target_web_site."'";list($old_signature) = exe_sql(DATABASE, $sql);

# If the new signature is different than the old one, update the database and# send an email notifying someone that the web

Page 334: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

# send an email notifying someone that the web page changed.if($new_signature != $old_signature) { // Update database if(isset($data_array)) unset($data_array); $data_array['SIGNATURE'] = $new_signature; update(DATABASE, $table="signatures", $data_array, $key_column="WEB_PAGE", $id=$target_web_site);

// Send email $subject = $target_web_site." has changed"; $message = $subject . "\n"; $message = $message . "Old signature = ".$old_signature."\n"; $message = $message . "New signature = ".$new_signature."\n"; $message = $message . "Webbot ran at: ".date("r")."\n"; $address['from'] = $webbot_email_address; $address['replyto'] = $webbot_email_address; $address['to'] = $notification_email_address; formatted_mail($subject, $message, $address, $content_type="text/plain"); }

Listing 16-4: A simple webbot that sends an emailwhen a web page changesWhen the webbot finds that the web page'ssignature has changed, it sends an email like theone in Listing 16-5.

Page 335: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

one in Listing 16-5.

www.trackrates.com has changedOld signature = baf73f476aef13ae48bd7df5122d685b6d2be2ddNew signature = baf73f476aed685b6d2be2ddf13ae48bd7df5124Webbot ran at: Mon, 20 Mar 2007 17:08:00 −0600

Listing 16-5: Email generated by the webbot inListing 16-4

Keeping Legitimate Mail out of SpamFilters

Many spam filters automatically reject any emailin which the domain of the sender doesn't matchthe domain of the mail server used to send themessage. For this reason, it is wise to verify thatthe domains for the From and Reply-to addressesmatch the outgoing mail server's domain.The idea here is not to fool spam filters into lettingyou send unwanted email, but rather to ensurethat legitimate email makes it to the intendedInbox and not the Junk folder, where no one willread it.

Sending HTML-Formatted Email

It's easy to send HTML-formatted email with

Page 336: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

It's easy to send HTML-formatted email withimages, hyperlinks, or any other media found inweb pages. To send HTML-formatted emails withthe formatted_mail() function, do the following:

Set the $content_type variable totext/html. This will tell the routine to use theproper MIME in the email header.Use fully formed URLs to refer to any imagesor hyperlinks. Relative address references willresolve to the mail client, not the online mediayou want to use.Since you never know the capabilities of theclient reading the email, use standardformatting techniques. Tables work well.Avoid CSS. Traditional font tags are morepredictable in HTML email.For debugging purposes, it's a good idea tobuild your message in a string, as shown inListing 16-6.

# Get libraryinclude("LIB_mail.php"); # Include mail library

# Define addresses$address['from'] = "[email protected]";$address['replyto'] = $address['from'];$address['to'] = "[email protected]";

Page 337: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$address['to'] = "[email protected]";

# Define subject line$subject = "Example of an HTML-formatted email";

# Define message$message = "";$message = $message . "<table bgcolor='#e0e0e0' border='0' cellpadding='0'cellspacing='0'>";$message = $message . "<tr>";$message = $message . "<td><img src='http://www.schrenk.com/logo.gif'><td>";$message = $message . "</tr>";$message = $message . "<tr>";$message = $message . "<td>";$message = $message . "<font face='arial'>";$message = $message . "Here is an example of a clean HTML-formatted email";$message = $message . "</font>";$message = $message . "<td>";$message = $message . "</tr>";$message = $message . "<tr>";$message = $message . "<td>";$message = $message . "<font face='arial'>";$message = $message . "with an image and a <a href='http://www.schrenk.com'>hyperlink</a>.";

$message = $message . "</font>";$message = $message . "<td>";$message = $message . "</tr>";$message = $message . "</table>";

Page 338: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

echo $message;

// Send emailformatted_mail($subject, $message, $address, $content_type="text/html");?>

Listing 16-6: Sending HTML-formatted emailThe email sent by Listing 16-6 looks likeFigure 16-1.

Figure 16-1. HTML-formatted email sent by thescript in Listing 16-6

Be aware that not all mail clients can renderHTML-formatted email. In those instances, youshould send either text-only emails or a multi-formatted email that contains both HTML andunformatted messages.

Page 339: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

unformatted messages.

[55] For information on periodic and autonomouslaunching of webbots, read Chapter 23.[56] This script makes use of LIB_mysql. If youhaven't already done so, make sure you readChapter 6 to learn how to use this library.

Page 340: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Further ExplorationIf you think about all the ways you use email,you'll probably be able to come up with some verycreative uses for your webbots. The followingconcepts should serve as starting points for yourown webbot development.

Using Returned Emails to Prune AccessLists

You can design an email-wielding webbot to helpyou identify illegitimate members of a members-only website. If someone has access to a business-to-business website but is no longer employed by acompany that uses the site, that person probablyalso lost access to his or her corporate emailaddress; any email sent to that account will bereturned as undeliverable. You could design awebbot that periodically sends some type ofreport to everyone who has access to the website.Any emails that return as undeliverable will alertyou to a member's email address that is no longervalid. Your webbot can then track theseundeliverable emails and deactivate formeremployees from your list of members.

Using Email as Notification That Your

Page 341: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Using Email as Notification That YourWebbot Ran

It's handy to have an indication that a webbot hasactually run. A simple email at the end of thewebbot's session can inform you that it ran andwhat it did. Often, the actual content of theseemail notifications is not as significant as theemails themselves, which indicate that a webbotran successfully. Similarly, you can use emailnotifications to tell you exactly when and how awebbot has failed.

Leveraging Wireless Technologies

Since wireless email clients like cell phones andBlackBerries allow people to use email away fromtheir desks, your webbots can effectively useemail in more situations than they could only afew years ago. Think about applications wherewebbots can exploit mobile email technology. Forexample, you could write a webbot that checks thestatus of your server and sends warnings topeople when they're away from the office. Youcould also develop a webbot that sends an instantmessage when your company is mentioned onCNN.com.

Writing Webbots That Send Text

Page 342: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Writing Webbots That Send TextMessages

Many wireless carriers support email interfacesfor text messaging, or short message service(SMS). These messages appear as text on cellphones, and many people find them to be lessintrusive than voice messages. To send a textmessage, you simply email the message to one ofthe email-to-text message addresses provided bywireless carriers—a task you could easily hand offto a webbot. Appendix C contains a list of email-to-text message addresses; if you can't find yourcarrier in this list, contact its customer servicedepartment to see if it provides this service.

Page 343: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 17. CONVERTING AWEBSITE INTO A FUNCTIONWebbots are easier to use when they're packagedas functions. These functions are simply interfacesto webbots that download and parse informationand return the desired data in a predefinedstructure. For example, the National Oceanic andAtmospheric Association (NOAA) providesweather forecasts on its website(http://www.noaa.gov). You could write a functionto execute a webbot that downloads and parses aforecast. This interface could also return theforecast in an array, as shown in Listing 17-1.

# Get weather forecast$forcast_array = get_noaa_forecast($zip=89109);

# Display forecastecho $forcast_array['MONDAY']['TEMPERATURE']."<br>";echo $forcast_array['MONDAY']['WIND_SPEED']."<br>";echo $forcast_array['MONDAY']['WIND_DIRECTION']."<br>";

Listing 17-1: Simplifying webbot use by creating afunction interfaceWhile the example in Listing 17-1 is hypothetical,you can see that interfacing with a webbot in thismanner conceals the dirty details of downloadingor parsing web pages. Yet, the programmer hasfull ability to access online information andservices that the webbots provide. From aprogrammer's perspective, it isn't even obvious

Page 344: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

programmer's perspective, it isn't even obviousthat webbots are used.When a programmer accesses a webbot from afunction interface, he or she gains the ability touse the webbot both programmatically and in realtime. This is a departure from the traditionalmethod of launching webbots.[57] Customarily,you schedule a webbot to execute periodically,and if the webbot generates data, that informationis stored in a database for later retrieval. With afunction interface to a webbot, you don't have towait for a webbot to run as a scheduled task.Instead, you can directly request the specificcontents of a web page whenever you need them.

Writing a Function InterfaceThis project uses a web page that decodes ZIPcodes and converts that operation into a function,which is available from a PHP program. Thisparticular web page finds the city, county, state,and geo coordinates for the post office located ina specific ZIP code. Theoretically, you could usethis function to validate ZIP codes or use thelatitude and longitude information to plotlocations on a map. Figure 17-1 shows the targetwebsite for this project.

Page 345: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 17-1. Target website, which returnsinformation about a ZIP code

The sole purpose of the web page in Figure 17-1 isto be a target for your webbots. (A link to thispage is available at this book's website.) Thistarget web page uses a standard form to capturea ZIP code. Once you submit that form, the webpage returns a variety of information about theZIP code you entered in a table below the form.

Defining the Interface

This example function uses the interface shown inListing 17-2, where a function nameddecode_zipcode() accepts a five-digit ZIP codeas a input parameter and returns an array, whichdescribes the area serviced by the ZIP code.

array $zipcode_array = decode_zipcode(int $zipcode);

input: $zipcode is a five-digit USPS ZIP codeoutput: $zipcode_array['CITY'] $zipcode_array['COUNTY'] $zipcode_array['STATE'] $zipcode_array['LATITUDE'] $zipcode_array['LONGITUDE']

Page 346: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$zipcode_array['LONGITUDE']

Listing 17-2: decode_zipcode() interface

Analyzing the Target Web Page

Since this webbot needs to submit a ZIP code to aform, you will need to use the techniques youlearned in Chapter 5 to emulate someonemanually submitting the form. As you learned, youshould always pass even simple forms through aform analyzer (similar to the one used inChapter 5) to ensure that you will submit the formin the manner the server expects. This isimportant because web pages commonly insertdynamic fields or values into forms that can behard to detect by just looking at a page.To use the form analyzer, simply load the webpage into a browser and view the source code, asshown in Figure 17-2.

Figure 17-2. Displaying the form's source code

Page 347: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 17-3. Saving the form's source code

Once you have the target's source code, save theHTML to your hard drive, as done in Figure 17-3.Once the form's HTML is on your hard drive, youmust edit it to make the form submit its content tothe form analyzer instead of the target server. Youdo this by changing the form's action attribute tothe location of the form analyzer, as shown inFigure 17-4.

Figure 17-4. Changing the form's actionattribute to the form analyzer

Page 348: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

attribute to the form analyzer

Now you have a copy of the target form on yourhard drive, with the form's original actionattribute replaced with the web address of theform analyzer. The final step is to load this localcopy of the form into a browser, manually fill inthe form, and submit it to the analyzer. Oncesubmitted, you should see the analysis performedby the form analyzer, as shown in Figure 17-5.

Figure 17-5. Analyzing the target form

The analysis tells us that the method is POST andthat there are three required data fields. Inaddition to the zipcode field, there is also a hiddensession field (which looks suspiciously like a Unixtimestamp) and a Submit field, which is actuallythe name of the Submit button. To emulate theform submission, it is vitally important tocorrectly use all the field names (with appropriatevalues) as well as the same method used by theoriginal form.

Page 349: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Once you write your webbot, it's a good idea totest it by using the form analyzer as a target toensure that the webbot submits the form as thetarget webserver expects it to. This is also a goodtime to verify the agent name your webbot uses.

Using describe_zipcode()

The script that interfaces the target web page to aPHP function, called describe_zipcode(), isavailable in its entirety at this book's website. It isbroken into smaller pieces and annotated here forclarity.

Getting the Session Value

It is uncommon to find dynamically assignedvalues, like the session value employed by thistarget, in forms. Since the session is assigneddynamically, the webbot must first make a pagerequest to get the session value before it cansubmit form values. This actually mimics normalbrowser use, as the browser first must downloadthe form before submitting it. The webbotcaptures the session variable with the scriptdescribed in Listing 17-3.

# Start interface describe_zipcode($zipcode)describe_zipcode($zipcode)

{ # Get required libraries and declare the target include ("LIB_http.php"); include("LIB_parse.php"); $target =

Page 350: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$target = "http://www.schrenk.com/nostarch/webbots/zip_code_form.php";

# Download the target $page = http_get($target, $ref="");

# Parse the session hidden tag from the downloaded page # <input type="hidden" name="session" value="xxxxxxxxxx"> $session_tag = return_between($string = $page['FILE'] , $start = "<input type=\"hidden\" name=\ "session\"", $end = ">", $type = EXCL );

# Remove the "'s and "value=" text to reveal the session value $session_value = str_replace("\"", "", $session_tag); $session_value = str_replace("value=", "", $session_value);

Listing 17-3: Downloading the target to get thesession variableThe script in Listing 17-3 is a classic screenscraper. It downloads the page and parses thesession value from the form <input> tag. Thestr_replace() function is later used to removesuperfluous quotes and the tag's value attribute.Notice that the webbot uses LIB_parse andLIB_http, described in previous chapters, todownload and parse the web page.[58]

Page 351: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Submitting the Form

Once you know the session value, the script inListing 17-4 may be used to submit the form.Notice the use of http_post_form() to emulatethe submission of a form with the POST method.The form fields are conveniently passed to thetarget webserver in $data_array[].

$data_array['session'] = $session_value;$data_array['zipcode'] = $zipcode;$data_array['Submit'] = "Submit";$form_result = http_post_form($target, $ref=$target, $data_array);

Listing 17-4: Emulating the form

Parsing and Returning the Result

The remaining step is to parse the desired city,county, state, and geo coordinates from the webpage obtained from the form submission in theprevious listing. The script that does this is shownin Listing 17-5.

$landmark = "Information about ".$zipcode;$table_array = parse_array($form_result['FILE'], "<table", "</table>");for($xx=0; $xx<count($table_array); $xx++)

{ # Parse the table containing the parsing landmark if(stristr($table_array[$xx], $landmark)) { $ret['CITY'] = return_between($table_array[$xx], "CITY", "

Page 352: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

return_between($table_array[$xx], "CITY", "</tr>", EXCL); $ret['CITY'] = strip_tags($ret['CITY']); $ret['STATE'] = return_between($table_array[$xx], "STATE", "</tr>", EXCL); $ret['STATE'] = strip_tags($ret['STATE']); $ret['COUNTY'] = return_between($table_array[$xx], "COUNTY", "</tr>", EXCL); $ret['COUNTY'] = strip_tags($ret['COUNTY']); $ret['LATITUDE'] = return_between($table_array[$xx], "LATITUDE", "</tr>", EXCL); $ret['LATITUDE'] = strip_tags($ret['LATITUDE']); $ret['LONGITUDE'] = return_between($table_array[$xx], "LONGITUDE", "</tr>", EXCL); $ret['LONGITUDE'] = strip_tags($ret['LONGITUDE']); } }# Return the parsed data

return $ret;} # End Interface describe_zipcode($zipcode)

Listing 17-5: Parsing and returning the dataThis script first uses parse_array() to create anarray containing all the tables in the downloadedweb page, which is returned in$form_result['FILE']. The script then looks forthe table that contains the parsing landmarkInformation about . . . . Once the webbot finds thetable that holds the data we're looking for, itparses the data using unique strings that identifythe beginning and end of the desired data. Theparsed data is then cleaned up with

Page 353: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

parsed data is then cleaned up withstrip_tags() and returned in the array wedescribed earlier. Once the data is parsed andplaced into an array, that array is returned to thecalling program.

[57] Traditional methods for executing webbots aredescribed in Chapter 23.[58] LIB_http and LIB_parse are described inChapters 3 and 4, respectively.

Page 354: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Final ThoughtsNow that you know how to write functioninterfaces to a web page (or in our case, a form),you can convert the data and functionality of anyweb page into something your programs can useeasily in real time. Here are a few more things foryou to consider.

Distributing Resources

A secondary benefit of creating a functioninterface to a webbot is that when a webbot usesa web page on another server as a resource, itallocates bandwidth and computational poweracross several computers. Since more resourcesare deployed, you can get more done in less time.You can use this technique to spread the burden ofrunning complex webbots to more than onecomputer on your local or remote networks. Thistechnique may also be used to make page requestsfrom multiple IP addresses (for added stealth) orto spread bandwidth across multiple Internetnodes.

Using Standard Interfaces

The interface described in this example is specific

Page 355: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

The interface described in this example is specificto PHP. Although scripts for Perl, Java, or C++environments would be very similar to this one,you could not use this script directly in anenvironment other than PHP. You can solve thisproblem by returning data in a language-independent format like XML or SOAP (SimpleObject Access Protocol). To learn more aboutthese protocols, read Chapter 26.

Designing a Custom Lightweight "WebService"

Our example assumed that the target was notunder our control, so we had to live within theconstraints presented by the target website. Whenyou control the website your interface targets,however, you can design the web page in such away that you don't have to parse the data fromHTML. In these instances, the data is returned asvariables that your program can use directly.These techniques are also described in detail inChapter 26.If you're interested in creating your own ZIP codeserver (with a lightweight interface), you'll need aZIP code database. You should be able to find oneby performing a Google search for ZIP codedatabase.

Page 356: Webbots, Spiders, And Screen Scrapers - Michael Schrenk
Page 357: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Part III. ADVANCEDTECHNICALCONSIDERATIONSThe chapters in this section explore the finertechnical aspects of webbot and spiderdevelopment. In the first two chapters, I'll sharesome lessons I learned the hard way while writingvery specialized webbots and spiders. I'll alsodescribe methods for leveraging PHP/CURL tocreate webbots that manage authentication,encryption, and cookies.

Chapter 18This discussion of spider design starts with anexploration of simple spiders that find andfollow links on specific web pages. Theconversation later expands to techniques fordeveloping advanced spiders thatautonomously roam the Internet, looking forspecific information and dropping payloads—performing predefined functions as they finddesired information.

Chapter 19In this chapter, we'll explore the design theory

Page 358: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

In this chapter, we'll explore the design theoryof writing snipers, webbots that automaticallypurchase items. Snipers are primarily used ononline auctions sites, "attacking" when aspecific list of criteria are met.

Chapter 20Encrypted websites are not a problem forwebbots using PHP/CURL. Here we'll explorehow online encryption certificates work andhow PHP/CURL makes encryption easy tohandle.

Chapter 21In this chapter on accessing authenticated (i.e.,password-protected) sites, we'll explore thevarious methods used to protect a website fromunauthorized users. You'll also learn how towrite webbots that can automatically log in tothese sites.

Chapter 22Advanced cookie management involvesmanaging cookie expiration dates and multiplesets of cookies for multiple users. We'll alsoexplore PHP/CURL's ability (and inability) tomeet these challenges.

Chapter 23In the final installment in this section, we'll

Page 359: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

In the final installment in this section, we'llexplore methods for periodically launching orexecuting a webbot. These techniques willallow your webbots to run unattended whilesimulating human activity.

Page 360: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 18. SPIDERSSpiders, also known as web spiders, crawlers, andweb walkers, are specialized webbots that—unliketraditional webbots with well-defined targets—download multiple web pages across multiplewebsites. As spiders make their way across theInternet, it's difficult to anticipate where they'll goor what they'll find, as they simply follow linksthey find on previously downloaded pages. Theirunpredictability makes spiders fun to writebecause they act as if they almost have minds oftheir own.The best known spiders are those used by themajor search engine companies (Google, Yahoo!,and MSN) to identify online content. And whilespiders are synonymous with search engines formany people, the potential utility of spiders ismuch greater. You can write a spider that doesanything any other webbot does, with theadvantage of targeting the entire Internet. Thiscreates a niche for developers that designspecialized spiders that do very specific work.Here are some potential ideas for spider projects:

Discover sales of original copies of 1963Spider-Man comics. Design your spider toemail you with links to new findings or price

Page 361: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

email you with links to new findings or pricereductions.Periodically create an archive of yourcompetitors' websites.Invite every MySpace member living inCleveland, Ohio to be your friend.[59]

Send a text message when your spider findsjobs for Miami-based fashion photographerswho speak Portuguese.Maintain an updated version of your localnewspaper on your PDA.Validate that all the links on your websitepoint to active web pages.Perform a statistical analysis of noun usageacross the Internet.Search the Internet for musicians thatrecorded new versions of your favorite songs.Purchase collectible Bibles when your spiderdetects one with a price substantially belowthe collectible price listed on Amazon.com.

This list could go on, but you get the idea. To abusiness, a well-purposed spider is like additionalstaff, easily justifying the one-time developmentcost.

Page 362: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

cost.

How Spiders WorkSpiders begin harvesting links at the seed URL,the address of the initial target web page. Thespider uses these links as references to the nextset of pages to process, and as it downloads eachof those web pages, the spider harvests morelinks. The first page the spider downloads isknown as the first penetration level. In eachsuccessive level of penetration, additional webpages are downloaded as directed by the linksharvested in the previous level. The spider repeatsthis process until it reaches the maximumpenetration level. Figure 18-1 shows a typicalspider process.

Page 363: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 18-1. A simple spider

[59] This is only listed here to show the potentialfor what spiders can do. Please don't actually dothis! Automated agents like this violate MySpace'sterms of use. Develop webbots responsibly.

Page 364: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Example SpiderOur example spider will reuse the image harvester(described in Chapter 8) that downloads imagesfor an entire website. The image harvester is thisspider's payload—the task that it will perform onevery web page it visits. While this spiderperforms a useful task, its primary purpose is todemonstrate how spiders work, so designcompromises were made that affect the spider'sscalability for use on larger tasks. After weexplore this example spider, I'll conclude withrecommendations for making a scalable spidersuitable for larger projects.Listings 18-1 and 18-2 are the main scripts for theexample spider. Initially, the spider is limited tocollecting links. Since the payload addscomplexity, we'll include it after you've had anopportunity to understand how the basic spiderworks.

# Initializationinclude("LIB_http.php"); // http libraryinclude("LIB_parse.php"); // parse libraryinclude("LIB_resolve_addresses.php"); // Address resolution libraryinclude("LIB_exclusion_list.php"); //

Page 365: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

include("LIB_exclusion_list.php"); // List of excluded keywordsinclude("LIB_simple_spider.php"); // Spider routines used by this app

set_time_limit(3600); // Don't let PHP time out

$SEED_URL = "http://www.YourSiteHere.com";$MAX_PENETRATION = 1; // Set spider penetration depth$FETCH_DELAY = 1; // Wait 1 second between page fetches$ALLOW_OFFISTE = false; // Don't let spider roam from seed domain$spider_array = array(); // Initialize the array that holds links

Listing 18-1: Main spider script, initializationThe script in Listing 18-1 loads the requiredlibraries and initializes settings that tell the spiderhow to operate. This project introduces two newlibraries: an exclusion list(LIB_exclusion_list.php) and the spiderlibrary used for this exercise(LIB_simple_spider.php). We'll explain both ofthese new libraries as we use them.In any PHP spider design, the default script time-out of 30 seconds needs to be set to a period moreappropriate for spiders, since script executionmay take minutes or even hours. Since spiders

Page 366: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

may take minutes or even hours. Since spidersmay have notoriously long execution times, thescript in Listing 18-1 sets the PHP script time-outto one hour (3,600 seconds) with theset_time_limit(3600) command.The example spider is configured to collectenough information to demonstrate how spiderswork but not so much that the sheer volume ofdata distracts from the demonstration. You canset these settings differently once you understandthe effects they have on the operation of yourspider. For now, the maximum penetration level isset to 1. This means that the spider will harvestlinks from the seed URL and the pages that thelinks on the seed URL reference, but it will notdownload any pages that are more than one linkaway from the seed URL. Even when you tie thespider's hands—as we've done here—it stillcollects a ridiculously large amount of data. Whenlimited to one penetration level, the spider stillharvested 583 links when pointed athttp://www.schrenk.com. This number excludesredundant links, which would otherwise raise thenumber of harvest links to 1,930. Fordemonstration purposes, the spider also rejectslinks that are not on the parent domain.The main spider script, shown in Listing 18-2, isquite simple. Much of this simplicity, however,comes at the cost of storing links in an array,

Page 367: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

comes at the cost of storing links in an array,instead of a more scalable (and morecomplicated) database. As you can see, thefunctions in the libraries make it easy to downloadweb pages, harvest links, exclude unwanted links,and fully resolve addresses.

# Get links from $SEED_URLecho "Harvesting Seed URL\n";$temp_link_array = harvest_links($SEED_URL);$spider_array = archive_links($spider_array, 0, $temp_link_array);

# Spider links from remaining penetration levelsfor($penetration_level=1; $penetration_level<=$MAX_PENETRATION; $penetration_level++) { $previous_level = $penetration_level - 1; for($xx=0; $xx<count($spider_array[$previous_level]); $xx++) { unset($temp_link_array); $temp_link_array = harvest_links($spider_array[$previous_level][$xx]); echo "Level=$penetration_level, xx=$xx of ".count($spider_array[$previous_level])." \n"; $spider_array = archive_links($spider_array, $penetration_level,

Page 368: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

archive_links($spider_array, $penetration_level, $temp_link_array); } }

Listing 18-2: Main spider script, harvesting linksWhen the spider uses www.schrenk.com as a seedURL, it harvests and rejects links, as shown inFigure 18-2.Now that you've seen the main spider script, anexploration of the routines inLIB_simple_spider will provide insight to how itreally works.

Page 369: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

LIB_simple_spiderSpecial spider functions are found in theLIB_simple_spider library. This library providesfunctions that parse links from a web page whengiven a URL, archive harvested links in an array,identify the root domain for a URL, and identifylinks that should be excluded from the archive.This library, as well as the other scripts featuredin this chapter, is available for download at thisbook's website.

Page 370: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 18-2. Running the simple spider fromListings 18-1 and 18-2

harvest_links()

The harvest_links() function downloads thespecified web page and returns all the links in anarray. This function, shown in Listing 18-3, usesthe $DELAY setting to keep the spider fromsending too many requests to the server over tooshort a period.[60]

function harvest_links($url) { # Initialize global $DELAY; $link_array = array();

# Get page base for $url (used to create fully resolved URLs for the links) $page_base = get_base_page_address($url);

# $DELAY creates a random delay period between 1 second and full delay period $random_delay = rand(1, rand(1, $DELAY)); # Download webpage sleep($random_delay); $downloaded_page = http_get($url, "");

# Parse links

Page 371: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

# Parse links $anchor_tags = parse_array($downloaded_page['FILE'], "<a", "</a>", EXCL); # Get http attributes for each tag into an array for($xx=0; $xx<count($anchor_tags); $xx++) { $href = get_attribute($anchor_tags[$xx], "href"); $resolved_addrses = resolve_address($href, $page_base); $link_array[] = $resolved_address; echo "Harvested: ".$resolved_addres." \n"; } return $link_array; }

Listing 18-3: Harvesting links from a web pagewith the harvest_links() function

archive_links()

The script in Listing 18-4 uses the link arraycollected by the previous function to create anarchival array. The first element of the archivalarray identifies the penetration level where thelink was found, while the second contains theactual link.

function archive_links($spider_array, $penetration_level, $temp_link_array)

Page 372: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$penetration_level, $temp_link_array) { for($xx=0; $xx<count($temp_link_array); $xx++) { # Don't add excluded links to $spider_array if(!excluded_link($spider_array, $temp_link_array[$xx])) { $spider_array[$penetration_level][] = $temp_link_array[$xx]; } } return $spider_array;

}

Listing 18-4: Archiving links in $spider_array

get_domain()

The function get_domain() parses the rootdomain from the target URL. For example, given atarget URL likehttps://www.schrenk.com/store/product_list.php,the root domain is schrenk.com.The function get_domain() compares the rootdomains of the links to the root domain of theseed URL to determine if the link is for a URL thatis not in the seed URL's domain, as shown in

Page 373: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

is not in the seed URL's domain, as shown inListing 18-5.

function get_domain($url) { // Remove protocol from $url $url = str_replace("http://", "", $url); $url = str_replace("https://", "", $url);

// Remove page and directory references if(stristr($url, "/")) $url = substr($url, 0, strpos($url, "/"));

return $url; }

Listing 18-5: Parsing the root domain from a fullyresolved URLThis function is only used when the configurationfor $ALLOW_OFFSITE is set to false.

exclude_link()

This function examines each link and determines ifit should be included in the archive of harvestedlinks. Reasons for excluding a link may include thefollowing:

The link is contained within JavaScript

Page 374: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

The link already appears in the archiveThe link contains excluded keywords arelisted in the exclusion arrayThe link is to a different domain

function excluded_link($spider_array, $link) { # Initialization global $exclusion_array, $ALLOW_OFFSITE; $exclude = false;

// Exclude links that are JavaScript commands if(stristr($link, "javascript")) { echo "Ignored JavaScript function: $link\n"; $exclude=true; }

// Exclude redundant links for($xx=0; $xx<count($spider_array); $xx++) { $saved_link=""; while(isset($saved_link)) { $saved_link=array_pop($spider_array[$xx]); if($link == array_pop($spider_array[$xx])) {

Page 375: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

{ echo "Ignored redundant link: $link\n"; $exclude=true; break; } } }

// Exclude links found in $exclusion_array for($xx=0; $xx<count($exclusion_array); $xx++) { if(stristr($link, $exclusion_array[$xx])) { echo "Ignored excluded link: $link\n"; $exclude=true; break; } }

// Exclude offsite links if requested if($ALLOW_OFFSITE==false) { if(get_domain($link)!=get_domain($SEED_URL)) { echo "Ignored offsite link: $link\n"; $exclude=true; break;

Page 376: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

break; } } return $exclude; }

Listing 18-6: Excluding unwanted linksThere are several reasons to exclude links. Forexample, it's best to ignore any links referencedwithin JavaScript because—without a properJavaScript interpreter—those links may yieldunpredictable results. Removing redundant linksmakes the spider run faster and reduces theamount of data the spider needs to manage. Theexclusion list allows the spider to ignoreundesirable links to places like Google AdSense,banner ads, or other places you don't want thespider to go.

[60] A stealthier spider would shuffle the order ofweb page requests.

Page 377: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Experimenting with the SpiderNow that you have a general idea how this spiderworks, go to the book's website and download therequired scripts. Play with the initializationsettings, use different seed URLs, and see whathappens.Consider these three warnings before you start:

Use a respectful $FETCH_DELAY of at least asecond or two so you don't create a denial ofservice (DoS) attack by consuming so muchbandwidth that others cannot use the webpages you target. Better yet, read Chapter 28before you begin.Keep the maximum penetration level set to alow value like 1 or 2. This spider is designedfor simplicity, not scalability, and if youpenetrate too deeply into your seed URL, yourcomputer will run out of memory.For best results, run spider scripts within acommand shell, not through a browser.

Page 378: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Adding the PayloadThe payload used by this spider is an extension ofthe library used in Chapter 8 to download all theimages found on a web page. This time, however,we'll download all the images referenced by theentire website. The code that adds the payload tothe spider is shown in Listing 18-7. You can tackthis code directly onto the end of the script for theearlier spider.

# Add the payload to the simple spider// Include download and directory creation libinclude("LIB_download_images.php");

// Download images from pages referenced in $spider_arrayfor($penetration_level=1; $penetration_level<=$MAX_PENETRATION; $penetration_level++) { for($xx=0; $xx<count($spider_array[$previous_level]); $xx++) { download_images_for_page($spider_array[$previous_level][$xx]); } }

Listing 18-7: Adding a payload to the simplespider

Page 379: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Functionally, the addition of the payload involvesthe inclusion of the image download library and atwo-part loop that activates the image harvesterfor every web page referenced at everypenetration level.

Page 380: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Further ExplorationAs mentioned earlier, the example spider wasoptimized for simplicity, not scalability. Moreover,while it was suitable for learning about spiders, itis not suitable for use in a production environmentwhere you want to spider many web pages. Thereare, however, opportunities for enhancements toimprove performance and scalability.

Save Links in a Database

The single biggest limitation of the example spideris that all the links are stored in an array. Arrayscan only get so big before the computer is forcedto rely on disk swapping, a technique that expandsthe amount of data space by moving some of thestorage task from RAM to a disk drive. Diskswapping adversely affects performance and oftenleads to system crashes. The other drawback tostoring links in an array is that all the work yourspider performed is lost as soon as the programterminates. A much better approach is to store theinformation your spiders harvest in a database.Saving your spider's data in a database has manyadvantages. First of all, you can store moreinformation. Not only does a database increase

Page 381: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

information. Not only does a database increasethe number of links you can store, but it alsomakes it practical to cache images of the pagesyou download for later processing. As we'll seelater, it also allows more than one spider to workon the same set of links and facilitates multiplecomputers to launch payloads on the datacollected by the spider(s).

Separate the Harvest and Payload

The example spider performs the payload afterharvesting all the links. Often, however, linkharvesting and payload are two distinctlyseparate pieces of code, and they are oftenperformed by two separate computers. While onescript harvests links and stores them in adatabase, another process can query the samedatabase to determine which web pages have notreceived the payload. You could, for example, usethe same computer to schedule the spiders to runin the morning and the payload script to run in theevening. This assumes, of course, that you saveyour spidered results in a database, where thedata has persistence and is available over anextended period.

Distribute Tasks Across MultipleComputers

Page 382: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Your spider can do more in less time if it teamswith other spiders to download multiple pagessimultaneously. Fortunately, spiders spend mostof their time waiting for webservers to respond torequests for web pages, so there's a lot of unusedcomputer power when a single spider process isrunning on a computer. You can run multiplecopies of the same spider script if your spidersoftware queries a database to identify the oldestunprocessed link. After it parses links from thatweb page, it can query the database again todetermine whether links on the next level ofpenetration already exist in the database—and ifnot, it can save them for later processing. Onceyou've written one spider to operate in thismanner, you can run multiple copies of theidentical spider script on the same computer, eachaccessing the same database to complete acommon task. Similarly, you can also run multiplecopies of the payload script to process all the linksharvested by the team of spiders.If you run out of processing power on a singlecomputer, you can use the same technique used torun parallel spiders on one machine to runmultiple spiders on multiple computers. You canimprove performance further by hosting thedatabase on its own computer. As long as all thespiders and all the payload computers have

Page 383: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

spiders and all the payload computers havenetwork access to a common database, youshould be able to expand this concept until thedatabase runs out of processing power.Distributing the database, unfortunately, is moredifficult than distributing spiders and payloadtasks.

Regulate Page Requests

Spiders (especially the distributed types) increasethe potential of overwhelming target websiteswith page requests. It doesn't take much computerpower to completely flood a network. In fact, avintage 33 MHz Pentium has ample resources toconsume a T1 network connection. Multiplemodern computers, of course, can do much moredamage. If you do build a distributed spider, youshould consider writing a scheduler, perhaps onthe computer that hosts your database, toregulate how often page requests are made tospecific domains or even to specific subnets. Thescheduler could also remove redundant links fromthe database and perform other routinemaintenance tasks. If you haven't already done so,this is a good time to read (or reread) Chapter 28.

Page 384: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 19. PROCUREMENTWEBBOTS AND SNIPERSA procurement bot is any intelligent web agentthat automatically makes online purchases on auser's behalf. These webbots are improvementsover manual online procurement because they notonly automate the online purchasing process, butalso autonomously detect events that indicate thebest time to buy. Procurement bots commonlymake automated purchases based on theavailability of merchandise or price reductions.For other webbots, external events like lowinventory levels trigger a purchase.The advantage of using procurement bots in yourbusiness is that they identify opportunities thatmay only be available for a short period or thatmay only be discovered after many hours ofbrowsing. Manually finding online deals can betedious, time consuming, and prone to humanerror. The ability to shop automatically uncoversbargains that would otherwise go unnoticed. I'vewritten automated procurement bots that—on amonthly basis—purchase hundreds of thousandsof dollars of merchandise that would be unknownto less vigilant human buyers.

Page 385: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Procurement Webbot TheoryBefore you begin, consider that procurement botsrequire both planning and in-depth investigationof target websites. These programs spend your (oryour clients') money, and their success isdependent on how well you design, program,debug, and implement them. With this in mind, usethe techniques described elsewhere in this bookbefore embarking on your first procurement bot—in other words, your first webbot shouldn't be onethat spends money. You can use the online teststore (introduced in Chapter 7) as target practicebefore writing webbots that make autonomouspurchases in the wild.While procurement bots purchase a wide range ofproducts in various circumstances, they typicallyfollow the steps shown in Figure 19-1.

Page 386: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 19-1. Structure of a procurement bot

While price and need govern this particularwebbot in deciding when to make a purchase, youcan design virtually any type of procurement botby substituting different purchase trigger events.

Get Purchase Criteria

A procurement bot first needs to gather thepurchase criteria, which is a description of theitem or items to purchase. The purchase criteriamay range from simple part numbers to itemdescriptions combined with complicatedcalculations that determine how much you want topay for an item.

Authenticate Buyer

Page 387: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Authenticate Buyer

Once the webbot has identified the purchasecriteria, it authenticates the buyer byautomatically logging in to the online store as aregistered user. In almost all cases, this means thewebbot must know the username and password ofthe person it represents.[61] (For more on howwebbots handle the authentication process, seeChapter 21.)

Verify Item

Prior to purchase, procurement bots should verifythat requested items are still available for sale ifthey were selected in advance of the actualpurchase. For example, if you instruct aprocurement bot to buy something in an onlineauction, the bot should email you if the auction iscanceled and the item is no longer for sale.(Chapter 16 describes how to send email from awebbot.) The procurement process should alsostop at this point. This sounds obvious, but unlessyou program your webbot to stop when items areno longer for sale, it may attempt to purchaseunavailable items.

Evaluate Purchase Triggers

Purchase triggers determine when available

Page 388: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Purchase triggers determine when availablemerchandise meets predefined purchase criteria.When those conditions are met, the purchase ismade. Bear in mind that it may take days, weeks,or even months before a buying opportunitypresents itself. But when it does, you'll be the firstin line (unless someone who is also running aprocurement bot beats you to it).[62]

Together, the purchase criteria and purchasetriggers define what your procurement bot does.If you want to pick up cheap merchandise orcapitalize on price reductions, you might use priceas a trigger. More complicated webbots mayweigh both price and inventory levels to makepurchasing decisions. Other procurement botsmay make purchases based on the scarcity ofmerchandise. Alternatively, as we'll explore later,you may write a sniper, which uses the time anauction ends as a trigger to bidding.

Make Purchase

Purchases are finalized by completing andsubmitting forms that collect information aboutthe purchased product, shipping address, andpayment method. Your webbot should submitthese forms in the same manner as describedearlier in this book. (See Chapter 5 for more onwriting webbots that submit forms to websites.)

Page 389: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

writing webbots that submit forms to websites.)

Evaluate Results

After making a purchase, the target server willdisplay a web page that confirms your purchase.Your webbot should parse the page to determinethat your acquisition was successful and thencommunicate the result of the purchase to you.Notifications of this type are usually done throughemail. If the procurement bot buys many items,however, it might be better to report the status ofall purchases on a web page or to send an emailwith the consolidated results for the entire day'sactivity.

[61] The exceptions to this rule are instances likethe eBay API, which allow third parties to act onsomeone's behalf without knowing thatindividual's username and password.[62] Occasionally, you may find yourself in directcompetition with other webbots. I've found thatwhen this happens, it's usually best not to getoverly competitive and do things like use excessivebandwidth or server connections that mightidentify your presence.

Page 390: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Sniper TheoryOf all procurement bots, snipers are the bestknown, largely because of their popularity oneBay. Snipers are procurement bots that use timeas their trigger event. Snipers wait until theclosing seconds of an online auction and bid justbefore the auction ends. The intent is to make theauction's last bid and avoid price escalationcaused by bidding wars. While making the last bidis what characterizes snipers, a more importantfeature is that they enable people to participate inonline auctions without having to dedicate theirtime to monitoring individual items or making bidsat the most opportune moments.While eBay is the most popular target, snipingprograms can purchase products from anyauction website, including Yahoo!, Overstock.com,uBid, or even official US government auctionsites.The sniping process is similar to that of theprocurement bots described earlier. The maindifferences are that the clocks on the auctionwebsite and sniper must be synchronized, and thepurchase trigger is determined by the auction'send time. Figure 19-2 shows a common sniperconstruction.

Page 391: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

construction.

Figure 19-2. Anatomy of a sniper

Get Purchase Criteria

The purchase criteria for an auction are generallythe auction identification number and the

Page 392: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

the auction identification number and themaximum price the user is willing to pay for theitem. Advanced snipers, however, mayperiodically look for and target any auction thatmatches other predefined purchase criteria likethe brand or age of an item.

Authenticate Buyer

Authentication of snipers is similar to otherauthentication practices discussed earlier.Occasionally, snipers can authenticate userswithout the need for a username and password,but these techniques vary depending on theauction site and the special programminginterfaces it provides. The problem of disclosinglogin credentials to third-party sniping services isone of the reasons people often choose to writetheir own snipers.

Verify Item

Many auctions end prematurely due to earlycancellation by the seller or to buy-it-nowpurchases, which allow a bidder to buy an itemfor a fixed price before the auction comes to itsscheduled end. For both of these reasons, snipersmust periodically verify that the auction it intendsto snipe is still a valid auction. Not doing so maycause a sniper to mistakenly bid on nonexistent

Page 393: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

cause a sniper to mistakenly bid on nonexistentauctions. Typically, snipers validate the auctiononce after collecting the purchase criteria andagain just before bidding.

Synchronize Clocks

Since a sniper uses the closing time of an auctionas its event trigger, the sniper and auction websitemust synchronize their clocks. Synchronizationinvolves requesting the timestamp from the onlineauction's server and subtracting that value fromthe auction's scheduled end. The result is thestarting value for a countdown clock. When thecountdown clock approaches zero, the sniperplaces its bid.A countdown clock is a more accurate method ofestablishing a bid time than relying on yourcomputer's internal clock to make a bid a fewseconds before the scheduled end of an auction.This is particularly true if your sniper is runningon a PC, where internal clocks are notoriouslyinaccurate.To guarantee synchronization of the sniper andthe online auction's clock, the sniper shouldsynchronize periodically and with increasedfrequency as the end of the auction nears.Periodic synchronization reduces the sniper'sreliance on the accuracy of your computer's

Page 394: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

reliance on the accuracy of your computer'sclock. Chances are, neither the clock on theauction site's server nor the one on your PC is setto the correct time, but from a sniper'sperspective, the server's clock is the only one thatmatters.Obtaining a server's clock value is as easy asmaking a header request and parsing the server'stimestamp from the header, as shown in Listing19-1.

// Include librariesinclude("LIB_http.php");include("LIB_parse.php");

// Identify the server you want to get local time from$target = "http://www.schrenk.com";

// Request the httpd head$header_array = http_header($target, $ref="");

// Parse the local server time from the header$local_server_time = return_between($header_array['FILE'], $start="Date:", $stop="\n", EXCL);

// Convert the local server time to a timestamp$local_server_time_ts =

Page 395: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$local_server_time_ts = strtotime($local_server_time);

// Display resultsecho "\nReturned header:\n";echo $header_array['FILE']."\n";echo "Parsed server timestamp = ".$local_server_time_ts ."\n";echo "Formatted server time = ".date("r", $local_server_time_ts)."\n";

Listing 19-1: A script that fetches and parses aserver's time settingsWhen the script in Listing 19-1 is run, it displays ascreen similar to the one in Figure 19-3. Here youcan see that the script requests an HTTP headerfrom a target server. It then parses the timestamp(which is identified by the line starting with Date:)from the header.

Figure 19-3. Result of running the script inListing19-1

Page 396: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Listing19-1

It is fairly safe to assume that the targetwebserver's clock is the same clock that is used totime the auctions. However, as a precaution, it isworthwhile to verify that the timestamp returnedfrom the webserver correlates to the timedisplayed on the auction web pages.Once the sniper parses the server's formattedtimestamp, it converts it into a Unix timestamp, aninteger that represents the number of seconds thathave elapsed since January 1, 1970. The use of theUnix timestamp is important because in order toperform the countdown, the sniper needs to knowhow many seconds separate the current time fromthe scheduled end of the auction. If you have Unixtimestamps for both events, it's simply a matter ofsubtracting the current server timestamp valuefrom the end of auction timestamp. Failure toconvert to Unix timestamps results in somedifficult calendar math. For example, without Unixtimestamps, you may need to subtract 10:20 PM,September 19 from 8:12 AM, September 20 toobtain the time remaining in an auction.

Time to Bid?

A sniper needs to make one bid, close to theauction's scheduled end but just before otherbidders have time to respond to it. Therefore, you

Page 397: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

bidders have time to respond to it. Therefore, youwill want to make your bid a few seconds beforethe auction ends, but not so close to the end thatthe auction is over before the server has time toprocess your bid.

Submit Bid

Your sniper will submit bids in a manner similarto the other procurement bots, but since your bidis time sensitive, your sniper will need toanticipate how long it will take to complete theforms and get responses from the target server.You should expect to fine-tune this process on liveauctions.

Evaluate Results

Evaluating the results of a sniping attempt is alsosimilar to evaluating the purchase results of otherprocurement bots. The only difference is that,unlike other procurement bots, there is apossibility that you were outbid or that the sniperbid too late to win the item. For these reasons,you may want to include additional diagnosticinformation in the results, including the finalprice, and whether you were outbid or the auctionended before your bid was completed. This way,you can learn what may have gone wrong and

Page 398: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

you can learn what may have gone wrong andcorrect problems that may reappear in futuresniping attempts.

Page 399: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Testing Your Own Webbotsand SnipersThe online store you used in Chapter 7 may alsobe used to test your trial procurement bots andsnipers. You should feel free to make yourmistakes here before you commit errors with areal procurement bot that discloses a competitiveadvantage or causes suspension of your privilegeson an actual target website. Aspects of the teststore that you may find particularly useful fortesting your skills include the following:

The store requires that buyers register andauthenticate themselves before making anypurchase or bidding in any auction.The prices in the store periodically change.Use this feature to design procurement botsthat capitalize on unannounced price dips.

The address of the online test store is listed on thisbook's website, which is available athttp://www.schrenk.com/nostarch/webbots.

Page 400: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Further ExplorationAs a developer with the skills to writeprocurement bots, you should ask yourself whatother types of purchasing agents you can writeand what other parameters you can use to makepurchasing decisions. Consider mapping out yourparticular ideas in a flowchart as I did inFigure 19-1 and Figure 19-2.After you've honed your skills at the book's teststore, consider the following ideas as startingpoints for developing your own procurement botsand snipers.

Develop a sniper that makes counterbids asnecessary.Design a sniper that uses scarcity of an itemas criteria for purchase.Write a procurement bot that detects pricereductions.Write a procurement bot that monitors theavailability of tickets for upcoming concertsand sporting events. When it appears that thetickets for a concert or game will sell out inadvance of the event, create a procurementbot that automatically purchases tickets for

Page 401: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

bot that automatically purchases tickets forresale later. (Make sure not to conflict withlocal laws, of course.)

Write a procurement bot that monitors weatherforecasts and makes stock or commoditypurchases based on industries that are affected byinclement weather.

Page 402: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Final ThoughtsPurchasing agents are easier to write than to test.This is especially true when sniping high-valueitems like cars, jewelry, and industrial equipment,where mistakes are expensive. Obviously, whenyou're writing sniping agents that buy big-ticketitems, you want to get things right the first time,but this is also true of procurement bots that buycheaper merchandise. Here is some generaladvice for debugging procurement bots andsnipers.

Debug code in stages, only moving to the nextstep after validating that the prior stageworks correctly.Assume that there are limited opportunities totest your ability to make purchases withactual trigger events. Hours, days, or evenweeks may pass between purchaseopportunities. Schedule ample debuggingtime, since the speed at which you canvalidate your code is directly associated withthe availability of specific products topurchase.Assume that all transactional websites, siteswhere money is exchanged, are closely

Page 403: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

where money is exchanged, are closelymonitored. Even though your intentions arepure, the system administrator of your targetwebserver may confuse your coding andprocess errors with hackers exploitingvulnerabilities in the server. The consequencesof such mistakes may lead to loss ofprivileges.Keep a low profile. Test as much as you canbefore communicating with the website'sserver, and limit the number of times youcommunicate with that target server.Make sure to read Chapters 25 and 28 beforedeploying any procurement bot.

Page 404: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 20. WEBBOTS ANDCRYPTOGRAPHYCryptography uses mathematics to secure data byapplying well-known algorithms (or ciphers) torender the data unreadable to people who don'thave the key, the string of bits required to unlockthe code. The beauty of cryptography is that itrelies on standards to secure data transmissionbetween web clients and servers. Without thesestandards, it would be impossible to haveconsistent security across the multitude of placesthat require secure data transmission.Don't confuse cryptography with obfuscation.Obfuscation attempts to obscure or hide datawithout standardized protocols—as a result, it isabout as reliable as hiding your house key underthe doormat. And since it doesn't rely on standardmethods for "un-obfuscation," it is not suitable forapplications that need to work in a variety ofcircumstances.Encryption—the use of cryptography—created acommercial environment on the Internet, mostlyby making it safe to pay for online purchases withcredit cards. The World Wide Web didn't widelysupport encryption until 1995, shortly after the

Page 405: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

support encryption until 1995, shortly after theNetscape Navigator browser (paired with itsCommerce Server) began supporting a protocolcalled Secure Sockets Layer (SSL). SSL is aprivate way to transmit personal data through anencrypted data transport layer. While TransportLayer Security (TLS) has superseded SSL, thenew protocol only changes SSL slightly, and SSLis still the popular term used to describe webencryption. Today, all popular webservers andweb browsers support encryption. (You canidentify when a website begins to use encryption,because the protocol changes from http tohttps.[63]) If you design webbots that handlesensitive information, you will need to know howto download encrypted websites and makeencrypted requests.In addition to privacy, SSL also ensures theidentity of websites by confirming that a digitalcertificate (what I referred to earlier as a key)was assigned to the website using SSL. Thismeans, for example, that when you check yourbank balance, you know that the web page youaccess is actually coming from your bank's serverand is not the product of a phishing attack. This isenforced by validating the bank's certificate withthe agency that assigned it to the bank's IPaddress. Another feature of SSL is that it ensuresthat web clients and servers receive all thetransmitted data, because the decryption methods

Page 406: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

transmitted data, because the decryption methodswon't work on partial data sets.

Designing Webbots That UseEncryptionAs when downloading unencrypted web pages,PHP provides choices to the webbot designer whoneeds to access secure servers. The followingsections explore methods for requesting anddownloading web pages that use encryption.

SSL and PHP Built-in Functions

In PHP version 5 or higher, you can use thestandard PHP built-in functions (discussed inChapter 3) to request and download encryptedfiles if you change the protocol from http: tohttps:. However, I wouldn't recommend usingthe built-in functions because they lack manyfeatures that are important to webbot developers,like automatic forwarding, form submission, andcookie support, just to name a few.

Encryption and PHP/CURL

To download an encrypted web page inPHP/CURL, simply set the protocol to https:, as

Page 407: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

PHP/CURL, simply set the protocol to https:, asshown in Listing 20-1.

http_get("https://some.domain.com", $referer);

Listing 20-1: Requesting an encrypted web pageIt's important to note that in some PHPdistributions, the protocol may be case sensitive,and a protocol defined as HTTPS: will not work.Therefore, it's a good practice to be consistentand always specify the protocol in lowercase.

[63] Additionally, when SSL is used, the networkport changes from 80 to 447.

Page 408: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

A Quick Overview of WebEncryptionThe following is a hasty overview of how webencryption works. While incomplete, it's here toprovide a greater appreciation for everythingPHP/CURL does and to help you be semi-literatein SSL conversations with peers, vendors, andclients.Once a web client recognizes it is talking to asecure server, it initiates a handshake process,where the web client and server agree on the typeof encryption to use. This is important becauseweb clients and servers are typically capable ofusing several ciphers or encryption algorithms.Two commonly used encryption ciphers includeDigital Encryption Standard (DES) and MessageDigest Algorithm (MD5).The server replies to the web client with a varietyof data, including its encryption certificate, a longstring of numbers used to authenticate the domainand tell the web client how to decrypt the data itgets from the server. The web client also sends theserver a random string of data that the serveruses to decrypt information originating from theclient.

Page 409: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

client.The process of creating an SSL for secure datacommunication should happen transparently andgenerally shouldn't be a concern for developers.This is regardless of the fact that creating asecure connection to a webserver requiresmultiple (complicated) communications betweenthe web client and server. In the end—when set upproperly—all data flowing to and from a securewebsite is encrypted, including all GET and POSTrequests and cookies. Aside from localcertificates, which are explained next, that's aboutall webbot developers need to know aboutencryption. If, however, you thirst for detailedinformation, or you see yourself as a futureHacker Jeopardy contestant,[64] you should readthe SSL specification. The full details areavailable at http://wp .netscape.com/eng/ssl3/ssl-toc.html.

[64] Hacker Jeopardy is a contest wherecontestants answer detailed questions aboutvarious Internet protocols. This game is an annualevent at the hacker conference DEFCON(http://www .defcon.org).

Page 410: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Local CertificatesCorporate networks sometimes use localcertificates to authenticate both client and server.In the vast majority of cases, however, there is noneed for a local certificate—in fact, I have neverbeen in a situation that required one. However,PHP/CURL supports local encryption certificates,and it's important to configure them even if youdon't use them. Versions 7.10 and later of cURLassume that a local certificate is used and will notdownload any web page if the local certificateisn't defined.[65] This is counterintuitive since localcertificates are seldom used; therefore,LIB_http—the library this book uses to fetch webpages and submit forms—assumes that there is nolocal encryption certificate and configuresPHP/CURL accordingly, as shown in Listing 20-2.

curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE); // No certificate

Listing 20-2: Telling PHP/CURL not to look for alocal certificateLater releases of cURL require this option evenwhen no local certificate is used. For that reason,you should define this option every time youdesign a PHP/CURL interface.

Page 411: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

design a PHP/CURL interface.If your webbot needs to run in a very securenetwork, a local certificate may be required toauthenticate your webbot as a valid user of theweb page or service it accesses. If you need to usea local encryption certificate, you can define onewith the PHP/CURL options described in Listing20-3.

curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, TRUE); // Certificate in usecurl_setopt($ch, CURLOPT_CAINFO, $file_name); // Certificate file name

Listing 20-3: Telling PHP/CURL how to use a localencryption certificateOn even rarer occasions, you may have to supportmultiple local certificates. In those cases, you candefine a directory path, instead of a filename, totell cURL where to find the location of all yourencryption certificates, as shown in Listing 20-4.

curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, TRUE); // Certificate in usecurl_setopt($ch, CURLOPT_CAPATH, $path); // Directory where multiple // certificates are stored

Listing 20-4: Telling PHP/CURL how to usemultiple local encryption certificates

Page 412: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

[65] I learned this lesson the hard way when aclient flew me to Palo Alto for a week to work ona project. None of my PHP/CURL routines workedon the client's server because it used a laterversion of cURL than I was using. After a fewembarrassing moments, I discovered that theproblem involved defining local certificates, evenwhen they aren't used.

Page 413: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Final ThoughtsOccasionally, you can force an encrypted websiteinto transferring unencrypted data by simplychanging the protocol from https to http in therequest. While this may allow you to download theweb page, this technique is a bad idea because, inaddition to potentially revealing confidential data,your webbot's actions will look unusual in serverlog files, which will destroy all attempts at stealth.Sometimes web developers use the wrongprotocol when designing web forms. It's importantto remember that the default protocol for formsubmission is http, and unless specifically definedas https by the form's action attribute, the formis submitted without encryption, even if the formexists on a secure web page! Using the wrongnetwork protocol is a common mistake made byinexperienced web developers. For that reason,when your webbot submits a form, you need to besure it uses the same form-submission protocolthat is defined by the downloaded form. Forexample, if you download an encrypted form pageand the form's action attribute isn't defined, theprotocol is http, not https! As wrong as itsounds, you need to use the same protocol definedby the web form, even if it is not the proper

Page 414: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

protocol to use in that specific case. If yourwebbot uses a protocol that is different than theone browsers use when submitting the form, youmay cause the system administrator to scratch hisor her head and investigate why one web clientisn't using the same protocol everyone else isusing.

Page 415: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 21. AUTHENTICATIONIf your webbots are going to access sensitiveinformation or handle money, they'll need toauthenticate, or sign in as registered users ofwebsites. This chapter teaches you how to writewebbots that access password-protected websites.As in previous chapters, you can practice whatyou learn with example scripts and special testpages on the book's website.

What Is Authentication?Authentication is the processes of proving thatyou are who you say you are. You authenticateyourself by presenting something that only youcan produce. Table 21-1 describes the threecategories of things used to prove a person'sidentity.Table 21-1. Things That Prove a Person's Identity

YouAuthenticateYourself With . ..

Examples

Something Usernames and passwords; Social Security

Page 416: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Somethingyou know

Usernames and passwords; Social Securitynumbers

Somethingyou are(biometrics)

DNA samples; thumbprints; retina, voice, andfacial scans

Somethingyou have

House keys, digital certificates, encodedmagnetic cards, wireless key fobs, implantedcanine microchips

Types of Online Authentication

Most websites that require authentication ask forusernames and passwords (something you know).The username and password—also known as logincriteria—are compared to records in a database.The user is allowed access to the website if thelogin criteria match the records in the database.Based on the login criteria, the website mayoptionally restrict the user to specific parts of thewebsite or grant specific functionality.Usernames and passwords are the mostconvenient way to authenticate people onlinebecause they can be authenticated with a browserand without the need for additional hardware orsoftware.Websites also authenticate through the use ofdigital certificates (something you have), which

Page 417: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

digital certificates (something you have), whichmust be exchanged between client and server andvalidated before access to a website or service isgranted. The intricacies of digital certificates aredescribed in Chapter 20. If you skipped thischapter, this is a good time to read it. Otherwise,all you need to know is that digital certificates arefiles that reside on servers, or less frequently, onthe hard drives of client computers. The contentsof these certificate files are automaticallyexchanged to authenticate the computer thatholds the certificate. You're most apt to encounterdigital certificates when using the HTTPS protocol(also know as SSL) to access secure websites.Here, the certificate authenticates the website andfacilitates the use of an encrypted data channel.Less frequently, a certificate is required on theclient computer as well, to access virtual privatenetworks (VPNs), which allow remote users toaccess private corporate networks. PHP/CURLmanages certificates automatically if you specifythe https: protocol in the URL. PHP/CURL alsofacilitates the use of local certificates; in the oddcircumstance that you require a client-sidecertificate, PHP/CURL and client-side certificatesare covered in Appendix A.Biometrics (something you are) are generally notused in online authentication and are beyond thescope of this chapter. Personally, I have only seenbiometrics used to authenticate users to online

Page 418: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

biometrics used to authenticate users to onlineservices when biometric information is readilyavailable, as in telemedicine.

Strengthening Authentication byCombining Techniques

Your webbots may encounter websites that usemultiple forms of authentication, sinceauthentication is strengthened when two or moretechniques are combined. For example, ATMsrequire both an ATM card (something you have)and a personal identification number (PIN)(something you know). Similarly, the retailerTarget experimented with an ATM-styleauthentication scheme when it introduced USBcredit card readers that worked in conjunctionwith Target.com.

Authentication and Webbots

You may very well encounter certificates—andeven biometrics—as a webbot developer, so themore familiar you are with the various forms ofauthentication, the more potential targets yourwebbots will have. You'll find, however, that mostwebbots authenticate with simple usernames andpasswords. The following sections describe themost common techniques for using usernames and

Page 419: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

most common techniques for using usernames andpasswords.

Page 420: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Example Scripts and PracticePagesWe'll explore three types of online authentication.For each case, you'll receive examples ofauthentication scripts designed specifically towork with password-protected sections of thisbook's website. You can experiment (and makemistakes) on these practice pages before writingauthenticating webbots that work on realwebsites. The location of the practice pages isshown in Table 21-2.Table 21-2. Location of Authentication PracticePages on the Book's Website

AuthenticationMethod

Location of Practice Pages

Basicauthentication http://www.schrenk.com/nostarch/webbots/basic_authentication

Cookiessessions http://www.schrenk.com/nostarch/webbots/cookie_authentication

Querysessions http://www.schrenk.com/nostarch/webbots/query_authentication

For simplicity, all of the authentication exampleson the book's website use the login criteria shownin Table 21-3.Table 21-3. Login Criteria Used for AllAuthentication Practice Pages

Page 421: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Username Password

webbot sp1der3

Page 422: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Basic AuthenticationThe most common form of online is authenticationis basic authentication. Basic authentication is adialogue between the webserver and browsingagent in which the login credentials are requestedand processed, as shown in Figure 21-1.Web pages subject to basic authentication exist inwhat's called a realm. Generally, realms refer toall web pages in the current server directory aswell as the web pages in sub-directories.Fortunately, browsers shield people from many ofthe details defined in Figure 21-1. Once youauthenticate yourself with a browser, it appearsthat you don't re-authenticate yourself whenaccessing other pages within the realm. In reality,the dialogue from Figure 21-1 happens for eachpage downloaded within the realm. Your browserautomatically resubmits your authenticationcredentials without asking you again for yourusername and password. When accessing a basicauthenticated website with a webbot, you willneed to send your login credentials every time thewebbot requests a page within the authenticatedrealm, as shown later in the example script.

Page 423: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 21-1. Basic authentication dialogue

Before you write an auto-authenticating webbot,you should first visit the target website andmanually authenticate yourself into the site with abrowser. This way you can validate your logincredentials and learn about the target site beforeyou design your webbot. When you request a webpage from the book's basic authentication testarea, your browser will initially present a loginform for entering usernames and passwords, asshown in Figure 21-2.

Figure 21-2. Basic authentication login form

After entering your username and password, youwill gain access to a simple set of practice pages(shown in Figure 21-3) for testing auto-authenticating webbots and basic authentication.You should familiarize yourself with these simplepages before reading further.

Page 424: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 21-3. Basic authentication test pages

The commands required to download a web pagewith basic authentication are very similar to thoserequired to download a page withoutauthentication. The only change is that you needto configure the CURLOPT_USERPWD option to passthe login credentials to PHP/CURL. The format forlogin credentials is the username and passwordseparated by a colon, as shown in Listing 21-1.

<?

# Define target page$target = "http://www.schrenk.com/nostarch/webbots/basic_authentication/index.php";

# Define login credentials for this page$credentials = "webbot:sp1der3";

# Create the cURL session$ch = curl_init();curl_setopt($ch, CURLOPT_URL, $target); // Define target sitecurl_setopt($ch, CURLOPT_USERPWD, $credentials); // Send credentialscurl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Return page in string

# Echo page$page = curl_exec($ch); // Place web page into a stringecho $page; // Echo downloaded page

# Close the cURL sessioncurl_close($ch);

Page 425: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

curl_close($ch);?>

Listing 21-1: The minimal code required to accessthe basic authentication test pagesOnce the favored form of authentication, basicauthentication is losing out to other techniquesbecause it is weaker. For example, with basicauthentication, there is no way to log out withoutclosing your browser. There is also no way tochange the appearance of the authentication formbecause the browser creates it. Basicauthentication is also not very secure, as thebrowser sends the login criteria to the server incleartext. Digest authentication is an improvementover basic authentication. Unlike basicauthentication, digest authentication sends thepassword to the server as an MD5 digest with128-bit encryption. Unfortunately, support fordigest authentication is spotty, especially witholder browsers. If you're using PHP 5, you can usethe curl_setopt() function to tell PHP/CURLwhich form of authentication to use. Since we'refocusing on PHP 4, let's limit this discussion tobasic authentication, though the process isotherwise identical with digest authentication.

Page 426: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Session AuthenticationUnlike basic authentication, in which logincredentials are sent each time a page isdownloaded, session authentication validatesusers once and creates a session value thatrepresents that authentication. The session values(instead of the actual username and password)are passed to each subsequent page fetch toindicate that the user is authenticated. There aretwo basic methods for employing sessionauthentication—with cookies and with querystrings. These methods are nearly identical inexecution and work equally well. You're apt toencounter both forms of sessions as you gainexperience writing webbots.

Authentication with Cookie Sessions

Cookies are small pieces of information thatservers store on your hard drive. Cookies areimportant because they allow servers to identifyunique users. With cookies, websites canremember preferences and browsing habits(within the domain), and use sessions to facilitateauthentication.

How Cookies Work

Servers send cookies in HTTP headers. It is up tothe client software to parse the cookie from theheader and save the cookie values for later use.On subsequent fetches within the same domain, itis the client's responsibility to send the cookiesback to the server in the HTTP header of the pagerequest. In our cookie authentication example, thecookie session can be viewed in the headerreturned by the server, as shown in Listing 21-2.

HTTP/1.1 302 FoundDate: Sat, 09 Sep 2006 16:09:03 GMTServer: Apache/2.0.58 (FreeBSD) mod_ssl/2.0.58

Page 427: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Server: Apache/2.0.58 (FreeBSD) mod_ssl/2.0.58 OpenSSL/0.9.8a PHP/4.4.2X-Powered-By: PHP/4.4.2Set-Cookie: authenticate=1157818143Location: index0.phpContent-Length: 1837Content-Type: text/html; charset=ISO-8859-1

Listing 21-2: Cookies returned from the server inthe HTTP headerThe line in bold typeface defines the name of thecookie and its value. In this case there is onecookie named authenticate with the value1157818143.Sometimes cookies have expiration dates, which isan indication that the server wants the client towrite the cookie to a file on the hard drive. Othertimes, as in our example, no expiration date isspecified. When no expiration date is specified, theserver requests that the browser save the cookiein RAM and delete it when the browser closes. Forsecurity reasons, authentication cookies typicallyhave no expiration date and are stored in RAM.When authentication is done using a cookie, eachsuccessive page within the website examines thesession cookie, and, based on internal rules,determines whether the web agent is authorized todownload that web page. The actual value of thecookie session is of little importance to thewebbot, as long as the value of the cookie sessionmatches the value expected by the targetwebserver. In many cases, as in our example, thesession also holds a time-out value that expiresafter a limited period. Figure 21-4 shows a typicalcookie authentication session.

Page 428: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 21-4. Authentication with cookie sessions

Unlike basic authentication, where the logincriteria are sent in a generic (browser-dependent)form, cookie authentication uses custom forms, asshown in Figure 21-5.

Figure 21-5. The login page for the cookieauthentication example

Regardless of the authentication method used byyour target web page, it's vitally important toexplore your target screens with a browser beforewriting self-authenticating webbots. This isespecially true in this example, because yourwebbot must emulate the login form. You shouldtake this time to explore the cookie authenticationpages on this book's website. View the sourcecode for each page, and see how the code works.

Page 429: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

code for each page, and see how the code works.Use your browser to monitor the values of thecookies the web pages use. Now is also a goodtime to preview Chapter 22.Figure 21-6 shows an example of the screens thatlay beyond the login screen.

Figure 21-6. The example cookie session pagefrom the book's website

Cookie Session Example

A webbot must do the following to authenticateitself to a website that uses cookie sessions:

Download the web page with the login formEmulate the form that gathers the logincredentialsCapture the cookie written by the serverProvide the session cookie to the server oneach page request

The script in Listing 21-3 first downloads the loginpage as a normal user would with a browser. As itemulates the form that sends the login credentials,it uses the CURLOPT_COOKIEFILE andCURLOPT_COOKIEJAR options to tell cURL wherethe cookies should be written and where to findthe cookies that are read by the server. To mostpeople (myself included), it seems redundant tohave one set of outbound cookies and another set

Page 430: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

have one set of outbound cookies and another setof inbound cookies. In every case I've seen,webbots use the same file to write and readcookies. It's important to note that PHP/CURL willalways save cookies to a file, even when thecookie has no expiration date. This presents someinteresting problems, which are explained inChapter 22.

<?# Define target page$target = "http://www.schrenk.com/nostarch/webbots/cookie_authentication/index.php";

# Define the login form data$form_data="enter=Enter&username=webbot&password=sp1der3";

# Create the cURL session$ch = curl_init();curl_setopt($ch, CURLOPT_URL, $target); // Define target sitecurl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Return page in stringcurl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt"); // Tell cURL where to writecookiescurl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt"); // Tell cURL which cookiesto sendcurl_setopt($ch, CURLOPT_POST, TRUE);curl_setopt($ch, CURLOPT_POSTFIELDS, $form_data);curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); // Follow redirects

# Execute the PHP/CURL session and echo the downloaded page$page = curl_exec($ch);echo $page;

# Close the cURL sessioncurl_close($ch);?>

Listing 21-3: Auto-authentication with cookiesessions

Page 431: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

sessionsOnce the session cookie is written, your webbotshould be able to download any authenticatedpage, as long as the cookie is presented to thewebsite by your cURL session. Just one word ofcaution: Depending on your version of cURL, youmay need to use a complete path when definingyour cookie file.

Authentication with Query Sessions

Query string sessions are nearly identical tocookie sessions, the difference being that insteadof storing the session value in a cookie, thesession value is stored in the query string. Otherthan this difference, the process is identical to theprotocol describing cookie session authentication(outlined in Figure 21-4). Query sessions createadditional work for website developers, as thesession value must be tacked on to all links andincluded in all form submissions. Yet some webdevelopers (myself included) prefer querysessions, as some browsers and proxies restrictthe use of cookies and make cookie sessionsdifficult to implement.This is a good time to manually explore the testpages for query authentication on the website.Once you enter your username and password,you'll notice that the authentication session valueis visible in the URL as a GET value, as shown inFigure 21-7. However, this may not be the case inall situations, as the session value could also be ina POST value and invisible to the viewer.

Page 432: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 21-7. Session variable visible in the querystring (URL)

Like the cookie session example, the query sessionexample first emulates the login form. Then itparses the session value from the authenticatedresult and includes the session value in the querystring of each page it requests. A script capable ofdownloading pages from the practice pages forquery session authentication is shown in Listing21-4.

<?# Include librariesinclude("LIB_http.php");include("LIB_parse.php");

# Request the login page$domain = "http://www.schrenk.com/";$target = $domain."nostarch/webbots/query_authentication";$page_array = http_get($target, $ref="");

echo $page_array['FILE']; // Display the login pagesleep(2); // Include small delay between page fetchesecho "<hr>";

# Send the query authentication form$login = $domain."nostarch/webbots/query_authentication/index.php";

$data_array['enter'] = "Enter";$data_array['username'] = "webbot";$data_array['password'] = "sp1der3";$page_array = http_post_form($login, $ref=$target, $data_array);echo $page_array['FILE']; // Display first page after login pagesleep(2); // Include small delay between page fetchesecho "<hr>";

Page 433: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

# Parse session variable$session = return_between($page_array['FILE'], "session=", "\"", EXCL);

# Request subsequent pages using the session variable$page2 = $target . "/index2.php?session=".$session;$page_array = http_get($page2, $ref="");echo $page_array['FILE']; // Display page two?>

Listing 21-4: Authenticating a webbot on a pageusing query sessions

Figure 21-8. Output of Listing 21-4

Page 434: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Final ThoughtsHere are a few additional things to rememberwhen writing webbots that access password-protected websites.

For clarity, the examples in this chapter use aminimal amount of code to perform a task. Inactual use, you'll want to follow thecomprehensive practices mentioned elsewherein this book for downloading pages, parsingresults, emulating forms, using cURL, andwriting fault-tolerant webbots.It's important to note that no form of onlineauthentication is effective unless it isaccompanied by encryption. After all, it doeslittle good to authenticate users if sensitiveinformation is sent across the network incleartext, which can be read by anyone with apacket sniffer.[66] In most cases,authentication will be combined withencryption. For more information aboutwebbots and encryption, revisit Chapter 20.If your webbot communicates with more thanone domain, you need to be careful not tobroadcast your login criteria when writingwebbots that use basic authentication. For

Page 435: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

webbots that use basic authentication. Forexample, if you hard-code your username andpassword into a PHP/CURL routine, makesure that you don't use the same functionwhen fetching pages from other domains. Thissounds silly, but I've seen it happen, resultingin cleartext login credentials in server logfiles.Websites may use a combination of two ormore authentication types. For example, anauthenticated site might use both query andcookie sessions. Make sure that you accountfor all potential authentication schemes beforereleasing your webbots.The latest versions of all the scripts used inthis chapter are available for download at thisbook's website.

[66] A packet sniffer is a special type of agent thatlets people read raw network traffic.

Page 436: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 22. ADVANCEDCOOKIE MANAGEMENTIn the previous chapter, you learned how to usecookies to authenticate webbots to accesspassword-protected websites. This chapter furtherexplores cookies and the challenges they presentto webbot developers.

How Cookies WorkCookies are small pieces of ASCII data thatwebsites store on your computer. Without usingcookies, websites cannot distinguish between newvisitors and those that visit on a daily basis.Cookies add persistence, the ability to identifypeople who have previously visited the site, to anotherwise stateless environment. Through themagic of cookies, web designers can write scriptsto recognize people's preferences, shippingaddress, login status, and other personalinformation.There are two types of cookies. Temporarycookies are stored in RAM and expire when theclient closes his or her browser; permanentcookies live on the client's hard drive and exist

Page 437: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

cookies live on the client's hard drive and existuntil they reach their expiration date (which maybe so far into the future that they'll outlive thecomputer they're on). For example, consider thescript in Listing 22-1, which writes one temporarycookie and one permanent cookie that expires inone hour.

# Set cookie that expires when browser closessetcookie ("TemporaryCookie", "66");

# Set cookie that expires in one hoursetcookie ("PermanentCookie", "88", time() + 3600);

Listing 22-1: Setting permanent and temporarycookies with PHPListing 22-1 shows the cookies' names, values, andexpiration dates, if required. Figure 22-1 andFigure 22-2 show how the cookies written by thescript in Listing 22-1 appear in the privacysettings of a browser.

Figure 22-1. A temporary cookie written from

Page 438: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 22-1. A temporary cookie written fromhttp://www.schrenk.com, with a value of 66

Figure 22-2. A permanent cookie written fromhttp://www.schrenk.com, with a value of 88

Browsers and webservers exchange cookies inHTTP headers. When a browser requests a webpage from a webserver, it looks to see if it has anycookies previously stored by that web page'sdomain. If it finds any, it will send those cookies tothe webserver in the HTTP header of the fetchrequest. When you execute the cURL command inFigure 22-3, you can see the cookies as theyappear in the returned header.

Page 439: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 22-3. Cookies as they appear in the HTTPheader sent by the server

A browser will never modify a cookie unless itexpires or unless the user erases it using thebrowser's privacy settings. Servers, however, maywrite new information to cookies every time theydeliver a web page. These new cookie values arethen passed to the web browser in the HTTPheader, along with the requested web page.According to the specification, a browser will onlyexpose cookies to the domain that wrote them.Webbots, however, are not bound by these rulesand can manipulate cookies as needed.

Page 440: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

PHP/CURL and CookiesYou can write webbots that support cookieswithout using PHP/CURL, but doing so adds to thecomplexity of your designs. Without PHP/CURL,you'll have to read each returned HTTP header,parse the cookies, and store them for later use.You will also have to decide which cookies to sendto which domains, manage expiration dates, andreturn everything correctly in headers of pagerequests. PHP/CURL does all this for you,automatically. Even with PHP/CURL, however,cookies pose challenges to webbot designers.Fortunately, PHP/CURL does support cookies, andwe can effectively use it to capture the cookiesfrom the previous example, as shown in Listing22-2.

include("LIB_http.php");$target="http://www.schrenk.com/nostarch/webbots/EXAMPLE_writing_cookies.php";

http_get($target, "");

Listing 22-2: Reading cookies with PHP/CURL andthe LIB_http libraryLIB_http defines the file where PHP/CURL storescookies. This declaration is done near thebeginning of the file, as shown in Listing 22-3.

# Location of your cookie file (must be a fully resolved address)define("COOKIE_FILE", "c:\cookie.txt");

Listing 23-3: Cookie file declaration, as made inLIB_http

As noted in Listing 22-3, the address for a cookiefile should be a fully resolved local one. Relativeaddresses sometimes work, but not for allPHP/CURL distributions. When you execute thescripts in Listing 22-1 (available at this book'swebsite), PHP/CURL writes the cookies (inNetscape Cookie Format) in the file defined in theLIB_http configuration, as shown in Listing 22-4.

# Netscape HTTP Cookie File

Page 441: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

# Netscape HTTP Cookie File# http://www.netscape.com/newsref/std/cookie_spec.html

# This file was generated by libcurl! Edit at your own risk.

www.schrenk.com FALSE /nostarch/webbots/ FALSE 1159120749 PermanentCookie 88

www.schrenk.com FALSE /nostarch/webbots/ FALSE 0 TemporaryCookie 66

Listing 22-4: The cookie file, as written byPHP/CURL

Note

Each web client maintains its own cookies,and the cookie file written by PHP/CURL isnot the same cookie file created by yourbrowser.

Page 442: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

How Cookies ChallengeWebbot DesignWebservers will not think anything is wrong ifyour webbots don't use cookies, since manypeople configure their browsers not to acceptcookies for privacy reasons. However, if yourwebbot doesn't support cookies, you will not beable to access sites that demand their use.Moreover, if your webbot doesn't support cookiescorrectly, you will lose your webbot's stealthyproperties. You also risk revealing sensitiveinformation if your webbot returns cookies toservers that didn't write them.Cookies operate transparently—as such, we mayforget that they even exist. Yet the data passed incookies is just as important as the datatransferred in GET or POST methods. WhilePHP/CURL handles cookies for webbotdevelopers, some instances still cause problems—most notably when cookies are supposed to expireor when multiple users (with separate cookies)need to use the same webbot.

Purging Temporary Cookies

One of the problems with the way PHP/CURL

Page 443: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

One of the problems with the way PHP/CURLmanages cookies is that as PHP/CURL writesthem to the cookie file, they all becomepermanent, just like a cookie written to your harddrive by a browser. My experience indicates thatall cookies accepted by PHP/CURL becomepermanent, regardless of the webserver'sintention. This in itself is usually not a problem,unless your webbot accesses a website thatmanages authentication with temporary cookies.If you fail to purge your webbot's temporarycookies, and it accesses the same website for ayear, that essentially tells the website's systemadministrator that you haven't closed yourbrowser (let alone rebooted your computer!) forthe same period of time. Since this is not a likelyscenario, your account may receive unwantedattention or your webbot may eventually violatethe website's authentication process. There is noconfiguration within PHP/CURL for managingcookie expiration, so you need to manually deleteyour cookies every so often in order to avoid theseproblems.

Managing Multiple Users' Cookies

In some applications, your webbots may need tomanage cookies for multiple users. For example,suppose you write one of the procurement bots orsnipers mentioned in Chapter 19. You may want to

Page 444: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

snipers mentioned in Chapter 19. You may want tointegrate the webbot into a website where severalpeople may log in and specify purchases. If thesepeople each have private accounts at the e-commerce website that the webbot targets, eachuser's cookies will require separate management.Webbots can manage multiple users' cookies byemploying a separate cookie file for each user.LIB_http, however, does not support multiplecookie files, so you will have to write a schemethat assigns the appropriate cookie file to eachuser. Instead of declaring the name of the cookiefile once, as is done in LIB_http, you will need todefine the cookie file each time a PHP/CURLsession is used. For simplicity, it makes sense touse the person's username in the cookie file, asshown in Listing 22-5.

# Open a PHP/CURL session$s = curl_init();

# Select the cookie file (based on username)$cookie_file = "c:\bots\".$username."cookies.txt";curl_setopt($s, CURLOPT_COOKIEFILE, $cookie_file); // Read cookie filecurl_setopt($s, CURLOPT_COOKIEJAR, $cookie_file); // Write cookie file

# Configure the cURL commandcurl_setopt($s, CURLOPT_URL, $target);

Page 445: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

curl_setopt($s, CURLOPT_URL, $target); // Define target sitecurl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE); // Return in string

# Indicate that there is no local SSL certificatecurl_setopt($s, CURLOPT_SSL_VERIFYPEER, FALSE); // No certificate

curl_setopt($s, CURLOPT_FOLLOWLOCATION, TRUE); // Follow redirectionscurl_setopt($s, CURLOPT_MAXREDIRS, 4); // Limit redirections to four

# Execute the cURL command (Send contents of target web page to string)$downloaded_page = curl_exec($s);

# Close PHP/CURL session

curl_close($s);

Listing 22-5: A PHP/CURL script, capable ofmanaging cookies for multiple users

Page 446: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Further ExplorationWhile PHP/CURL's cookie management isextremely useful to webbot developers, it has afew shortcomings. Here are some ideas forimproving on what PHP/CURL already does.

Design a script that reads cookies directlyfrom the HTTP header and programmaticallysends the correct cookies back to the server inthe HTTP header of page requests. Whileyou're at it, improve on PHP/CURL's ability tomanage cookie expiration dates.For security reasons, sometimesadministrators do not allow scripts running onhosted webservers to write local files. Whenthis is the case, PHP/CURL is not able tomaintain cookie files. Resolve this problem bywriting a MySQL-based cookie managementsystem.Write a webbot that pools cookies written bytwo or more webservers. Find a usefulapplication for this exploit.Write a script that, on a daily basis, deletestemporary cookies from PHP/CURL'sNetscape-formatted cookie file.

Page 447: Webbots, Spiders, And Screen Scrapers - Michael Schrenk
Page 448: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 23. SCHEDULINGWEBBOTS AND SPIDERSUp to this point, all of our webbots have run onlywhen executed directly from a command line orthrough a browser. In real-world situations,however, you may want to schedule your webbotsand spiders to run automatically. This chapterdescribes methods for scheduling webbots to rununattended in a Windows environment. Mostreaders should have access to the scheduling toolI'll be using here.If you are using an operating system other thanWindows, don't despair. Most operating systemssupport scheduling software of some type. InUnix, Linux, and Mac OS X environments, you canalways use the cron command, a text-basedscheduling tool. Regardless of the operatingsystem you use, there should also be a graphicalinterface for a scheduling tool, similar to the oneWindows uses.

The Windows Task SchedulerThe Windows Task Scheduler is an easy-to-usegraphical user interface (GUI) designed for the

Page 449: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

graphical user interface (GUI) designed for thesomewhat complex duty of scheduling tasks. Youcan access the Task Scheduler through theControl Panel or in the Accessories directory,under System Tools.To see the tasks currently scheduled on yourcomputer, simply click Scheduled Tasks. Inaddition to showing the schedule and status ofthese tasks, this window is also the tool you'll useto create new scheduled tasks. It will look like theone in Figure 23-1.

Figure 23-1. The Windows Task Scheduler

Preparing Your Webbots to Run asScheduled Tasks

Before you schedule your webbot to runautomatically, you should create a batch file thatexecutes the webbot. It is easier to schedule abatch file than to specify the PHP file directly,

Page 450: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

batch file than to specify the PHP file directly,because the batch file adds flexibility in definingpath names and allows multiple webbots, orevents, to run from the same scheduled task.Listing 23-1 shows the format for executing a PHPwebbot from a batch file.

drive:/php_path/php drive:/webbot_path/my_webbot.php

Listing 23-1: Executing a local webbot from abatch fileIn the batch file shown in Listing 23-1, theoperating system executes the PHP interpreter,which subsequently executes my_webbot.php.

drive:/curl_path/curl http://www.somedomain.com/remote_webbot.php

Listing 23-2: Executing a remote webbot from abatch file

Scheduling a Webbot to Run Daily

To schedule a daily execution of your batch file,click Add Scheduled Task in the Task Schedulerwindow. This will initiate a wizard, which walksyou through the process of creating a schedule ofexecution times for your application. The first stepis to identify the application you want to schedule.To schedule your webbot, click the Browse button

Page 451: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

To schedule your webbot, click the Browse buttonto locate the batch file that executes it, as shownin Figure 23-2.

Figure 23-2. Selecting an application to schedule

Once you select the webbot you want to schedule—in this example, test_webbot.bat—the wizardasks for the periodicity, or the frequency ofexecution. Windows allows you to schedule a taskto run daily, weekly, monthly, just once, when thecomputer starts, or when you log on, as shown inFigure 23-3.

Page 452: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 23-3. Configuring the periodicity of yourwebbot

After selecting a period, you will specify the timeof day you want your webbot to execute. You canalso specify whether the webbot will run everyday or only on weekdays, as shown in Figure 23-4.You can even schedule a webbot to skip one dayor more.Additionally, you can set the entire schedule tobegin sometime in the future. For example, theconfiguration shown in Figure 23-4 will cause thewebbot to run Monday through Friday at 6:20 PM,commencing on January 16, 2008.

Page 453: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 23-4. Configuring the time and days yourwebbot will run

The final step of the scheduling wizard is to enteryour Windows username and password, as shownin Figure 23-5. This will allow your webbot to runwithout Windows prompting you forauthentication.

Figure 23-5. Entering a username and password

Page 454: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 23-5. Entering a username and passwordto authenticate your webbot

On completing the wizard, the scheduler displaysyour new scheduled task, as shown in Figure 23-6.

Figure 23-6. The Task Scheduler showing thestatus of test_webbot's schedule

Page 455: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Complex SchedulesThere are several ways to satisfy the need for acomplex schedule. The easiest solution may be toschedule additional tasks. For example, if youneed to run a webbot once at 6:20 PM and againat 6:45 PM, the simplest solution is to createanother task that runs the same webbot at thelater time.The Task Scheduler is also capable of managingvery complex schedules. If you right-click yourwebbot in the Task Scheduler window, select theSchedule tab, and then click the Advanced button,you can create the schedule shown in Figure 23-7,which runs the webbot every 10 minutes from6:20 PM to 9:10 PM, every weekday exceptWednesdays, starting on January 16, 2008.

Page 456: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 23-7. An advanced weekly schedule

If a monthly period is required, you can specifywhich month and days you want the webbot torun. The configuration in Figure 23-8 describes aschedule that launches a webbot on the firstWednesday of every month.

Page 457: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 23-8. Scheduling webbots to launchmonthly

Page 458: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Non-Calendar-Based TriggersCalendar events, like those examined in thischapter, are not the only events that may trigger awebbot to run. However, other types of triggersusually require that a scheduled task runperiodically to detect if the non-calendar event hasoccurred. For example, the script in the followinglistings uses techniques discussed in Chapter 15 totrigger a webbot to run after receiving an emailwith the words Run the webbot in the subject line.First, the webbot initializes itself to read emailand establishes the location of the webbot it willrun when it receives the triggering email message,as shown in Listing 23-3.

// Include the POP3 command libraryinclude("LIB_pop3.php");define("SERVER", "your.mailserver.net"); // Your POP3 mail serverdefine("USER", "[email protected]"); // Your POP3 email addressdefine("PASS", "your_password"); // Your POP3 password$webbot_path = "c:\\webbots\\view_competitor.bat";

Listing 23-3: Initializing the webbot that istriggered via email

Page 459: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

triggered via emailOnce the initialization is complete, this webbotattempts to make a connection to the mail server,as shown in Listing 23-4.

// Connect to POP3 server$connection_array = POP3_connect(SERVER, USER, PASS);$POP3_connection = $connection_array['handle'];

Listing 23-4: Making a mail server connectionAs shown in Figure 23-5, once a successfulconnection to the mail server is made, this webbotlooks at each pending message to determine if itcontains the trigger phrase Run the webbot. Whenthis phrase is found, the webbot executes in ashell.

if($POP3_connection) { // Create an array of received messages $email_array = POP3_list($POP3_connection);

// Examine each message in $email_array for($xx=0; $xx<count($email_array); $xx++) { // Get each email message list($mail_id, $size) = explode(" ", $email_array[$xx]); $message = POP3_retr($POP3_connection, $mail_id);

Page 460: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

// Run the webbot if email subject contains "Run the webbot" if(stristr($message, "Subject: Run the webbot")) { $output = shell_exec($webbot_path); echo "<pre>$output </pre>";

// Delete message, so we don't trigger another event from this email POP3_delete($POP3_connection, $mail_id); } }

}

Listing 23-5: Reading each message and executinga webbot when a specific email is receivedOnce the webbot runs, it deletes the triggeringemail message so it won't mistakenly be executeda second time on subsequent checks for emailmessages containing the trigger phrase.

Page 461: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Final ThoughtsNow that you know how to automate the task oflaunching webbots from both scheduled and non-scheduled events, it's time for a few words ofcaution.

Determine the Webbot's BestPeriodicity

A common question when deploying webbots ishow often to schedule a webbot to check if datahas changed on a target server. The answer tothis question depends on your need for stealth andhow often the target data changes. If your webbotmust run without detection, you should limit thenumber of file accesses you perform, since everyfile your webbot downloads leaves a clue to itsexistence in the server's log file. Your webbotbecomes increasingly obvious as it creates moreand more log entries.The periodicity of your webbot's execution mayalso hinge on how often your target changes.Additionally, you may require notification as soonas a particularly important website changes.Timeliness may drive the need to run the webbotmore frequently. In any case, you never want to

Page 462: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

more frequently. In any case, you never want torun a webbot more often than necessary. Youshould read Chapter 28 before you deploy awebbot that runs frequently or consumesexcessive bandwidth from a server.I always contend that you shouldn't access atarget more than what's necessary to perform ajob. If that need for expedience requires that youconnect to a target more than once every hour orso, you're probably hitting it too hard. Obviously,the rules change if you own the target server.

Avoid Single Points of Failure

Remember that hardware and software are bothsubject to unexpected crashes. If your webbotperforms a mission-critical task, you shouldensure that your scheduler doesn't create a singlepoint of failure or execute a process step that maycause an entire webbot to fail if that one stepcrashes. Chapter 25 describes methods to ensurethat your webbot does not stop working if ascheduled webbot fails to run.

Add Variety to Your Schedule

The other potential problem with scheduled tasksis that they run precisely and repeatedly, creatingentries in the target's access log at the same hour,

Page 463: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

entries in the target's access log at the same hour,minute, and second. If you schedule your webbotto run once a month, this may not be a problem,but if a webbot runs daily at exactly the sametime, it will become obvious to any competentsystem administrator that a webbot, and not ahuman, is accessing the server. If you want toschedule a webbot that emulates a human using abrowser, you should continue on to Chapter 24 formore information.

Page 464: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Part IV. LARGERCONSIDERATIONSAs you develop webbots and spiders, you will soonlearn (or wish you had learned) that there is moreto webbot and spider development than masteringthe underlying technologies. Beyond technology,your webbots need to coexist with society—andperhaps more importantly, they need to coexistwith the system administrators of the sites youtarget. This section attempts to guide you throughthe larger considerations of webbot and spiderdevelopment with the hope of keeping you out oftrouble.

Chapter 24Sometimes it is best if webbots areindistinguishable from normal Internet traffic.In this chapter, I'll explain when and howstealth is important to webbots and how todesign and deploy webbots that look likenormal browser traffic.

Chapter 25Since the Internet is constantly changing, it is agood idea to design webbots that will be lesslikely to fail if your target websites change. In

Page 465: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

likely to fail if your target websites change. Inthis chapter, we'll focus on methods to designfault tolerance into your webbots and spidersso they will more easily adapt (or at leastgracefully fail) when websites change.

Chapter 26Here I'll explain how and why to write webpages that are easy for webbots and spiders todownload and analyze, with a special focus onthe needs of search engine spiders. You willalso learn how to write specialized interfaces,designed specifically to transfer data fromwebsites to webbots.

Chapter 27In this chapter, we'll explore techniques forwriting web pages that protect sensitiveinformation from webbots and spiders, whilestill accommodating normal browser users.

Chapter 28Possibly the most important part of this book,this chapter discusses the possible legal issuesyou may encounter as a webbot developer andtells you how to avoid them.

Page 466: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 24. DESIGNINGSTEALTHY WEBBOTS ANDSPIDERSThis chapter explores design and implementationconsiderations that make webbots hard to detect.However, the inclusion of a chapter on stealthshouldn't imply that there's a stigma associatedwith writing webbots; you shouldn't feel self-conscious about writing webbots, as long as yourgoals are to create legal and novel solutions totedious tasks. Most of the reasons for maintainingstealth have more to do with maintainingcompetitive advantage than covering the tracks ofa malicious web agent.

Why Design a StealthyWebbot?Webbots that create competitive advantages fortheir owners often lose their value shortly afterthey're discovered by the targeted website'sadministrator. I can tell you from personalexperience that once your webbot is detected, youmay be accused of creating an unfair advantage

Page 467: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

for your client. This type of accusation is commonagainst early adopters of any technology. (It isalso complete bunk.) Webbot technology isavailable to any business that takes the time toresearch and implement it. Once it is discovered,however, the owner of the target site may limit orblock the webbot's access to the site's resources.The other thing that can happen is that theadministrator will see the value that the webbotoffers and will create a similar feature on the sitefor everyone to use.Another reason to write stealthy webbots is thatsystem administrators may misinterpret webbotactivity as an attack from a hacker. A poorlydesigned webbot may leave strange records in thelog files that servers use to track web traffic anddetect hackers. Let's look at the errors you canmake and how these errors appear in the log filesof a system administrator.

Log Files

System administrators can detect webbots bylooking for odd activity in their log files, whichrecord access to servers. There are three types oflog files for this purpose: access logs, error logs,and custom logs (Figure 24-1). Some servers alsodeploy special monitoring software to parse anddetect anomalies from normal activity within log

Page 468: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

detect anomalies from normal activity within logfiles.

Figure 24-1. Windows' log files recording fileaccess and errors (Apache running on Windows)

Access Logs

As the name implies, access logs recordinformation related to the access of files on awebserver. Typical access logs record the IPaddress of the requestor, the time the file wasaccessed, the fetch method (typically GET orPOST), the file requested, the HTTP code, and thesize of the file transfer, as shown in Listing 24-1.

221.2.21.16 - - [03/Feb/2008:14:57:45 −0600] "GET / HTTP/1.1" 200 1494

Page 469: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

"GET / HTTP/1.1" 200 149412.192.2.206 - - [03/Feb/2008:14:57:46 −0600] "GET /favicon.ico HTTP/1.1" 404 28327.116.45.118 - - [03/Feb/2008:14:57:46 −0600] "GET /apache_pb.gif HTTP/1.1" 200 2326214.241.24.35 - - [03/Feb/2008:14:57:50 −0600] "GET /test.php HTTP/1.1" 200 41

Listing 24-1: Typical access log entriesAccess log files have many uses, like meteringbandwidth and controlling access. Know that thewebserver records every file download yourwebbot requests. If your webbot makes 50requests a day from a server that gets 200 hits aday, it will become obvious to even a casualsystem administrator that a single party is makinga disproportionate number of requests, which willraise questions about your activity.Also, remember that using a website is a privilege,not a right. Always assume that your budget ofaccesses per day is limited, and if you go over thatlimit, it is likely that a system administrator willattempt to limit your activity when he or sherealizes a webbot is accessing the website. Youshould strive to limit the number of times yourwebbot accesses any site. There are no definiterules about how often you can access a website,but remember that if an individual systemadministrator decides your IP is hitting a site toooften, his or her opinion will always trump

Page 470: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

often, his or her opinion will always trumpyours.[67] If you ever exceed your bandwidthbudget, you may find yourself blocked from thesite.

Error Logs

Like access logs, error logs record access to awebsite, but unlike access logs, error logs onlyrecord errors that occur. A sampling of an actualerror log is shown in Listing 24-2.

[Tue Mar 08 14:57:12 2008] [warn] module mod_php4.c is already added, skipping[Tue Mar 08 15:48:10 2008] [error] [client 127.0.0.1] File does not exist:c:/program files/apache group/apache/htdocs/favicon.ico[Tue Mar 08 15:48:13 2008] [error] [client 127.0.0.1] File does not exist:c:/program files/apache group/apache/htdocs/favicon.ico[Tue Mar 08 15:48:37 2008] [error] [client 127.0.0.1] File does not exist:c:/program files/apache group/apache/htdocs/t.gif

Listing 24-2: Typical error log entriesThe errors your webbot is most likely to makeinvolve requests for unsupported methods (oftenHEAD requests) or requesting files that aren't on

Page 471: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

HEAD requests) or requesting files that aren't onthe website. If your webbot repeatedly commitseither of these errors, a system administrator willeasily determine that a webbot is making theerroneous page requests, because it is almostimpossible to cause these errors when manuallysurfing with a browser. Since error logs tend tobe smaller than access logs, entries in error logsare very obvious to system administrators.However, not all entries in an error log indicatethat something unusual is going on. For example,it's common for people to use expired bookmarksor to follow broken links, both of which couldgenerate File not found errors.At other times, errors are logged in access logs,not error logs. These errors include using a GETmethod to send a form instead of a POST (or visaversa), or emulating a form and sending the datato a page that is not a valid action address. Theseare perhaps the worst errors because they areimpossible for someone using a browser tocommit—therefore, they will make your webbotstand out like a sore thumb in the log files.These are the best ways to avoid strange errors inlog files:

Debug your webbot's parsing software on webpages that are on your own server beforereleasing it into the wilderness

Page 472: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

releasing it into the wildernessUse a form analyzer, as described inChapter 5, when emulating formsProgram your webbot to stop if it is lookingfor something specific but cannot find it

Custom Logs

Many web administrators also keep detailedcustom logs, which contain additional data notfound in either error or access logs. Informationthat may appear in custom logs includes thefollowing:

The name of the web agent used to downloada fileA fully resolved domain name that resolvesthe requesting IP addressA coherent list of pages a visitor viewedduring any one sessionThe referer to get to the requested page

The first item on the list is very important andeasy to address. If you call your webbot testwebbot, which is the default setting in LIB_http,the web administrator will finger your webbot assoon as he or she views the log file. Sometimes

Page 473: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

soon as he or she views the log file. Sometimesthis is by design; for example, if you want yourwebbot to be discovered, you may use an agentname like See www.myWebbot.com for moredetails. I have seen many webbots brandthemselves similarly.If the administrator does a reverse DNS lookup toconvert IP addresses to domain names, thatmakes it very easy to trace the origin of traffic.You should always assume this is happening andrestrict the number of times you access a target.Some metrics programs also create reports thatshow which pages specific visitors downloaded onsequential visits. If your webbot alwaysdownloads the same pages in the same order,you're bound to look odd. For this reason, it's bestto add some variety (or randomness, if applicable)to the sequence and number of pages yourwebbots access.

Log-Monitoring Software

Many system administrators use monitoringsoftware that automatically detects strangebehavior in log files. Servers using monitoringsoftware may automatically send a notificationemail, instant message, or even page to the systemadministrator upon detection of critical errors.Some systems may even automatically shut down

Page 474: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Some systems may even automatically shut downor limit accessibility to the server.Some monitoring systems can have unanticipatedresults. I once created a webbot for a client thatmade HEAD requests from various web pages.While the use of the HEAD request is part of theweb specification, it is rarely used, and thisparticular monitoring software interpreted the useof the HEAD request as malicious activity. Myclient got a call from the system administrator,who demanded that we stop hacking his website.Fortunately, we all discussed what we were doingand left as friends, but that experience taught methat many administrators are inexperienced withwebbots; if you approach situations like this withconfidence and knowledge, you'll generally berespected. The other thing I learned from thisexperience is that when you want to analyze aheader, you should request the entire page insteadof only the header, and then parse the results onyour hard drive.

[67] There may also be legal implications forhitting a website too many times. For moreinformation on this subject, see Chapter 28.

Page 475: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Stealth Means SimulatingHuman PatternsWebbots that don't draw attention to themselvesare ones that behave like people and leavenormal-looking records in log files. For thisreason, you want your webbot to simulate normalhuman activity. In short, stealthy webbots don'tact like machines.

Be Kind to Your Resources

Possibly the worst thing your webbot can do isconsume too much bandwidth from an individualwebsite. To limit the amount of bandwidth awebbot uses, you need to restrict the amount ofactivity it has at any one website. Whatever youdo, don't write a webbot that frequently makesrequests from the same source. Since your webbotdoesn't read the downloaded web pages and clicklinks as a person would, it is capable ofdownloading pages at a ridiculously fast rate. Forthis reason, your webbot needs to spend most ofits time waiting instead of downloading pages.The ease of writing a stealthy webbot is directlycorrelated with how often your target datachanges. In the early stages of designing your

Page 476: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

changes. In the early stages of designing yourwebbot, you should decide what specific data youneed to collect and how often that data changes. Ifupdates of the target data happen only once aday, it would be silly to look for it more often thanthat.System administrators also use various methodsand traps to deter webbots and spiders. Theseconcepts are discussed in detail in Chapter 27.

Run Your Webbot During Busy Hours

If you want your webbot to generate log recordsthat look like normal browsing, you should designyour webbot so that it makes page requests wheneveryone else makes them. If your webbot runsduring busy times, your log records will beintermixed with normal traffic. There will also bemore records separating your webbot's accessrecords in the log file. This will not reduce thetotal percentage of requests coming from yourwebbot, but it will make your webbot slightly lessnoticeable.Running webbots during high-traffic times isslightly counterintuitive, since many peoplebelieve that the best time to run a webbot is in theearly morning hours—when the systemadministrator is at home sleeping and you're notinterfering with normal web traffic. While the

Page 477: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

interfering with normal web traffic. While theearly morning may be the best time to go out inpublic without alerting the paparazzi, on theInternet, there is safety in numbers.

Don't Run Your Webbot at the SameTime Each Day

If you have a webbot that needs to run on a dailybasis, it's best not to run it at exactly same timeevery day, because doing so would leavesuspicious-looking records in the server log file.For example, if a system administrator noticesthat someone with a certain IP address access thesame file at 7:01 AM every day, he or she willsoon assume that the requestor is either a highlycompulsive human or a webbot.

Don't Run Your Webbot on Holidaysand Weekends

Obviously, your webbot shouldn't access a websiteover a holiday or weekend if it would be unusualfor a person to do the same. For example, I'vewritten procurement bots (see Chapter 19) thatbuy things from websites only used by businesses.It would have been odd if the webbot checkedwhat was available for purchase at a time whenbusinesses are typically closed. This is,

Page 478: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

businesses are typically closed. This is,unfortunately, an easy mistake to make, becausefew task-scheduling programs track localholidays. You should read Chapter 23 for moreinformation on this issue.

Use Random, Intra-fetch Delays

One sure way to tell a system administrator thatyou've written a webbot is to request pages fasterthan humanly possible. This is an easy mistake tomake, since computers can make page requests atlightening speeds. For this reason, it's imperativeto insert delays between repeated page fetches onthe same domain. Ideally, the delay period shouldbe a random value that mimics human browsingbehavior.

Page 479: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Final ThoughtsA long time ago—before I knew better—I neededto gather some information for a client from agovernment website (on a Saturday, no less). Idetermined that in order to collect all the data Ineeded by Monday morning, my spider wouldhave to run at full speed for most of the weekend(another bad idea). I started in the morning, andeverything was going well; the spider wasdownloading pages, parsing information, andstoring the results in my database at a blazingrate.While only casually monitoring the spider, I usedthe idle time to browse the website I wasspidering. To my horror, I found that the welcomepage explicitly stated that the website did not,under any circumstances, allow webbots to gatherinformation from it.Furthermore, the welcome page stated that anyviolation of this policy was considered a felony,and violators would be prosecuted fully. Since thiswas a government website, I assumed it had thelawyers to follow through with a threat like this.In retrospect, the phrase full extent of the law wasprobably more of a fear tactic than an indicationof eminent legal action. Since all the data I

Page 480: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

of eminent legal action. Since all the data Icollected was in the public domain, and thefunding of the site's servers came from publicmoney (some of it mine), I couldn't possibly havedone anything wrong, could I?My fear was that since I was hitting the serververy hard, the department would file a trespass-to-chattels[68] case against me. Regardless, it had myattention, and I questioned the wisdom of what Iwas doing. An activity that seemed so innocentonly moments earlier suddenly had the potentialof becoming a criminal offense. I wasn't sure whatthe department's legal rights were, nor was I sureto what extent a judge would have agreed with itsarguments, since there were no applicablewarnings on the pages I was spidering.Nevertheless, it was obvious that the governmentwould have more lawyers at its disposal than Iwould, if it came to that.Just as I started to contemplate my future in jail,the spider suddenly stopped working. Fearing theworst, I pointed my browser at the page I hadbeen spidering and felt the blood drain from myface as I read a web page similar to the oneshown in Figure 24-2.

Page 481: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 24-2. A government warning that my IPaddress had been blocked

I knew I had no choice but to call the number onthe screen. This website obviously had monitoringsoftware, and it detected that I was operatingoutside of stated policies. Moreover, it had my IPaddress, so someone could easily discover who Iwas by tracing my IP address back to my ISP.[69]

Once the department knew who my ISP was, itcould subpoena billing and log files to use asevidence. I was busted—not by some guy with aserver, but by the full force and assets (i.e.,lawyers) of the State of Minnesota. My paranoiawas magnified by the fact that it was only lateSaturday morning, and I had all weekend to thinkabout my situation before I could call the numberon Monday morning.

Page 482: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

When Monday finally came, I called the numberand was very apologetic. Realizing that theyalready knew what I was doing, I gave them a fullconfession. Moreover, I noted that I had read thepolicy on the main page after I started spideringthe site and that there were no warnings on thepages I was spidering.Fortunately, the person who answered the phonewas not the department's legal counsel (as Ifeared), but a friendly system administrator whowas mostly concerned about maintaining a busywebsite on a limited budget. She told me thatshe'd unblock my IP address if I promised not tohit the server more than three times a minute.Problem solved. (Whew!)The embarrassing part of this story is that Ishould have known better. It only takes a smallamount of code between page requests to make awebbot's actions look more human. For example,the code snippet in Listing 24-3 will cause arandom delay between 20 and 45 seconds.

$minumum_delay_seconds = 20;$maximum_delay_seconds = 45;sleep($minumum_delay_seconds, $maximum_delay_seconds);

Listing 24-3: Creating a random delayYou can summarize the complete topic of stealthy

Page 483: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

You can summarize the complete topic of stealthywebbots in a single concept: Don't do anythingwith a webbot that doesn't look like somethingone person using a browser would do. In thatregard, think about how and when people usebrowsers, and try to write webbots that mimicthat activity.

[68] See Chapter 28 for more information abouttrespass to chattels.[69] You can find the owner of an IP address athttp://www.arin.net.

Page 484: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 25. WRITING FAULT-TOLERANT WEBBOTSThe biggest complaint users have about webbotsis their unreliability: Your webbots will suddenlyand inexplicably fail if they are not fault tolerant,or able to adapt to the changing conditions ofyour target websites. This chapter is devoted tohelping you write webbots that are tolerant tonetwork outages and unexpected changes in theweb pages you target.Webbots that don't adapt to their changingenvironments are worse than nonfunctional onesbecause, when presented with the unexpected,they may perform in odd and unpredictable ways.For example, a non-fault-tolerant webbot may notnotice that a form has changed and will continueto emulate the nonexistent form. When a webbotdoes something that is impossible to do with abrowser (like submit an obsolete form), systemadministrators become aware of the webbot.Furthermore, it's usually easy for systemadministrators to identify the owner of a webbotby tracing an IP address or matching a user to ausername and password. Depending on what yourwebbot does and which website it targets, theidentification of a webbot can lead to possiblebanishment from the website and the loss of acompetitive advantage for your business. It'sbetter to avoid these issues by designing fault-tolerant webbots that anticipate changes in thewebsites they target.Fault tolerance does not mean that everything willalways work perfectly. Sometimes changes in atargeted website confuse even the most fault-tolerant webbot. In these cases, the proper thingfor a webbot to do is to abort its task and report

Page 485: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

for a webbot to do is to abort its task and reportan error to its owner. Essentially, you want yourwebbot to fail in the same manner a person usinga browser might fail. For example, if a webbot isbuying an airline ticket, it should not proceed witha purchase if a seat is not available on a desiredflight. This action sounds silly, but it is exactlywhat a poorly programmed webbot may do if it isexpecting an available seat and has no provisionto act otherwise.

Types of Webbot FaultToleranceFor a webbot, fault tolerance involves adapting tochanges to URLs, HTML content (which affectparsing), forms, cookie use, and network outagesand congestion). We'll examine each of theseaspects of fault tolerance in the followingsections.

Adapting to Changes in URLs

Possibly the most important type of webbot faulttolerance is URL tolerance, or a webbot's abilityto make valid requests for web pages underchanging conditions. URL tolerance ensures thatyour webbot does the following:

Download pages that are available on thetarget siteFollow header redirections to updated pagesUse referer values to indicate that youfollowed a link from a page that is still on thewebsite

Avoid Making Requests for Pages ThatDon't Exist

Page 486: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Avoid Making Requests for Pages ThatDon't Exist

Before you determine that your webbotdownloaded a valid web page, you should verifythat you made a valid request. Your webbot canverify successful page requests by examining theHTTP code, a status code returned in the headerof every web page. If the request was successful,the resulting HTTP code will be in the 200 series—meaning that the HTTP code will be a three-digitnumber beginning with a two. Any other value forthe HTTP code may indicate an error. The mostcommon HTTP code is 200, which says that therequest was valid and that the requested page wassent to the web agent. The script in Listing 25-1shows how to use the LIB_http library'shttp_get() function to validate the returnedpage by looking at the returned HTTP code. If thewebbot doesn't detect the expected HTTP code, anerror handler is used to manage the error and thewebbot stops.

<?include("LIB_http.php");# Get the web page$page = http_get($target="www.schrenk.com", $ref="");# Vector to error handler if error code detectedif($page['STATUS']['http_code']!="200") error_handler("BAD RESULT", $page['STATUS']['http_code']);

?>

Listing 25-1: Detecting a bad page requestBefore using the method described in Listing 25-1,review a list of HTTP codes and decide whichcodes apply to your situation.[70]

If the page no longer exists, the fetch will return a404 Not Found error. When this happens, it's

Page 487: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

404 Not Found error. When this happens, it'simperative that the webbot stop and not downloadany more pages until you find the cause of theerror. Not proceeding after detecting an error is afar better strategy than continuing as if nothing iswrong.Web developers don't always remove obsolete webpages from their websites—sometimes they justlink to an updated page without removing the oldone. Therefore, webbots should start at the webpage's home page and verify the existence of eachpage between the home page and the actualtargeted web page. This process does two things.It helps your webbot maintain stealth, as itsimulates the browsing habits of a person using abrowser. Moreover, by validating that there arelinks to subsequent pages, you verify that thepages you are targeting are still in use. Incontrast, if your webbot targets a page within asite without verifying that other pages still link toit, you risk targeting an obsolete web page.The fact that your webbot made a valid pagerequest does not indicate that the page you'vedownloaded is the one you intended to downloador that it contains the information you expected toreceive. For that reason, it is useful to find avalidation point, or text that serves as anindication that the newly downloaded web pagecontains the expected information. Every situationis different, but there should always be some texton every page that validates that the pagecontains the content you're expecting. Forexample, suppose your webbot submits a form toauthenticate itself to a website. If the next webpage contains a message that welcomes themember to the website, you may wish to use themember's name as a validation point to verify thatyour webbot successfully authenticated, as shownin Listing 25-2.

Page 488: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$username = "GClasemann";$page = http_get($target, $ref="");if(!stristr($page['FILE'], "$username") { echo "authentication error"; error_handler("BAD AUTHENTICATION for ".$username, $target); }

Listing 25-2: Using a username as a validationpoint to confirm the result of submitting a formThe script in Listing 25-2 verifies that a validationpoint, in this case a username, exists asanticipated on the fetched page. This strategyworks because the only way that the user's namewould appear on the web page is if he or she hadbeen successfully authenticated by the website. Ifthe webbot doesn't find the validation point, itassumes there is a problem and it reports thesituation with an error handler.

Follow Page Redirections

Page redirections are instructions sent by theserver that tell a browser that it should downloada page other than the one originally requested.Web developers use page redirection techniquesto tell browsers that the page they're looking forhas changed and that they should downloadanother page in its place. This allows people toaccess correct pages even when obsoleteaddresses are bookmarked by browsers or listedby search engines. As you'll discover, there areseveral methods for redirecting browsers. Themore web redirection techniques your webbotsunderstand, the more fault tolerant your webbotbecomes.Header redirection is the oldest method of pageredirection. It occurs when the server places a

Page 489: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

redirection. It occurs when the server places aLocation: URL line in the HTTP header, whereURL represents the web page the browser shoulddownload (in place of the one requested). When aweb agent sees a header redirection, it's supposedto download the page defined by the new location.Your webbot could look for redirections in theheaders of downloaded pages, but it's easier toconfigure PHP/CURL to follow headerredirections automatically.[71] Listing 25-3 showsthe PHP/CURL options you need to makeautomatic redirection happen.

curl_setopt($curl_session, CURLOPT_FOLLOWLOCATION, TRUE); // Follow redirectscurl_setopt($curl_session, CURLOPT_MAXREDIRS, 4); // Only follow 4redirects

Listing 25-3: Configuring PHP/CURL to follow upto four header redirectionsThe first option in Listing 25-3 tells PHP/CURL tofollow all page redirections as they are defined bythe target server. The second option limits thenumber of redirections your webbot will follow.Limiting the number of redirections defeatswebbot traps where servers redirect agents to thepage they just downloaded, causing an endlessnumber of requests for the same page and anendless loop.In addition to header redirections, you should alsobe prepared to identify and accommodate pageredirections made between the <head> and</head> tags, as shown in Listing 25-4.

<html> <head> <meta http-equiv="refresh" content="0; URL=http://www.nostarch.com"> </head>

Page 490: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

</html >

Listing 25-4: Page redirection between the <head>and </head> tagsIn Listing 25-4, the web page tells the browser todownload http://www.nostarch.com instead of theintended page. Detecting these kinds ofredirections is accomplished with a script like theone in Listing 25-5. This script looks forredirections between the <head> and </head>tags in a test page on the book's website.

<?# Include http, parse, and address resolution librariesinclude("LIB_http.php");include("LIB_parse.php");include("LIB_resolve_addresses.php");

# Identify the target web page and the page base$target = "http://www.schrenk.com/nostarch/webbots/head_redirection_test.php";

$page_base = "http://www.schrenk.com/nostarch/webbots/";

# Download the web page$page = http_get($target, $ref="");

# Parse the <head></head>$head_section = return_between($string=$page['FILE'], $start="<head>", $end="</head>", $type=EXCL);

# Create an array of all the meta tags$meta_tag_array = parse_array($head_section, $beg_tag="<meta", $close_tag=">");

# Examine each meta tag for a redirection commandfor($xx=0; $xx<count($meta_tag_array); $xx++) {

Page 491: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

# Look for http-equiv attribute $meta_attribute = get_attribute($meta_tag_array[$xx], $attribute="http-equiv"); if(strtolower($meta_attribute)=="refresh") { $new_page = return_between($meta_tag_array[$xx], $start="URL", $end=">",$type=EXCL); # Clean up URL $new_page = trim(str_replace("", "", $new_page)); $new_page = str_replace("=", "", $new_page); $new_page = str_replace("\"", "", $new_page); $new_page = str_replace("'", "", $new_page); # Create fully resolved URL $new_page = resolve_address($new_page, $page_base); } break; }

# Echo results of scriptecho "HTML Head redirection detected<br>";echo "Redirect page = ".$new_page;?>

Listing 25-5: Detecting redirection between the<head> and </head> tagsListing 25-5 is also an example of the need forgood coding practices as part of writing fault-tolerant webbots. For instance, in Listing 25-5notice how these practices are followed:

The script looks for the redirection betweenthe <head> and </head> tags, and not justanywhere on the web pageThe script looks for the http-equiv attributeonly within a meta tag

Page 492: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

only within a meta tagThe redirected URL is converted into a fullyresolved addressLike a browser, the script stops looking forredirections when it finds the first one

The last—and most troublesome—type ofredirection is that done with JavaScript. Theseinstances are troublesome because webbotstypically lack JavaScript parsers, making itdifficult for them to interpret JavaScript. Thesimplest redirection of this type is a single line ofJavaScript, as shown in Listing 25-6.

<script>document.location = 'http://www.schrenk.com'; </script>

Listing 25-6: A simple JavaScript page redirectionDetecting JavaScript redirections is also trickybecause JavaScript is a very flexible language,and page redirections can take many forms. Forexample, consider what it would take to detect apage redirection like the one in Listing 25-7.

<html> <head> <script> function goSomeWhereNew(URL) { location.href = URL; } </script> <body onLoad=" goSomeWhereNew('http://www.schrenk.com')"> </body></html>

Listing 27-7: A complicated JavaScript pageredirectionFortunately, JavaScript page redirection is not aparticularly effective way for a web developer to

Page 493: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

particularly effective way for a web developer tosend a visitor to a new page. Some people turn offJavaScript in their browser configuration, so itdoesn't work for everyone; therefore, JavaScriptredirection is rarely used. Since it is difficult towrite fault-tolerant routines to handle JavaScript,you may have to tough it out and rely on theerror-detection techniques addressed later in thischapter.

Maintain the Accuracy of Referer Values

The last aspect of verifying that you're usingcorrect URLs is ensuring that your referer valuescorrectly simulate followed links. You should setthe referer to the last target page you requested.This is important for several reasons. Forexample, some image servers use the referervalue to verify that a request for an image ispreceded by a request for the entire web page.This defeats bandwidth hijacking, the practice ofsourcing images from other people's domains. Inaddition, websites may defeat deep linking, orlinking to a website's inner pages, by examiningthe referer to verify that people followed aprescribed succession of links to get to a specificpoint within a website.

Adapting to Changes in Page Content

Parse tolerance is your webbot's ability to parseweb pages when your webbot downloads thecorrect page, but its contents have changed. Thefollowing paragraphs describe how to writeparsing routines that are tolerant to minorchanges in web pages. This may also be a goodtime to review Chapter 4, which covers generalparsing techniques.

Avoid Position Parsing

Page 494: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Avoid Position Parsing

To facilitate fault tolerance when parsing webpages, you should avoid all attempts at positionparsing, or parsing information based on itsposition within a web page. For example, it's abad idea to assume that the information you'relooking for has these characteristics:

Starts x characters from the beginning of thepage and is y characters in lengthIs in the xth table in a web pageIs at the very top or bottom of a web page

Any small change in a website can effect positionparsing. There are much better ways of findingthe information you need to parse.

Use Relative Parsing

Relative parsing is a technique that involveslooking for desired information relative to otherthings on a web page. For example, since manyweb pages hold information in tables, you canplace all the tables into an array, identifyingwhich table contains a landmark term thatidentifies the correct table. Once a webbot findsthe correct table, the data can be parsed from thecorrect cell by finding the cell relative to a specificcolumn name within that table. For an example ofhow this works, look at the parsing techniquesperformed in Chapter 7 in which a webbot parsesprices from an online store.Table column headings may also be used aslandmarks to identify data in tables. For example,assume you have a table like Table 25-1, whichpresents statistics for three baseball players.Table 25-1. Use Table Headers to Identify Data

Page 495: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Table 25-1. Use Table Headers to Identify DataWithin Columns

Player Team Hits Home Runs Average

Zoe Marsupials 78 15 .327

Cullen Wombats 56 16 .331

Kade Wombats 58 17 .324

In this example you could parse all the tables fromthe web page and isolate the table containing thelandmark Player Statistics. In that table, yourwebbot could then use the column names assecondary landmarks to identify players and theirstatistics.

Look for Landmarks That Are LeastLikely to Change

You achieve additional fault tolerance when youchoose landmarks that are least likely to change.From my experience, the things in web pages thatchange with the lowest frequency are those thatare related to server applications or back-endcode. In most cases, names of form elements andvalues for hidden form fields seldom change. Forexample, in Listing 25-8 it's very easy to find thenames and breeds of dogs because the formhandler needs to see them in a well-definedmanner. Webbot developers generally don't lookfor data values in forms because they aren'tvisible in rendered HTML. However, if you'relucky enough to find the data values you'relooking for within a form definition, that's whereyou should get them, even if they appear in othervisible places on the website.

Page 496: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

<form method="POST" action="dog_form.php"> <input type="hidden" name="Jackson" value="Jack Russell Terrier"> <input type="hidden" name="Xing" value="Shepherd Mix"> <input type="hidden" name="Buster" value="Maltese"> <input type="hidden" name="Bare-bear" value="Pomeranian"></form>

Listing 25-8: Finding data values in form variablesSimilarly, you should avoid landmarks that aresubject to frequent changes, like dynamicallygenerated content, HTML comments (whichMacromedia Dreamweaver and other page-generation software programs automaticallyinsert into HTML pages), and information that istime or calendar derived.

Adapting to Changes in Forms

Form tolerance defines your webbot's ability toverify that it is sending the correct forminformation to the correct form handler. Whenyour webbot detects that a form has changed, it isusually best to terminate your webbot, rather thantrying to adapt to the changes on the fly. Formemulation is complicated, and it's too easy tomake embarrassing mistakes—like submittingnonexistent forms. You should also use the formdiagnostic page on the book's website (describedin Chapter 5) to analyze forms before writingform emulation scripts.Before emulating a form, a webbot should verifythat the form variables it plans to submit are stillin use in the submitted form. This check shouldverify the data pair names submitted to the formhandler and the form's method and action. Listing25-9 parses this information on a test page on the

Page 497: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

25-9 parses this information on a test page on thebook's website. You can use similar scripts toisolate individual form elements, which can becompared to the variables in form emulationscripts.

<?# Import librariesinclude("LIB_http.php");include("LIB_parse.php");include("LIB_resolve_addresses.php");

# Identify location of form and page base address$page_base ="http://www.schrenk.com/nostarch/webbots/";$target = "http://www.schrenk.com/nostarch/webbots/easy_form.php";

$web_page = http_get($target, "");

# Find the forms in the web page$form_array = parse_array($web_page['FILE'], $open_tag="<form", $close_tag="</form>");

# Parse each form in $form_arrayfor($xx=0; $xx<count($form_array); $xx++) { $form_beginning_tag = return_between($form_array[$xx], "<form", ">", INCL); $form_action = get_attribute($form_beginning_tag, "action");

// If no action, use this page as action if(strlen(trim($form_action))==0) $form_action = $target; $fully_resolved_form_action = resolve_address($form_action, $page_base);

// Default to GET method if no method specified if(strtolower(get_attribute($form_beginning_tag, "method")=="post"))

Page 498: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

"method")=="post")) $form_method="POST"; else $form_method="GET";

$form_element_array = parse_array($form_array[$xx], "<input", ">"); echo "Form Method=$form_method<br>"; echo "Form Action=$fully_resolved_form_action<br>"; # Parse each element in this form for($yy=0; $yy<count($form_element_array); $yy++) { $element_name = get_attribute($form_element_array[$yy], "name"); $element_value = get_attribute($form_element_array[$yy], "value"); echo "Element Name=$element_name, value=$element_value<br>"; } }?>

Listing 25-9: Parsing form valuesListing 25-9 finds and parses the values of allforms in a web page. When run, it also finds theform's method and creates a fully resolved URLfor the form action, as shown in Figure 25-1.

Figure 25-1. Results of running the script inListing 25-9

Adapting to Changes in CookieManagement

Page 499: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Management

Cookie tolerance involves saving the cookieswritten by websites and making them availablewhen fetching successive pages from the samewebsite. Cookie management should happenautomatically if you are using the LIB_httplibrary and have the COOKIE_FILE pointing to afile your webbots can access.One area of concern is that the LIB_http library(and PHP/CURL, for that matter) will not deleteexpired cookies or cookies without an expirationdate, which are supposed to expire when thebrowser is closed. In these cases, it's important tomanually delete cookies in order to simulate newbrowser sessions. If you don't delete expiredcookies, it will eventually look like you're using abrowser that has been open continuously formonths or even years, which can look prettysuspicious.

Adapting to Network Outages andNetwork Congestion

Unless you plan accordingly, your webbots andspiders will hang, or become nonresponsive, whena targeted website suffers from a network outageor an unusually high volume of network traffic.Webbots become nonresponsive when theyrequest and wait for a page that they neverreceive. While there's nothing you can do aboutgetting data from nonresponsive target websites,there's also no reason your webbot needs to behung up when it encounters one. You can avoidthis problem by inserting the command shown inListing 25-10 when configuring your PHP/CURLsessions.

curl_setopt($curl_session, CURLOPT_TIME,

Page 500: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$timeout_value);

Listing 25-10: Setting time-out values inPHP/CURLCURLOPT_TIME defines the number of secondsPHP/CURL waits for a targeted website torespond. This happens automatically if you use theLIB_http library featured in this book. Bydefault, page requests made by LIB_http wait amaximum of 25 seconds for any target website torespond. If there's no response within the allottedtime, the PHP/CURL session returns an emptyresult.While on the subject of time-outs, it's important torecognize that PHP, by default, will time-out if ascript executes longer than 30 seconds. In normaluse, PHP's time-out ensures that if a script takestoo long to execute, the webserver will return aserver error to the browser. The browser, in turn,informs the user that a process has timed-out. Thedefault time-out works great for serving webpages, but when you use PHP to build webbot orspider scripts, PHP must facilitate longerexecution times. You can extend (or eliminate) thedefault PHP script-execution time with thecommands shown in Listing 25-11.You should exercise extreme caution wheneliminating PHP's time-out, as shown in thesecond example in Listing 25-11. If you eliminatethe time-out, your script may hang permanently ifit encounters a problem.

set_time_limit(60); // Set PHP time-out to 60 secondsset_time_limit(0); // Completely remove PHP script time-out

Listing 25-11: Adjusting the default PHP scripttime-outAlways try to avoid time-outs by designing

Page 501: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Always try to avoid time-outs by designingwebbots that execute quickly, even if that meansyour webbot needs to run more than once toaccomplish a task. For example, if a webbot needsto download and parse 50 web pages, it's usuallybest to write the bot in such a way that it canprocess pages one at a time and know where itleft off; then you can schedule the webbot toexecute every minute or so for an hour. Webbotscripts that execute quickly are easier to test,resemble normal network traffic more closely,and use fewer system resources.

[70] A full list of HTTP codes is available inAppendix B.[71] LIB_http does this for you.

Page 502: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Error HandlersWhen a webbot cannot adjust to changes, the onlysafe thing to do is to stop it. Not stopping yourwebbot may otherwise result in odd performanceand suspicious entries in the target server's accessand error log files. It's a good idea to write aroutine that handles all errors in a prescribedmanner. Such an error handler should send youan email that indicates the following:

Which webbot failedWhy it failedThe date and time it failed

A simple script like the one in Listing 25-12 workswell for this purpose.

function webbot_error_handler($failure_mode) { # Initialization $email_address = "[email protected]"; $email_subject = "Webbot Failure Notification";

# Build the failure message $email_message = "Webbot T-Rex encountered a fatal error <br>";

Page 503: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

fatal error <br>"; $email_message = $email_message . $failure_more . "<br>"; $email_message = $email_message . "at".date("r") . "<br>";

# Send the failure message via email mail($email_address, $email_subject, $email_message); # Don't return, force the webbot script to stop exit; }

Listing 25-12: Simple error-reporting scriptThe trick to effectively using error handlers is toanticipate cases in which things may go wrongand then test for those conditions. For example,the script in Listing 25-13 checks the size of adownloaded web page and calls the function in theprevious listing if the web page is smaller thanexpected.

# Download web page$target = "http://www.somedomain.com/somepage.html";$downloaded_page = http_get($target, $ref="");$web_page_size = strlen($downloaded_page['FILE']);if($web_page_size < 1500) webbot_error_handler($target." smaller than expected, actual size="

Page 504: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

expected, actual size=".$web_page_size);

Listing 25-13: Anticipating and reporting errorsIn addition to reporting the error, it's important toturn off the scheduler when an error is found ifthe webbot is scheduled to run again in the future.Otherwise, your webbot will keep bumping upagainst the same problem, which may leave oddrecords in server logs. The easiest way to disablea scheduler is to write error handlers that recordthe webbot's status in a database. Before ascheduled webbot runs, it can first query thedatabase to determine if an unaddressed erroroccurred earlier. If the query reveals that an errorhas occurred, the webbot can ignore the requestsof the scheduler and simply terminate itsexecution until the problem is addressed.

Page 505: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 26. DESIGNINGWEBBOT-FRIENDLYWEBSITESI'll start this chapter with suggestions that helpmake web pages accessible to the most widelyused webbots—the spiders that download,analyze, and rank web pages for search engines, aprocess often called search engine optimization(SEO).Finally, I'll conclude the chapter by explaining theoccasional importance of special-purpose webpages, formatted to send data directly to webbotsinstead of browsers.

Optimizing Web Pages forSearch Engine SpidersThe most important thing to remember whendesigning a web page for SEO is that spiders relyon you, the developer, to provide context for theinformation they find. This is important becauseweb pages using HTML mix content with displayformat commands. To add complexity to thespider's task, a spider has to examine words in the

Page 506: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

spider's task, a spider has to examine words in theweb page's content to determine how relevant thewords are to the web page's main topic. You canimprove a spider's ability to index and rank yourweb pages, as well as improve your searchranking by predictably using a few standardHTML tags. The topic of SEO is vast and manybooks are entirely dedicated to it. This chapteronly scratches the surface, but it should get youon your way.

Well-Defined Links

Search engines generally associate the number oflinks to a web page with the web page's popularityand importance. In fact, getting other websites tolink to your web page is probably the best way toimprove your web page's search ranking.Regardless of where the links originate, it'salways important to use descriptive hyper-references when making links. Without descriptivelinks, search engine spiders will know the linkedURL, but they won't know the importance of thelink. For example, the first link in Listing 26-1 ismuch more useful to search spiders than thesecond link.

<!-- Example of a descriptive link --><a href="http://www.schrenk.com/js">JavaScript Animation Tutorial</a>

Page 507: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

<!-- Example of a nondescriptive link --><a href="http://www.schrenk.com/js">Click here</a> for a JavaScript tutorial

Listing 26-1: Descriptive and nondescriptive links

Google Bombs and Spam Indexing

Google bombing is an example of how searchrankings are affected by the terms used todescribe links. Google bombing (also known asspam indexing) is a technique where peopleconspire to create many links, with identical linkdescriptions, to a specific web page. As Google (orany other search engine) indexes these web pages,the link descriptions become associated with thetargeted web page. As a result, when people enterthe link descriptions as search terms, the targetedpages are highly ranked in the results. Googlebombing is occasionally used for politicalpurposes to place a targeted politician's websiteas the highest ranked result for a derogatorysearch term. For example, depending on thesearch engine you use, a search for the phrasemiserable failure may return the officialbiography of George W. Bush as the top result.Similarly, a search for the word waffles mayproduce the official web page of Senator JohnKerry. While Google has adapted its rankings to

Page 508: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Kerry. While Google has adapted its rankings toaccount for a few well-known instances of thisgamesmanship, Google bombing is still possible,and it remains an unresolved challenge for allsearch engines.

Title Tags

The HTML title tag helps spiders identify the maintopic of a web page. Each web page should have aunique title that describes the general purpose ofthe page, as shown in Listing 26-2.

<title>Official Website: Webbots, Spiders, and Screen Scrapers</title>

Listing 26-2: Describing a web page with a titletag

Meta Tags

You can think of meta tags as extensions of thetitle tag. Like title tags, meta tags explain themain topic of the web page. However, unlike titletags, they allow for detailed descriptions of thecontent on the web page and the search termspeople may use to find the page. For example,Listing 26-3 shows meta tags that may accompanythe title tag used in the previous example.

<!- The meta:author defines the author of the

Page 509: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

web page --><meta name="Author" content="Michael Schrenk">

<!-- The meta:description is how search engines describe the page in search results--><meta name="Description" content="Official Website: Webbots, Spiders, and ScreenScrapers">

<!-- The meta:keywords are a list of search terms that may lead people to your webpage--><meta name="Keywords" content="Webbot, Spider, Webbot Development, SpiderDevelopment">

Listing 26-3: Describing a web page in detail withmeta tagsThere are many misconceptions about meta tags.Many people insist on using every conceivablekeyword that may apply to a web page, using themore, the better theory. In reality, you should limityour selection of keywords to the six or eightkeywords that best describe the content of yourweb page. It's important to remember that thekeywords represent potential search terms thatpeople may use to find your web page. Moreover,for each additional keyword you use, your webpage becomes less specific in the eyes of searchengines. As you increase the number of keywords,you also increase the competition for use of those

Page 510: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

you also increase the competition for use of thosekeywords. When this happens, other pagescontaining the same keywords dilute your positionwithin search rankings. There are also rumorsthat some search engines ignore web pages thathave excessive numbers of keywords as ameasure to avoid keyword spamming, or theoveruse of keywords. Whether these rumors aretrue or not, it still makes sense to use fewer, butbetter quality, keywords. For this reason, there isusually no need to include regular plurals[72] inkeywords.

Note

The more unique your keywords are, thehigher your web page will rank in searchresults when people use those keywords inweb searches. Once thing to watch out for iswhen your keyword is part of another, longerword. For example, I once worked for acompany called Entolo. We had difficultygetting decent rankings on search enginesbecause the word Entolo is a subset of theword Scientology (sciENTOLOgy). Since therewere many more heavily linked web pagesdedicated to Scientology, our website seldomregistered highly with any search services.

Page 511: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Header Tags

In addition to making web pages easier to read,header tags help search engines identify andlocate important content on web pages. Forexample, consider the example in Listing 26-4.

<h1 class="main_header">North American Wire Packaging</h1>In North America, large amounts of wire are commonly shipped on spools...

Listing 26-4: Using header tags to identify keycontent on a web pageIn the past, web designers strayed from usingheader tags because they only offer a smallavailability of font selections. But now, with thewide acceptance of style sheets, there is no reasonnot to use HTML header tags to describeimportant sections of your web pages.

Image alt Attributes

Long ago, before everyone had graphicalbrowsers, web designers used the alt attribute ofthe HTML <img> tag to describe images to peoplewith text-based browsers. Today, with theincreasing popularity of image search tools, thealt attribute helps search engines interpret the

Page 512: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

alt attribute helps search engines interpret thecontent of images, as shown below in Listing 26-5.

<img src="mydog.jpg" alt="Jackson the wonder dog">

Listing 26-5: Using the alt attribute to identify thecontent of an image

[72] A regular plural is the singular form of a wordfollowed by the letter s.

Page 513: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Web Design Techniques ThatHinder Search Engine SpidersThere are common web design techniques thatinhibit search engine spiders from properlyindexing web pages. You don't have to avoid usingthese techniques altogether, but you should avoidusing them in situations where they obscure linksand ASCII text from search engine spiders. Thereis no single set of standards or specifications forSEO. Search engine companies also capriciouslychange their techniques for compiling searchresults. The concepts mentioned here, however,are a good set of suggestions for you to consideras you develop your own best practice policies.

JavaScript

Since most webbots and spiders lack JavaScriptinterpreters, there is no guarantee that a spiderwill understand hyper-references made withJavaScript. For example, the second hyper-reference in Listing 26-6 stands a far betterchance of being indexed by a spider than the firstone.

<-- Example of a non-optimized hyper-reference -->

Page 514: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

-><script> function linkToPage(url)

{ document.location=url; }</script>

<-- Example of an easy-to-index hyper-reference --><a href="http://www.MySpace.com/haxtor">My home page</a>

Listing 26-6: JavaScript links are hard for searchspiders to interpret.

Non-ASCII Content

Search engine spiders depend on ASCIIcharacters to identify what's on a web page. Forthat reason, you should avoid presenting text inimages or Flash movies. It is particularlyimportant not to design your website's navigationscheme in Flash, because it will not be visibleoutside of the Flash movie, and it will becompletely hidden from search pages. Not onlywill your Flash pages fail to show up in searchresults, but other pages will also not be able todeep link directly to the pages within Flashmovies. In short, websites done entirely in Flash

Page 515: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

movies. In short, websites done entirely in Flashkill any and all attempts at SEO and will receiveless traffic than properly formatted HTMLwebsites.

Page 516: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Designing Data-Only InterfacesOften, the express purpose of a web page is todeliver data to a webbot, another website, or astand-alone desktop application. These web pagesaren't concerned about how people will read themin a browser. Rather, they are optimized forefficiency and ease of use by other computerprograms. For example, you might need to designa web page that provides real-time salesinformation from an e-commerce site.

XML

Today, the eXtensible Markup Language (XML) isconsidered the de facto standard for transferringonline data. XML describes data by wrapping it inHTML-like tags. For example, consider the samplesales data from an e-commerce site, shown inTable 26-1.When converted to XML, the data in Table 26-1looks like Listing 26-7.Table 26-1. Sample Sales Information

Brand Style Color Size Price

Gordon LLC Cotton T Red XXL 19.95

Ava St Girlie T Blue S 19.95

<ORDER> <SHIRT> <BRAND>Gordon LLC</BRAND> <STYLE>Cotton T</STYLE > <COLOR>Red</COLOR>

Page 517: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

<COLOR>Red</COLOR> <SIZE>XXL</SIZE> <PRICE>19.95</PRICE> </SHIRT> <SHIRT> <BRAND>Ava St</BRAND> <STYLE>Girlie T</STYLE > <COLOR>Blue</COLOR> <SIZE>S</SIZE> <PRICE>19.95</PRICE> </SHIRT></ORDER>

Listing 26-7: An XML version of the data inTable 26-1XML presents data in a format that is not onlyeasy to parse, but, in some applications, it mayalso tell the client computer what to do with thedata. The actual tags used to describe the dataare not terribly important, as long as the XMLserver and client agree to their meaning. Thescript in Listing 26-8 downloads and parses theXML represented in the previous listing.

# Include librariesinclude("LIB_http.php");include("LIB_parse.php");

# Download the order$url = "http://www.schrenk.com/nostarch/webbots/26_1.php";

$download = http_get($url, "");

# Parse the orders$order_array = return_between($download ['FILE'], "<ORDER>", "</ORDER>", $type=EXCL);

# Parse shirts from order array$shirts = parse_array($order_array, $open_tag="<SHIRT>", $close_tag="</SHIRT>");

Page 518: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

<SHIRT>", $close_tag="</SHIRT>");for($xx=0; $xx<count($shirts); $xx++) { $brand[$xx] = return_between($shirts[$xx], "<BRAND>", "</BRAND>", $type=EXCL); $color[$xx] = return_between($shirts[$xx], "<COLOR>", "</COLOR>", $type=EXCL); $size[$xx] = return_between($shirts[$xx], "<SIZE>", "</SIZE>", $type=EXCL); $price[$xx] = return_between($shirts[$xx], "<PRICE>", "</PRICE>", $type=EXCL); }

# Echo data to validate the download and parsefor($xx=0; $xx<count($color); $xx++) echo "BRAND=".$brand[$xx]."<br> COLOR=".$color[$xx]."<br> SIZE=".$size[$xx]."<br> PRICE=".$price[$xx]."<hr>";

Listing 26-8: A script that parses XML data

Lightweight Data Exchange

As useful as XML is, it suffers from overheadbecause it delivers much more protocol than data.While this isn't important with small amounts ofXML, the problem of overhead grows along withthe size of the XML file. For example, it may takea 30KB XML file to present 10KB of data. Excessoverhead needlessly consumes bandwidth andCPU cycles, and it can become expensive onextremely popular websites. In order to reduceoverhead, you may consider designing lightweightinterfaces. Lightweight interfaces deliver datamore efficiently by presenting data in variables orarrays that can be used directly by the webbot.Granted, this is only possible when you defineboth the web page delivering the data and theclient interpreting the data.

Page 519: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

How Not to Design a LightweightInterface

Before we explore proper methods for passingdata to webbots, let's explore what can happen ifyour design doesn't take the proper securitymeasures. For example, consider the order datafrom Table 26-1, reformatted as variable/valuepairs, as shown in Listing 26-9.

$brand[0]="Gordon LLC";$style[0]="Cotton T";$color[0]="red";$size[0]="XXL";$price[0]=19.95;$brand[1]="Ava LLC";$style[0]="Girlie T";$color[1]="blue";$size[1]="S";$price[1]=19.95;

Listing 26-9: Data sample available athttp://www.schrenk.com/nostarch/webbots/26_2.phpThe webbot receiving this data could convert thisstring directly into variables with PHP's eval()function, as shown in Listing 26-10.

# Include librariesinclude("LIB_http.php");$url = "http://www.schrenk.com/nostarch/webbots/26_2.php";

$download = http_get($url, "");# Convert string received into variableseval($download['FILE']);

# Show imported variables and valuesfor($xx=0; $xx<count($color); $xx++) echo "BRAND=".$brand[$xx]."<br>

Page 520: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

echo "BRAND=".$brand[$xx]."<br> COLOR=".$color[$xx]."<br> SIZE=".$size[$xx]."<br> PRICE=".$price[$xx]."<hr>";

Listing 26-10: Incorrectly interpretingvariable/value pairsWhile this seems very efficient, there is a severesecurity problem associated with this technique.The eval() function, which interprets thevariable settings in Listing 26-10, is also capableof interpreting any PHP command. This opens thedoor for malicious code that can run directly onyour webbot!

A Safer Method of Passing Variables toWebbots

An improvement on the previous example wouldverify that only data variables are interpreted bythe webbot. We can accomplish this by slightlymodifying the variable/value pairs sent to thewebbot (shown in Listing 26-11) and adjustinghow the webbot processes the data (shown inListing 26-12). Listing 26-11 shows a newlightweight test interface that will deliverinformation directly in variables for use by awebbot.

brand[0]="Gordon LLC";style[0]="Cotton T";color[0]="red";size[0]="XXL";price[0]=19.95;brand[1]="Ava LLC";style[0]="Girlie T";color[1]="blue";size[1]="S";price[1]=19.95;

Page 521: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

price[1]=19.95;

Listing 26-11: Data sample used by the script inListing 26-12The script in Listing 26-12 shows how thelightweight interface in Listing 26-11 isinterpreted.

# Get http libraryinclude("LIB_http.php");

# Define and download lightweight test interface$url = "http://www.schrenk.com/nostarch/webbots/26_3.php";

$download = http_get($url, "");

# Convert the received lines into array elements$raw_vars_array = explode(";", $download['FILE']);

# Convert each of the array elements into a variable declarationfor($xx=0; $xx<count($raw_vars_array)-1; $xx++) { list($variable, $value)=explode("=", $raw_vars_array[$xx]); $eval_string="$".trim($variable)."="."\"".trim($value)."\"".";";

eval($eval_string); }

# Echo imported variablesfor($xx=0; $xx<count($color); $xx++) { echo "BRAND=".$brand[$xx]."<br> COLOR=".$color[$xx]."<br>

SIZE=".$size[$xx]."<br> PRICE=".$price[$xx]."<hr>";

Page 522: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

}

Listing 26-12: A safe method for directlytransferring values from a website to a webbotThe technique shown in Figure 26-12 safelyimports the variable/data pairs from Listing 26-11because the eval() command is explicitlydirected to only set a variable to a value and notto execute arbitrary code.This lightweight interface actually has anotheradvantage over XML, in that the data does nothave to appear in any particular order. Forexample, if you rearranged the data in Listing 26-11, the webbot would still interpret it correctly.The same could not be said for the XML data. Andwhile the protocol is slightly less platformindependent than XML, most computer programsare still capable of interpreting the data, as donein the example PHP script in Listing 26-12.

SOAP

No discussion of machine-readable interfaces iscomplete without mentioning the Simple ObjectAccess Protocol (SOAP). SOAP is designed to passinstructions and data between specific types ofweb pages (known as web services) and scriptsrun by webbots, webservers, or desktopapplications. SOAP is the successor of earlierprotocols that make remote application calls, likeRemote Procedure Call (RPC), DistributedComponent Object Model (DCOM), and CommonObject Request Broker Architecture (CORBA).SOAP is a web protocol that uses HTTP and XMLas the primary protocols for passing data betweencomputers. In addition, SOAP also provides a

Page 523: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

computers. In addition, SOAP also provides alayer (or two) of abstraction between thefunctions that make the request and receive thedata. In contrast to XML, where the client needsto make a fetch and parse the results, SOAPfacilitates functions that (appear to) directlyexecute functions on remote services, whichreturn data in easy-to-use variables. An exampleof a SOAP call is shown in Listing 26-13.In typical SOAP calls, the SOAP interface andclient are created and the parameters describingrequested web services are passed in an array.With SOAP, using a web service is much likecalling a local function.If you'd like to experiment with SOAP, considercreating a free account at Amazon Web Services.Amazon provides SOAP interfaces that allow youto access large volumes of data at both Amazonand Alexa, a web-monitoring service(http://www.alexa.com). Along with Amazon WebServices, you should also review the PHP-specificAmazon SOAP tutorial at Dev Shed, a PHPdevelopers' site (http://www.devshed.com).PHP 5 has built-in support for SOAP. If you'reusing PHP 4, however, you will need to use theappropriate PHP Extension and ApplicationRepository (PEAR, http://www.pear.php.net)libraries, included in most PHP distributions. ThePHP 5 SOAP client is faster than the PEARlibraries, because SOAP support in PHP 5 iscompiled into the language; otherwise bothversions are identical.

include("inc/PEAR/SOAP"); // Import SOAP client

# Define the request

Page 524: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

# Define the request$params = array( 'manufacturer' => "XYZ CORP", 'mode' => 'development', 'sort' => '+product', 'type' => 'heavy', 'userkey' => $ACCESS_KEY )

# Create the SOAP object$WSDL = new SOAP_WSDL($ADDRESS_OF_SOAP_INTERFACE);

# Instantiate the SOAP client$client = $WSDL->getProxy();

# Make the request$result_array = $client->SomeGenericSOAPRequest($params);

Listing 26-13: A SOAP call

Advantages of SOAP

SOAP interfaces to web services provide acommon protocol for requesting and receivingdata. This means that web services running onone operating system can communicate with avariety of computers, PDAs, or cell phones usingany operating system, as long as they have aSOAP client.

Disadvantages of SOAP

SOAP is a very heavy interface. Unlike theinterfaces explored earlier, SOAP requires manylayers of protocols. In traffic-heavy applications,all this overhead can result in sluggishperformance. SOAP applications can also suffer

Page 525: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

performance. SOAP applications can also sufferfrom a steep learning curve, especially fordevelopers accustomed to lighter data interfaces.That being said, SOAP and web services are thestandard for exchanging online data, and SOAPinstructions are something all webbot developersshould know how to use. The best way to learnSOAP is to use it. In that respect, if you'd like toexplore SOAP further, you should read thepreviously mentioned Dev Shed tutorial on usingPHP to access the Amazon SOAP interface. Thiswill provide a gradual introduction that shouldmake complex interfaces (like eBay's SOAP API)easier to understand.

Page 526: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 27. KILLING SPIDERSThus far, we have talked about how to createeffective, stealthy, and smart webbots. However,there is also a market for developers who createcountermeasures that defend websites fromwebbots and spiders. These opportunities existbecause sometimes website owners want to shieldtheir sites from webbots and spiders for thesepurposes:

Protect intellectual propertyShield email addresses from spammersRegulate how often the website is usedCreate a level playing field for all users

The first three items in this list are fairly obvious,but the fourth is more complicated. Believe it ornot, creating a level playing field is one of themain reasons web developers cite for attemptingto ban webbots from their sites. Online companiesoften try to be as impartial as possible whenwholesaling items to resellers or awardingcontracts to vendors. At other times, websitesdeny access to all webbots to create anassumption of fairness or parity, as is the casewith MySpace. This is where the conflict exists.

Page 527: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

with MySpace. This is where the conflict exists.Businesses that seek to use the Internet to gaincompetitive advantages are not interested inparity. They want a strategic advantage.Successfully defending websites from webbots ismore complex than simply blocking all webbotactivity. Many webbots, like those used by searchengines, are beneficial, and in most cases theyshould be able to roam sites at will. It's also worthpointing out that, while it's more expensive, peoplewith browsers can gather corporate intelligenceand make online purchases just as effectively aswebbots can. Rather than barring webbots ingeneral, it's usually preferable to just ban certainbehavior.Let's look at some of the things people do toattempt to block webbots and spiders. We'll startwith the simplest (and least effective) methods andgraduate to more sophisticated practices.

Asking NicelyYour first approach to defending a website fromwebbots is to request nicely that webbots andspiders do not use your resources. This is yourfirst line of defense, but if used alone, it is not veryeffective. This method doesn't actually keepwebbots from accessing data—it merely states

Page 528: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

webbots from accessing data—it merely statesyour desire for such—and it may or may notexpress the actual rights of the website owner.Though this strategy is limited in its effectiveness,you should always ask first, using one of themethods described below.

Create a Terms of Service Agreement

The simplest way to ask webbots to avoid yourwebsite is to create a site policy or Terms ofService agreement, which is a list of limitations onhow the website should be used by all parties. Awebsite's Terms of Service agreement typicallyincludes a description of what the website doeswith data it collects, a declaration of limits ofliability, copyright notifications, and so forth. Ifyou don't want webbots and spiders harvestinginformation or services from your website, yourTerms of Service agreement should prohibit theuse of automated web agents, spiders, crawlers,and screen scapers. It is a good idea to provide alink to the usage policy on every page of yourwebsite. Though some webbots will honor yourrequest, others surely won't, so you should neverrely solely on a usage policy to protect a websitefrom automated agents.Although an official usage policy probably won'tkeep webbots and spiders away, it is youropportunity to state your case. With a site policy

Page 529: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

opportunity to state your case. With a site policythat specifically forbids the use of webbots, it'seasier to make a case if you later decide to playhardball and file legal action against a webbot orspider owner.You should also recognize that a written usagepolicy is for humans to read, and it will not beunderstood by automated agents. There are,however, other methods that convey your desiresin ways that are easy for webbots to detect.

Use the robots.txt File

The robots.txt file,[73] or robot exclusion file, wasdeveloped in 1994 after a group of webmastersdiscovered that search engine spiders indexedsensitive parts of their websites. In response, theydeveloped the robots.txt file, which instructs webagents to access only certain parts of a site.According to the robots.txt specification, awebbots should first look for the presence of a filecalled robots.txt in the website's root directorybefore it downloads anything else from thewebsite. This file defines how the webbot shouldaccess files in other directories.[74]

The robots.txt file borrows its Unix-type formatfrom permissions files. A typical robots.txt file isshown in Figure 27-1.

Page 530: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

shown in Figure 27-1.

Figure 27-1. A typical robots.txt file, disallowingall user agents from selected directories

In addition to what you see in Figure 27-1, arobots.txt file may disallow different directoriesfor specific web agents. Some robots.txt files evenspecify the amount of time that webbots must waitbetween fetches, though these parameters are notpart of the actual specification. Make sure to readthe specification[75] before implementing arobots.txt file.There are many problems with robots.txt. Thefirst problem is that no recognized body, such asthe World Wide Web Consortium (W3C) or acorporation, governs the specification. The robotsexclusion file is actually the result of a "consensusof opinion" of members of a now-defunct robotsmailing list. The lack of a recognized organizingbody has left the specification woefully out of

Page 531: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

body has left the specification woefully out ofdate. For example, the specification did notanticipate agent name spoofing, so unless arobots.txt file disallows all webbots, any webbotcan comply with the imposed restrictions bychanging its name. In fact, a robots.txt file mayactually direct a webbot to sensitive areas of awebsite or otherwise hidden directories. A muchbetter tactic is to secure your confidentialinformation through authentication or evenobfuscation. Perhaps the most serious problemwith the robots.txt specification is that there is noenforcement mechanism. Compliance is strictlyvoluntary.However futile the attempt, you should still usethe robots.txt file if for no other reason than tomark your turf. If you are serious about securingyour site from webbots and spiders, however, youshould use the the tactics described later in thischapter.

Use the Robots Meta Tag

Like the robots.txt file, the intent of the robotsmeta tag[76] is to warn spiders to stay clear ofyour website. Unfortunately, this tactic suffersfrom many of the same limitations as the robots.txfile, because it also lacks an enforcementmechanism. A typical robots meta tag is shown in

Page 532: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

mechanism. A typical robots meta tag is shown inListing 27-1.

<head>

<meta name="robots" content="noindex, nofollow">

</head>

Listing 27-1: The robots meta tagThere are two main commands for this meta tag:noindex and nofollow. The first command tellsspiders not to index the web page in searchresults. The second command tells spiders not tofollow links from this web page to other pages.Conversely, index and follow commands arealso available, and they achieve the oppositeeffect. These commands may be used together orindependently.The problem with site usage policies, robots.txtfiles, and meta tags is that the webbots visitingyour site must voluntarily honor your requests. Ona good day, this might happen. On its own, aTerms of Service policy, a robots.txt file, or arobots meta tag is something short of a socialcontract, because a contract requires at least twowilling parties. There is no enforcing agency tocontact when someone doesn't honor yourrequests. If you want to deter webbots and

Page 533: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

requests. If you want to deter webbots andspiders, you should start by asking nicely and thenmove on to the tougher approaches describednext.

[73] The filename robots.txt is case sensitive. Itmust always be lowercase.[74] Each website should have only one robots.txtfile.[75] The robots.txt specification is available athttp://www.robotstxt.org,[76] The specification for the robots meta tag isavailable athttp://www.robotstxt.org/wc/meta_user.html

Page 534: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Building Speed BumpsBetter methods of deterring webbots are ones thatmake it difficult for a webbot to operate on awebsite. Just remember, however, that adetermined webbot designer may overcome theseobstacles.

Selectively Allow Access to SpecificWeb Agents

Some developers may be tempted to detect theirvisitors' web agent names and only serve pages tospecific browsers like Internet Explorer orFirefox. This is largely ineffective because awebbot can pose as any web agent it chooses.[77]

However, if you insist on implementing thisstrategy, make sure you use a server-side methodof detecting the agent, since you can't trust awebbot to interpret JavaScript.

Use Obfuscation

As you learned in Chapter 20, obfuscation is thepractice of hiding something through confusion.For example, you could use HTML specialcharacters to obfuscate an email link, as shown in

Page 535: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

characters to obfuscate an email link, as shown inListing 27-2.

Please email me at:<a href="mailto:&#109;&#101;&#64;<s></s>&#97;&#100;&#100;&#114;&#46;&#99;&#111;&#109;"> &#109;&#101;<b></b>&#64;&#97;&#100;&#100;&#114; <u></u>&#46;&#99;&#111;&#109;

</a>

Listing 27-2: Obfuscating the email [email protected] with HTML special charactersWhile the special characters are hard for a personto read, a browser has no problem renderingthem, as you can see in Figure 27-2.You shouldn't rely on obfuscation to protect databecause once it is discovered, it is usually easilydefeated. For example, in the previous illustration,the PHP function htmlspecialchars() can beused to convert the codes into characters. There isno effective way to protect HTML throughobfuscation. Obfuscation will slow determinedwebbot developers, but it is not apt to stop them,because obfuscation is not the same asencryption. Sooner or later, a determined webbotdesigner is bound to decode any obfuscatedtext.[78]

Page 536: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Figure 27-2. A browser rendering of theobfuscated script in Listing 27-2

Use Cookies, Encryption, JavaScript,and Redirection

Lesser webbots and spiders have trouble handlingcookies, encryption, and page redirection, soattempts to deter webbots by employing thesemethods may be effective in some cases. WhilePHP/CURL resolves most of these issues, webbotsstill stumble when interpreting cookies and pageredirections written in JavaScript, since mostwebbots lack JavaScript interpreters. Extensiveuse of JavaScript can often effectively deterwebbots, especially if JavaScript creates links toother pages or if it is used to create HTMLcontent.

Authenticate Users

Where possible, place all confidential information

Page 537: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

in password-protected areas. This is your bestdefense against webbots and spiders. However,authentication only affects people without logincredentials; it does not prevent authorized usersfrom developing webbots and spiders to harvestinformation and use services within password-protected areas of a website. You can learn aboutwriting webbots that access password-protectedwebsites in Chapter 21.

Update Your Site Often

Possibly the single most effective way to confuse awebbot is to change your site on a regular basis.A website that changes frequently is more difficultfor a webbot to parse than a static site. Thechallenge is to change the things that foul upwebbot behavior without making your site hardfor people to use. For example, you may choose torandomly take one of the following actions:

Change the order of form elementsChange form methodsRename files in your websiteAlter text that may serve as convenientparsing reference points, like form variables

These techniques may be easy to implement if

Page 538: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

These techniques may be easy to implement ifyou're using a high-quality content managementsystem (CMS). Without a CMS, though, it will takea more deliberate effort.

Embed Text in Other Media

Webbots and spiders rely on text represented byHTML codes, which are nothing more thannumbers capable of being matched, compared, ormanipulated with mathematical precision.However, if you place important text insideimages or other non-textual media like Flash,movies, or Java applets, that text is hidden fromautomated agents. This is different from theobfuscation method discussed earlier, becauseembedding relies on the reasoning power of ahuman to react to his or her environment. Forexample, it is now common for authenticationforms to display text embedded in an image andask a user to type that text into a field before itallows access to a secure page. While it's possiblefor a webbot to process text within an image, it isquite difficult. This is especially true when the textis varied and on a busy background, as shown inFigure 27-3. This technique is called a CompletelyAutomated Public Turing test to tell Computersand Humans Apart (CAPTCHA).[79] You can findmore information about CAPTCHA devices at thisbook's website.

Page 539: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

book's website.Before embedding all your website's text inimages, however, you need to recognize thedownside. When you put text in images, beneficialspiders, like those used by search engines, will notbe able to index your web pages. Placing textwithin images is also a very inefficient way torender text.

Figure 27-3. Text within an image is hard for awebbot to interpret

[77] Read Chapter 3 if you are interested inbrowser spoofing.[78] To learn the difference between obfuscationand encryption, read Chapter 20.[79] Completely Automated Public Turing test to

Page 540: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

[79] Completely Automated Public Turing test totell Computers and Humans Apart (CAPTCHA) isa registered trademark of Carnegie MellonUniversity.

Page 541: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Setting TrapsYour strongest defenses against webbots aretechniques that detect webbot behavior. Webbotsbehave differently because they are machines anddon't have the reasoning ability of people.Therefore, a webbot will do things that a personwon't do, and a webbot lacks information that aperson either knows or can figure out byexamining his or her environment.

Create a Spider Trap

A spider trap is a technique that capitalizes on thebehavior of a spider, forcing it to identify itselfwithout interfering with normal human use. Thespider trap in the following example exploits thespider behavior of indiscriminately followingevery hyperlink on a web page. If some links areeither invisible or unavailable to people usingbrowsers, you'll know that any agent that followsthe link is a spider. For example, consider thehyperlinks in Listing 27-3.

<a href="spider_trap.php"><a><a href="spider_trap.php"><img src="spacer.gif" width="0" height="0"><a>

Listing 27-3: Two spider traps

Page 542: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Listing 27-3: Two spider trapsThere are many ways to trap a spider. Some othertechniques include image maps with hot spots thatdon't exist and hyperlinks located in invisibleframes without width or height attributes.

Fun Things to Do with UnwantedSpiders

Once unwanted guests are detected, you can treatthem to a variety of services.Identifying a spider is the first step in dealing withit. Moreover, with browser-spoofing techniques, aspider trap becomes a necessity in determiningwhich traffic is automated and which is human.What you do once you detect a spider is up to you,but Table 27-1 should give you some ideas. Justremember to act within commonsense legalguidelines and your own website policies.Table 27-1. Strategies for Responding When YouIdentify a Spider

Strategy Implementation

BanishRecord the IP addresses of spiders that reach thespider trap and configure the webserver to ignorefuture requests from these addresses.

Page 543: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Limitaccess

Record the IP addresses of the spiders in the spidertrap and limit the pages they can access on their nextvisit.

Mislead

Depending on the situation, you could redirect known(unwanted) spiders with an alternate set of misleadingweb pages. As much as I love this tactic, you shouldconsult with an attorney before implementing this idea.

Analyze

Analyze the IP address and find out where the spidercomes from, who might own it, and what it is up to. Agood resource for identifying IP addresses registered inthe United States is http://www.arin.net. You couldeven create a special log that tracks all activity fromknown hostile spiders. You can also use this techniqueto learn whether or not a spider is part of a distributedattack.

Ignore The default option is to just ignore any automatedactivity on your website.

Page 544: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Final ThoughtsBefore website owners decide to expend theirresources on deterring webbots, they should askthemselves a few questions.

What can a webbot do with your website thata person armed with a browser cannot do?Are your deterrents keeping desirable spiders(like search engines) from accessing your webpages?Does an automated agent (that you want tothwart) pose an actual threat to your website?Is it possible that it may even provide abenefit, as a procurement bot might?If your website contains information thatneeds to be protected from webbots, shouldthat information really be online in the firstplace?If you put information in a public place, doyou really have the right to bar certainmethods of reading it?

If you still insist on banning webbots from yourwebsite, keep in mind that unless you deliberatelydevelop measures like the ones near the end of

Page 545: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

develop measures like the ones near the end ofthis chapter, you will probably have little luck indefending your site from rogue webbots.

Page 546: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Chapter 28. KEEPINGWEBBOTS OUT OF TROUBLEBy this point, you know how to access, download,parse, and process any of the 76 million websiteson the Internet.[80] Knowing how to do something,however, does not give you theright to do it. While I have cast warningsthroughout the book, I haven't, until now, focusedon the consequences of designing webbots orspiders that act selfishly and without regard to therights of website owners or relatedinfrastructure.[81]

Since many businesses rely on the performance oftheir websites to conduct business, you shouldconsider interfering with a corporate websiteequivalent to interfering with a physical store orfactory. When deploying a webbot or spider,remember that someone else is paying for hosting,bandwidth, and development for the websites youtarget. Writing webbots and spiders that consumeirresponsible amounts of bandwidth, guesspasswords, or capriciously reuse intellectualproperty may well be a violation of someone'srights and will eventually land you in trouble.Back in the day—that is, before the popularizationof the Internet—programmers had to win theirstripes before they earned the confidence of theirpeers and gained access to networks or sensitiveinformation. At that time, people who had access

Page 547: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

information. At that time, people who had accessto data networks were less likely to abuse thembecause they had a stake in the security of dataand the performance of networks. One of theoutcomes of the Internet's free access toinformation, open infrastructure, and apparentanonymous browsing is that it is now easier thanever to act irresponsibly. A free dial-up accountgives anyone and everyone access to (and theopportunity to compromise) servers all over theworld. With worldwide access to data centers andthe ability to download quick exploits, it's easy forpeople without a technical background (or avested interest in the integrity of the Internet) toaccess confidential information or launch attacksthat render services useless to others.The last thing I want to do is pave a route forpeople to create havoc on the Internet. Thepurpose of this book is to help Internet developersthink beyond the limitations of the browser and todevelop webbots that do new and useful things.Webbot development is still virgin territory andthere are still many new and creative things to do.You simply lack creativity if you can't developwebbots that do interesting things withoutviolating someone's rights.Webbots (and their developers) generally get intotrouble when they make unauthorized use ofcopyrighted information or use an excessiveamount of a website's infrastructure (bandwidth,servers, administration, etc.). This chapteraddresses both of these areas. We'll also explorethe requests webmasters make to limit webbot use

Page 548: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

the requests webmasters make to limit webbot useon their websites.

Note

This chapter introduces warnings that allwebbot and spider writers should understandand consider before embarking on webbotprojects. I'm not dispensing legal advice, sodon't even think of blaming me if youmisbehave and are sued or find the FBIknocking at your door. This is my attempt toidentify a few (but not all) issues related todeveloping webbots and spiders. Perhaps withthis information, you will be able to at leastask an attorney intelligent questions. Toreiterate, I am not a lawyer, and this is notlegal advice. My responsibility is to tell youthat, if misused, automated web agents canget you into deep trouble. In turn, you'reobligated to take responsibility for your ownactions and to consult an attorney who isaware of local laws before doing anything thateven remotely violates the rights of someoneelse. I urge you to think before you act.

It's All About RespectYour career as a webbot developer will be short-lived if you don't respect the rights of those whoown, maintain, and rely upon the web serversyour webbots and spiders target. Remember that

Page 549: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

your webbots and spiders target. Remember thatwebsites are designed for people using browsersand that often a website's profit model isdependent on those traffic patterns. In a matter ofseconds, a single webbot can create as much webtraffic as a thousand web surfers, without thebenefit of generating commerce or ad revenue, orextending a brand. It's helpful to think of webbotsas "super browsers," as webbots have increasedabilities. But in order to walk among merebrowsers, webbots and spiders need to complywith the norms and customs of the rest of the webagents on the Internet.In Chapter 27 you read about website polices,robots.txt files, robots meta tags, and other toolsserver administrators use to regulate webbots andspiders. It's important to remember, however, thatobeying a webmaster's webbot restrictions doesnot absolve webbot developers fromresponsibility. For example, even if a webbotdoesn't find any restrictions in the website's Termsof Service agreement, robots.txt file, or meta tags,the webbot developer still doesn't have permissionto violate the website's intellectual property rightsor use inordinate amounts of the webserver'sbandwidth.

[80] This estimate of the number of websites on theInternet as of February 2006 comes fromhttp://news.netcraft.com/archives/web_server_survey.html.[81] If you interfere with the operation of one site,

Page 550: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

[81] If you interfere with the operation of one site,you may also affect other, non-targeted websites ifthey are hosted on the same (virtual) server.

Page 551: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

CopyrightOne way to keep your webbots out of trouble is toobey copyright, the set of laws that protectsintellectual property owners. Copyright allowspeople and organizations to claim the exclusiveright to use specific text, images, media, andcontrol the manner in which they are published.All webbot developers need to have an awarenessof copyright. Ignoring copyright can result inbanishment from websites and even lawsuits.

Do Consult Resources

Before you venture off on your own (or assumethat what you're reading here applies to yoursituation), you should check out a few otherresources. For basic copyright information, startwith the website of the United States CopyrightOffice, http://www.copyright.gov. Anotherresource, which you might find more readable, ishttp://www.bitlaw.com/copyright, maintained byDaniel A. Tysver of Beck & Tysver, a firmspecializing in intellectual property law. Ofcourse, these websites only apply to US laws. Ifyou're outside the United States, you'll need toconsult other resources.

Page 552: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Don't Be an Armchair Lawyer

Mitigating factors and varying interpretationsaffect copyright law enforcement. There seems tobe an exception to every rule. If you have specificquestions about copyright law, the smartest thingto do is to consult an attorney. Since the Internetis relatively new, intellectual property law—as itapplies to the Internet—is somewhat fluid andopen to interpretation. Ultimately, courts interpretthe law. While it is not within the scope of thisbook to cover copyright in its entirety, thefollowing sections identify common copyrightissues that webbot developers may findinteresting.

Copyrights Do Not Have to BeRegistered

In the United States, you do not have to officiallyregister a copyright with the Copyright Office tohave the protection of copyright laws. The USCopyright office states that copyrights aregranted automatically, as soon as an originalwork is created. As the Copyright Office describeson its website:

Copyright is secured automatically when thework is created, and a work is "created"

Page 553: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

work is created, and a work is "created"when it is fixed in a copy or phonorecord forthe first time. "Copies" are material objectsfrom which a work can be read or visuallyperceived either directly or with the aid of amachine or device, such as books,manuscripts, sheet music, film, videotape, ormicrofilm. "Phonorecords" are materialobjects embodying fixations of sounds(excluding, by statutory definition, motionpicture soundtracks), such as cassette tapes,CDs, or LPs. Thus, for example, a song (the"work") can be fixed in sheet music ("copies")or in phonograph disks ("phonorecords"), orboth. If a work is prepared over a period oftime, the part of the work that is fixed on aparticular date constitutes the created workas of that date.[82]

Notice that online content isn't specificallymentioned in the above paragraph, while thereare specific references to original works "fixed incopy" through books, sheet music, videotape, CDs,and LPs. While there is no specific mention ofwebsites, one may assume that references toworks that may be "perceived either directly orthrough the aid of a machine or device" alsocovers content on webservers. The importantthing for webbot developers to remember is that itis dangerous to assume that something is free to

Page 554: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

is dangerous to assume that something is free touse if it is not expressly copyrighted.If you don't need to register a copyright, why dopeople still do it? People file for specificcopyrights to strengthen their ability to defendtheir rights in court. If you are interested inregistering a copyright for a website, the USCopyright Office has a special publication foryou.[83]

Assume "All Rights Reserved"

If you hold (or claim to hold) a copyright, youdon't need to explicitly add the phrase all rightsreserved to the copyright notice. For example, if amovie script does not indicate that all rights arereserved, you are not free to assume that you canlegally produce an online cartoon based on themovie. Similarly, if a web page doesn't explicitlystate that the site owner reserves all rights, don'tassume that a webbot can legally use the site'simages in an unrelated project. The habit ofstating all rights reserved in a copyright noticestems from old intellectual property treaties thatrequired it. If a work is unmarked, assume that allrights are reserved.

You Cannot Copyright a Fact

Page 555: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

You Cannot Copyright a Fact

The US Copyright Office website explains thatcopyright protects the way one expresses oneselfand that no one has exclusive rights to facts, asstated below:

Copyright protects the particular way anauthor has expressed himself; it does notextend to any ideas, systems, or factualinformation conveyed in the work.[84]

How would you interpret this? You mightconclude that someone cannot copy the manner orstyle in which someone else publishes facts, butthat the facts themselves are not copyrightable.What happens if a business announces on itswebsite that it has 83 employees? Does the headcount for that company become a fact that is notprotected by copyright laws? What if the websitealso lists prices, phone numbers, addresses, orhistoric dates?You might be safe if you write a webbot that onlycollects pure facts.[85] But that doesn't preventsomeone else from having a differing opinion andchallenging you in court.

You Can Copyright a Collection of Factsif Presented Creatively

Page 556: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

In the previous excerpt from the US CopyrightOffice website, we learned that copyright lawprotects the "particular way" in which someoneexpresses him or herself and that facts themselvesare not protected by copyright. One way to thinkof this is that while you cannot copyright a fact,you might be able to copyright a collection offacts—if they are presented creatively. Forexample, a phone company cannot copyright aphone number, but it can copyright an entirephone directory website, if the phone numbers arepresented in an original and creative way.It appears that courts are serious when they saycopyright only applies to collections of facts whenthey are presented in new and creative ways. Forexample, in one case a phone companyrepublished the names and phone numbers(subscriber information) from another phonecompany's directory.[86] A dispute overintellectual property rights erupted between thetwo companies, and the case went to court. Thefact that the original phone book contained phonenumbers from a selected area and listed them inalphabetical order was not enough creativity tosecure copyright protection. The judge ruled thatthe original phone directory lacked originality andwas not protected by copyright law—even thoughthe publication had a registered copyright. If

Page 557: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

the publication had a registered copyright. Ifnothing else, this indicates that intellectualproperty law is open to interpretation and thatindividuals' interpretations of the law are lessimportant than court decisions.

You Can Use Some Material Under FairUse Laws

United States copyright law also allows for fairuse, a set of exclusions from copyright formaterial used within certain limits. The scope ofwhat falls into the fair use category is largelydependent on the following:

Nature of the copyrighted materialAmount of material usedPurpose for which the material is usedMarket effect of the new work upon theoriginal

Copyrighted material commonly falls under fairuse if a limited amount of the material is used forscholastic or archival purposes. Fair use alsoprotects the right to use selections of copyrightedmaterial for parody, in short quotations, or inreviews. Generally speaking, you can quote asmall amount of copyrighted material if you

Page 558: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

small amount of copyrighted material if youinclude a reference to the original source.However, you may become a target for a lawsuitif you profit from selling shirts featuring acatchphrase from a movie, even though you areonly quoting a small part of a larger work, as itwill likely interfere with the market for legitimateT-shirts.The US Copyright Office says the followingregarding fair use:

Under the fair use doctrine of the U.S.copyright statute, it is permissible to uselimited portions of a work including quotes,for purposes such as commentary, criticism,news reporting, and scholarly reports. Thereare no legal rules permitting the use of aspecific number of words, a certain numberof musical notes, or percentage of a work.Whether a particular use qualifies as fair usedepends on all the circumstances.[87]

As you may guess, fair use exclusions are oftenabused and frequently litigated. A famous casesurrounding fair use was Kelly v. Arriba Soft.[88]

In this case, Leslie A. Kelly conducted an onlinebusiness of licensing copyrighted images. TheArriba Soft Corporation, in contrast, created animage-management program that used webbotsand spiders to search the Internet for new images

Page 559: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

and spiders to search the Internet for new imagesto add to its library. Arriba Soft failed to identifythe sources of the images it found and gave thegeneral impression that the images it found wereavailable under fair use statutes. While Kellyeventually won her case against Arriba Soft, ittook five years of charges, countercharges,rulings, and appeals. Much of the confusion insettling the suit was caused by applying pre-Internet laws to determine what constituted fairuse of intellectual property published online.

[82] US Copyright Office, "Copyright OfficeBasics," July 2006(http://www.copyright.gov/circs/circ1.html).[83] US Copyright Office, "Copyright Registrationfor Online Works (Circular 66)," July 2006(http://www.copyright.gov/circs/circ66.html).[84] US Copyright Office, "Fair Use," July 2006(http://www.copyright.gov/fls/fl102.html).[85] Consult your attorney for clarification on yourlegal rights to collect specific information.[86] Feist Publications, Inc. v. Rural TelephoneService Co., 499 U.S. 340, 1991.[87] US Copyright Office, "Can I Use Someone

Page 560: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

[87] US Copyright Office, "Can I Use SomeoneElse's Work? Can Someone Else Use Mine?(FAQ)," July 12, 2006(http://www.copyright.gov/help/faq/faq-fairuse.html#howmuch).[88] If you Google Kelly v. Arriba, you'll find awealth of commentary and court rulings for thissaga.

Page 561: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Trespass to ChattelsIn addition to copyright, the other main conceptthat you should be aware of is trespass tochattels. Unlike traditional trespass, which refersto unauthorized use of real property (land or realestate), trespass to chattels prevents or impairs anowner's use of or access to personal property. Thetrespass-to-chattels laws were written before theinvention of the Internet, but in certain instances,they still protect access to personal property.Consider the following examples of trespass tochattels:

Blocking access to someone's boat with afloating swim platformPreventing the use of a fax machine bycontinually spamming it with nuisance or junkfaxesErecting a building that blocks someone'socean view

From your perspective as a webbot or spiderdeveloper, violation of trespass to chattels mayinclude:

Consuming so much bandwidth from a target

Page 562: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Consuming so much bandwidth from a targetserver that you affect the website'sperformance or other people's use of thewebsiteIncreasing network traffic on a website to thepoint that the owner is forced to addinfrastructure to meet traffic needsSending excessive quantities of email as todiminish the utility of email or email servers

To better understand trespass to chattels,consider the spider developed by a companycalled Bidder's Edge, which cataloged auctions oneBay. This centralized spider collectedinformation about auctions in an effort toaggregate the contents of several auction sites,including eBay, into one convenient website. Inorder to collect information on all eBay auctions,it downloaded as many as 100,000 pages a day.To put the impact of Bidder's Edge spider intocontext, assume that a typical eBay web page isabout 250KB in size. If the spider requested100,000 pages a day, the spider would consume25GB of eBay's bandwidth every day, or 775GBeach month. In response to the increased webtraffic, eBay was forced to add servers andupgrade its network.With this amount of requests coming from

Page 563: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Bidder's Edge spiders, it was easy for eBay toidentify the source of the increased server load.Initially, eBay claimed that Bidder's Edge illegallyused its copyrighted auctions. When thatargument proved unsuccessful, eBay pursued atrespass-to-chattels case.[89] In this case, eBaysuccessfully argued that the Bidder's Edge spiderincreased the load on its servers to the point thatit interfered with the use of the site. eBay alsoclaimed a loss due to the need to upgrade itsservers to facilitate the increased network trafficcaused by the Bidder's Edge spider. Bidder's Edgeeventually settled with eBay out of court, but onlyafter it was forced offline and agreed to changeits business plan.How do you avoid claims of trespass to chattels?You can start by not placing an undue load on atarget server. If the information is available froma number of sources, you might target multipleservers instead of relying on a single source. If theinformation is only available from a single source,it is best to limit downloads to the absoluteminimum number of pages to do the job. If thatdoesn't work, you should evaluate whether therisk of a lawsuit outweighs the opportunitiescreated by your webbot. You should also ensurethat your webbot or spider does not causedamage to a business or individual.

Page 564: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

[89] You can find more information about this caseat http://pub.bna.com/lw/21200.htm. GooglingeBay, Inc. v. Bidder's Edge will also provide linksto comments about the succession of rulings onthis case.

Page 565: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Internet LawWhile the laws protecting physical property arelong established and reinforced by considerablenumbers of court rulings, the laws governingvirtual property and virtual behavior are lessmature and constantly evolving. While one wouldthink the same laws should protect both onlineand offline property, the reality is that most lawswere written before the Internet and don't directlyaddress those things that are unique to it, likeemail, frames, hyperlinks, or blogs. Since manyexisting laws do not specifically address theInternet, the application of the law (as it applies tothe Internet) is open to much interpretation.One example of a law to deal specifically withInternet abuse is Virginia's so-called Anti-SpamLaw.[90] This law is a response to the largeamount of server resources consumed byservicing unwanted email. The law attacksspammers indirectly by declaring it a felony tofalsify or forge email addresses in connection tounsolicited email. It also provides penalties of asmuch as $10.00 per unsolicited email or $25,000per day. Laws like this one are required toaddress specific Internet-related concerns. Well-defined rules, like those imposed by Virginia'sAnti-Spam Law, are frequently difficult to derivefrom existing statutes. And while it may bepossible to prosecute a spammer with lawsdrafted before the popularity of the Internet, lessis open to the court's interpretation when the lawdeals specifically with the offense.When contemplating the laws that apply to you asa webbot developer, consider the following:

Page 566: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

a webbot developer, consider the following:

Webbots and spiders add a wrinkle to the wayonline information is used, as most web pagesare intended to be used with manuallyoperated browsers. For example, disputesmay arise when webbots ignore paidadvertising and disrupt the intended businessmodel of a website. Webmasters, however,usually want some webbots (such as searchengine crawlers) to visit their sites.The Internet is still relatively young and thereare few precedents for online law. Existingintellectual property law doesn't always applywell to the Internet. For example, in the Kellyv. Arriba Soft case, which we discussedearlier, there was serious contention overwhether or not a website has the right to linkto other web pages. The opportunity tochallenge (and regulate) hyper-references tomedia belonging to someone else didn't existbefore the Internet.New laws governing online commerce andintellectual property rights are constantlyintroduced as the Internet evolves and peopleconduct themselves in different ways. Forexample, blogs have recently created anumber of legal questions. Are bloggerspublishers? Are bloggers responsible for postsmade by visitors to their websites? Theanswer to both questions is no—at least fornow.[91]

It is always wise for webbot developers to staycurrent with online laws, since old laws areconstantly being tested and new laws arebeing written to address specific issues.

Page 567: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

being written to address specific issues.The strategies people use to violate as well asprotect online intellectual property areconstantly changing. For example, pay perclick advertising, a process in whichcompanies only pay for ads that people click,has spawned the arrival of so-called clickbots,which simulate people clicking ads togenerate revenue for the owner of the websitecarrying the advertisements. People test thelaw again by writing webbots that stuff theballot boxes of online polls and contests. Inresponse to the threat mounted by newwebbot designs, web developers counter withtechnologies like CAPTCHA devices,[92] whichforce people to type text from an image (orcomplete some other task that would besimilarly difficult for webbots) beforeaccessing a website. There may be as manyprospects for webbot developers to createmethods to block webbots as there areopportunities to write webbots.Laws vary from country to country. And sincewebsites can be hosted by servers anywherethe world, it can be difficult to identify—letalone prosecute—the violator of a law whenthe offender operates from a country thatdoesn't honor other countries' laws.

[90] "SB 881 Computer Crimes Act; electronicmail," Virginia Senate, approved March 29, 1999(http://leg1.state.va.us/cgi-bin/legp504.exe?991+sum+SB881).[91] In 2006 a Pennsylvania court ruled that

Page 568: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

[91] In 2006 a Pennsylvania court ruled thatbloggers are not responsible for comments postedto the blog by their readers; to read a PDF of thejudge's opinion, visithttp://www.paed.uscourts.gov/documents/opinions/06D0657P.pdf[92] More information about CAPTCHA devices isavailable in Chapter 27.

Page 569: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Final ThoughtsThe knowledge and techniques required todevelop a useful webbot are identical to thoserequired to develop a destructive one. Therefore,it is imperative to realize when your enthusiasmfor what you're doing obscures your judgment andcauses you to cross a line you didn't intend tocross. Be careful. Talk to a qualified attorneybefore you need one.If Internet law is appealing to you or if you areinterested in protecting your online rights, youshould consider joining the Electronic FrontierFoundation (EFF). This group of lawyers, coders,and other volunteers is dedicated to protectingdigital rights. You can find more informationabout the organization at its website,http://www.eff.org.

Page 570: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Appendix A. PHP/CURLREFERENCEThis appendix highlights the options and featuresof PHP/CURL that will be of greatest interest towebbot developers. In addition to the featuresdescribed here, you should know that PHP/CURLis an extremely powerful interface with a dizzyingarray of options. A full specification of PHP/CURLis available at the PHP website.[93]

Creating a Minimal PHP/CURLSessionIn some regards, a PHP/CURL session is similar toa PHP file I/O session. Both create a session (orfile handle) to reference an external file. And inboth cases, when the file transfer is complete, thesession is closed. However, PHP/CURL differsfrom standard file I/O because it requires a seriesof options that define the nature of the filetransfer set before the exchange takes place.These options are set individually, in any order.When many options are required, the list ofsettings can be long and confusing. For simplicity,Listing A-1 shows the minimal options required to

Page 571: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Listing A-1 shows the minimal options required tocreate a PHP/CURL session that will put adownloaded file into a variable.

<?# Open a PHP/CURL session$s = curl_init();

# Configure the cURL commandcurl_setopt($s, CURLOPT_URL, "http://www.schrenk.com"); // Define target sitecurl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE); // Return in string

# Execute the cURL command (send contents of target web page to string)$downloaded_page = curl_exec($s);

# Close PHP/CURL sessioncurl_close($s);?>

Listing A-1: A minimal PHP/CURL sessionThe rest of this section details how to initiatesessions, set options, execute commands, andclose sessions in PHP/CURL. We'll also look athow PHP/CURL provides transfer status and errormessages.

[93] See http://us2.php.net/manual/en/ref.curl.php.

Page 572: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

[93] See http://us2.php.net/manual/en/ref.curl.php.

Page 573: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Initiating PHP/CURL SessionsBefore you use cURL, you must initiate a sessionwith the curl_init() function. Initializationcreates a session variable, which identifiesconfigurations and data belonging to a specificsession. Notice how the session variable $s,created in Listing A-1, is used to configure,execute, and close the entire PHP/CURL session.Once you create a session, you may use it as manytimes as you need to.

Page 574: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Setting PHP/CURL OptionsThe PHP/CURL session is configured with thecurl_setopt() function. Each individualconfiguration option is set with a separate call tothis function. The script in Listing A-1 is unusualin its brevity. In normal use, there are many callsto curl_setopt(). There are over 90 separateconfiguration options available within PHP/CURL,making the interface very versatile.[94] Theaverage PHP/CURL user, however, uses only asmall subset of the available options. Thefollowing sections describe the PHP/CURL optionsyou are most apt to use. While these options arelisted here in order of relative importance, youmay declare them in any order. If the session isleft open, the configu-ration may be reused manytimes within the same session.

CURLOPT_URL

Use the CURLOPT_URL option to define the targetURL for your PHP/CURL session, as shown inListing A-2.

curl_setopt($s, CURLOPT_URL, "http://www.schrenk.com/index.php");

Listing A-2: Defining the target URL

Page 575: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Listing A-2: Defining the target URLYou should use a fully formed URL describing theprotocol, domain, and file in every PHP/CURL filerequest.

CURLOPT_RETURNTRANSFER

The CURLOPT_RETURNTRANSFER option must be setto TRUE, as in Listing A-3, if you want the result tobe returned in a string. If you don't set this optionto TRUE, PHP/CURL echoes the result to theterminal.

curl_setopt($s, CURLOPT_RETURNTRANSFER, TRUE); // Return in string

Listing A-3: Telling PHP/CURL that you want theresult to be returned in a string

CURLOPT_REFERER

The CURLOPT_REFERER option allows your webbotto spoof a hyper-reference that was clicked toinitiate the request for the target file. The examplein Listing A-4 tells the target server that someoneclicked a link onhttp://www.a_domain.com/index.php to requestthe target web page.

curl_setopt($s, CURLOPT_REFERER, "http://www.a_domain.com/index.php");

Page 576: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

"http://www.a_domain.com/index.php");

Listing A-4: Spoofing a hyper-reference

CURLOPT_FOLLOWLOCATION andCURLOPT_MAXREDIRS

The CURLOPT_FOLLOWLOCATION option tells cURLthat you want it to follow every page redirection itfinds. It's important to understand that PHP/CURLonly honors header redirections and notredirections set with a refresh meta tag or withJavaScript, as shown in Listing A-5.

# Example of redirection that cURL will followheader("Location: http://www.schrenk.com");?>

<!-- Examples of redirections that cURL will not follow--><meta http-equiv="Refresh" content="0;url=http://www.schrenk.com"><script>document.location="http://www.schrenk.com"</script>

Listing A-5: Redirects that cURL can and cannotfollowAny time you use CURLOPT_FOLLOWLOCATION, setCURLOPT_MAXREDIRS to the maximum number ofredirections you care to follow. Limiting thenumber of redirections keeps your webbot out ofinfinite loops, where redirections point repeatedly

Page 577: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

infinite loops, where redirections point repeatedlyto the same URL. My introduction toCURLOPT_MAXREDIRS came while trying to solve aproblem brought to my attention by a networkadministrator, who initially thought that someone(using a webbot I wrote) launched a DoS attackon his server. In reality, the server misinterpretedthe webbot's header request as a hacking exploitand redirected the webbot to an error page. Therewas a bug on the error page that caused it torepeatedly redirect the webbot to the error page,causing an infinite loop (and near-infinitebandwidth usage). The addition ofCURLOPT_MAXREDIRS solved the problem, asdemonstrated in Listing A-6.

curl_setopt($s, CURLOPT_FOLLOWLOCATION, TRUE); // Follow header redirectionscurl_setopt($s, CURLOPT_MAXREDIRS, 4); // Limit redirections to 4

Listing A-6: Using the CURLOPT_FOLLOWLOCATIONand CURLOPT_MAXREDIRS options

CURLOPT_USERAGENT

Use this option to define the name of your useragent, as shown in Listing A-7. The user agentname is recorded in server access log files and isavailable to server-side scripts in the$_SERVER['HTTP_USER_AGENT'] variable.

Page 578: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

$agent_name = "test_webbot";curl_setopt($s, CURLOPT_USERAGENT, $agent_name);

Listing A-7: Setting the user agent nameKeep in mind that many websites will not servepages correctly if your user agent name issomething other than a standard web browser.

CURLOPT_NOBODY andCURLOPT_HEADER

These options tell PHP/CURL to return either theweb page's header or body. By default, PHP/CURLwill always return the body, but not the header.This explains why setting CURL_NOBODY to TRUEexcludes the body, and setting CURL_HEADER toTRUE includes the header, as shown in Listing A-8.

curl_setopt($s, CURLOPT_HEADER, TRUE); // Include the headercurl_setopt($s, CURLOPT_NOBODY, TURE); // Exclude the body

Listing A-8: Using the CURLOPT_HEADER andCURLOPT_NOBODY options

CURLOPT_TIMEOUT

If you don't limit how long PHP/CURL waits for aresponse from a server, it may wait forever—

Page 579: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

response from a server, it may wait forever—especially if the file you're fetching is on a busyserver or you're trying to connect to a nonexistentor inactive IP address. (The latter happensfrequently when a spider follows dead links on awebsite.) Setting a time-out value, as shown inListing A-9, causes PHP/CURL to end the sessionif the download takes longer than the time-outvalue (in seconds).

curl_setopt($s, CURLOPT_TIMEOUT, 30); // Don't wait longer than 30 seconds

Listing A-9: Setting a socket time-out value

CURLOPT_COOKIEFILE andCURLOPT_COOKIEJAR

One of the slickest features of PHP/CURL is theability to manage cookies sent to and receivedfrom a website. Use the CURLOPT_COOKIEFILEoption to define the file where previously storedcookies exist. At the end of the session, PHP/CURLwrites new cookies to the file indicated byCURLOPT_COOKIEJAR. An example is in Listing A-10; I have never seen an application where thesetwo options don't reference the same file.

curl_setopt($s, CURLOPT_COOKIEFILE, "c:\bots\cookies.txt"); // Read cookie filecurl_setopt($s, CURLOPT_COOKIEJAR, "c:\bots\cookies.txt"); // Write cookie file

Page 580: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

"c:\bots\cookies.txt"); // Write cookie file

Listing A-10: Telling PHP/CURL where to readand write cookiesWhen specifying the location of a cookie file,always use the complete location of the file, anddo not use relative addresses. More informationabout managing cookies is available inChapter 22.

CURLOPT_HTTPHEADER

The CURLOPT_HTTPHEADER configuration allows acURL session to send an outgoing header messageto the server. The script in Listing A-11 uses thisoption to tell the target server the MIME type itaccepts, the content type it expects, and that theuser agent is capable of decompressingcompressed web responses.Note that CURLOPT_HTTPHEADER expects toreceive data in an array.

$header_array[] = "Mime-Version: 1.0";$header_array[] = "Content-type: text/html; charset=iso-8859-1";$header_array[] = "Accept-Encoding: compress, gzip";curl_setopt($curl_session, CURLOPT_HTTPHEADER, $header_array);

Listing A-11: Configuring an outgoing header

Page 581: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Listing A-11: Configuring an outgoing header

CURLOPT_SSL_VERIFYPEER

You only need to use this option if the targetwebsite uses SSL encryption and the protocol inCURLOPT_URL is https:. An example is shown inListing A-12.

curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE); // No certificate

Listing A-12: Configuring PHP/CURL not to use alocal client certificateDepending on the version of PHP/CURL you use,this option may be required; if you don't use it, thetarget server will attempt to download a clientcertificate, which is unnecessary in all but rarecases.

CURLOPT_USERPWD andCURLOPT_UNRESTRICTED_AUTH

As shown in Listing A-13, you may use theCURLOPT_USERPWD option with a valid usernameand password to access websites that use basicauthentication. In contrast to using a browser,you will have to submit the username andpassword to every page accessed within the basicauthentication realm.

Page 582: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

curl_setopt($s, CURLOPT_USERPWD, "username:password");curl_setopt($s, CURLOPT_UNRESTICTED_AUTH, TRUE);

Listing A-13: Configuring PHP/CURL for basicauthentication schemesIf you use this option in conjunction withCURLOPT_FOLLOWLOCATION, you should also usethe CURLOPT_UNRESTRICTED_AUTH option, whichwill ensure that the username and password aresent to all pages you're redirected to, providingthey are part of the same realm.Exercise caution with using CURLOPT_USERPWD, asit is possible that you can inadvertently sendusername and password information to the wrongserver, where it may appear in access log files.

CURLOPT_POST andCURLOPT_POSTFIELDS

The CURLOPT_POST and CURLOPT_POSTFIELDSoptions configure PHP/CURL to emulate formswith the POST method. Since the default method isGET, you must first tell PHP/CURL to use the POSTmethod. Then you must specify the POST data thatyou want to be sent to the target webserver. Anexample is shown in Listing A-14.

curl_setopt($s, CURLOPT_POST, TRUE);

Page 583: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

curl_setopt($s, CURLOPT_POST, TRUE); // Use POST method$post_data = "var1=1&var2=2&var3=3"; // Define POST data valuescurl_setopt($s, CURLOPT_POSTFIELDS, $post_data);

Listing A-14: Configuring POST method transfersNotice that the POST data looks like a standardquery string sent in a GET method. Incidentally, tosend form information with the GET method,simply attach the query string to the target URL.

CURLOPT_VERBOSE

The CURLOPT_VERBOSE option controls thequantity of status messages created during a filetransfer. You may find this helpful duringdebugging, but it is best to turn off this optionduring the production phase, because it producesmany entries in your server log file. A typicalsuccession of log messages for a single filedownload looks like Listing A-15.

* About to connect() to www.schrenk.com port 80* Connected to www.schrenk.com (66.179.150.101) port 80* Connection #0 left intact* Closing connection #0

Listing A-15: Typical messages from a verbosePHP/CURL sessionIf you're in verbose mode on a busy server, you'll

Page 584: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

If you're in verbose mode on a busy server, you'llcreate very large log files. Listing A-16 shows howto turn off verbose mode.

curl_setopt($s, CURLOPT_VERBOSE, FALSE); // Minimal logs

Listing A-16: Turning off verbose mode reducesthe size of server log files.

CURLOPT_PORT

By default, PHP/CURL uses port 80 for all HTTPsessions, unless you are connecting to an SSLencrypted server, in which case port 443 isused.[95] These are the standard port numbers forHTTP and HTTPS protocols, respectively. Ifyou're connecting to a custom protocol or wish toconnect to a non-web protocol, use CURLOPT_PORTto set the desired port number, as shown inListing A-17.

curl_setopt($s, CURLOPT_PORT, 234); // Use port number 234

Listing A-17: Using nonstandard communicationports

Note

Configuration settings must be capitalized, as

Page 585: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Configuration settings must be capitalized, asshown in the previous examples. This isbecause the option names are predefined PHPconstants. Therefore, your code will fail if youspecify and option as curlopt_port insteadof CURLOPT_PORT.

[94] You can find a complete set of PHP/CURLoptions athttp://www.php.net/manual/en/function.curl-setopt.php.[95] Well-known and standard port numbers aredefined at http://www.iana.org/assignments/port-numbers.

Page 586: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Executing the PHP/CURLCommandExecuting the PHP/CURL command sets intoaction all the options defined with thecurl_setopt() function. This command executesthe previously configured session (referenced by$s in Listing A-18).

$downloaded_page = curl_exec($s);

Listing A-18: Executing a PHP/CURL commandfor session $sYou can execute the same command multipletimes or use curl_setopt() to changeconfigurations between calls of curl_exec(), aslong as the session is defined and hasn't beenclosed. Typically, I create a new PHP/CURLsession for every page I access.

Retrieving PHP/CURL SessionInformation

Additional information about the currentPHP/CURL session is available once acurl_exec() command is executed. Listing A-19shows how to use this command.

Page 587: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

shows how to use this command.

$info_array = curl_getinfo($s);

Listing A-19: Getting additional information aboutthe current PHP/CURL sessionThe curl_getinfo() command returns an arrayof information, including connect and transfertimes, as shown in Listing A-20.

array(20) { ["url"]=> string(22) "http://www.schrenk.com" ["content_type"]=> string(29) "text/html; charset=ISO-8859-1" ["http_code"]=> int(200) ["header_size"]=> int(247) ["request_size"]=> int(125) ["filetime"]=> int(-1) ["ssl_verify_result"]=> int(0) ["redirect_count"]=> int(0) ["total_time"]=> float(0.884) ["namelookup_time"]=> float(0) ["connect_time"]=> float(0.079) ["pretransfer_time"]=> float(0.079) ["size_upload"]=> float(0) ["size_download"]=> float(19892) ["speed_download"]=> float(22502.2624434) ["speed_upload"]=> float(0) ["download_content_length"]=> float(0) ["upload_content_length"]=> float(0) ["starttransfer_time"]=> float(0.608) ["redirect_time"]=> float(0)

Page 588: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

["redirect_time"]=> float(0) }

Listing A-20: Data made available by thecurl_getinfo() command

Viewing PHP/CURL Errors

The curl_error() function returns any errorsthat may have occurred during a PHP/CURLsession. The usage for this function is shown inListing A-21.

$errors = curl_error($s);

Listing A-21: Accessing PHP/CURL session errorsA typical error response is shown in Listing A-22.

Couldn't resolve host 'www.webbotworld.com'

Listing A-22: Typical PHP/CURL session error

Page 589: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Closing PHP/CURL SessionsYou should close a PHP/CURL sessionimmediately after you are done using it, as shownin Listing A-23. Closing the PHP/CURL sessionfrees up server resources, primarily memory.

curl_close($s);

Listing A-23: Closing a PHP/CURL sessionIn normal use, PHP performs garbage collection,freeing resources like variables, socketconnections, and memory when the scriptcompletes. This works fine for scripts that controlweb pages and execute quickly. However,webbots and spiders may require that PHP scriptsrun for extended periods without garbagecollection. (I've written webbot scripts that run formonths without stopping.) Closing eachPHP/CURL session is imperative if you're writingwebbot and spider scripts that make manyPHP/CURL connections and run for extendedperiods of time.

Page 590: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Appendix B. STATUS CODESThis appendix contains status codes returned byweb (HTTP) and news (NNTP) servers. Yourwebbots and spiders should use these status codesto determine the success or failure communicatingwith servers. When debug-ging your scripts, statuscodes also provide hints as to what's wrong.

HTTP CodesThe following is a representative sample of HTTPcodes. These codes reflect the status of an HTTP(web page) request. You'll see these codesreturned in $returned_web_page['STATUS']['http_code'] if you're using the LIB_httplibrary.

100 Continue101 Switching Protocols200 OK201 Created202 Accepted203 Non-Authoritative Information204 No Content205 Reset Content206 Partial Content300 Multiple Choices301 Moved Permanently

Page 591: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

301 Moved Permanently302 Found303 See Other304 Not Modified305 Use Proxy306 (Unused)307 Temporary Redirect400 Bad Request401 Unauthorized402 Payment Required403 Forbidden404 Not Found405 Method Not Allowed406 Not Acceptable407 Proxy Authentication Required408 Request Timeout409 Conflict410 Gone411 Length Required412 Precondition Failed413 Request Entity Too Large414 Request-URI Too Long415 Unsupported Media Type416 Requested Range Not Satisfiable417 Expectation Failed500 Internal Server Error501 Not Implemented502 Bad Gateway503 Service Unavailable504 Gateway Timeout505 HTTP Version Not Supported

Page 592: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

NNTP CodesListed below are the NNTP status codes. Yourwebbots should use these codes to verify thereposes returned from news servers.

100 help text follows199 debug output200 server ready - posting allowed201 server ready - no posting allowed202 slave status noted205 closing connection - goodbye!211 group selected215 list of newsgroups follows220 article retrieved - head and body follow221 article retrieved - head follows222 article retrieved - body follows223 article retrieved - request text separately230 list of new articles by message-id follows231 list of new newsgroups follows235 article transferred ok240 article posted ok335 send article to be transferred. End with <CR-LF>.<CR-LF>340 send article to be posted. End with <CR-LF>.<CR-LF>400 service discontinued411 no such news group412 no newsgroup has been selected420 no current article has been selected421 no next article in this group

Page 593: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

421 no next article in this group422 no previous article in this group423 no such article number in this group430 no such article found435 article not wanted - do not send it436 transfer failed - try again later437 article rejected - do not try again440 posting not allowed441 posting failed500 command not recognized501 command syntax error502 access restriction or permission denied503 program fault - command not performed

Page 594: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Appendix C. SMS EMAILADDRESSESSometimes it is useful for webbots to send ShortMessage Service (SMS) or text messagenotifications. In most cases, you can send a textmessage to a subscriber by simply sending anemail to the wireless subscriber's mail server,using the subscriber's phone number or usernameas the addressee. Below is a collection of emailaddresses that will send text messages. The emailaddresses in the table below have not beenindividually verified, but each entry was found onmore than one source.

Note

Special charges may apply to the use of theseservices. Contact the individual serviceprovider for more information regardingcharges.

If you don't see the carrier you need listed below,contact the carrier to check—most wirelessservices support this service and the carrier'scustomer service department should be able tohelp if you have questions.

Page 595: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

help if you have questions.

Wireless Carrier Text Message Email Address

Alltel [email protected]

Ameritech Paging [email protected]

BeeLine GSM [email protected]

Bell Mobility(Canada) [email protected]

Bell South [email protected]

Bell South Mobility [email protected]

Blue Sky Frog [email protected]

Boost [email protected]

Cellular One [email protected]

Cellular One West [email protected]

Cingular Wireless [email protected]

Dutchtone/Orange-NL [email protected]

Page 596: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Edge Wireless [email protected]

Fido [email protected]

Golden Telecom [email protected]

Idea Cellular [email protected]

Manitoba TelecomSystems [email protected]

MetroPCS [email protected]

MobileOne [email protected]

Mobilfone [email protected]

Mobility Bermuda [email protected]

Netcom [email protected]

Nextel [email protected]

NPI Wireless [email protected]

O2 [email protected]

Orange [email protected]

Page 597: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

Oskar [email protected]

PersonalCommunication [email protected] (number in subject line)

PlusGSM [email protected]

Qualcomm [email protected]

Qwest [email protected]

Southern LINC [email protected]

Sprint PCS [email protected]

SunCom [email protected]

SureWestCommunications [email protected]

T-Mobile [email protected]

T-Mobile Germany [email protected]

T-Mobile UK [email protected]

Tele2 Latvia [email protected]

Telefonica

Page 598: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

TelefonicaMovistar [email protected]

Telenor [email protected]

TIM [email protected]

UMC [email protected]

Unicel [email protected]

Verizon Pagers [email protected]

Verizon PCS [email protected]

Virgin Mobile [email protected]

Wyndtell [email protected]

Page 599: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

About the AuthorMichael Schrenk uses webbots and data-drivenweb applications to create competitive advantagesfor businesses. He has written for Computerworldand Web Techniques magazines and has taughtcourses on Web usability and Internet marketing.He has also given presentations on intelligent Webagents and online corporate intelligence at theDEFCON hacker's convention.

Page 600: Webbots, Spiders, And Screen Scrapers - Michael Schrenk

ColophonWebbots, Spiders, and Screen Scrapers was laidout in Adobe FrameMaker. The font families usedare New Baskerville for body text, Futura forheadings and tables, and Dogma for titles.The book was printed and bound at MalloyIncorporated in Ann Arbor, Michigan. The paperis Glatfelter Thor 60# Antique, which is madefrom 50 percent recycled materials, including 30percent postconsumer content. The book uses aRepKover binding, which allows it to lay flat whenopen.