GDG İstanbul Şubat Etkinliği - Sunum
-
Upload
cueneyt-yesilkaya -
Category
Documents
-
view
940 -
download
4
description
Transcript of GDG İstanbul Şubat Etkinliği - Sunum
![Page 1: GDG İstanbul Şubat Etkinliği - Sunum](https://reader033.fdocuments.net/reader033/viewer/2022050904/54756251b4af9fa90a8b59ea/html5/thumbnails/1.jpg)
Web Crawling Web Scraping
cuneytykaya
cuneyt.yesilkaya
![Page 2: GDG İstanbul Şubat Etkinliği - Sunum](https://reader033.fdocuments.net/reader033/viewer/2022050904/54756251b4af9fa90a8b59ea/html5/thumbnails/2.jpg)
Cüneyt Yeşilkaya
2007
2048
......... 20102012
![Page 3: GDG İstanbul Şubat Etkinliği - Sunum](https://reader033.fdocuments.net/reader033/viewer/2022050904/54756251b4af9fa90a8b59ea/html5/thumbnails/3.jpg)
Agenda
● Web Crawling● Web Scraping● Web Crawling Tools● Demo (Crawler4j & Jsoup)● Crawling - Where to Use
![Page 4: GDG İstanbul Şubat Etkinliği - Sunum](https://reader033.fdocuments.net/reader033/viewer/2022050904/54756251b4af9fa90a8b59ea/html5/thumbnails/4.jpg)
Web Crawling
Browsing the World Wide Web in a methodical, automated manner or in an orderly fashion.
![Page 5: GDG İstanbul Şubat Etkinliği - Sunum](https://reader033.fdocuments.net/reader033/viewer/2022050904/54756251b4af9fa90a8b59ea/html5/thumbnails/5.jpg)
Web Scraping
Computer software technique of extracting information from websites.
![Page 6: GDG İstanbul Şubat Etkinliği - Sunum](https://reader033.fdocuments.net/reader033/viewer/2022050904/54756251b4af9fa90a8b59ea/html5/thumbnails/6.jpg)
Web Crawling Tools
![Page 7: GDG İstanbul Şubat Etkinliği - Sunum](https://reader033.fdocuments.net/reader033/viewer/2022050904/54756251b4af9fa90a8b59ea/html5/thumbnails/7.jpg)
Selecting Crawler ?
● Multi-Threaded Structure● Max Page to Fetch● Max Page Size● Max Depth to Crawl● Redundant Link Control● Politeness Time● Resumable● Well-Documented
![Page 8: GDG İstanbul Şubat Etkinliği - Sunum](https://reader033.fdocuments.net/reader033/viewer/2022050904/54756251b4af9fa90a8b59ea/html5/thumbnails/8.jpg)
Crawler4j
Yasser Ganjisaffar
Microsoft Bing & Microsoft Live Search
![Page 9: GDG İstanbul Şubat Etkinliği - Sunum](https://reader033.fdocuments.net/reader033/viewer/2022050904/54756251b4af9fa90a8b59ea/html5/thumbnails/9.jpg)
Demo - Crawler4j (1/3)
myCrawler.java myController.java
![Page 10: GDG İstanbul Şubat Etkinliği - Sunum](https://reader033.fdocuments.net/reader033/viewer/2022050904/54756251b4af9fa90a8b59ea/html5/thumbnails/10.jpg)
Demo - Crawler4j (2/3)
myCrawler.java
import edu.uci.ics.crawler4j.crawler.WebCrawler; public class myCrawler extends WebCrawler { @Override public boolean shouldVisit(WebURL url) { return url.getURL().startsWith("http://www.gdgistanbul.com"); } @Override public void visit(Page page) { String url = page.getWebURL().getURL(); }}
![Page 11: GDG İstanbul Şubat Etkinliği - Sunum](https://reader033.fdocuments.net/reader033/viewer/2022050904/54756251b4af9fa90a8b59ea/html5/thumbnails/11.jpg)
Demo - Crawler4j (3/3)
myController.java
int numberOfCrawlers = 4; CrawlConfig config = new CrawlConfig(); config.setPolitenessDelay(250); config.setMaxPagesToFetch(100); PageFetcher pageFetcher = new PageFetcher(config); RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); controller.addSeed("http://www.gdgistanbul.com"); controller.start(myCrawler.class, numberOfCrawlers);
![Page 12: GDG İstanbul Şubat Etkinliği - Sunum](https://reader033.fdocuments.net/reader033/viewer/2022050904/54756251b4af9fa90a8b59ea/html5/thumbnails/12.jpg)
Demo - Jsoup (1/2)Jsoup : nice way to do HTML Parsing in Java
● scrape and parse HTML from a URL, file, or string● find and extract data, using DOM traversal or CSS selectors● manipulate the HTML elements, attributes, and text
![Page 13: GDG İstanbul Şubat Etkinliği - Sunum](https://reader033.fdocuments.net/reader033/viewer/2022050904/54756251b4af9fa90a8b59ea/html5/thumbnails/13.jpg)
Demo - Jsoup (2/2)Document doc = Jsoup.connect("http://en.wikipedia.org/").get();Elements newsHeadlines = doc.select("#mp-itn b a");
String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>";Document doc = Jsoup.parse(html);
Element content = doc.getElementById("content");Elements links = content.getElementsByTag("a");for (Element link : links) {
String linkHref = link.attr("href");String linkText = link.text();
}Elements links = doc.select("a[href]");Elements media = doc.select("[src]");
![Page 14: GDG İstanbul Şubat Etkinliği - Sunum](https://reader033.fdocuments.net/reader033/viewer/2022050904/54756251b4af9fa90a8b59ea/html5/thumbnails/14.jpg)
Where to Use
● Search Engines (GoogleBot)● Aggregators
○ Data aggregator○ News aggregator○ Review aggregator○ Search aggregator○ Social network aggregation○ Video aggregator
● Kaarun Product Collector
![Page 15: GDG İstanbul Şubat Etkinliği - Sunum](https://reader033.fdocuments.net/reader033/viewer/2022050904/54756251b4af9fa90a8b59ea/html5/thumbnails/15.jpg)
www.kaarun.com
![Page 16: GDG İstanbul Şubat Etkinliği - Sunum](https://reader033.fdocuments.net/reader033/viewer/2022050904/54756251b4af9fa90a8b59ea/html5/thumbnails/16.jpg)
All Friends
![Page 17: GDG İstanbul Şubat Etkinliği - Sunum](https://reader033.fdocuments.net/reader033/viewer/2022050904/54756251b4af9fa90a8b59ea/html5/thumbnails/17.jpg)
Products for each Facebook Like
![Page 18: GDG İstanbul Şubat Etkinliği - Sunum](https://reader033.fdocuments.net/reader033/viewer/2022050904/54756251b4af9fa90a8b59ea/html5/thumbnails/18.jpg)
cyesilkaya.wordpress.com & @cuneytykaya & tr.linkedin/cuneyt.yesilkaya
Teşekkürler...