Smart Crawler: Using Committee Machines for Web Pages Continuous Classification
-
Upload
luiz-henrique-zambom-santana -
Category
Internet
-
view
381 -
download
0
Transcript of Smart Crawler: Using Committee Machines for Web Pages Continuous Classification
![Page 1: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification](https://reader036.fdocuments.net/reader036/viewer/2022062503/58a1605a1a28abc1708b4e77/html5/thumbnails/1.jpg)
Smart Crawler: Using Committee Machines for Web
Pages Continuous ClassificationLuiz Henrique Zambom Santana,
Prof. Dr. Ronaldo dos Santos Mello e Prof. Dr. Mauro Roisenberg
Federal University of Santa Catarina – Florianópolis/SC
WebMedia – Manaus, 2015
![Page 2: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification](https://reader036.fdocuments.net/reader036/viewer/2022062503/58a1605a1a28abc1708b4e77/html5/thumbnails/2.jpg)
Agenda• Goals• Motivation• Model• Architecture• Implementation• Experiments• Conclusions
![Page 3: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification](https://reader036.fdocuments.net/reader036/viewer/2022062503/58a1605a1a28abc1708b4e77/html5/thumbnails/3.jpg)
Goals• Idea:
• If:• www.infomoney.com.br = Finance• www.lance.com.br = Futbol• www.4rodas.com.br = Cars
• So:• www.valor.com.br = Finance• placar.abril.com.br = Futbol• revistaautoesporte.globo.com = Cars
![Page 4: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification](https://reader036.fdocuments.net/reader036/viewer/2022062503/58a1605a1a28abc1708b4e77/html5/thumbnails/4.jpg)
Motivation• If we know the category of a page, then
• We can better parse• We can provide better search results• We can customize the user experience
• Classify web page contents, for generating dataset
![Page 5: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification](https://reader036.fdocuments.net/reader036/viewer/2022062503/58a1605a1a28abc1708b4e77/html5/thumbnails/5.jpg)
Motivation• Using ML techniques seemed a good idea, but:
• We need to scale, so Matlab was not an option• We need to collect and classify pages continuously, so we need to index the
pages
• After find the right tools, we had the following question:• What is the best ML technique to use? We tried:
• Naive Bayes, but the degree of class overlapping is not small in our case• SVM, but it can only classify between two extremes
• We decided to create a committee machine of SVM models• Better generalization• Could be very slow
![Page 6: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification](https://reader036.fdocuments.net/reader036/viewer/2022062503/58a1605a1a28abc1708b4e77/html5/thumbnails/6.jpg)
Model
![Page 7: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification](https://reader036.fdocuments.net/reader036/viewer/2022062503/58a1605a1a28abc1708b4e77/html5/thumbnails/7.jpg)
Implementation• Cloud-ready technologies
• Apache Spark• Elasticsearch
• Java frameworks:• Crawler4J• Apache Lucene• Jsoup: parsing
![Page 8: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification](https://reader036.fdocuments.net/reader036/viewer/2022062503/58a1605a1a28abc1708b4e77/html5/thumbnails/8.jpg)
Support vector machine (SVM)• Non-probabilistic binary linear classifier• Can parametrize the number of iteractions • Slow!• “One Vs. All” approach with committee [1 e 2]•The model that had more votes is the winner
[1] e Silva, Sergio Roberto de Lima, and Mauro Roisenberg. "Continuous authentication by keystroke dynamics using committee machines." Intelligence and Security Informatics. Springer Berlin Heidelberg, 2006. 686-687.[2] Sun, Bing-Yu, et al. "Support vector machine committee for classification."Advances in Neural Networks–ISNN 2004. Springer Berlin Heidelberg, 2004. 648-653.
Finance Vs. Sport Finance Vs. Movies Finance Vs. Cars
Sport Vs. Movies Sport Vs. Cars
Movies Vs. Cars
![Page 9: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification](https://reader036.fdocuments.net/reader036/viewer/2022062503/58a1605a1a28abc1708b4e77/html5/thumbnails/9.jpg)
Achitecture
![Page 10: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification](https://reader036.fdocuments.net/reader036/viewer/2022062503/58a1605a1a28abc1708b4e77/html5/thumbnails/10.jpg)
Implementation details - Training1. Set of pages is used as input to the models
String [] pagesCars={"http://g1.globo.com/carros/index.html","http://quatrorodas.abril.com.br/"};String [] pagesFinance={"http://www.valor.com.br/","http://www.infomoney.com.br/", "http://exame.abril.com.br/"};String [] pagesSport={"http://globoesporte.globo.com/","http://oledobrasil.com.br/","http://espn.uol.com.br"};String [] pagesMovies={"http://www.imdb.com/list/ls002231878/","http://www.adorocinema.com/","http://www.filmeb.com.br/", "http://www.revistabula.com/3165-lista-dos-100-melhores-filmes-de-todos-os-tempos-segundo-hollywood/"};
2. Set of pages is used as input to the models
![Page 11: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification](https://reader036.fdocuments.net/reader036/viewer/2022062503/58a1605a1a28abc1708b4e77/html5/thumbnails/11.jpg)
Implementation details - Training3. Clean the page and calculate Feature Vector using HashingTF• Get only the page text (ie., exclude HTML tags)• Use Lucene to remove stopwords, simbols, numbers and other
meaning less parts• Calc the term frequence and create a feature vector
![Page 12: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification](https://reader036.fdocuments.net/reader036/viewer/2022062503/58a1605a1a28abc1708b4e77/html5/thumbnails/12.jpg)
Implementation details - Training
16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.filmeb.com.br/ the model predicts movies16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.valor.com.br/ the model predicts finance16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://www.infomoney.com.br/ the model predicts finance16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://exame.abril.com.br/ the model predicts finance16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://oledobrasil.com.br/ the model predicts sport16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://espn.uol.com.br the model predicts sport16:33:42,784 INFO CallTrainerSVMPageTest:78 - For page http://sportv.globo.com/site/ the model predicts movies
4. Test the data against the models
![Page 13: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification](https://reader036.fdocuments.net/reader036/viewer/2022062503/58a1605a1a28abc1708b4e77/html5/thumbnails/13.jpg)
Experiments• First dataset
• Classes: Finance (Infomoney), Sports (Lance), Movies (IMDB), and Cars (4 Rodas)
• Second dataset• Classes: Life Style, Soup opera, Technology
• Most of the documents are correctly classified, but there was also lot of ambiguity:
![Page 14: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification](https://reader036.fdocuments.net/reader036/viewer/2022062503/58a1605a1a28abc1708b4e77/html5/thumbnails/14.jpg)
Other problems• Templates in portals (headers and footer)• Documents with few information (e.g., assine já)• Documents with too much information (e.g., the main page)
![Page 15: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification](https://reader036.fdocuments.net/reader036/viewer/2022062503/58a1605a1a28abc1708b4e77/html5/thumbnails/15.jpg)
Focused cralwer• 100 labeled pages of each kind, runned
the focused crawler with Carreira, Mercados, Onde Investir and Negócios
• The page structure is easier to test and provides much better results:
• The errors are due texts in more than one category, for instance:
![Page 16: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification](https://reader036.fdocuments.net/reader036/viewer/2022062503/58a1605a1a28abc1708b4e77/html5/thumbnails/16.jpg)
Performance experiments• Three experiments:
• 1: 200000 classifications in 30 minutes, and 4 classes
• 2: 180000 classifications and 8 classes• 3: Focused crawler and 3 classes
![Page 17: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification](https://reader036.fdocuments.net/reader036/viewer/2022062503/58a1605a1a28abc1708b4e77/html5/thumbnails/17.jpg)
Current version• eCrawler• Disambiguation• Pipeline of methods:
![Page 18: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification](https://reader036.fdocuments.net/reader036/viewer/2022062503/58a1605a1a28abc1708b4e77/html5/thumbnails/18.jpg)
Conclusions• Cloud ready technologies, such as Apache Spark and Elasticsearch,
enables the Smart Crawler for expanding accordingly to the application necessities;
• The use of SVM, a traditional machine Learning method, implemented using a Machine Committee can improve the generalization power of the classification components;
• The architecture is created to be general-propose, so it can be used to crawl different domains and make this content available to transformations, search, and retrieval operations.
• The source code is available in:• https://github.com/lhzsantana/smart-crawler
![Page 19: Smart Crawler: Using Committee Machines for Web Pages Continuous Classification](https://reader036.fdocuments.net/reader036/viewer/2022062503/58a1605a1a28abc1708b4e77/html5/thumbnails/19.jpg)
Smart Crawler: Using Committee Machines for Web
Pages Continuous ClassificationLuiz Henrique Zambom Santana,
Prof. Dr. Ronaldo dos Santos Mello e Prof. Dr. Mauro Roisenberg
Obrigado!
Federal University of Santa Catarina
WebMedia - 2015