[Week5]R_scraping

33
Data Designer Week 05 Data Scraping

Transcript of [Week5]R_scraping

Page 1: [Week5]R_scraping

Data DesignerWeek 05

Data Scraping

Page 2: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

Scraping & Crawling

Page 3: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

Web Scrapingthe process of processing a web document and extracting information out of it

Page 4: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

Web Scraping

… … … …

… … … …

… … … …

Page 5: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

Web Crawlingthe process of iteratively finding and fetching web links

starting from a list of seed URL's

Page 6: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

Web Crawling

URL

URL

URL

URL

URL

URL

URL

URL

URL

URL

URL

Page 7: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

Web CrawlingWeb Scraping

Data

Page 8: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

HTML & XML

Page 9: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

Markup Language태그등을이용해서문서나데이터의구조를표시한다

Page 10: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

HTMLHyperText Markup Language웹페이지를위한마크업언어

XMLeXtensible Markup Language다른마크업언어를만드는데사용하는다목적마크업언어서로다른시스템끼리데이터를쉽게주고받을수있게한다

Page 11: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

HTML<!DOCTYPE html><html>

<head><title>HTML Document</title>

</head><body>

.</body>

</html>

Page 12: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

XML<dailyBoxOfficeList>

<dailyBoxOffice><rank>1</rank><movieCd>20148048</movieCd><movieNm> </movieNm><openDt>2015-08-05</openDt><audiCnt>144263</audiCnt><audiAcc>10957701</audiAcc><scrnCnt>828</scrnCnt><showCnt>4262</showCnt>

</dailyBoxOffice></dailyBoxOfficeList>

Page 13: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

데이터가져오기실습HTML

Page 14: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

https://ko.wikipedia.org/wiki/대한민국의_경제성장률

실습데이터

http://score.sports.media.daum.net/record/baseball/kbo/prnk.daum

Page 15: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

실습도구 -엑셀

Page 16: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

실습도구 -엑셀

Page 17: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

실습도구 -구글스프레드시트

=IMPORTHTML("https://ko.wikipedia.org/wiki/대한민국의_경제성장률","table",1)

Page 18: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

실습도구 -구글스프레드시트

Page 19: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

실습도구 -구글스프레드시트

=IMPORTHTML(URL, 쿼리, 색인)

URL 원하는데이터의주소(http:// 등프로토콜포함)

쿼리 원하는데이터가어떤형태인가– 'table' / 'list'

색인 HTML 소스에서해당요소가몇번째요소인가

Page 20: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

http://score.sports.media.daum.net/record/baseball/kbo/prnk.daum

데이터가져오기

Page 21: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

실습도구 - Rhttps://en.wikipedia.org/wiki/List_of_South_Korean_regions_by_GDP

Page 22: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

실습도구 - R

install.packages('rvest')library(rvest)

url_wiki = 'https://en.wikipedia.org/wiki/List_of_South_Korean_regions_by_GDP'

wiki = html(url_wiki)html_table(html_node(wiki, 'table'))

Page 23: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

웹페이지를긁어온다

'https://en.wikipedia.org/wiki/List_of_South_Korean_regions_by_GDP'

실습도구 - R

긁어온웹페이지를 html 구조로파싱한다

<table> 태그를찾는다

<table> 태그의내용을 data.frame으로변환

html()

html()

html_node('table')

html_table()

Page 24: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

실습도구 - R

Page 25: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

실습도구 - R (윈도우사용자용)

library(XML)library(RCurl)

daum_bball = 'http://score.sports.media.daum.net/record/baseball/kbo/prnk.daum'

xml_daum = getURL(daum_bball)

bball_table = readHTMLTable(xml_daum)$table1names(bball_table) = repair_encoding(names(bball_table))

Page 26: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

데이터가져오기실습XML

Page 27: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

library(rvest)

boxoffice = xml('boxoffice0831.xml', encoding='UTF-8')

daily = xml_node(boxoffice, 'dailyboxofficelist')

rank = xml_text(xml_nodes(daily, 'rank'))movieNm = xml_text(xml_nodes(daily, 'movienm'))audiCnt = xml_text(xml_nodes(daily, 'audicnt'))

daily_boxoffice = data.frame(rank, movieNm, audiCnt)

XML 데이터가져오기

Page 28: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

<boxOfficeResult><boxofficeType> </boxofficeType><showRange>20150831~20150831</showRange><dailyBoxOfficeList>

<dailyBoxOffice><rank>1</rank><movieNm> </movieNm><audiCnt>144263</audiCnt>

</dailyBoxOffice><dailyBoxOffice>

<rank>2</rank><movieNm> </movieNm><audiCnt>59994</audiCnt>

</dailyBoxOffice>…………

</dailyBoxOfficeList></boxOfficeResult>

boxoffice0831.xml

xml()

Page 29: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

xml_node('dailyboxofficelist')

<boxOfficeResult><boxofficeType> </boxofficeType><showRange>20150831~20150831</showRange><dailyBoxOfficeList>

<dailyBoxOffice><rank>1</rank><movieNm> </movieNm><audiCnt>144263</audiCnt>

</dailyBoxOffice><dailyBoxOffice>

<rank>2</rank><movieNm> </movieNm><audiCnt>59994</audiCnt>

</dailyBoxOffice>…………

</dailyBoxOfficeList></boxOfficeResult>

Page 30: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

<dailyBoxOffice><rank>1</rank><movieNm> </movieNm><audiCnt>144263</audiCnt>

</dailyBoxOffice><dailyBoxOffice>

<rank>2</rank><movieNm> </movieNm><audiCnt>59994</audiCnt>

</dailyBoxOffice>…………

xml_nodes('rank')

<rank>1</rank><rank>2</rank><rank>3</rank><rank>4</rank>

......<rank>10</rank>

Page 31: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

xml_text()

<rank>1</rank><rank>2</rank><rank>3</rank><rank>4</rank>

......<rank>10</rank>

c(1,2,3,4,...,10)

Page 32: [Week5]R_scraping

꿈꾸는데이터디자이너 시즌2

boxoffice0831_full.xml

Page 33: [Week5]R_scraping