R을 이용한 국토부 실거래가 사이트 웹 스크래핑
-
Upload
byeungchun-kwon -
Category
Data & Analytics
-
view
184 -
download
4
Transcript of R을 이용한 국토부 실거래가 사이트 웹 스크래핑
Overview
• OPEN-API, Web 스크래핑에대한정의
• 구현을위한필수기술소개
• R에서 API로 FRED, ECOS 데이터입수
• R에서스크래핑으로아파트실거래가입수
• 입수데이터를활용한각종모형 분석
OPEN-API, Web 스크래핑
• 전통적인데이터입수 방법
OPEN-API, Web 스크래핑
Open API (often referred to as OpenAPI new technology) is a word used to
describe sets of technologies that enable websites to interact with each other by
using REST, SOAP, JavaScript and other web technologies. While its
possibilities aren't limited to web-based applications, it's becoming an
increasing trend in so-called Web 2.0 applications.
Web scraping (web harvesting or web data extraction) is a computer
software technique of extracting information from websites. Usually, such
software programs simulate human exploration of the World Wide Web by
either implementing low-level Hypertext Transfer Protocol (HTTP), or
embedding a fully-fledged web browser, such as Internet Explorer or Mozilla
Firefox.
OPEN-API
Web 스크래핑
http://rt.molit.go.kr/rtApt.do?cmd=getTradeAptLocal
&dongCode=1168010600
&danjiCode=ALL
&srhYear=2014
&srhPeriod=2
&gubunRadio2=1
구현을위한필수기술소개 - R
R Base
패키지
R Studio
구현을위한필수기술소개 - JSON
JSON(JavaScript Object Notation)
<root><ZFPMember>
<name>문**</name></ZFPMemer><ZFPMember>
<name>박**</name></ZFPMemer><ZFPMember>
<name>김**</name></ZFPMemer><ZFPMember>
<name>최**</name></ZFPMemer></root>
{ ZFPMember =[
{ “name” : “문**”}, { “name”: “박**”}, {“name” : “김**”}, {“name”, “최**”}
] }
R에서 API로 ECOS, FRED 입수
API 입수를위한 5단계
① API KEY 유무확인
②필요한패키지받기(jsonlite)
③쿼리 만들기
④데이터입수
⑤분석(Parsing)
API KEY 유무확인
필요한패키지받기(jsonlite)
> install.packages(“jsonlite”)
> library(jsonlite)
쿼리만들기(FRED)
http://api.stlouisfed.org/fred/series/observations?
series_id=CPIAUCSL
&api_key=b55d00cc4e7ea4483038c2f6edad____
&file_type=json
데이터입수(FRED)
library(jsonlite)
series_id <- "CPIAUCSL"
api_key <- "b55d00cc4e7ea4483038c2f6edad____"
file_type <-"json"
url = paste0("http://api.stlouisfed.org/fred/series/observations",
"?series_id=",series_id,
"&api_key=",api_key,
"&file_type=",file_type)
raw.data <- readLines(url, warn = "F",encoding="UTF-8")
데이터처리(FRED)
> dat<- fromJSON(raw.data)
> str(dat)
List of 13
$ realtime_start : chr "2014-06-12"
$ realtime_end : chr "2014-06-12"
$ observation_start: chr "1776-07-04“
:
$ limit : num 1e+05
$ observations :'data.frame': 808 obs. of 4 variables:
..$ realtime_start: chr [1:808] "2014-06-12" "2014-06-12"
..$ realtime_end : chr [1:808] "2014-06-12" "2014-06-12
..$ date : chr [1:808] "1947-01-01" "1947-02-01"
..$ value : chr [1:808] "21.48" "21.62" "22.0" "22.0" ...
> dat$observations$value
쿼리만들기(ECOS)
http://ecos.bok.or.kr/api/StatisticTableList/SCES3Y78SI__/xml/kr/1/10
쿼리만들기(ECOS)http://ecos.bok.or.kr/api/StatisticItemList/sample/xml/kr/1/10/021Y123/
http://ecos.bok.or.kr/api/StatisticSearch/SCES3Y78SI__/xml/kr/1/1000/021Y123/MM/196501/201405/0/
데이터입수(ECOS)
library(jsonlite)
api_key = "SCES3Y78SI4P/“; file_type = "json/“; lang_type = "kr/"
start_no = "1/“; end_no ="100/"
stat_code = "021Y123/“; cycle_type = "MM/"
start_date = "196501/“; end_date = "201405/"
item_no = "0"
url = paste0("http://ecos.bok.or.kr/api/StatisticSearch/", api_key,file_type,lang_type,start_no,end_no,stat_code,cycle_type,start_date,end_date,item_no)
raw.data <- readLines(url, warn = "F",encoding="UTF-8")
데이터처리(ECOS)> raw.data <- readLines(url, warn = "F", encoding="UTF-8")
> dat<- fromJSON(raw.data)
> str(dat)
List of 1 $ StatisticSearch:List of 2
..$ list_total_count: num 25
..$ row :'data.frame': 10 obs. of 8 variables:
.. ..$ UNIT_NAME : chr [1:10] "십억원 " "십억원
.. .. ..$ STAT_NAME : chr [1:10] "1.1.주요 통화금융지
.. .. ..$ STAT_CODE : chr [1:10] "010Y002" "010Y002" "010Y002"
.. .. ..$ ITEM_NAME1: chr [1:10] "화폐발행잔액(말잔)" "화폐발
.. .. ..$ ITEM_NAME2: chr [1:10] " " " " " " "
.. .. ..$ DATA_VALUE: chr [1:10] "49777.5" "50528" "50226.
.. .. ..$ ITEM_NAME3: chr [1:10] " " " " " " " "
.. .. ..$ TIME : chr [1:10] "201204" "201205“
> dat$StatisticSearch$row$DATA_VALUE
데이터분석(FRED)
library(zoo)
lst_series <- list("CPIAUCSL","UNRATE","FEDFUNDS") #소비자 물가지수,실업률, 기준금리
api_key <- "b55d00cc4e7ea4483038c2f6edad____"
file_type <-"json"
ts<-zoo()
for(i in 1:length(lst_series)){
url = paste0("http://api.stlouisfed.org/fred/series/observations",
"?series_id=",lst_series[i], "&api_key=",api_key, "&file_type=",file_type)
raw.data <- readLines(url, warn = "F",encoding="UTF-8")
dat<- fromJSON(raw.data)
temp<-zoo(as.numeric(dat$observations$value),as.Date(c(dat$observations$date)))
if(i==1){ ts<-temp }else{
ts<-na.locf(merge(ts,temp))
colnames(ts)[i]<-lst_series[i] }
}
colnames(ts)[1] <- lst_series[1] #첫번째 컬럼이름을 정의
데이터분석(FRED)#NA값 제거
ts<-ts[!is.na(ts[,3]),]
#1차차분
ts.diff1 <- diff(ts,lag=1)
#ACF(autocorrelation) 그래프
acf(as.numeric(ts.diff1[,1]),main=colnames(ts)[1])
#전기대비 증감
ts.rate <- ts.diff1/ts
#dataframe으로 변환
df<- data.frame(ts)
#Plot 그리기
plot(x=as.Date(rownames(df)),y=df[,1],type="l", xlab="date",ylab=colnames(df)[1])
#회귀분석
summary(lm(CPIAUCSL~UNRATE+FEDFUNDS, data=df))
Web Scrapping(국토교통부)
Web Scrapping(국토교통부)dongCode = "1168010600"
danjiCode = "ALL"
srhYear = "2014"
srhPeriod = "1"
gubunRadio2 = "1"
url = paste0("http://rt.molit.go.kr/rtApt.do?cmd=getTradeAptLocal&dongCode=",
dongCode,"&danjiCode=",danjiCode,"&srhYear=",srhYear,
"&srhPeriod=",srhPeriod,"&gubunRadio2=",gubunRadio2)
raw.data <- readLines(url, warn = "F",encoding="UTF-8")
dat<- fromJSON(raw.data)
str(dat)
df<-data.frame(cbind(
dat$detailList$APT_CODE,dat$detailList$AREA,
dat$detailList$MONTH,dat$detailList$SUM_AMT))
write.csv(df, file=“aptTrans.csv”)
Web Scrapping(국토교통부) –대용량dongCode = "1168010600"
danjiCode = "ALL"
gubunRadio2 = "1“
dft <- data.frame()
for(i in 2006:2014){
for(j in 1:4){
url = paste0("http://rt.molit.go.kr/rtApt.do?cmd=getTradeAptLocal&dongCode=",
dongCode,"&danjiCode=",danjiCode,"&srhYear=",i,
"&srhPeriod=",j,"&gubunRadio2=",gubunRadio2)
raw.data <- readLines(url, warn = "F",encoding="UTF-8")
dat<- fromJSON(raw.data)
df<-data.frame(cbind(dat$detailList$APT_CODE,dat$detailList$AREA,
dat$detailList$MONTH,dat$detailList$SUM_AMT))
dft<-rbind(dft,df)
}
}
But, Quantmod
Yahoo! Finance, FRED, Google Finance, Oanda,
The Currency Site 의 데이터를 함수형식으로 제공
- http://www.quantmod.com/
And, Quandl
9백만개가넘는데이터셋에서 함수형태로데이터를제공
- http://www.quandl.com/