Open Source를 활용한 - Tistory
Transcript of Open Source를 활용한 - Tistory
Open Source를 활용한
Telco Big Data 구축 사례
2012-05-31
KT cloudware Big Data Division
Data Analysis Platform Team Project Manager 정구범
KT Big Data Problem & Mission
1
KT 젂사적으로
Daily 수십 TB 이상의
Service Log 발생
특히 무선 영역
(3G/LTE/Wibro/Wifi)
폭발적읶 이용 증가
무선 영역 Daily 수 TB Log 적재 최소 3개월 보관
(읷읷 집계/분석처리)
처리량/젃차 지속 증가
부분별 Silo 형태로 고성능 HW + RDBMS
중심의 상용제품 기반으로 구축
Conversions!
통합비용? 수백억 이상 소요
그러나 매출/수익은↓ 어떻게 구축?
Open Source 기반의 Cloud & Big Data 내재화 및 상품화
Challenge!!!
Legacy Platform Architecture
2
Network System
Service System
…
High Scale-up UNIX Machine
ODS ETL
DW ETL Data
Mart
OLAP
업무 시스템
Bottleneck
수직적 확장만 가능한 구조
대부분 단읷 노드 구조
데이터 증가하면 성능저하 및 비용증가
성능 향상을 위한 방안이 제한적
점짂적 확장이 불가능 (대개체로 통이관)
구축/운영 비용 과다지출
원천 데이터의 지속적 증가 다양한 데이터 제공 요구
처리젃차 및 결과의 복잡성 증가
Requirements
3
경제성 확보
데이터와 성능 증가목표에 대응하여 단계별 적기 투자가 가능한 비용 합리성 확보
Commodity Hardware에서 운용 가능한 비용 효율성 확보
동읷한 처리결과 확보
기졲의 SQL을 최대한 홗용하여 적용할 수 있는 호홖성 확보
기졲의 업무처리 결과와 동읷한 처리결과 확보
확장성과 Real-time 성능 확보
지속적으로 증가하는 데이터의 수용이 가능한 적재 확장성 확보
장비 추가에 따른 처리성능의 선형적 확장성 확보
원하는 조건의 데이터를 Real-time 이내에 확인 가능한 성능 확보
새로운 데이터/분석 기법에 대한 수용력 확보
기졲 시스템 통합으로 데이터 포맷 등의 추가적용이 가능한 비정형 데이터 수용력 확보
새로운 분석 기법/알고리즘의 추가 적용이 용이한 분석기능 수용력 확보
Requirement Analysis
4
단계별 투자 Sclale-out
Commodity HW 저사양 구동
SQL 호홖성
동읷한 처리결과
비정형 데이터 수용력
새로운 분석 기능 수용력
Hadoop
Hive
Distributed Search
R
Open Source 세부 요건 주요 요건 Reference
적재 확장성
선형적 성능 확장성
Near Real-time 검색
경제성 확보
동읷한 처리결과 확보
확장성 & 성능 확보
신규 데이터 새로운 분석 수용
Conceptual Architecture
5
Data Storing Data Processing
Data Sources
Workflow Control Scheduling
Coordination
Data Integration Query / Script
Processing
Data Collection Indexing Searching
Data Mart
Access HDFS
Access HDFS
Map-Reduce Execution
Scheduled Querying
Integration Executing
Storing
Data Import/Export
Storing
Open Source based Realization Architecture
6
Query Tool
Apache Flume Apache Chukwa Facebook Scribe
Apache Hadoop
Apache Hive Apache Pig
Apache Lucene Apache Solr ElasticSearch
Storing
Scheduled Querying
Log / Data Collection
Searching
Querying
Apache Sqoop
Data Import/Export
Ad-hoc Querying
Map-Reduce Execution
Apache Oozie LinkedIn Azkaban
Cascading Hamake
Access HDFS
Access HDFS
Integration Executing
Log Repository
DBMS
업무 시스템
OLAP
Apache Zookeeper
Storing
NDAP Workflow
NDAP(Nexr Data Analysis Platform) based Architecture
7
NDAP Data Store (enhanced of Apache Hadoop)
NDAP Enterprise Hive
NDAP Search (based on ElasticSearch)
NDAP Collector (based on Apache Flume)
NDAP Hive (enhanced of Apache Hive)
Data Integrator (based on Apache Sqoop)
Log Collection
Storing Indexing
Access HDFS
Access HDFS
Workflow Engine (enhanced on Apache Oozie)
Workflow Manager
Integration Executing
Scheduled Querying
Hive Performance Analyzer
Hive Workbench
Map-Reduce Execution
Data Import/Export
Manager System
Searching
OLAP (MSTR) Data Mart Querying
NDAP Coordinator (enhanced of Apache Zookeeper) Log Repository
NDAP Workflow NDAP Admin Center
NDAP Provisioning (using by Fabric)
NDAP Full Stack
8
NDAP Data Store (enhanced of Apache Hadoop)
NDAP Enterprise Hive
NDAP Search (based on ElasticSearch)
NDAP Dashboard
NDAP System Monitor (using by Collectd/RabbitMQ)
NDAP Hadoop Manager
NDAP Search Manager
NDAP RHive (open source)
NDAP Collector (based on Apache Flume)
NDAP Hive (enhanced of Apache Hive)
NDAP Workbench
Hive Performance Analyzer
NDAP Coordinator (enhanced of Apache Zookeeper)
Workflow Manager
Workflow Engine (enhanced on Apache Oozie)
Data Integrator (based on Apache Sqoop)
Chukwa Chukwa Agent
Open Source Collector
Data Source Adaptor
Queue Sender
Check Point
Agent Driver
Chukwa Collector
Collector Servlet Pipeline
Stage Writer
Writer
Servlet Container (jetty)
Writer
Writer
HDFS
Storage
…
Scribe Scribe Node
Thrift Listener
Buffer Store
Primary Store
Hash based Load Balancer
HDFS Store
Thrift File Store
Network Store
HDFS
Storage
Flume
Application
(Thrift Client) File Store
Scribe Node Secondary Store
Data
Flume Node
Driver
Event Source
Decorator
Event Sink
HDFS
Storage
Flume Node
Flume Master Meta Repository
Open Source Collector 비교 검토
10
항목 Flume Chukwa Scribe 비고
로그 소스 추가 ○ ○ ○
로그 젂달 추가 ○ ○ Ⅹ
로그 가공 용이성 ○ △ Ⅹ • Chukwa : 배치처리 기반으로 실시간성 결여
다양한 젂송 프로토콜 지원 ○ Ⅹ Ⅹ • Chukwa는 HTTP, Scribe는 RPC만 지원
Log Agent 장애 복구 Ⅹ ○ Ⅹ • Chukwa : CheckPoint 지원
Collector 장애 fail-over ○ ○ ○
데이터 중복 제거 Ⅹ ○ Ⅹ
Log Agent 모니터링 △ Ⅹ Ⅹ • Flume : 상태 모니터링만 가능
Collector 모니터링 △ ○ ○ • Flume : 상태 모니터링만 가능
Collector 상태 통보 ○ Ⅹ Ⅹ • Flume : collector 상태 변경 시 Agent에 통보
개발 언어 Java Java C
문서화 정도 △ ○ Ⅹ • Scribe는 문서가 거의 없음. 소스 수준 분석 필요
오픈소스 홗성화 (이슈/개선) ○ △ Ⅹ • Flume : Cloudera가 push. Apache TLP 예상됨.
Flume 기반의 Data Collector 개발
11
Flume node
Driver
EventSource
Decorator
EventSink
Data
Flume node
HDFS
Storage
Flume Master
META Repository
RPC
Source
Log File Log File Data
Project 규격 만족하는Event Source 추가
Check Point 기능 추가
• Agent / Collector 상태 관리 • 현황 모니터링 보강 • Open API 개발 • 로그 Merging
• 로그 중복 제거 기능
Project에 특화된 Decorator 개발
검색엔짂을 target으로 하는 Event Sink 추가
NDAP Collector
12
based on Apache Flume - 분산 및 확장 가능하면서 신속한 장애대응이 가능한 데이터 수집.
NDAP Data Store (enhanced of Apache Hadoop)
NDAP Coordinator (enhanced of Apache Zookeeper)
Node Membership Flume
Master
Hadoop Name Node Single-Point-Of-Failure 대응
13
Image Mirroring 적용 Active의 edit-log를 2중화하여 한곳은 NFS Export
Standby에서 주기적으로 edit-log load
Hot-Standby 적용 Data Node가 block 정보를 Active/Standby 모두에 젂송
Standby가 edit-log를 load 않은 상태는 retry 젂송
HDFS Client fail-over Zookeeper Event를 통해 Active-Standby 자동젂홖
based on CDH3 + Avatar Node Active-Standby 구조의 HA
Zookeeper와 연동
Name Node HA
R&D
Name Node Crash Critical Issue
Name Node SPOF 대응
NDAP Data Store enhanced of Apache Hadoop
Hadoop Name Node SPOF 대응 Architecture
14
NDAP Search
15
Based on ElasticSearch
NDAP Enterprise Hive
16
대용량 데이터에 대한 접근성 및 편의성 제공으로 보다 빠른 업무처리가 가능한 홖경 지원
Developer DBA
NDAP Data Store (enhanced of Apache Hadoop)
NDAP Enterprise Hive
NDAP Workbench
NDAP Hive (enhanced of Apache Hive)
Hive Performance Analyzer
HDFS Access
Map-Reduce Execution
NDAP Enterprise Hive
17
NDAP Workbench – Web UI에서 Hive Metadata 관리 / Ad-hoc Query 실행 / HDFS Browsing
Support Pure Web UI
- No client installation
- No Active-X (모든 브라우저 지원)
Support Connection Management
- 여러 개의 Hive Server 접속 가능
Support Hive Metadata Management
Support Multi-Quering
- TAB별로 실행 결과를 독립적으로 확인
Support Execution History
- 실행했었던 Query & Result 확인
Support Query Snippet
- 자주 사용하는 Query Save/Load
- Category 설정으로 분류하여 관리
Support result save to Hadoop
Support result download (CSV)
Why Hive?
18
Legacy Query 홗용 개발 부담&시갂 젃감 Apache Hive Enhancement NDAP Enterprise Hive
기졲에 개발한 수천 개의
Oracle Query를 Hadoop 기반으로 어떻게 젂홖하나?
Map-Reduce 재개발
Apache Pig – Script 기반
Apache Hive – Query 용
Oracle to Hive Query 수동 변홖
+ Hive UDF add-on
(Oracle compatible) +
Enhanced Hive patch +
Query 변홖 Guide
Apache Hive
Facebook에서 Leading하는 Apache Top-Level Project
Facebook의 Data Warehousing에서 사용
SQL to Map-Reduce 형태로 동적 변홖되어 실행
ANSI-SQL 형태의 Query 지원 (≠ Oracle에서 주로 사용하는 형식) 수동 변홖 필요
Default Function 부족, User Define Function으로 Add-on 가능
Oracle SQL / Hive QL / Pig Script / M-R 단순 비교
19
Hive QL
select E.empno, E.ename, D.dname, E.sal
from emp E join dept D on (E.deptno = D.deptno);
Oracle SQL
select E.empno, E.ename, D.dname, E.sal
from emp E, dept D
where E.deptno = D.deptno;
Pig Script
EMP = LOAD ‘input/hive/tab/emp.txt’
DEPT = LOAD ‘input/hive/tab/dept.txt’
RESULT = JOIN EMP BY DEPTNO, DEPT BY DEPTNO
DUMP RESULT
import java.io.IOException;
import java.util.Iterator;
import kr.devpub.nexr.common.DepartmentTuple;
import kr.devpub.nexr.common.EmploeeTuple;
import kr.devpub.nexr.common.JoinKey;
import kr.devpub.nexr.common.Tuple;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
@SuppressWarnings("deprecation")
public class JoinV1 extends Configured implements Tool {
private static final Log LOG = LogFactory.getLog(JoinV1.class);
public static class EmploeeMapper extends Mapper<LongWritable, Text, JoinKey, Tuple> {
protected void map(LongWritable key, Text value,
Mapper<LongWritable, Text, JoinKey, Tuple>.Context context) throws IOException,
InterruptedException {
Tuple tuple = EmploeeTuple.parse(JoinKey.RIGHT, value.toString());
JoinKey joinKey = new JoinKey(tuple.getField(LongWritable.class, EmploeeTuple.DEPTNO).get(), JoinKey.RIGHT);
context.write(joinKey, tuple);
}
}
public static class DepartmentMapper extends Mapper<LongWritable, Text, JoinKey, Tuple> {
protected void map(LongWritable key, Text value,
Mapper<LongWritable, Text, JoinKey, Tuple>.Context context) throws IOException,
InterruptedException {
Tuple tuple = DepartmentTuple.parse(JoinKey.LEFT, value.toString());
JoinKey joinKey = new JoinKey(tuple.getField(LongWritable.class, DepartmentTuple.DEPTNO).get(), JoinKey.LEFT);
context.write(joinKey, tuple);
}
}
public static class JoinReducer extends Reducer<JoinKey, Tuple, LongWritable, Tuple> {
protected void reduce(JoinKey key, Iterable<Tuple> values,
Reducer<JoinKey, Tuple, LongWritable, Tuple>.Context context) throws IOException,
InterruptedException {
Iterator<Tuple> i = values.iterator();
Tuple department = i.next();
Text departmentName = department.getField(Text.class, DepartmentTuple.NAME);
LOG.info("department: " + department);
if (department.getPlacing() == JoinKey.LEFT) {
while (i.hasNext()) {
Tuple emploee = i.next();
LOG.info("emploee: " + emploee);
context.write(emploee.getField(LongWritable.class, EmploeeTuple.EMPNO), new Tuple(0,
emploee.getField(Text.class, EmploeeTuple.NAME),
departmentName,
emploee.getField(LongWritable.class, EmploeeTuple.SAL)
));
}
}
}
}
@Override
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setJarByClass(JoinV1.class);
job.setJobName("Join");
job.setMapOutputKeyClass(JoinKey.class);
job.setMapOutputValueClass(Tuple.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Tuple.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setGroupingComparatorClass(JoinKey.GroupComparator.class);
job.setSortComparatorClass(JoinKey.SortComparator.class);
job.setReducerClass(JoinReducer.class);
MultipleInputs.addInputPath(job, new Path("/nexr/input/emp"), TextInputFormat.class, EmploeeMapper.class);
MultipleInputs.addInputPath(job, new Path("/nexr/input/dept"), TextInputFormat.class, DepartmentMapper.class);
FileOutputFormat.setOutputPath(job, new Path("/nexr/output/join/v1"));
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new JoinV1(), args);
System.exit(exitCode);
}
}
• Employ Object • Department Object
• JoinKey
Mapper
Reducer
Driver
Map-Reduce
Contribution to Apache Hive
20
-------- Patched : 34 --------
Split thrift_client#execute to compile + run for early access to execution plan HIVE-2684
HiveServer should provide per session configuration HIVE-2503
Support per-session function registry HIVE-2573
Bug fix on merging join tree HIVE-2253
Not using map aggregation, fails to execute group-by after cluster-by with same key HIVE-2329
If all of the parameters of distinct functions are exists in group by columns, query fails in runtime HIVE-2332
optimize order-by followed by a group-by with same key HIVE-2340
Implement BETWEEN operator HIVE-2005
Hive server is SHUTTING DOWN when invalid queries beeing executed HIVE-2264
Implement lateral float/double type and make possible to compare HIVE-2586
Optimize Group By + Order By with the same keys for TPC-H 3.4.2(Q1) HIVE-1772
Fix test failure on ppr_pushdown.q HIVE-2686
Hive CLI should let you specify database on the command line HIVE-2172
Semantic Analysis failed for GroupBy query with aliase HIVE-2709
Does not throw Ambiguous column reference exception in sub query with select star/pattern HIVE-2723
Improve optimization of DPAL-592 and fix test failures HIVE-482
Use name of original expression for name of CAST output HIVE-2477
Implement head sampler for each blocks HIVE-2780
Fail on table sampling HIVE-2778
Select columns by index instead of name HIVE-494
Support auto completion for hive configs in CliDriver HIVE-2796
Add cleanup stages for UDFs HIVE-2261
Implement NULL-safe equality operator <=> HIVE-2810
SUBSTR(CAST(<string> AS BINARY)) produces unexpected results HIVE-2792
Reduce Sink deduplication fails if the child reduce sink is followed by a join HIVE-2732
Make timestamp accessible in the hbase KeyValue HIVE-2781
Implement NULL-safe equi-join HIVE-2827
Invalid tag is used for MapJoinProcessor HIVE-2820
Filters on outer join with mapjoin hint is not applied correctly HIVE-2839
Support between filter pushdown for key ranges in hbase HIVE-2854
Support eventual constant expression for filter pushdown for key ranges in hbase HIVE-2861
Add validation to HiveConf ConfVars HIVE-2848
Ambiguous table name or column reference message displays when table and column names are the same HIVE-2863
Remove redundant key comparing in SMBMapJoinOperator HIVE-2882
-------- Applied : 16 --------
Bug fix on merging join tree HIVE-2253
Not using map aggregation, fails to execute group-by after cluster-by with same key HIVE-2329
Implement BETWEEN operator HIVE-2005
SUBSTR(CAST(<string> AS BINARY)) produces unexpected results HIVE-2792
Implement NULL-safe equality operator HIVE-2810
Implement NULL-safe equi-join HIVE-2827
Fail on table sampling HIVE-2778
HiveServer should provide per session configuration HIVE-2503
Remove redundant key comparing in SMBMapJoinOperator HIVE-2881
Support eventual constant expression for filter pushdown for key ranges in hbase HIVE-2861
Ambiguous table name or column reference message displays when table and column names are the same HIVE-2863
Hive union with NULL constant and string in same column returns all null HIVE-2901
TestHiveServerSessions hangs when executed directly HIVE-2937
GROUP BY causing ClassCastException LazyDioInteger cannot be cast LazyInteger HIVE-2958
Provide error message when using UDAF in the place of UDF instead of throwing NPE HIVE-2956
Reduce Sink deduplication fails if the child reduce sink is followed by a join HIVE-2732
NDAP Workflow
21
규칙적이고 복잡한 배치 업무를 보다 편리하게 설계, 체계적읶 수행, 다양한 처리도구를 지원.
Hive
RHive
Sqoop
Pig
Remote Shell
JDBC
HDFS
NDAP Workflow
Workflow Engine (Enhanced of Apache Oozie)
Data Integrator (based on Apache Sqoop)
Workflow Manager
Developer DBA
NDAP Workflow
22
Workflow Manager – GUI 기반의 Job 생성/실행/스케줄링/모니터링
Support Pure Web UI
- No client installation
- No Active-X (모든 브라우저 지원)
- GUI 기반의 Action Diagram
Support Hive/RHive/Sqoop/Pig
/RemoteShell/JDBC/HDFS
Support dynamic decision
- dynamic check & flow control
- Action Multi-Folk & Join
Support job dependency
Support Expression Language
Support Execution Log Viewer
Support repeat scheduling (cron type)
Support Integrated Dashboard
NDAP Admin Center
23
Processing Performance Test
24
7개의 ETL Task에 대한 각각의 수행 시갂을 측정 및 비교.
AS-IS (RDBMS)
총 처리시갂 HP Superdome
(44 core, RAM 98GB, 별도 Storage)
NDAP Hive
총 처리시갂 Data Node 8대
(Xeon E5640 dual, RAM 48GB, 내장 SATA)
ETL Task ID AS-IS
(RDBMS) NDAP (Hive)
104757931 0:34:30 0:05:12
100501188 0:14:30 0:02:21
104757932 0:13:30 0:03:13
100441247 0:14:32 0:02:06
104757884 0:19:29 0:04:22
107586407 0:25:31 0:03:05
100434825 0:15:30 0:01:01
sum 2:17:32 0:21:20
분석 홗용 예시 : 통화품질 이상징후
25
• 21시대에 젃단율이 가장 높음
• 건대 근처 지역의 젃단율 높음
Deployment & Expansion Plan
26
olleh ucloud VPC (Virtual Private Cloud)
KT Platforms
3G Voice/data
LTE data
SMS / MMS
olleh Wibro
olleh Wifi
Wifi Call / VoIP
olleh TV
Wired Internet
2013년 약 200 Node / Raw Disk 1 PB 규모 Open 데이터 수용 범위 및 규모 단계별 확대
KT Subscriber Analysis System
2013년 통합
확대
Hadoop 기반의 국내 최대 분석 플랫폼
M2M
27
The End.