Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.

Big Data

Zekeriya Beşiroğluhttp://zekeriyabesiroglu.com

http://bilginc.comhttp://twitter.com/zbesiroglu

http://zekeriyabesiroglu.com

http://bilginc.com

http://twitter.com/zbesiroglu

Zekeriya Besiroglu• Bilginc IT Academy - Expert

Consultant

• + 16 IT

• +14 ORACLE DB/DWH

• +7 WEBLOGIC

• +3 BIG DATA

• TROUG

• Speaker

Bilginc IT Academy

DATA TRENS

- Facebook has around 60 PB warehouse and it’s constantly growing

- Twitter messages are 140 bytes each generating 8TB data per day.

-Data is more than doubling every year. -Almost 80% of data will be unstructured data.

-Amazon: 35% of product sales come from product recommendations

New Type of DATA?• Sentiment : Understand how your customers feel about

your products / company

• Sensor/Machine:Discover patters in data streaming automatically from sensors and machines.

• Unstructured: text,video,pictures.

• Server Logs:Search logs find pattern

• Geographic:Analyze location-based data

• Clickstream:Capture and analyze website visitors data

Big Datahttps://www.youtube.com/watch?v=1GU4Imbo6R8

https://www.youtube.com/watch?v=1GU4Imbo6R8

Capacity vs CostYear Capacity(GB) Cost per GB(USD)

1990 0.10 $4000

1997 2 $150

2002 80 $3.75

2007 750 $0.35

2012 3.000 $0.05

2015 10.000 $0.02

What is Big Data• Big Data is When the Volume,Velocity,Variety of

data gets to the point where it is too difficult/expensive for traditional systems to work with.

3Vs of Big Data

Traditional Large scale Computing System Problems• Computation has been

processor bound

• Relatively small amount of data

• Complex processing

• Need bigger computers

• More memory,More/fast processor

Better Solution

• Distributed Systems- Multiple machine run for single job

Problem Of Distributed Systems

Data Stored central location Data Copied processor runtime

Todays• Total Data size PetaBytes

• Daily TerabytesWe Need New Solution

HADOOP

HADOOP• Distribute the Data when it is stored

SPARK Data is Distributed in Memory

RDBMS vs HADOOP

Hadoop

• Hadoop consist of two component

• HDFS

• Map Reduce

• Hadoop ecosystem

• Pig,Hive,Hbase,Flume,Oozie,Sqoop,etc

Traditional ETL

Source Layer Structured Data DWH Data Mart

ETL/ELT ETL/ELT

Hadoop ETLSource Layer

Structured Data UnStructed Data DWH Data MartHADOOP

HDFS

• Hadoop Distributed File System:Storing data

• Data Split into blocks. 64 Mb…

• Each Block replicated e.g 3 times. replicas store different nodes.

• Based on Google File system

• ext3,ext4,xfs

• No random writes allowed. Prefer large streaming reads

HDFS• hadoop fs -ls (user home directory)

• hadoop fs -ls / (root directory)

• hadoop fs -cat /user/zekeriya/deneme.txt

• hadoop fs -mkdir

• hadoop fs -rm -r veri

MapReduce

• Process Data in the Hadoop Cluster

• Two Stage MAP and REDUCE

MAPREDUCE

map(String input_key, String input_value)foreach word w in input_value:emit(w, 1)reduce(String output_key, Iterator<int> intermediate_vals) set count = 0 foreach v in intermediate_vals: count += vemit(output_key, count)

(1000,’Galatasaray sampiyon olur’)(2000,’beşiktas sampiyon olur’)(2200,’Galatasaray Türkiyedir’)

MAPREDUCEOutput Mapper(‘Galatasaray’, 1), (‘sampiyon’, 1), (‘olur’, 1), (‘beşiktas’, 1),(‘sampiyon, 1), (‘olur’, 1), (‘Galatasaray’, 1), (‘Türkiyedir’, 1)

Intermediate Data Reducer’a gönderilen(‘Galatasaray’,[1,1])(‘sampiyon’,[1,1])(‘olur’,[1])(‘beşiktas’,[1])(‘Türkiyedir’,[1])

Reducer’ın son cıktısı

(‘Galatasaray’,2)(‘sampiyon’,2)(‘olur’,1)(‘beşiktas’,1)(‘Türkiyedir’,1)

Hadoop Ecosystem• HIVE

• LIKE SQL

• User query data in hadoop cluster without knowing Java and Map reduce.

• PIG

• Uses a dataflow scripting language

• IMPALA

• Open source project created by cloudier

• Very similar to HiveQL.Produces much faster.

Hadoop Ecosystem• FLUME

• Import data into HDFS as it is generated

• Log files from a Web Server

• Sqoop

• Import data from tables in a OLTP into HDFS

• Populate database tables from files in HDFS

• Oozi

• Developers create a workflow of MapReduce Jobs

Hadoop Ecosystem• HBASE

• HADOOP DATABASE

• NOSQL DATASTORE

• HUGE DATA STORE,GB,TB,PB

• Query Language get/put/scan

• Read/write Throughput Millions of query ps ,rdbms is 1000s queries/second

Big Data

• Finance ,Fraud detection,Customer risk analysis

• Retail, Product recommendation,buy and discount

• Advertising,More effective web ads

• Defense

• Telco

• Healthcare

Analyzing Twitter Data• https://github.com/cloudera/cdh-twitter-

example

https://github.com/cloudera/cdh-twitter-example

Career Path

• Develop with Hadoop

• Hadoop Administration

• Hadoop for Data Scientists & Analysts

Zekeriya Beşiroğlu http://zekeriyabesiroglu.com http://twitter.com/zbesiroglu

http://bilginc.com http://troug.org

mail to:[email protected] [email protected]

http://zekeriyabesiroglu.com

http://twitter.com/zbesiroglu

http://bilginc.com

http://troug.org

http://bilginc.com

mailto:[email protected]

Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.

Software

Transcript of Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.