Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
-
Upload
zekeriya-besiroglu -
Category
Software
-
view
611 -
download
1
Transcript of Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Big Data
Zekeriya Beşiroğluhttp://zekeriyabesiroglu.com
http://bilginc.comhttp://twitter.com/zbesiroglu
Zekeriya Besiroglu• Bilginc IT Academy - Expert
Consultant
• + 16 IT
• +14 ORACLE DB/DWH
• +7 WEBLOGIC
• +3 BIG DATA
• TROUG
• Speaker
Bilginc IT Academy
DATA TRENS
- Facebook has around 60 PB warehouse and it’s constantly growing
- Twitter messages are 140 bytes each generating 8TB data per day.
-Data is more than doubling every year. -Almost 80% of data will be unstructured data.
-Amazon: 35% of product sales come from product recommendations
New Type of DATA?• Sentiment : Understand how your customers feel about
your products / company
• Sensor/Machine:Discover patters in data streaming automatically from sensors and machines.
• Unstructured: text,video,pictures.
• Server Logs:Search logs find pattern
• Geographic:Analyze location-based data
• Clickstream:Capture and analyze website visitors data
Big Datahttps://www.youtube.com/watch?v=1GU4Imbo6R8
Capacity vs CostYear Capacity(GB) Cost per GB(USD)
1990 0.10 $4000
1997 2 $150
2002 80 $3.75
2007 750 $0.35
2012 3.000 $0.05
2015 10.000 $0.02
What is Big Data• Big Data is When the Volume,Velocity,Variety of
data gets to the point where it is too difficult/expensive for traditional systems to work with.
3Vs of Big Data
Traditional Large scale Computing System Problems• Computation has been
processor bound
• Relatively small amount of data
• Complex processing
• Need bigger computers
• More memory,More/fast processor
Better Solution
• Distributed Systems- Multiple machine run for single job
Problem Of Distributed Systems
Data Stored central location Data Copied processor runtime
Todays• Total Data size PetaBytes
• Daily TerabytesWe Need New Solution
HADOOP
HADOOP• Distribute the Data when it is stored
SPARK Data is Distributed in Memory
RDBMS vs HADOOP
Hadoop
• Hadoop consist of two component
• HDFS
• Map Reduce
• Hadoop ecosystem
• Pig,Hive,Hbase,Flume,Oozie,Sqoop,etc
Traditional ETL
Source Layer Structured Data DWH Data Mart
ETL/ELT ETL/ELT
Hadoop ETLSource Layer
Structured Data UnStructed Data DWH Data MartHADOOP
HDFS
• Hadoop Distributed File System:Storing data
• Data Split into blocks. 64 Mb…
• Each Block replicated e.g 3 times. replicas store different nodes.
• Based on Google File system
• ext3,ext4,xfs
• No random writes allowed. Prefer large streaming reads
HDFS
HDFS• hadoop fs -ls (user home directory)
• hadoop fs -ls / (root directory)
• hadoop fs -cat /user/zekeriya/deneme.txt
• hadoop fs -mkdir
• hadoop fs -rm -r veri
MapReduce
• Process Data in the Hadoop Cluster
• Two Stage MAP and REDUCE
MAPREDUCE
map(String input_key, String input_value)foreach word w in input_value:emit(w, 1)reduce(String output_key, Iterator<int> intermediate_vals) set count = 0 foreach v in intermediate_vals: count += vemit(output_key, count)
(1000,’Galatasaray sampiyon olur’)(2000,’beşiktas sampiyon olur’)(2200,’Galatasaray Türkiyedir’)
MAPREDUCEOutput Mapper(‘Galatasaray’, 1), (‘sampiyon’, 1), (‘olur’, 1), (‘beşiktas’, 1),(‘sampiyon, 1), (‘olur’, 1), (‘Galatasaray’, 1), (‘Türkiyedir’, 1)
Intermediate Data Reducer’a gönderilen(‘Galatasaray’,[1,1])(‘sampiyon’,[1,1])(‘olur’,[1])(‘beşiktas’,[1])(‘Türkiyedir’,[1])
Reducer’ın son cıktısı
(‘Galatasaray’,2)(‘sampiyon’,2)(‘olur’,1)(‘beşiktas’,1)(‘Türkiyedir’,1)
Hadoop Ecosystem• HIVE
• LIKE SQL
• User query data in hadoop cluster without knowing Java and Map reduce.
• PIG
• Uses a dataflow scripting language
• IMPALA
• Open source project created by cloudier
• Very similar to HiveQL.Produces much faster.
Hadoop Ecosystem• FLUME
• Import data into HDFS as it is generated
• Log files from a Web Server
• Sqoop
• Import data from tables in a OLTP into HDFS
• Populate database tables from files in HDFS
• Oozi
• Developers create a workflow of MapReduce Jobs
Hadoop Ecosystem• HBASE
• HADOOP DATABASE
• NOSQL DATASTORE
• HUGE DATA STORE,GB,TB,PB
• Query Language get/put/scan
• Read/write Throughput Millions of query ps ,rdbms is 1000s queries/second
Big Data
• Finance ,Fraud detection,Customer risk analysis
• Retail, Product recommendation,buy and discount
• Advertising,More effective web ads
• Defense
• Telco
• Healthcare
Analyzing Twitter Data• https://github.com/cloudera/cdh-twitter-
example
Career Path
• Develop with Hadoop
• Hadoop Administration
• Hadoop for Data Scientists & Analysts
Zekeriya Beşiroğlu http://zekeriyabesiroglu.com http://twitter.com/zbesiroglu
http://bilginc.com http://troug.org
mail to:[email protected] [email protected]