Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

26

description

Presented at Lucene/Solr Revolution 2014

Transcript of Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Page 1: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson
Page 2: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Near Real Time Indexing Kafka Messages into Apache Blur Dibyendu Bhattacharya Big Data Architect , Pearson

Page 3: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Pearson : What We Do ?

We  are  building  a  scalable,  reliable  cloud-­‐based  learning  pla3orm  providing  services  to  power  the  next  genera:on  of  products  for  Higher  Educa:on.  

With  a  common  data  pla3orm,  we  build  up  student  analy:cs  across  product  and  ins:tu:on  boundaries  that  deliver  efficacy  insights  to  learners  and  ins:tu:ons  not  possible  before.  

Pearson  is  building    ●  The  worlds  greatest  collec:on  of  

educa:onal  content  ●  The  worlds  most  advanced  data,  analy:cs,  

adap:ve,  and  personaliza:on  capabili:es  for  educa:on  

 

 

Page 4: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Pearson Learning Platform : GRID

Course Structure

Foundational Services

Grades Recommend-

ations Assignments

Identity

Data Channels

Analytics

Authorization Entitlements

Media

eCommerce

API Management (Apigee)

Activity Eventing Integration

Mobile Enablers

Assessment Catalog

Customer (CRM)

Achievements

Content

Adaptive

Learner Profile Behavioral

IaaS (AWS)

Learning Services

Page 5: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Data , Adaptive and Analytics

Page 6: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Pearson Search Services : Why Blur We are presently evaluating Apache Blur which is Distributed Search engine built on top of Hadoop and Lucene. Primary reason for using Blur is .. •  Distributed Search Platform stores Indexes in HDFS. •  Leverages all goodness built into the Hadoop and Lucene stack

“HDFS-based indexing is valuable when folks are also using Hadoop for other purposes (MapReduce, SQL queries, HBase, etc.). There are considerable operational efficiencies to a shared storage system. For example, disk space, users, etc. can be centrally managed.” - Doug Cutting

Benefit   Descrip-on  

Scalable   Store  ,  Index  and  Search  massive  amount  of  data  from  HDFS  

Fast   Performance  similar  to  standard  Lucene  implementa:on  

Durable   Provided    by  built  in  WAL  like  store  

Fault  Tolerant   Auto  detect  node  failure  and  re-­‐assigns  indexes  to  surviving  nodes  

Query   Support  all  standard  Lucene  queries  and  Join  queries  

Page 7: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Blur Features

Fast Data Ingestion: Blur’s massively parallel processing (MPP) architecture uses MapReduce to bulk load data at incredible speeds. Near Real-Time Updates: Blur’s remote API allows new data to be indexed and searchable right away. Powerful and Accurate Search: Blur allows you to get precise search results against terabytes of data at Google-like speed. Actionable Data: Blur allows you to build rich data models and search them in a semi-relational manner -- similar to joins while querying a relational database -- allowing you to uncover new relationships in your data. Secure: Blur provides record-level access control to your data. This ensures that users only see the information that they are authorized to view. Scalable & Fault Tolerant: Blur provides linear scaling using commodity server hardware and provides automatic and immediate failover when a node in the cluster goes down. Open Source: Blur is part of the Apache Incubator program.

Page 8: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Blur Architecture

Components   Purpose  

Lucene   Perform    actual  search  du:es  

HDFS   Store  Lucene  Index  

Map  Reduce   Use  Hadoop  MR  for  batch  indexing  

ThriQ   Inter  Process  Communica:on  

Zookeeper   Manage  System    State  and    stores  Metadata  

 Blur  uses  two  types  of  Server  Processes  •   Controller  Server  •   Shard  Server  

     

Orchestrate  Communica:on    between  all  Shard  Servers  for                                                                communica:on    

Responsible  for  performing  searches  for  all  shard  and  returns  results  to  controller        

Controller  Server  

Cache  

                                     Shard  Server  

Cache  

S  H  A  R  D  

S  H  A  R  D  

Page 9: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Blur Architecture

 

     

Controller  Server  

Cache  

                                     Shard  Server  

Cache  

S  H  A  R  D  

S  H  A  R  D  

Controller  Server  

Cache  

                                     Shard  Server  

Cache  

S  H  A  R  D  

S  H  A  R  D  

                                     Shard  Server  

Cache  

S  H  A  R  D  

S  H  A  R  D  

Page 10: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Major Challenges Blur Solved Random Access Latency w/HDFS Problem : HDFS is a great file system for streaming large amounts data across large scale clusters. However the random access latency is typically the same performance you would get in reading from a local drive if the data you are trying to access is not in the operating systems file cache. In other words every access to HDFS is similar to a local read with a cache miss. Lucene relies on file system caching or MMAP of index for performance when executing queries on a single machine with a normal OS file system. Most of time the Lucene index files are cached by the operating system's file system cache. Solution: Blur have a Lucene Directory level block cache to store the hot blocks from the files that Lucene uses for searching. a concurrent LRU map stores the location of the blocks in pre allocated slabs of memory. The slabs of memory are allocated at start-up and in essence are used in place of OS file system cache.

Page 11: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Blur Index Directory Structure -----CacheDirectory : Write Through Cache | ------Use JoinDirectory ( Long Term and Short Term Storage) | ------ Long Term : HDFSDirectory ( Files Written here After Merge) ------ Short Term : FastHdfsKeyValueDirectory (Small or Recently Added Files) | -- Uses HdfsKeyValueStore which act like a WAL

Blur V1 BlockCache, HDFSDirectory is committed to Lucene/Solr Present Blur V2 Cache is more advanced and tuned towards performance.

Page 12: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Blur Data Structure

Blur   is   a   table   based  query   system.   So  within   a   single   cluster   there  can   be  many   different   tables,   each   with   a   different   schema,   shard  size,  analyzers,  etc.  Each  table  contains  Rows.  A  Row  contains  a  row  id   (Lucene  StringField   internally)   and  many  Records.  A   record  has   a  record   id  (Lucene  StringField   internally),  a  family  (Lucene  StringField  internally),  and  many  Columns.  A  column  contains  a  name  and  value,  both   are   Strings   in   the   API   but   the   value   can   be   interpreted   as  different   types.   All   base   Lucene   Field   types   are   supported,   Text,  String,  Long,  Int,  Double,  and  Float.    Row  Query  :  

Ø execute  queries  across  Records  within  the  same  Row.    Ø similar  idea  to  an  inner  join.      

find  all  the  Rows  that  contain  a  Record  with  the  family  "author"  and  has   a   "name"   Column   that   has   that   contains   a   term   "Jon"   and  another  Record  with  the  family  "docs"  and  has  a  "body"  Column  with  a  term  of  "Hadoop".    +<author.name:Jon>  +<docs.body:Hadoop>        

Page 13: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Kafka Stream Indexing to Blur

Major  Challenges  :    •   No  Reliable  Kaba  Consumer  for  Spark  Streaming  exists.  

• Present  High  Level  Kaba  Consumer  has  possible  data  loss    

•   No  Spark  to  Blur  Connector  available  

Page 14: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Spark

Page 15: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Spark

Page 16: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Spark RDD

An   RDD   in   Spark   is   simply   a   distributed   collec:on   of   objects.   Each   RDD   is   split   into   mul:ple  par$$ons,  which  may  be  computed  on  different  nodes  of  the  cluster.    

There  are  lot  more  .  Different  Types  of  RDD  .  E.g.  PairRDD  RDD  Persistence,  RDD  Check  poin:ng  ....  

Page 17: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Spark in a Slide

RDD   (Res i l i en t   D i s t r i bu ted  Dataset),   which   is   a   logically  centralized   en:ty   but   physically  pa r::oned   ac ross   mu l:p le  machines  inside  a  cluster  based  on  some  no:on  of  key    

driver   program   that   launches  various   parallel   opera:ons   on   a  cluster   .   Driver   programs   access  Spark   through   a   SparkContext  object  which  helps  to  create  RDD.  

RDD  can  op:onally  be  cached  in  memory   and   hence   providing  fast   access.   RDD   can   also   be  checkpointed  to  Disk  .  

applica:on   logic  are  expressed   in   terms  of   a   sequence   of   Transforma:on   and  Ac:on.   "Transforma:on"   specifies   the  processing   dependency   among   RDDs  and   "Ac:on"   specifies   what   the   output  will  be  

Page 18: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Spark Streaming

Page 19: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Spark Streaming and Kafka

Spark  Streaming  has  Kaba  High  level  Consumer  which  has  data  loss  problem.    Possible  Data  loss  scenarios..    1.  Receiver  Failure  (SPARK-­‐4062)  2.  Driver  Failure  (SPARK-­‐3129)  

We  have  implemented  a  Low  Level  Kaba-­‐Spark  Consumer  to  solve  the  Receiver  failure  

Problem.  hjps://github.com/dibbhaj/kaba-­‐spark-­‐consumer  

Ka5a  Topic  

Page 20: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Low Level Kafka Consumer Challenges

Consumer  implemented  as  Custom  Spark  Receiver  which  need  to  handle  ..  

•       Consumer  need  to  know  Leader  of  a  Par::on.  •       Consumer  should  aware  of  leader  changes.  •       Consumer  should  handle  ZK  :meout.  •       Consumer  need  to  manage  Kaba  message  Offset.  •       Consumer  need  to  handle  fetch  from  topic.      All  of  the  Kaba  related  challenges  are  already  solved  in  Low  Level  Storm-­‐Kaba  Spout.  That  has  been  modified  to  run  as  Spark  Receiver.    

 

Page 21: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Kafka to Blur Integration Using Spark

hjps://github.com/dibbhaj/spark-­‐blur-­‐connector  

Created  Spark  Conf.  Specify  Serializer  

Created  Streaming  Context.  Set  Checkpoint  Directory  

Create  Receiver  for  Every  Topic  Par::on  

Create  Union  of  All  Receiver  Stream  

Page 22: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Kafka to Blur Integration Using Spark

hjps://github.com/dibbhaj/spark-­‐blur-­‐connector  

First  Transforma:on.    Union  Stream  to  Pair  Stream  Create  the  BlurMutate  object  

Page 23: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Kafka to Blur Integration Using Spark

hjps://github.com/dibbhaj/spark-­‐blur-­‐connector  

For  every  RDD  of  this  Pair  Stream  Perform  some  ac:on.  

Specify  Blur  Table  Details  

Make  number  of  Par::on  for  this  RDD  same  as  Blur  Table  Shard  Count  using  Custom  Par::oner.  Finally  Checkpoint  the  RDD.  

Page 24: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Kafka to Blur Integration Using Spark

hjps://github.com/dibbhaj/spark-­‐blur-­‐connector  

Blur  Hadoop  Job  specific  proper:es.  

Page 25: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Kafka to Blur Integration Using Spark

hjps://github.com/dibbhaj/spark-­‐blur-­‐connector  

Finally  ,  Save  the  RDD  as  Hadoop  File.  This  uses  Blur  HDFSDirectory  to  write  indexes.  This  does  not  go  though  CacheDirectory.  

Start  Stream  Execu:on  

Page 26: Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Thank You

Questions ?

We are Hiring @ jobs.pearson.com