Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson
-
Upload
lucidworks -
Category
Software
-
view
1.908 -
download
3
description
Transcript of Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson
Near Real Time Indexing Kafka Messages into Apache Blur Dibyendu Bhattacharya Big Data Architect , Pearson
Pearson : What We Do ?
We are building a scalable, reliable cloud-‐based learning pla3orm providing services to power the next genera:on of products for Higher Educa:on.
With a common data pla3orm, we build up student analy:cs across product and ins:tu:on boundaries that deliver efficacy insights to learners and ins:tu:ons not possible before.
Pearson is building ● The worlds greatest collec:on of
educa:onal content ● The worlds most advanced data, analy:cs,
adap:ve, and personaliza:on capabili:es for educa:on
Pearson Learning Platform : GRID
Course Structure
Foundational Services
Grades Recommend-
ations Assignments
Identity
Data Channels
Analytics
Authorization Entitlements
Media
eCommerce
API Management (Apigee)
Activity Eventing Integration
Mobile Enablers
Assessment Catalog
Customer (CRM)
Achievements
Content
Adaptive
Learner Profile Behavioral
IaaS (AWS)
Learning Services
Data , Adaptive and Analytics
Pearson Search Services : Why Blur We are presently evaluating Apache Blur which is Distributed Search engine built on top of Hadoop and Lucene. Primary reason for using Blur is .. • Distributed Search Platform stores Indexes in HDFS. • Leverages all goodness built into the Hadoop and Lucene stack
“HDFS-based indexing is valuable when folks are also using Hadoop for other purposes (MapReduce, SQL queries, HBase, etc.). There are considerable operational efficiencies to a shared storage system. For example, disk space, users, etc. can be centrally managed.” - Doug Cutting
Benefit Descrip-on
Scalable Store , Index and Search massive amount of data from HDFS
Fast Performance similar to standard Lucene implementa:on
Durable Provided by built in WAL like store
Fault Tolerant Auto detect node failure and re-‐assigns indexes to surviving nodes
Query Support all standard Lucene queries and Join queries
Blur Features
Fast Data Ingestion: Blur’s massively parallel processing (MPP) architecture uses MapReduce to bulk load data at incredible speeds. Near Real-Time Updates: Blur’s remote API allows new data to be indexed and searchable right away. Powerful and Accurate Search: Blur allows you to get precise search results against terabytes of data at Google-like speed. Actionable Data: Blur allows you to build rich data models and search them in a semi-relational manner -- similar to joins while querying a relational database -- allowing you to uncover new relationships in your data. Secure: Blur provides record-level access control to your data. This ensures that users only see the information that they are authorized to view. Scalable & Fault Tolerant: Blur provides linear scaling using commodity server hardware and provides automatic and immediate failover when a node in the cluster goes down. Open Source: Blur is part of the Apache Incubator program.
Blur Architecture
Components Purpose
Lucene Perform actual search du:es
HDFS Store Lucene Index
Map Reduce Use Hadoop MR for batch indexing
ThriQ Inter Process Communica:on
Zookeeper Manage System State and stores Metadata
Blur uses two types of Server Processes • Controller Server • Shard Server
Orchestrate Communica:on between all Shard Servers for communica:on
Responsible for performing searches for all shard and returns results to controller
Controller Server
Cache
Shard Server
Cache
S H A R D
S H A R D
Blur Architecture
Controller Server
Cache
Shard Server
Cache
S H A R D
S H A R D
Controller Server
Cache
Shard Server
Cache
S H A R D
S H A R D
Shard Server
Cache
S H A R D
S H A R D
Major Challenges Blur Solved Random Access Latency w/HDFS Problem : HDFS is a great file system for streaming large amounts data across large scale clusters. However the random access latency is typically the same performance you would get in reading from a local drive if the data you are trying to access is not in the operating systems file cache. In other words every access to HDFS is similar to a local read with a cache miss. Lucene relies on file system caching or MMAP of index for performance when executing queries on a single machine with a normal OS file system. Most of time the Lucene index files are cached by the operating system's file system cache. Solution: Blur have a Lucene Directory level block cache to store the hot blocks from the files that Lucene uses for searching. a concurrent LRU map stores the location of the blocks in pre allocated slabs of memory. The slabs of memory are allocated at start-up and in essence are used in place of OS file system cache.
Blur Index Directory Structure -----CacheDirectory : Write Through Cache | ------Use JoinDirectory ( Long Term and Short Term Storage) | ------ Long Term : HDFSDirectory ( Files Written here After Merge) ------ Short Term : FastHdfsKeyValueDirectory (Small or Recently Added Files) | -- Uses HdfsKeyValueStore which act like a WAL
Blur V1 BlockCache, HDFSDirectory is committed to Lucene/Solr Present Blur V2 Cache is more advanced and tuned towards performance.
Blur Data Structure
Blur is a table based query system. So within a single cluster there can be many different tables, each with a different schema, shard size, analyzers, etc. Each table contains Rows. A Row contains a row id (Lucene StringField internally) and many Records. A record has a record id (Lucene StringField internally), a family (Lucene StringField internally), and many Columns. A column contains a name and value, both are Strings in the API but the value can be interpreted as different types. All base Lucene Field types are supported, Text, String, Long, Int, Double, and Float. Row Query :
Ø execute queries across Records within the same Row. Ø similar idea to an inner join.
find all the Rows that contain a Record with the family "author" and has a "name" Column that has that contains a term "Jon" and another Record with the family "docs" and has a "body" Column with a term of "Hadoop". +<author.name:Jon> +<docs.body:Hadoop>
Kafka Stream Indexing to Blur
Major Challenges : • No Reliable Kaba Consumer for Spark Streaming exists.
• Present High Level Kaba Consumer has possible data loss
• No Spark to Blur Connector available
Spark
Spark
Spark RDD
An RDD in Spark is simply a distributed collec:on of objects. Each RDD is split into mul:ple par$$ons, which may be computed on different nodes of the cluster.
There are lot more . Different Types of RDD . E.g. PairRDD RDD Persistence, RDD Check poin:ng ....
Spark in a Slide
RDD (Res i l i en t D i s t r i bu ted Dataset), which is a logically centralized en:ty but physically pa r::oned ac ross mu l:p le machines inside a cluster based on some no:on of key
driver program that launches various parallel opera:ons on a cluster . Driver programs access Spark through a SparkContext object which helps to create RDD.
RDD can op:onally be cached in memory and hence providing fast access. RDD can also be checkpointed to Disk .
applica:on logic are expressed in terms of a sequence of Transforma:on and Ac:on. "Transforma:on" specifies the processing dependency among RDDs and "Ac:on" specifies what the output will be
Spark Streaming
Spark Streaming and Kafka
Spark Streaming has Kaba High level Consumer which has data loss problem. Possible Data loss scenarios.. 1. Receiver Failure (SPARK-‐4062) 2. Driver Failure (SPARK-‐3129)
We have implemented a Low Level Kaba-‐Spark Consumer to solve the Receiver failure
Problem. hjps://github.com/dibbhaj/kaba-‐spark-‐consumer
Ka5a Topic
Low Level Kafka Consumer Challenges
Consumer implemented as Custom Spark Receiver which need to handle ..
• Consumer need to know Leader of a Par::on. • Consumer should aware of leader changes. • Consumer should handle ZK :meout. • Consumer need to manage Kaba message Offset. • Consumer need to handle fetch from topic. All of the Kaba related challenges are already solved in Low Level Storm-‐Kaba Spout. That has been modified to run as Spark Receiver.
Kafka to Blur Integration Using Spark
hjps://github.com/dibbhaj/spark-‐blur-‐connector
Created Spark Conf. Specify Serializer
Created Streaming Context. Set Checkpoint Directory
Create Receiver for Every Topic Par::on
Create Union of All Receiver Stream
Kafka to Blur Integration Using Spark
hjps://github.com/dibbhaj/spark-‐blur-‐connector
First Transforma:on. Union Stream to Pair Stream Create the BlurMutate object
Kafka to Blur Integration Using Spark
hjps://github.com/dibbhaj/spark-‐blur-‐connector
For every RDD of this Pair Stream Perform some ac:on.
Specify Blur Table Details
Make number of Par::on for this RDD same as Blur Table Shard Count using Custom Par::oner. Finally Checkpoint the RDD.
Kafka to Blur Integration Using Spark
hjps://github.com/dibbhaj/spark-‐blur-‐connector
Blur Hadoop Job specific proper:es.
Kafka to Blur Integration Using Spark
hjps://github.com/dibbhaj/spark-‐blur-‐connector
Finally , Save the RDD as Hadoop File. This uses Blur HDFSDirectory to write indexes. This does not go though CacheDirectory.
Start Stream Execu:on
Thank You
Questions ?
We are Hiring @ jobs.pearson.com