Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Near Real Time Indexing Kafka Messages into Apache Blur Dibyendu Bhattacharya Big Data Architect , Pearson

Pearson : What We Do ?

We are building a scalable, reliable cloud-‐based learning pla3orm providing services to power the next genera:on of products for Higher Educa:on.

With a common data pla3orm, we build up student analy:cs across product and ins:tu:on boundaries that deliver efficacy insights to learners and ins:tu:ons not possible before.

Pearson is building ●  The worlds greatest collec:on of

educa:onal content ●  The worlds most advanced data, analy:cs,

adap:ve, and personaliza:on capabili:es for educa:on

Pearson Learning Platform : GRID

Course Structure

Foundational Services

Grades Recommend-

ations Assignments

Identity

Data Channels

Analytics

Authorization Entitlements

Media

eCommerce

API Management (Apigee)

Activity Eventing Integration

Mobile Enablers

Assessment Catalog

Customer (CRM)

Achievements

Content

Adaptive

Learner Profile Behavioral

IaaS (AWS)

Learning Services

Data , Adaptive and Analytics

Pearson Search Services : Why Blur We are presently evaluating Apache Blur which is Distributed Search engine built on top of Hadoop and Lucene. Primary reason for using Blur is .. •  Distributed Search Platform stores Indexes in HDFS. •  Leverages all goodness built into the Hadoop and Lucene stack

“HDFS-based indexing is valuable when folks are also using Hadoop for other purposes (MapReduce, SQL queries, HBase, etc.). There are considerable operational efficiencies to a shared storage system. For example, disk space, users, etc. can be centrally managed.” - Doug Cutting

Benefit Descrip-on

Scalable Store , Index and Search massive amount of data from HDFS

Fast Performance similar to standard Lucene implementa:on

Durable Provided by built in WAL like store

Fault Tolerant Auto detect node failure and re-‐assigns indexes to surviving nodes

Query Support all standard Lucene queries and Join queries

Blur Features

Fast Data Ingestion: Blur’s massively parallel processing (MPP) architecture uses MapReduce to bulk load data at incredible speeds. Near Real-Time Updates: Blur’s remote API allows new data to be indexed and searchable right away. Powerful and Accurate Search: Blur allows you to get precise search results against terabytes of data at Google-like speed. Actionable Data: Blur allows you to build rich data models and search them in a semi-relational manner -- similar to joins while querying a relational database -- allowing you to uncover new relationships in your data. Secure: Blur provides record-level access control to your data. This ensures that users only see the information that they are authorized to view. Scalable & Fault Tolerant: Blur provides linear scaling using commodity server hardware and provides automatic and immediate failover when a node in the cluster goes down. Open Source: Blur is part of the Apache Incubator program.

Blur Architecture

Components Purpose

Lucene Perform actual search du:es

HDFS Store Lucene Index

Map Reduce Use Hadoop MR for batch indexing

ThriQ Inter Process Communica:on

Zookeeper Manage System State and stores Metadata

Blur uses two types of Server Processes •  Controller Server •  Shard Server

Orchestrate Communica:on between all Shard Servers for communica:on

Responsible for performing searches for all shard and returns results to controller

Controller Server

Cache

Shard Server

Cache

S H A R D

S H A R D

Blur Architecture

Controller Server

Cache

Shard Server

Cache

S H A R D

S H A R D

Controller Server

Cache

Shard Server

Cache

S H A R D

S H A R D

Shard Server

Cache

S H A R D

S H A R D

Major Challenges Blur Solved Random Access Latency w/HDFS Problem : HDFS is a great file system for streaming large amounts data across large scale clusters. However the random access latency is typically the same performance you would get in reading from a local drive if the data you are trying to access is not in the operating systems file cache. In other words every access to HDFS is similar to a local read with a cache miss. Lucene relies on file system caching or MMAP of index for performance when executing queries on a single machine with a normal OS file system. Most of time the Lucene index files are cached by the operating system's file system cache. Solution: Blur have a Lucene Directory level block cache to store the hot blocks from the files that Lucene uses for searching. a concurrent LRU map stores the location of the blocks in pre allocated slabs of memory. The slabs of memory are allocated at start-up and in essence are used in place of OS file system cache.

Blur Index Directory Structure -----CacheDirectory : Write Through Cache | ------Use JoinDirectory ( Long Term and Short Term Storage) | ------ Long Term : HDFSDirectory ( Files Written here After Merge) ------ Short Term : FastHdfsKeyValueDirectory (Small or Recently Added Files) | -- Uses HdfsKeyValueStore which act like a WAL

Blur V1 BlockCache, HDFSDirectory is committed to Lucene/Solr Present Blur V2 Cache is more advanced and tuned towards performance.

Blur Data Structure

Blur is a table based query system. So within a single cluster there can be many different tables, each with a different schema, shard size, analyzers, etc. Each table contains Rows. A Row contains a row id (Lucene StringField internally) and many Records. A record has a record id (Lucene StringField internally), a family (Lucene StringField internally), and many Columns. A column contains a name and value, both are Strings in the API but the value can be interpreted as different types. All base Lucene Field types are supported, Text, String, Long, Int, Double, and Float. Row Query :

Ø execute queries across Records within the same Row. Ø similar idea to an inner join.

find all the Rows that contain a Record with the family "author" and has a "name" Column that has that contains a term "Jon" and another Record with the family "docs" and has a "body" Column with a term of "Hadoop". +<author.name:Jon> +<docs.body:Hadoop>

Kafka Stream Indexing to Blur

Major Challenges : •  No Reliable Kaba Consumer for Spark Streaming exists.

• Present High Level Kaba Consumer has possible data loss

•  No Spark to Blur Connector available

Spark RDD

An RDD in Spark is simply a distributed collec:on of objects. Each RDD is split into mul:ple par$$ons, which may be computed on different nodes of the cluster.

There are lot more . Different Types of RDD . E.g. PairRDD RDD Persistence, RDD Check poin:ng ....

Spark in a Slide

RDD (Res i l i en t D i s t r i bu ted Dataset), which is a logically centralized en:ty but physically pa r::oned ac ross mu l:p le machines inside a cluster based on some no:on of key

driver program that launches various parallel opera:ons on a cluster . Driver programs access Spark through a SparkContext object which helps to create RDD.

RDD can op:onally be cached in memory and hence providing fast access. RDD can also be checkpointed to Disk .

applica:on logic are expressed in terms of a sequence of Transforma:on and Ac:on. "Transforma:on" specifies the processing dependency among RDDs and "Ac:on" specifies what the output will be

Spark Streaming

Spark Streaming and Kafka

Spark Streaming has Kaba High level Consumer which has data loss problem. Possible Data loss scenarios.. 1.  Receiver Failure (SPARK-‐4062) 2.  Driver Failure (SPARK-‐3129)

We have implemented a Low Level Kaba-‐Spark Consumer to solve the Receiver failure

Problem. hjps://github.com/dibbhaj/kaba-‐spark-‐consumer

Ka5a Topic

Low Level Kafka Consumer Challenges

Consumer implemented as Custom Spark Receiver which need to handle ..

•  Consumer need to know Leader of a Par::on. •  Consumer should aware of leader changes. •  Consumer should handle ZK :meout. •  Consumer need to manage Kaba message Offset. •  Consumer need to handle fetch from topic. All of the Kaba related challenges are already solved in Low Level Storm-‐Kaba Spout. That has been modified to run as Spark Receiver.

Kafka to Blur Integration Using Spark

hjps://github.com/dibbhaj/spark-‐blur-‐connector

Created Spark Conf. Specify Serializer

Created Streaming Context. Set Checkpoint Directory

Create Receiver for Every Topic Par::on

Create Union of All Receiver Stream



First Transforma:on. Union Stream to Pair Stream Create the BlurMutate object



For every RDD of this Pair Stream Perform some ac:on.

Specify Blur Table Details

Make number of Par::on for this RDD same as Blur Table Shard Count using Custom Par::oner. Finally Checkpoint the RDD.



Blur Hadoop Job specific proper:es.



Finally , Save the RDD as Hadoop File. This uses Blur HDFSDirectory to write indexes. This does not go though CacheDirectory.

Start Stream Execu:on

Thank You

Questions ?

We are Hiring @ jobs.pearson.com

Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Software

Transcript of Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson