Lucene solrrev documentlevelsecurity_rajanimaski_final

34
Rajani Maski - Senior Software Engineer DOCUMENT LEVEL SECURITY IN SEARCH BASED APPLICATIONS

description

Apache Solr Search Engine

Transcript of Lucene solrrev documentlevelsecurity_rajanimaski_final

Page 1: Lucene solrrev documentlevelsecurity_rajanimaski_final

Rajani Maski - Senior Software Engineer

DOCUMENT LEVEL SECURITY IN SEARCH BASED APPLICATIONS

Page 2: Lucene solrrev documentlevelsecurity_rajanimaski_final

Introduction to Search Based Applications Requirement Analysis of Document Level Security Access Control Lists Multiple Solutions Summary

Agenda

Page 3: Lucene solrrev documentlevelsecurity_rajanimaski_final

Search Based Applications are software application in which Search Engine

platform is used as the core infrastructure for information accessing and reporting.

E-commerce web applications or content management systems are the types of search based application.

Search Based Applications

Page 4: Lucene solrrev documentlevelsecurity_rajanimaski_final

Authentication• User is authenticated before providing access

to the applicationApplication• Presents with full fledge User Interface• Perform user operations such as upload

documents, send emails, search, etc.Unified Data Layer• Search Server• Indexes content across the sources• Retrieves data at very high speed.Data Storage• Volume of data sources from different

repositories.

Overview of Search Based System

Unified Data Layer

Search Based Application Server

Archives Documents

User Authentication System

EmailsFile

Server

Page 5: Lucene solrrev documentlevelsecurity_rajanimaski_final

So Far, So Good!

What’s the problem?

Page 6: Lucene solrrev documentlevelsecurity_rajanimaski_final

Unified Data Layer

Search Based Application

Archives Documents

User Authentication System

Emails

Common Access To Unified data Layer

How is this a threat?

File Servers

Page 7: Lucene solrrev documentlevelsecurity_rajanimaski_final

User A :- Logs in to application.- Performs a search operation - With the key words such as ‘Pay Slips’, ‘Personal’ or ‘appraisal’.

Sample results demonstrated for “appraisal”

Consider a Sample Use Case

Page 8: Lucene solrrev documentlevelsecurity_rajanimaski_final

Un Authorized Results

Search Results

Page 9: Lucene solrrev documentlevelsecurity_rajanimaski_final

Relevant Search Results : [Correct]- User A was returned with relevant search results based on his search query;

such as exact matches, more like this key words, synonym key words, etc.

Unauthorized Search results: [Wrong]- Few of the search results retrieved were the documents to which he was not

authorized to view.

Threats:• Exposure to other users’ confidential documents• Access to Unauthorized information.

Observations

How are we doing with this?

Page 10: Lucene solrrev documentlevelsecurity_rajanimaski_final

• To develop a search platform where every user has access to only those documents to which he/she is authorized to.

• To ensure that all the confidential data uploaded is not globally searchable unless it is intended to be globally accessible.

Problem Definition

How can we achieve this?

Page 11: Lucene solrrev documentlevelsecurity_rajanimaski_final

SolutionMaintaining Access Control List mapped to each document

object.

Access Control

List?

Page 12: Lucene solrrev documentlevelsecurity_rajanimaski_final

• Access Controls are Security features that control how users [subject] and documents[object] communicate and interact with one another.

• Subject: An active entity[User] that requests access to an object[Document].

• Object: A passive entity[Document] that contains information

Access Control List

Document

ObjectSubject

Interaction

Page 13: Lucene solrrev documentlevelsecurity_rajanimaski_final

Let’s first understand the data model of search engine.

How are documents stored in search engine?Document Oriented Approach.

Data Model

Alec_1167{_id:”1167”,

Name:”Ale C”,Agent:”Miller”

Place:”NY, NJ, CA”,Units:570}

3424 Kiwi reds 340

5612 Reh Mo’s 664

1167 Alec Miller 570

1167 2 NJ

1167 3 CA

1167 1 NY

Page 14: Lucene solrrev documentlevelsecurity_rajanimaski_final

• User A uploads a document into the system• Text Extraction• Convert it to a flat structure• Input it to Search Engine

Indexing and Storing Document Object

Document Text Extract

Search Engine

Document Saved

Page 15: Lucene solrrev documentlevelsecurity_rajanimaski_final

• We missed to capture something!

• What did we miss?– Capturing of User information for each document!

• Who uploaded the document • To whom did the user share with?

• How do we maintain this information?– Access control list to each document object.

Document Text Extract

Search Engine

Document Saved

Page 16: Lucene solrrev documentlevelsecurity_rajanimaski_final

• Access Control Lists for each user.

• At the time of search, – Retrieve search results,– And perform a check on each document for

user’s authorization and– Finally return the results.

Conventional Solution

Search Engine

Security Filter Each Document

Return Results to User

Page 17: Lucene solrrev documentlevelsecurity_rajanimaski_final

Multiple Solutions.

Page 18: Lucene solrrev documentlevelsecurity_rajanimaski_final

Solutions are dependent on the Access Control Models we choose.

Two important types of Access Control Models:

1. Non-Discretionary Access Control(Role Based)2. Discretionary Access Control (DAC)

Access Control Models

Page 19: Lucene solrrev documentlevelsecurity_rajanimaski_final

Definition:

• Non-Discretionary ACL uses a administered set of rules to determine how Users and Documents interact.

• It is referred to as nondiscretionary because assigning a user to a role is unavoidable

1. Non-Discretionary (Role Based)Sales

Super User

Manager

Sales Documents

Marketing Documents

Engineering Documents

Admin Documents

Page 20: Lucene solrrev documentlevelsecurity_rajanimaski_final

System that has,• Roles defined during design time and Static ACL set

to each document .• We choose, “Early Binding with ACL bound to

Document Objects”

In such systems,• Document objects will include a multi-valued Role-

id field that will contain list of role-Ids which has access to the document.

Solution For Role Based ACL - Type 1

Documents with ACLs

Index Time

Document 1role-Ids: [“1”, “2”, “3”]

Document 1role-Ids: [“1”, “2”, “3”]Document 2

“role-Ids:” [ “2”, “3”]

Page 21: Lucene solrrev documentlevelsecurity_rajanimaski_final

Continued…At the time of search,• User Search Query should be appended

with user’s Role Id.• Solr’s Filter Query feature and it’s caching

techniques gives the most efficient solution for

such ACL Techniques. This approach is called as‘Early Binding’ approach.

Query Request

Solr J Client

QueryResponse

User Role-Id

Early Binding

Page 22: Lucene solrrev documentlevelsecurity_rajanimaski_final

Systems that has,• Roles which often change; data is normalized by

segregating access control information into different tables.

• This approach is called as ‘Early Binding with Externalized ACL’

In such systems: • Role-Ids are not attached to the document object.• Instead they are stored into different tables with

foreign key relation.• Use Pseudo Joins at the time of Search

Solution For Role Based ACL - Type 2

Document1D1

Doc ID Role-IdsD1 1, 2, 3, N

Page 23: Lucene solrrev documentlevelsecurity_rajanimaski_final

Definition:• Discretionary – Document

owner has the authority to control access of the document.

• A system that enables the document owner to specify set of Users with access to a set ofdocuments

2. Discretionary Access Control

Specifies Users/groups who can Access

Owner Object

Page 24: Lucene solrrev documentlevelsecurity_rajanimaski_final

System that has • Frequent changes in ACL• ACL is defined for each user and a document,• We choose ‘Late Binding Approach with

Externalized ACL’

In such systems, • ACL is a 2D-matrix with users and documents

along its rows and columns

Solution for Discretionary ACL - Type 1

Users Doc1 Doc2 Doc N

User A 1 1 1

User B 0 1 1

User M

Encode Values – 0 :No access, 1 : AccessN : Number of Users, M – Number of Documents

Page 25: Lucene solrrev documentlevelsecurity_rajanimaski_final

For implementation, the ACL matrix can be represented as a array of bits.

This compact representation improves search efficiency and memory over head.

Continued…

Users Doc1 Doc2 Doc NUserA 1 1 1

UserB 0 1 1

111

011

[1]

[2]

Page 26: Lucene solrrev documentlevelsecurity_rajanimaski_final

Consider, • Maximum documents in the Search systems is 5 with document ids:{1,2, 3, 4, 5}• Maximum Users are 2 { Id : 1,2 }• User 1 has access to document {1, 2, 3} • User 2 has access to Document {1,2,3,4,5}

• ACL matrix and array representation:

User 1 2 3 4 51 1 1 1 0 0

2 1 1 1 1 1

11100

11111

[1]

[2]

1 1 1 1 1

1 1 1 0 0

Example

Page 27: Lucene solrrev documentlevelsecurity_rajanimaski_final

Solution 1• Solr has a Post Filter Interface that can be extended to develop a Custom Plugin.• Interface has a method called ‘collect()’

• Collect() has a list of documents matched to the user’s search query.– Iterate through the list, get the document-Id from the Field Cache and apply

ACL using bit array .

• Code Snippets: https://gist.github.com/rajanim/7197154

Solr Implementation

1 1 1 0 0

Page 28: Lucene solrrev documentlevelsecurity_rajanimaski_final

Solution 2• Using BitSet utilities• Get the bitset of documents matched by the search query from Search Engine• Get the User ACL bitset instance• Obtain the intersection of the two bitsets [intersect(bitset other)]

Other Implementation Solution

1 1 1 0 0 1 1 1 0 0

1 1 1 0 0

Page 29: Lucene solrrev documentlevelsecurity_rajanimaski_final

• Discretionary ACL systems have static ACL• We choose, “Early Binding with ACL bound to Document

Objects”

In such systems,• Document objects will include a multi-valued user-id field that

contains a list of user-ids with access to the document.• The user-id field has to be indexed.

Solution for Discretionary ACL - Type 2

Page 30: Lucene solrrev documentlevelsecurity_rajanimaski_final

• This solution requires the ACL and document data to be de-normalized to flat structure.

Continued…

Index Time Search Time

Query RequestWith User ID

Solr J Client

QueryResponse

Parse Document

Add List of Users Who has access

Page 31: Lucene solrrev documentlevelsecurity_rajanimaski_final

Summary

Page 32: Lucene solrrev documentlevelsecurity_rajanimaski_final

• Discretionary ACL with late binding solution is a complex model and it requires extensive verification

• Leverage Solr’s smart caching capability

• Since ACL always adds an additional over head it has to be optimized to provide minimum delay.

Summary

Page 33: Lucene solrrev documentlevelsecurity_rajanimaski_final

• searchhub.org/2012/02/22/custom-security-filtering-in-solr/• Secure Search in Enterprise Webs: Tradeoffs in Efficient Implementation for

Document Level Security By Peter Bailey, David Hawking, Brett Matson• All in One Book (Shon Harris, 2005)• http://www.searchtechnologies.com/enterprise-search-document-level-

security.html• http://alvinalexander.com/java/jwarehouse/lucene/src/test/org/apache/

lucene/search/TestFilteredQuery.java.shtml• https://github.com/Zvents/score_stats_component/blob/master/src/main/

java/com/zvents/solr/components/ScoreStatsPostFilter.java

References:

Page 34: Lucene solrrev documentlevelsecurity_rajanimaski_final

Thank

You