II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

31
The Road to Federated Text Mining: Are we there yet? II-SDV 2014 Guy Singh

Transcript of II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Page 1: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

The Road to Federated Text Mining: Are we there yet?

II-SDV 2014

Guy Singh

Page 2: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style

“Federated search is an information retrieval technology that allows the simultaneous search of multiple searchable resources.

2

What is federated search?

A user makes a single query request which is distributed to the search engines participating in the federation”

- Wikipedia

Page 3: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style Current Situation

• Volume of data ever increasing

• Proprietary content can reside within Enterprise

• No need for everyone to keep standard sources up-to-date

• Data from content providers can reside on their sites

Linguamatics Customer Confidential 3

Internal Content External Content

MEDLINE Clinical Trials

Publisher Content

FDA Drug Labels

Patents

Page 4: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style

Data Sources

Scientific Literature

Social Media

News

Web Pages

Internal Documents

Patents

RSS

Clinical Trials

4

Increasing Range of Data Sources

Page 5: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style

5

Varying in Structure

Page 6: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style How does text mining differ from keyword search?

Example: What genes affect breast cancer

Page 7: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style

• Searching across documents using keywords is relatively trivial

– Do not need to be aware of where the words occur and in what context

• Text mining documents with varying structure requires a more sophisticated approach; Need to:

– Know where words matching entities/concepts occur

– Disambiguate depending on context and location

– Find terms in particular regions/parts of document for targeted searches

7

Why does document structure matter?

Page 8: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style

• Integrate the data together into a data warehouse

– Extract, Transform and Load each data source into a new database

– Multiple copies of the data

– Data normalisation can be difficult and challenging

– Time consuming and expensive process

– Most database vendors take this approach

– Allows users to perform a single search across all the content

• Leave the data where it is, federated content

– Data remains in it’s original form and location

– Multiple data types

– Multiple network locations

– Single search across multiple different data sources

8

Approaches to dealing with different data sources

Page 9: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style

Data Normalisation

Link the Content Servers

Merge Results

Federated Text Mining

9

How do we get to Federated Text Mining?

Page 10: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style

10

Data Normalisation – Virtual Indexes

Pathology Reports Index

Journal Abstracts Index

Virtual Index

Page 11: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style

11

Data Normalisation – Document Structure

Pathology Reports

Journal Abstracts

Page 12: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style

12

Data Normalisation - Entities

Journal Abstracts

Pathology Reports Combined

(Normalized)

Page 13: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Linking Content Servers

Linguamatics Customer Confidential 13

Page 14: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style

• I2E 4.1 introduced a new feature – Linked Server

• One I2E server can be linked to another I2E server

• Provides access to remote and local indexes and queries through a single I2E interface (Linked Servers)

– Indexes and queries on remote servers on the network appear the same as local indexes

Linked Servers

Development Status

Page 15: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style

Linguamatics – Customer confidential

I2E 4.1 Linked Servers

I2E Enterprise on Customer network

I2E OnDemand SaaS

Infrastructure

In-house Indexes

I2E OnDemand Standard Indexes

I2E Enterprise Access

Custom Indexes

Access via Linked Servers

Access via single UI

Page 16: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Merging Results (Part I)

Single Server, Multiple Queries

Page 17: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style I2E 3.0 (2009) – Merging Results (part I) from one server

Profiling Individuals

• Example from news reports related to pharmaceutical industry

• Pick up properties from one document or many

© Linguamatics 2012 - Customer Confidential

Page 18: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style

© Linguamatics 2013 - Confidential

I2E 3.0 – Merging Results (part I) from one server

Document

Identifier

Patient

information Disease history

Patient data

Medications

and dosages

Hit displayed in

context

Page 19: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Merging Results (Part II)

Linguamatics Customer Confidential 19

Multiple Servers, Multiple Queries

Page 20: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style

20

Each Server supplying separate set of results

Content Server 1

Content Server 2

Content Server 3

Content Server 4

Merge into a single set of results

Page 21: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

The Road to Federated Text Mining

Page 22: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Linking Content Servers

Page 23: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style I2E 4.0: Multiple Clients, Multiple Results

I2E Server 2 FDA Drug Labels

I2E Server 1 Internal Documents

external network internal network

Linguamatics Customer Confidential 23

Page 24: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style I2E 4.1/4.2: Single Client, Multiple Results

I2E Server 2 FDA Drug Labels

I2E Server 1 Internal Documents

external network internal network

Linguamatics Customer Confidential 24

Linked server

Page 25: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Merging Results (Part II)

Page 26: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style Q4 2014: Single Client, Single Result, Multiple Servers

I2E Server 2 FDA Drug Labels

I2E Server 1 Internal Documents

external network internal network

Linguamatics Customer Confidential 26

Linked server

Page 27: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style Q4 2014: Federated Text Mining Example

• Single Query

• Differently structured data sources on different servers

– Journal Articles (PubMed Central) on Enterprise Server

– MEDLINE on I2E OnDemand

• Single set of results

Linguamatics Customer Confidential 27

Page 28: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style The Road to Federated – Are we there yet?

I2E 4.0

Dec 2012

I2E 4.1

October 2013

Next release: in Development

Q4 2014

Merging the Results (part II)

Data Normalisation

Linking Content Servers

Page 29: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Demo

Linguamatics – Customer confidential

Page 30: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Click to edit Master title style Click to edit Master title style

30

Demo

Cambridge

VPN

Nice

Linked Server

Journal Abstracts

Pathology Reports

Page 31: II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Thank you

Linguamatics – Customer confidential