II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

The Road to Federated Text Mining: Are we there yet?

II-SDV 2014

Guy Singh

Click to edit Master title style Click to edit Master title style

“Federated search is an information retrieval technology that allows the simultaneous search of multiple searchable resources.

2

What is federated search?

A user makes a single query request which is distributed to the search engines participating in the federation”

- Wikipedia

Click to edit Master title style Click to edit Master title style Current Situation

• Volume of data ever increasing

• Proprietary content can reside within Enterprise

• No need for everyone to keep standard sources up-to-date

• Data from content providers can reside on their sites

Linguamatics Customer Confidential 3

Internal Content External Content

MEDLINE Clinical Trials

Publisher Content

FDA Drug Labels

Patents


Data Sources

Scientific Literature

Social Media

News

Web Pages

Internal Documents

Patents

RSS

Clinical Trials

4

Increasing Range of Data Sources


5

Varying in Structure

Click to edit Master title style Click to edit Master title style How does text mining differ from keyword search?

Example: What genes affect breast cancer


• Searching across documents using keywords is relatively trivial

– Do not need to be aware of where the words occur and in what context

• Text mining documents with varying structure requires a more sophisticated approach; Need to:

– Know where words matching entities/concepts occur

– Disambiguate depending on context and location

– Find terms in particular regions/parts of document for targeted searches

7

Why does document structure matter?


• Integrate the data together into a data warehouse

– Extract, Transform and Load each data source into a new database

– Multiple copies of the data

– Data normalisation can be difficult and challenging

– Time consuming and expensive process

– Most database vendors take this approach

– Allows users to perform a single search across all the content

• Leave the data where it is, federated content

– Data remains in it’s original form and location

– Multiple data types

– Multiple network locations

– Single search across multiple different data sources

8

Approaches to dealing with different data sources


Data Normalisation

Link the Content Servers

Merge Results

Federated Text Mining

9

How do we get to Federated Text Mining?


10

Data Normalisation – Virtual Indexes

Pathology Reports Index

Journal Abstracts Index

Virtual Index


11

Data Normalisation – Document Structure

Pathology Reports

Journal Abstracts


12

Data Normalisation - Entities

Journal Abstracts

Pathology Reports Combined

(Normalized)

Linking Content Servers



• I2E 4.1 introduced a new feature – Linked Server

• One I2E server can be linked to another I2E server

• Provides access to remote and local indexes and queries through a single I2E interface (Linked Servers)

– Indexes and queries on remote servers on the network appear the same as local indexes

Linked Servers

Development Status


Linguamatics – Customer confidential

I2E 4.1 Linked Servers

I2E Enterprise on Customer network

I2E OnDemand SaaS

Infrastructure

In-house Indexes

I2E OnDemand Standard Indexes

I2E Enterprise Access

Custom Indexes

Access via Linked Servers

Access via single UI

Merging Results (Part I)

Single Server, Multiple Queries

Click to edit Master title style Click to edit Master title style I2E 3.0 (2009) – Merging Results (part I) from one server

Profiling Individuals

• Example from news reports related to pharmaceutical industry

• Pick up properties from one document or many

© Linguamatics 2012 - Customer Confidential


© Linguamatics 2013 - Confidential

I2E 3.0 – Merging Results (part I) from one server

Document

Identifier

Patient

information Disease history

Patient data

Medications

and dosages

Hit displayed in

context

Merging Results (Part II)


Multiple Servers, Multiple Queries


20

Each Server supplying separate set of results

Content Server 1

Content Server 2

Content Server 3

Content Server 4

Merge into a single set of results

The Road to Federated Text Mining

Click to edit Master title style Click to edit Master title style I2E 4.0: Multiple Clients, Multiple Results

I2E Server 2 FDA Drug Labels

I2E Server 1 Internal Documents

external network internal network


Click to edit Master title style Click to edit Master title style I2E 4.1/4.2: Single Client, Multiple Results





Linked server

Merging Results (Part II)

Click to edit Master title style Click to edit Master title style Q4 2014: Single Client, Single Result, Multiple Servers





Linked server

Click to edit Master title style Click to edit Master title style Q4 2014: Federated Text Mining Example

• Single Query

• Differently structured data sources on different servers

– Journal Articles (PubMed Central) on Enterprise Server

– MEDLINE on I2E OnDemand

• Single set of results


Click to edit Master title style Click to edit Master title style The Road to Federated – Are we there yet?

I2E 4.0

Dec 2012

I2E 4.1

October 2013

Next release: in Development

Q4 2014

Merging the Results (part II)

Data Normalisation


Demo



30

Demo

Cambridge

VPN

Nice

Linked Server

Journal Abstracts

Pathology Reports

Thank you


II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)

Software

Transcript of II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)