II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)
-
Upload
dr-haxel-congress-and-event-management-gmbh -
Category
Software
-
view
418 -
download
0
Transcript of II-SDV 2014 The Road to Federated Text Mining: Are we there yet? (Guy Singh - Linguamatics, UK)
The Road to Federated Text Mining: Are we there yet?
II-SDV 2014
Guy Singh
Click to edit Master title style Click to edit Master title style
“Federated search is an information retrieval technology that allows the simultaneous search of multiple searchable resources.
2
What is federated search?
A user makes a single query request which is distributed to the search engines participating in the federation”
- Wikipedia
Click to edit Master title style Click to edit Master title style Current Situation
• Volume of data ever increasing
• Proprietary content can reside within Enterprise
• No need for everyone to keep standard sources up-to-date
• Data from content providers can reside on their sites
Linguamatics Customer Confidential 3
Internal Content External Content
MEDLINE Clinical Trials
Publisher Content
FDA Drug Labels
Patents
Click to edit Master title style Click to edit Master title style
Data Sources
Scientific Literature
Social Media
News
Web Pages
Internal Documents
Patents
RSS
Clinical Trials
4
Increasing Range of Data Sources
Click to edit Master title style Click to edit Master title style
5
Varying in Structure
Click to edit Master title style Click to edit Master title style How does text mining differ from keyword search?
Example: What genes affect breast cancer
Click to edit Master title style Click to edit Master title style
• Searching across documents using keywords is relatively trivial
– Do not need to be aware of where the words occur and in what context
• Text mining documents with varying structure requires a more sophisticated approach; Need to:
– Know where words matching entities/concepts occur
– Disambiguate depending on context and location
– Find terms in particular regions/parts of document for targeted searches
7
Why does document structure matter?
Click to edit Master title style Click to edit Master title style
• Integrate the data together into a data warehouse
– Extract, Transform and Load each data source into a new database
– Multiple copies of the data
– Data normalisation can be difficult and challenging
– Time consuming and expensive process
– Most database vendors take this approach
– Allows users to perform a single search across all the content
• Leave the data where it is, federated content
– Data remains in it’s original form and location
– Multiple data types
– Multiple network locations
– Single search across multiple different data sources
8
Approaches to dealing with different data sources
Click to edit Master title style Click to edit Master title style
Data Normalisation
Link the Content Servers
Merge Results
Federated Text Mining
9
How do we get to Federated Text Mining?
Click to edit Master title style Click to edit Master title style
10
Data Normalisation – Virtual Indexes
Pathology Reports Index
Journal Abstracts Index
Virtual Index
Click to edit Master title style Click to edit Master title style
11
Data Normalisation – Document Structure
Pathology Reports
Journal Abstracts
Click to edit Master title style Click to edit Master title style
12
Data Normalisation - Entities
Journal Abstracts
Pathology Reports Combined
(Normalized)
Linking Content Servers
Linguamatics Customer Confidential 13
Click to edit Master title style Click to edit Master title style
• I2E 4.1 introduced a new feature – Linked Server
• One I2E server can be linked to another I2E server
• Provides access to remote and local indexes and queries through a single I2E interface (Linked Servers)
– Indexes and queries on remote servers on the network appear the same as local indexes
Linked Servers
Development Status
Click to edit Master title style Click to edit Master title style
Linguamatics – Customer confidential
I2E 4.1 Linked Servers
I2E Enterprise on Customer network
I2E OnDemand SaaS
Infrastructure
In-house Indexes
I2E OnDemand Standard Indexes
I2E Enterprise Access
Custom Indexes
Access via Linked Servers
Access via single UI
Merging Results (Part I)
Single Server, Multiple Queries
Click to edit Master title style Click to edit Master title style I2E 3.0 (2009) – Merging Results (part I) from one server
Profiling Individuals
• Example from news reports related to pharmaceutical industry
• Pick up properties from one document or many
© Linguamatics 2012 - Customer Confidential
Click to edit Master title style Click to edit Master title style
© Linguamatics 2013 - Confidential
I2E 3.0 – Merging Results (part I) from one server
Document
Identifier
Patient
information Disease history
Patient data
Medications
and dosages
Hit displayed in
context
Merging Results (Part II)
Linguamatics Customer Confidential 19
Multiple Servers, Multiple Queries
Click to edit Master title style Click to edit Master title style
20
Each Server supplying separate set of results
Content Server 1
Content Server 2
Content Server 3
Content Server 4
Merge into a single set of results
The Road to Federated Text Mining
Linking Content Servers
Click to edit Master title style Click to edit Master title style I2E 4.0: Multiple Clients, Multiple Results
I2E Server 2 FDA Drug Labels
I2E Server 1 Internal Documents
external network internal network
Linguamatics Customer Confidential 23
Click to edit Master title style Click to edit Master title style I2E 4.1/4.2: Single Client, Multiple Results
I2E Server 2 FDA Drug Labels
I2E Server 1 Internal Documents
external network internal network
Linguamatics Customer Confidential 24
Linked server
Merging Results (Part II)
Click to edit Master title style Click to edit Master title style Q4 2014: Single Client, Single Result, Multiple Servers
I2E Server 2 FDA Drug Labels
I2E Server 1 Internal Documents
external network internal network
Linguamatics Customer Confidential 26
Linked server
Click to edit Master title style Click to edit Master title style Q4 2014: Federated Text Mining Example
• Single Query
• Differently structured data sources on different servers
– Journal Articles (PubMed Central) on Enterprise Server
– MEDLINE on I2E OnDemand
• Single set of results
Linguamatics Customer Confidential 27
Click to edit Master title style Click to edit Master title style The Road to Federated – Are we there yet?
I2E 4.0
Dec 2012
I2E 4.1
October 2013
Next release: in Development
Q4 2014
Merging the Results (part II)
Data Normalisation
Linking Content Servers
Demo
Linguamatics – Customer confidential
Click to edit Master title style Click to edit Master title style
30
Demo
Cambridge
VPN
Nice
Linked Server
Journal Abstracts
Pathology Reports
Thank you
Linguamatics – Customer confidential