11 Ontology-Guided Search and Text Mining for Intelligence Gathering Kurt Godden, Ph.D. MSR Lab, R&D...
-
Upload
jeremy-savary -
Category
Documents
-
view
219 -
download
0
Transcript of 11 Ontology-Guided Search and Text Mining for Intelligence Gathering Kurt Godden, Ph.D. MSR Lab, R&D...
1 1
Ontology-Guided Search and Text Mining
for Intelligence Gathering
Kurt Godden, Ph.D.MSR Lab, R&D
2 2
Outline
• Definitions of terms• Customers (Who cares?)• Finding Text – ontology-guided search• Text Processing –
– Content extraction– Text Mining
• Temporal Data Mining at GM• Multi-Lingual Text Processing• Summary
3 3
What is Text Mining?• Data Mining:
– The process of analyzing data to discover new patterns or relationships– 1st International Conference was KDD-95– http://www-aig.jpl.nasa.gov/public/kdd95/
• Text Mining is Subfield of Data Mining– As such, ideally TM is the process of analyzing unstructured text to discover
new patterns or relationships– In practice, TM often refers simply to the Content Extraction (CE) of
structured data from unstructured text, usually from finite-state parsers.
4 4
Content Extraction:Structured Data from Unstructured Text
<XYZ-Corp,exports-through,Dubai>
“Company XYZ, is known to ship products through the port of Dubai.”
From Text to Actionable Knowledge:
Abbas
AdenYemen
AjmanUAE
Algiers
AmmanJordan
BenghaziLybia
Brazil1
Brazil2
BuenosAires
Cairo
Canada1
CixiChina
DammamSaudi
Dominican1
DubaiUAE
French1Gdansk
Guangzhou
Hamburg
Helsinki
Homs
HongKong
Istanbul
Jakarta
Jeddah
Kansas
Karachi
KhamisMushaytSaudi
LahorePakistan
Libya1
Lisboa
LosAngeles
Magadan Misratah
MisratahLibya
MississaugaCanada
NingboPort
PortAden
RioDeJaneiroRioHaina
Riyadh
RuianZhejiangProvince
SanaaYemen
SaoPaulo
Saudi1
Shanghai
ShanghaiPort
SharjahUAESomervilleUSA
StPetersburg
SunsetUSA
Taipei
Urumchi
VyborgRussia
WichitaUSAXinfengGuangdong
ZhaoqingGuangdongProvince
ZhongshanGuangdongProvince
Automatic multi- language scanning
Entity and Relation extraction/distillation
Filtering
5 5
Who Cares?
• Government– NSA, CIA, DIA, DHS, DARPA
• Industry– Automotive
– Chemical
– Pharmaceutical
– Legal
– Consumer goods
– Aerospace
6 6
Why do they care?• Intelligence and Security
– Valdis E. Krebs was able to manually map much of the 9/11 terrorist cell from public documents.
• http://vlado.fmf.uni-lj.si/pub/networks/doc/Seminar/Krebs.pdf
• Industrial– Urban Legend: (Is it true?)
“80% of all corporate knowledge is in text.”– Market research– Fraud detection– Root cause analysis– Document clustering and categorization– Competitive intelligence– Patent analysis– etc
8 8
Ontology-Guided Search (OGS)
• Oft-cited definition of ontology by T.R. Gruber:– An ontology is a formal specification of a shared
conceptualization.
• www.vivisimo.com clusters search results according to semantic categories
• OGS: use an ontology to guide the search for documents to include not only keywords of interest, but also terms that are semantically related to those keywords
9 9
What ontology to use?
• Public– Wordnet: http://wordnet.princeton.edu/
• Organizes content words (N,V,Adj,Adv) into sets of semantically-related concepts connected by relations
• Currently 207k pairs of words-senses– <bank1, monetary institution>– <bank2, land adjacent to river>
• Custom– Parts– Products– Processes
• Tool: Protégé at http://protege.stanford.edu/
10 10
Ontology-Guided Search (OGS)
avoids neighborhood riot “driving through”
avoiding neighborhoods riots “drive through”
avoided suburb “civil unrest” “drove through”
suburbs
• Use ontology to search not only on keywords, but on semantically-related keywords
11 11
Pitfalls of OGS
• Beware of semantically related terms
• Simulation of OGS using Wordnet– Original query:
• Which neighborhoods of Paris are safe?
– One of several transformed queries was:• Which suburbs of Paris are condoms?
12 12
Content Extraction Technology• Regular Expressions Mapped to Semantic
Templates• Regular Expression for Passives:
NP1 BE TV [by NP2] “The lecture was presented by Kurt Godden”
• Mapping of Match Registers to Template< NP2:agent, TV:relation, NP1:object><kg, presented, lecture>Post-ProcessingRule:
if NP2 is empty string, then use ‘someone’:agent
13 13
Content Extraction Example“Some 40 vehicles were torched in the Val d'Oise area NW of Paris.”
http://www.breitbart.com/news/2005/11/04/D8DLFA780.html
For pattern: NP1 BE TV [by NP2]‘vehicles’ matches NP1
‘were’ matches BE‘torched’ matches TV
No match for NP2
• Canonicalize tokens via a domain ontology (e.g. vehicles→vehicle, torched→burn)<someone, burn, vehicle>
• Additional triples can be matched by other RegExp patterns, giving:<vehicle, count, 40><vehicle, located-in, val-d’oise><val-d’oise, near, paris>
14 14
Why Only Regular Expressions?
• Computational Efficiency• Practical Adequacy• Workaround for lack of recursion: Lots of RE’s !
NP → NP and NP becomes
NP → CN and CN
NP → CN and CN and CN
NP → NAME and NAME
NP → NAME and NAME and NAME
15 15
After Text Must Come Mining
• Temporal Data Mining research by K.P. Unnikrishnan (GM R&D) and P.S. Sastry (IISc, Bangalore)
• TDMiner – Proprietary tool– Discovers frequent sequences of events from
symbolic data
19 19
For More Info:
• 4th Workshop on Temporal Data Mining: Network Reconstruction from Dynamic Data– http://www.kdd2006.com/workshops.html
• Laxman, Sastry and Unnikrishnan. “Discovering Frequent Episodes and Learning Hidden Markov Models: a Formal Connection.” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 11, pp. 1505-1517. 2005
20 20
• How to determine directed, acyclic graphs from sequential event data
x
z a n p
g
Network Reconstruction
22 22
Machine Translation (MT)
• Free, web-based tools not state-of-the-arte.g. http://babelfish.altavista.com/
• LanguageWeaver uses Statistical-Based MTSpin-off of USC Information Sciences Institute
www.languageweaver.com
24 24
Hypothesis
• Effective Content Extraction rules can be custom-developed for raw machine-translated text.