Visually Exploring Patent Collections for Events and Patterns

Post on 26-Jun-2015

197 views 1 download

Tags:

description

My talk on Patent Visualization at The 3rd IEEE Workshop on Interactive Visual Text Analytics. Primary focus is to introduce the Scalable Visual Analytics research that my team is working on. Workshop paper can be found at: http://vialab.science.uoit.ca/textvis2013/papers/Ankam-TextVis2013.pdf

Transcript of Visually Exploring Patent Collections for Events and Patterns

Visually Exploring Patent Collections for Events

and PatternsDerek X. Wang

Associate Director of the Charlotte Visualization Center

Together with: Wenwen Dou, Wlodek Zadrozny, Suraj Ankam, Debbie Strumsky, Terry Rabinowitz

Value

BusinessesValue

BusinessesValue

BusinessesValue

• 800 patents:

• $1 billion worth of patents from AOL to Microsoft

BusinessesValue

• 800 patents:

• $1 billion worth of patents from AOL to Microsoft

• 1,100 patents from Kodak

• 525 Million to group license

BusinessesValue

• 800 patents:

• $1 billion worth of patents from AOL to Microsoft

• 1,100 patents from Kodak

• 525 Million to group license

• 17, 000 Patents

• $12.5 billion Motorola Mobility to Google

2006 2007 2008 2009 2010

Dataset: 123 Publications from VAST proceedings from 2006-2010.

Value

2006 2007 2008 2009 2010

Dataset: 123 Publications from VAST proceedings from 2006-2010.

ValueTechnology

2006 2007 2008 2009 2010

Cyan topic: variable uncertainty trend correlation linear multivariate sensitivity

Dataset: 123 Publications from VAST proceedings from 2006-2010.

ValueTechnology

2006 2007 2008 2009 2010

Cyan topic: variable uncertainty trend correlation linear multivariate sensitivity

Blue topic: dimension quality cluster measure lda attribute reduction projection

Dataset: 123 Publications from VAST proceedings from 2006-2010.

ValueTechnology

2006 2007 2008 2009 2010

Cyan topic: variable uncertainty trend correlation linear multivariate sensitivity

Blue topic: dimension quality cluster measure lda attribute reduction projection

Dataset: 123 Publications from VAST proceedings from 2006-2010.

FODAVA

ValueTechnology

2006 2007 2008 2009 2010

Cyan topic: variable uncertainty trend correlation linear multivariate sensitivity

Blue topic: dimension quality cluster measure lda attribute reduction projection

Dataset: 123 Publications from VAST proceedings from 2006-2010.

FODAVA

ValueTechnology

**X. Wang et al., ParallelTopics: A probabilistic approach to exploring document collections, IEEE VAST 2011

Goal

Value

GoalValue

• Can we spot an emerging new technology?

GoalValue

• Can we spot an emerging new technology?

• Text mining and visualization

GoalValue

• Can we spot an emerging new technology?

• Text mining and visualization

• Can we spot novelty within a patent?

GoalValue

• Can we spot an emerging new technology?

• Text mining and visualization

• Can we spot novelty within a patent?

• How much do claims differ from class descriptions?

GoalValue

• Can we spot an emerging new technology?

• Text mining and visualization

• Can we spot novelty within a patent?

• How much do claims differ from class descriptions?

• How much do claims differ from claims in other similar patents

GoalValue

• Can we spot an emerging new technology?

• Text mining and visualization

• Can we spot novelty within a patent?

• How much do claims differ from class descriptions?

• How much do claims differ from claims in other similar patents

• Can we list “all” patents relevant for some technology? (and what does it mean)

GoalValue

A Robust and Scalable Patent Analysis Infrastructure Is Needed

GoalValue

A Robust and Scalable Patent Analysis Infrastructure Is Needed

Visual Analytics Will Play a Key Role

BalancedAnalytics

Technology

GoalValue

A Robust and Scalable Patent Analysis Infrastructure Is Needed

Visual Analytics Will Play a Key Role

BalancedAnalytics

Technology

Human

Computer+=

Value

Challenge

Goal

Value ChallengeGoal

Value ChallengeGoal

Unstructured or semi-structured

Highly heterogeneous

Leading to highly heterogeneous models

Incomplete or with holes

With intrinsic uncertainty (and in some cases deception)

Inside and outside the enterprise

Containing detailed time and space information:

GoalValue Challenge

Research

GoalValue Challenge ResearchStructuring the Unstructured:

Topic Modeling

GoalValue Challenge ResearchStructuring the Unstructured:

Topic Modeling• Latent Dirichlet Allocation (LDA)

GoalValue Challenge ResearchStructuring the Unstructured:

Topic Modeling• Latent Dirichlet Allocation (LDA)

• Reveals Latent topics from large textual corpus

GoalValue Challenge ResearchStructuring the Unstructured:

Topic Modeling• Latent Dirichlet Allocation (LDA)

• Reveals Latent topics from large textual corpus

• Coherent sets of most likely words to describe topics

GoalValue Challenge ResearchStructuring the Unstructured:

Topic Modeling• Latent Dirichlet Allocation (LDA)

• Reveals Latent topics from large textual corpus

• Coherent sets of most likely words to describe topics

• Topics defined by keyword groups

GoalValue Challenge ResearchStructuring the Unstructured:

Topic Modeling• Latent Dirichlet Allocation (LDA)

• Reveals Latent topics from large textual corpus

• Coherent sets of most likely words to describe topics

• Topics defined by keyword groups

• Topics in text collections can effectively be inferred

GoalValue Challenge Research

GoalValue Challenge ResearchStructuring the Unstructured:

Investigative Element Extraction

GoalValue Challenge ResearchStructuring the Unstructured:

Investigative Element Extraction

• Recognition of entities including people, locations, buildings, organizations.

GoalValue Challenge ResearchStructuring the Unstructured:

Investigative Element Extraction

• Recognition of entities including people, locations, buildings, organizations.

• Recognition of times and dates.

GoalValue Challenge ResearchStructuring the Unstructured:

Investigative Element Extraction

• Recognition of entities including people, locations, buildings, organizations.

• Recognition of times and dates.

• Construct near-real-time analysis pipeline for entity association

RealityValue Challenge Research

RealityValue Challenge ResearchStructuring the Unstructured:

Event Structuring

RealityValue Challenge ResearchStructuring the Unstructured:

Event Structuring

Events: Meaningful occurrences in space and time

RealityValue Challenge ResearchStructuring the Unstructured:

Event Structuring

Events: Meaningful occurrences in space and time

Motivating Event

Particular Topic Stream

RealityValue Challenge ResearchStructuring the Unstructured:

Event Structuring

Events: Meaningful occurrences in space and time

Motivating Event

Particular Topic Stream

Narrative: a series of clustered (event-based) stories temporally-linked based on content similarity.

RealityValue Challenge Research

Results

RealityValue Challenge Research ResultsCan we spot an emerging new technology?

RealityValue Challenge Research ResultsCan we spot an emerging new technology?

Data: 50,000 telecommunication patents, in past 10 years Abstract text and patent meta-information;

1.5 Gb Raw Patent Documents

RealityValue Challenge Research ResultsCan we spot an emerging new technology?

Data: 50,000 telecommunication patents, in past 10 years Abstract text and patent meta-information;

1.5 Gb Raw Patent Documents

Methods: Topic modeling and visualization

RealityValue Challenge Research ResultsCan we spot an emerging new technology?

Results: We can see a significant change in the topic of “software and storage” in communication around 2007 (corresponding to Apple iPhone?)

Data: 50,000 telecommunication patents, in past 10 years Abstract text and patent meta-information;

1.5 Gb Raw Patent Documents

Methods: Topic modeling and visualization

RealityValue Challenge ResearchCan we spot an emerging new technology?

Results

**W. Dou et al., HierarchicalTopics: Visually Exploring Large Text Collections Using Topic Hierarchies, IEEE VAST 2013

RealityValue Challenge ResearchCan we spot an emerging new technology?

Results

Model: § 100 topics § Each topic a distribution on

words § Each abstract a combination

of topics !

Note: Width of the graph proportional to the number of patents and the number of words from a particular topic (topic signal strength). Number of class 455 patents grew from 2234 in 2005 to 7647 in 2012

**W. Dou et al., HierarchicalTopics: Visually Exploring Large Text Collections Using Topic Hierarchies, IEEE VAST 2013

RealityValue Challenge ResearchCan we spot an emerging new technology?

Results

Model: § 100 topics § Each topic a distribution on

words § Each abstract a combination

of topics !

Note: Width of the graph proportional to the number of patents and the number of words from a particular topic (topic signal strength). Number of class 455 patents grew from 2234 in 2005 to 7647 in 2012

**W. Dou et al., HierarchicalTopics: Visually Exploring Large Text Collections Using Topic Hierarchies, IEEE VAST 2013

RealityValue Challenge Research Results

RealityValue Challenge Research Results

Typical Keyword: “transistor”

RealityValue Challenge Research Results

!

Emergent: “storage, software, …”

Typical Keyword: “transistor”

RealityValue Challenge Research Results

RealityValue Challenge ResearchCan we spot novelty within an existing patent?

Results

RealityValue Challenge ResearchCan we spot novelty within an existing patent?Data$$

$Ini(ally:$A"random"sample"of"40"patents"in"several"classes"with"focus"on"455"(telecom)."""

$Recently:$Confirmed"through"automated"analysis"of"several"subclasses"of"455.""$Method:"Compare"words"in"claims"with"words"in"class"plus"subclass"definiAon""Results:"Large"symmetric"differences

""#$%&(()*+,&)÷"#$%&(./0+1+2+#1)"""#$%&(34&2$*52&)÷"#$%&(./0+1+2+#1)"

"

Results

RealityValue Challenge Research ResultsExample

h)p://pa,t.uspto.gov/netacgi/nph-­‐Parser?Sect1=PTO2&p=1&u=%2Fnetahtml%2Fsearch-­‐bool.html&r=2&f=G&l=50&d=pall&s1=449%2F8.CCLS.&OS=CCL/449/8&RS=CCL/449/8  

Patent  Title  Process  for  rearing  bumblebee  queens  and  process  for  

rearing  bumblebees    

Main  ClassificaTon    449/1  ;  449/2;  449/8  

Class  449  –  Bee  Culture  /  Subclass  1  Class  449  –  Bee  Culture  /  Subclass  8

RealityValue Challenge Research ResultsWe  claim:  1.  A  process  for  rearing  bumblebee  queens  (genus  Bombus)  comprising  generaTng  a  colony  with  workers  in  the  presence  of  ferTlized  eggs  and/or  larvae  from  at  least  one  colony,  in  a  room  with  a  controlled  climate  provided  with  food,  and  allowing  the  colony  to  grow  unTl  bumblebee  queens  are  produced,  wherein  subadult  and/or  adult  workers  that  originate  from  at  least  one  different  colony  are  brought  together  with  said  ferTlized  eggs  and/or  larvae.   2.  The  process  according  to  claim  1,  wherein  the  workers  that  originate  from  said  at  least  one  different  colony  are  brought  together  with  a  young  colony  in  the  eusocial  phase,  consisTng  of  a  ferTlized  queen,  brood  and  the  first  born  workers.   3.  The  process  according  to  claim  1,  wherein  more  than  100  workers  are  brought  together.  4.  The  process  according  to  claim  1,  wherein  rearing  is  carried  out  using  a  workers:  ferTlized  eggs  raTo  of  0.5-­‐4.  5.  The  process  according  to  claim  1,  wherein  the  workers  originaTng  from  said  at  least  one  different  colony  are  first  kept  in  a  room  without  any  queen  and  without  brood  for  one  day.  6.  The  process  according  to  claim  1,  wherein  brood  and  workers  from  different  bumblebee  species  are  brought  together.   7.  A  process  for  rearing  bumblebees  (genus  Bombus),  comprising  rearing  bumblebee  queens  by  generaTng  a  colony  with  workers  in  the  presence  of  ferTlized  eggs  and/or  larvae  from  at  least  one  colony,  in  a  room  with  a  controlled  climate  provided  with  food,  and  allowing  the  colony  to  grow,  wherein  subadult  and/or  adult  workers  that  originate  from  at  least  one  different  colony  are  brought  together  with  said  ferTlized  eggs  and/or  larvae,  and  using  said  bumblebee  queens  for  rearing  bumblebees.

RealityValue Challenge Research Results

Class 449 1 -> Class Definition

8 -> 7 -> 3 -> Class Definition

Subclass Nesting

RealityValue Challenge Research Results

Class 449 1 -> Class Definition

8 -> 7 -> 3 -> Class Definition

Subclass Nesting

Class  Name:  Bee  Culture  Class  Defini;on:    This  class  includes  the  methods  of  and  structures  for  propagaTng,  raising  and  caring  for  bees;  as  well  as  certain  ancillary  methods  and  structures.

RealityValue Challenge Research Results

Class  449  Subclass  1Subclass  Name:  Method  Subclass  Defini;on:    This  subclass  is  indented  under  the  class  definiTon.    Process.

RealityValue Challenge Research Results

Class  449  Subclass  8Subclass  Name:  Queen  Raising  Subclass  Defini;on:    This  subclass  is  indented  under  subclass  7.    Structure  with  provision  to  encourage  and  care  for  the  producTon  of  a  bee  larvae  into  a  queen  bee.

RealityValue Challenge Research ResultsWords  in  class  /  subclass  defini;ons  found  in  patent  claim

method 0 colony 11

process 7 culture 0

queen 6 propagate 0

raise 0

encourage 0

care 0

larvae 4

producTon 1

bee 7

mulT 0

swarm 0

capture 0

house 0

hive 0

structure 0

RealityValue Challenge Research ResultsWords  in  claim  that  were  not  in  definiTons

rearing 5

worker 10

egg 5

ferTlize 6

climate 2

food 2

different 5

control 2

RealityValue Challenge Research Results

RealityValue Challenge Research Results

Observations • Novelty is in words/relations that are not part of the definition (but appear in

patent claims or its abstract) • Some things can be left unsaid. Is there a boundary? • Happens in all patents (but degree varies)

Can we spot novelty within an existing patent?

RealityValue Challenge Research Results

Can we spot novelty within an existing patent?

Next • Opportunity to text mine these differences – Are they random on a time scale? – Would descriptions of emerging technologies emerge from these

patterns? – Do combination patents have more of these?

RealityValue Challenge Research Results

RealityValue Challenge Research ResultsCan we list “all” patents relevant for some technology?

RealityValue Challenge Research Results

– Data: Patents, Wikipedia

Can we list “all” patents relevant for some technology?

RealityValue Challenge Research Results

– Data: Patents, Wikipedia– Potential Data: Cell phone manuals or other descriptions

Can we list “all” patents relevant for some technology?

RealityValue Challenge Research Results

– Data: Patents, Wikipedia– Potential Data: Cell phone manuals or other descriptions

Can we list “all” patents relevant for some technology?

RealityValue Challenge Research Results

– Data: Patents, Wikipedia– Potential Data: Cell phone manuals or other descriptions

– Method: Text mining of patents in certain classes, text mining of filing by certain market/technology players, text mining of other patents, using Wikipedia and manuals as a guidance what to look for.

Can we list “all” patents relevant for some technology?

RealityValue Challenge Research Results

– Data: Patents, Wikipedia– Potential Data: Cell phone manuals or other descriptions

– Method: Text mining of patents in certain classes, text mining of filing by certain market/technology players, text mining of other patents, using Wikipedia and manuals as a guidance what to look for.

Can we list “all” patents relevant for some technology?

RealityValue Challenge Research

Scale

Results

RealityValue Challenge Research ScaleScalable Computing Architecture for Extracting Latent Topics and Events*

Results

RealityValue Challenge Research ScaleScalable Computing Architecture for Extracting Latent Topics and Events*

Results

RealityValue Challenge Research ScaleScalable Computing Architecture for Extracting Latent Topics and Events*

Distributed Data Storage and Pre-Processing Environment

Results

RealityValue Challenge Research ScaleScalable Computing Architecture for Extracting Latent Topics and Events*

**X. Wang et al., I-SI: Scalable Visual Analytics Architecture for Analyzing Latent Topical-Level Information From Social Media Data, Journal of Computer Graphics Forum, 2012

Distributed Data Storage and Pre-Processing Environment

Results

RealityValue Challenge Research ScaleScalable Computing Architecture for Extracting Latent Topics and Events*

**X. Wang et al., I-SI: Scalable Visual Analytics Architecture for Analyzing Latent Topical-Level Information From Social Media Data, Journal of Computer Graphics Forum, 2012

Distributed Data Storage and Pre-Processing Environment

MapReduce procedures for data-cleaning and pre-processing Distributed Storage Solution (MongoDB), is used for data storage,

analysis and Retrieval

Results

RealityValue Challenge Research ScaleScalable Computing Architecture for Extracting Latent Topics and Events*

**X. Wang et al., I-SI: Scalable Visual Analytics Architecture for Analyzing Latent Topical-Level Information From Social Media Data, Journal of Computer Graphics Forum, 2012

Distributed Data Storage and Pre-Processing Environment

MapReduce-based social media crawlers for Twitter, blogs and news articles: Unstructured Contents: Textual Information, Image, Comments

Structured Contents: User Graph, Geo-tags, HashTag

Results

RealityValue Challenge Research ScaleScalable Computing Architecture for Extracting Latent Topics and Events*

**X. Wang et al., I-SI: Scalable Visual Analytics Architecture for Analyzing Latent Topical-Level Information From Social Media Data, Journal of Computer Graphics Forum, 2012

Results

RealityValue Challenge Research ScaleScalable Computing Architecture for Extracting Latent Topics and Events*

**X. Wang et al., I-SI: Scalable Visual Analytics Architecture for Analyzing Latent Topical-Level Information From Social Media Data, Journal of Computer Graphics Forum, 2012

Parallel Data Analytics Cluster

MPI-based Parallel-LDA implementation for Topic modeling with Memory Sharing Optimization

Results

RealityValue Challenge Research ScaleScalable Computing Architecture for Extracting Latent Topics and Events*

**X. Wang et al., I-SI: Scalable Visual Analytics Architecture for Analyzing Latent Topical-Level Information From Social Media Data, Journal of Computer Graphics Forum, 2012

Parallel Data Analytics Cluster

OpenNLP-based Parallel Implementation for Entity-Extraction Customized PBS to schedule jobs for parallel computing environment

Results

RealityValue Challenge Research Results Scale

News Briefing App

RealityValue Challenge Research Scale

Resources we’d be happy to share

• Complete US patents and applications (until 1q2013) with with a search engine (Lucene) interface • Patent Classes • Other text resources (Wikipedia, Wiktionary etc) !

We’d be happy to prepare specialized extracts or combination for those who need them.

Results

RealityValue Challenge Research Scale

Thank you!

Derek Xiaoyu Wang xiaoyu.wang@uncc.edu

Results

News Briefing App @News_Briefing

Now FREE at App Store