Inventor Disambiguation for Patents led at...

28
School of Computer Science College of Engineering and Computer Science Inventor Disambiguation for Patents filed at USPTO Swapnil Mishra - u5053816 COMP8740 -Artificial Intelligence Project Supervisor: Dr Wray Buntine May 31, 2013

Transcript of Inventor Disambiguation for Patents led at...

Page 1: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

School of Computer Science

College of Engineering andComputer Science

Inventor Disambiguation forPatents filed at USPTOSwapnil Mishra - u5053816

COMP8740 -Artificial Intelligence Project

Supervisor: Dr Wray Buntine

May 31, 2013

Page 2: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

Abstract

Patents filed at USPTO have no consistent and unique identifiers for in-ventors. Identification of an inventor is a very important task to facilitateanalysis of contributions and trends in field of technology. Here we present astudy of various existing approaches and a semi supervised way of achievingit. We have demonstrated that these specific approaches can be replacedwith simple machine learning algorithms like Logistic Regression and Sup-port Vector Machines.

Page 3: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

Acknowledgments

First of all I would like to thank my supervisor Dr Wray Buntine for provid-ing his constant guidance, motivation and time for completion of my project.A note of thanks is reserved for project coordinator Dr Weifa Liang for orga-nizing community meetings every week. I will also take this opportunity toextend my gratitude towards Neil Bacon and Doug Ashton from NICTA forproviding very insightful discussions. I cannot thank my family and friendsenough for being a constant source of help and courage. I extend my grati-tude towards Anish and Jeff, my colleagues at ANU for discussions in timeof need. I take this opportunity to thank my friends Shivani and Rajveer, forkeeping my belief alive and guiding me through my times of struggle. Lastbut not least I would like to dedicate this work to my colleagues Surabhi,Akshay, Prajakta and Tanmay at MIT Robocon lab for teaching me how toalways keep your chase on for passion.

1

Page 4: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

Contents

1 Introduction 41.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 62.1 Existing Work on Name Disambiguation . . . . . . . . . . . . 7

2.1.1 Work on other Data-Sets . . . . . . . . . . . . . . . . . 72.1.2 Work on Inventor Disambiguation . . . . . . . . . . . . 8

3 Overview 103.1 Semi-Supervised Algorithm . . . . . . . . . . . . . . . . . . . 103.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3.1 Broader Objectives . . . . . . . . . . . . . . . . . . . . 123.3.2 Project Goals . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Information Landscape . . . . . . . . . . . . . . . . . . . . . . 13

4 Methods 144.1 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . 154.2 Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.3 Similarity Profile . . . . . . . . . . . . . . . . . . . . . . . . . 164.4 Training Set Creation . . . . . . . . . . . . . . . . . . . . . . . 164.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.6 Existing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Results and Evaluation 195.1 Classifier Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 195.2 Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 205.4 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2

Page 5: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

5.5 Splitting and Lumping Results . . . . . . . . . . . . . . . . . . 22

6 Conclusion and Future Work 24

3

Page 6: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

Chapter 1

Introduction

1.1 Overview

United States Patent and Trademark Office (USPTO) releases its patent dataevery month in the public domain. This has led to a surge in data analysisto use this data for trend analysis.

Inventor disambiguation is the problem of identifying whether two giveninventor-patent instances belong to a same inventor or not. The need fordisambiguation is accentuated by the fact that USPTO does not have anyunique or consistent identifiers for inventors [9]. A disambiguated USPTOdatabase would be helpful to identify how researchers collaborate and whatare the possible upcoming areas for new invention.

There has been some amount of work done in this field, but all this workhas been restricted to building complex and specific systems. This has ledto the problem that either these approaches are not scalable or not a viableoption to be implemented apart for academic purposes. Almost all of thesemethods are based on use of manually curated weighing schemes or decisionrules [14]

Here we present an analysis of Lai et al. (2011), a semi-supervised algo-rithm for inventor disambiguation [9]. An approach to use simple machinelearning classification algorithms is also presented. We have shown here thatuse of unlabeled data to create pseudo labeled records combined with Logis-tic Regression and Support Vector Classification is a good technique. Thistechnique has distinct advantage of being scalable and general. Moreoverwe revisit the problem from an information perspective to understand what

4

Page 7: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

other databases could be used to support the task.

1.2 Contribution

The contributions by this project are:

• Use of Logistic Regression and Support Vector Machines in inventordisambiguation.

• New information that can be used to formulate how to analyze thepatent-inventor database.

1.3 Structure

The report is structure as followed, Chapter 2 gives a description aboutbackground of the field of disambiguation and earlier work done. Chapter3 gives an overview of the framework considered and goals of the project.Chapter 4 is about the methodology used to develop the system. Chapter 5talks about results obtained and evaluation done. Chapter 6 concludes withpointers to possible future work.

5

Page 8: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

Chapter 2

Background

Record linkage is the field of matching records in databases. In general, abipartite record linkage matches a unique record in one database to at mostone other record in a different database. In the scientific community disam-biguation is considered to be a sub-field of record linkage. It is the problemof linking records within a single database.

Differences between disambiguation and record linkage are subtle but in-triguing. Disambiguation works with the same record fields whereas in recordlinkage this does not hold. It is a very common case in record linkages thatfor example in one database we have fields relating names and date of birth ofa person and in other database we have fields relating first name and address.At same time the basic assumption of matching one record with at most onein other database does not hold in the disambiguation process. It is verycommon to have tens of records matched against a single entity. This makesit difficult to extend various frameworks of record linkage to disambiguation.In spite of all this what lies at heart of both processes is pair-wise compari-son of similarity scores in two records and then labeling them accordingly as’match’ or ’non-match’.

USPTO data serves a good source to do inventor name disambiguationfor the fact that it contains data from such varied fields and the number ofpatents in itself is huge and ever growing. The whole idea of inventor namedisambiguation for patents can lead to interesting results. This disambigua-tion can lead to analysis of how people collaborate among their peers, whatare the future trends, which areas of research are active topics, what is thecontribution by an individual in a particular field and how people collaborateacross technology [9].

6

Page 9: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

There are four major challenges for name disambiguation [11]:

• Some people write under different names or variants due to namechanges, spelling variations, marriages and use of pen names.

• Presence of many different people with the same name, more often withvery common names like John Smith

• Metadata having name information is incomplete. At times it showsthe full name and at other times it will just have initials.

• Generally, documents are multi-authored and interdisciplinary.

2.1 Existing Work on Name Disambiguation

This section describes some existing work done in name disambiguation. Aspecial emphasis is made in the study of existing work done for inventordisambiguation.

2.1.1 Work on other Data-Sets

Significant work has been done in the field of name disambiguation for variousdifferent digital libraries and databases. Most notable of all this have beenthe following.

K-Way

K-way is an unsupervised learning approach to disambiguate authors in ci-tations. This approach utilizes three different citation attributes: co-authorattributes, paper titles and publication venue titles, as vector components.These attributes are then processed with the help of Laplacian matrices andK-way spectral clustering is used for disambiguation [7].

A unified framework for name disambiguation

Tang et al. provides a unified framework for the name disambiguation prob-lem. They consider five different features for solving ambiguity: co-conference,co-author, citation, constraints and τ -CoAuthor. If two papers share a con-ference or journal then they have a co-conference relationship; the same sec-ondary author indicates a co-author relationship. If one paper cities the otherthen they have a citation relationship. The τ -CoAuthor relationship is ex-plained as follows: if Paper X has two authors A and B, paper Y has authors

7

Page 10: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

A and C, now if there exists a Paper Z such that it has authors B and Cthen we say paper X and Y have 2-CoAuthor relationship. Their frameworkis based on Hidden Markov Random fields which build dependencies amongpapers (observations). Disambiguation is done with the help of the BayesianInformation Criterion to estimate the right number of distinct authors [12].

Author-ity Project

Author-ity is the work of Torvik et al. to disambiguate authors in the Med-line database. They introduce a semi-supervised classification algorithm fordisambiguation. Their main idea is to extract training sets for both matchand non-match sets automatically from data sets. For record pair compar-isons and then clustering they introduced a concept of ’r-value’, which is therelative frequency of a similarity profile in match sets to non-match sets [13].After building this they used logistic regression to classify the authors [11].

2.1.2 Work on Inventor Disambiguation

Inventor disambiguation for patents is a relatively new field. Here we discussthree major work done till now.

Fleming et al 2007

Fleming et al. (2007) present an earlier work in the field of inventor disam-biguation. The algorithm consists of exact string matching techniques. Thisstring matching is then followed by a simple algorithm developed on ’if-then-else’ decision making rules. Though this technique inherently suffers fromsystematic errors it paved the way for scientific community to look into thisfield and came up with some interesting trends about location-technologyrelationship. Subsequently they developed an updated algorithm in 2009 [8].This algorithm had options to handle typographical errors and name varia-tions. It was based on similarity scores, field weights and matching thresholds[5].

Lai et al 2011

The Lai et al. (2011) algorithm is a Bayesian semi-supervised approach todisambiguate inventors. It is mainly an adaptation of the Torvik-Smalheiser(2009) author disambiguation algorithm [11]. Feature vectors are dividedinto two sets of independent features, Name features and Patent features.Each set is fixed such that it divides records that are certainly matches or

8

Page 11: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

non-matches. The similarity profile is created for the pairs to create a likeli-hood ratio database for each profile. Now whenever we need to compare tworecords first their prior match probability is calculated, which is then multi-plied by the likelihood ratio corresponding to the similarity profile obtainedby doing a field wise comparison of records.

Now all records that have a score above a particular threshold are com-bined to form a cluster with a representative. The clusters are compared toeach other by first comparing their representatives and if it passes a certainthreshold then only the inner elements are compared. If comparison yieldsa score greater than threshold, the clusters are merged. This step goes tillwe have no more clusters left for merging. Again the above stated procedureis repeated with more relaxed data partitioning schemes and using the dataobtained in the previous pass as the base data for the next step [9].

Ventura et al 2012

The Ventura et al 2012 algorithm uses supervised classification. The algo-rithm starts with identifying and labeling inventor records for 281 uniqueCV inventors. Now a classification model known as Conditional Forest ofRandom Forests is built to learn from the labeled data. The classifier forConditional FoRF is built firstly by splitting the pair wise comparison datainto multiple groups based on various conditions like inventor country, dif-ferent frequencies of names, no of patents an inventor has as from labeleddata. Now training for a random forest is done within each group. Finallyall random forests are saved for predicting the unlabeled pairwise data [14].

9

Page 12: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

Chapter 3

Overview

This project involved work in analyzing, formulating ideas, techniques and es-tablishing goals for inventor disambiguation. This chapter gives an overviewof the framework considered and goals formulated.

3.1 Semi-Supervised Algorithm

A significant amount of related work was reviewed to understand the detailsof disambiguation. Analyzing them led to a conclusion that the best ap-proach to disambiguate inventors in patents would be a pseudo supervisedclassification system. This is because of the lack of labeled data presentthat relates patents to their inventors. Moreover, even if we get hold of alabeled set, it would just represent a very small set of actual data and isbest used for evaluation and validation purposes. We will still need an algo-rithm where we can utilize this large unlabeled information to our advantage.

Lai et al. (2011) has these advantages and hence it was chosen to be thealgorithm to be analyzed and studied at length [9]. Their algorithm can besummarized as following:

• Collect data from USPTO XML, NBER and preprocess them to extractfields like inventor first name, middle name, last name, assignee name,assignee type, inventor address, technology class, co-authors name, etc.

• Partition data using blocking rules (see Section 4.2) as per inventornames.

• The feature vector consists of seven fields, divided into two sets of inde-pendent features sets namely inventor name features (first name, mid-

10

Page 13: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

dle initials, last name) and patent features (author address, assignee,technology class, co-authors).

• Sets of records pair are created automatically for each of the featuresets by conditioning them in such a way that the pairwise comparisonof two records mostly indicates a match or non match.

• Now an r-value is calculated for each similarity profile, which is thelikelihood ratio of a similarity profile being found in either a match setto non match set [13].

• To compare two records now their prior match probability in the blockis calculated and then multiplied with corresponding r-value of thesimilarity profile obtained after doing a field wise comparison. All stringcomparisons are done with the Jaro-Winkler method [15].

• Now all records that have a score above a particular threshold arecombined to form a cluster with a representative. The clusters arecompared to each other by first comparing their representatives and ifit passes a certain threshold then only the inner elements are compared.If comparison yields a score greater than the threshold, the clusters aremerged. This step goes on till we have no more clusters left for merging.

• They have defined a sequence of seven blocking rules, which means theabove steps are done 7 times in total [9].

3.2 Dataset

The dataset used for the experiments is the invpat file from a public databasePatent Network Dataverse made available by Harvard Business School [10].This dataset has 8 million records and is in form of a CSV file, containinginventor-patent instances for Utilities patents granted by USPTO till year2011.

Each row in the file belongs to an inventor associated with a patent.If in an original patent grant the application had three inventors, you willfind three different instances of this patent number associated with threedifferent individuals. All inventor and patent attributes like name, address,patent number, application grant date, class of patent and others are presentas columns in each row. The choice was made to use this dataset because iteliminated the need for pre-processing of XML (which is released by USPTO

11

Page 14: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

as patent applications) and also this database integrated some address andassignee data from secondary sources like NBER [6].

3.3 Goals

This section talks about what are the broader objectives and project goalsfor disambiguation.

3.3.1 Broader Objectives

The main objectives for developing a framework for inventor disambiguationare summarized as follows:

• Use a semi-supervised algorithm that can handle both labeled and un-labeled data to optimize the learning for classifier.

• System should be robust enough to handle the whole patent data com-promising of more than 10 million records in USPTO.

• It should be able to use other data-sources like patent families andcitations to gain extra information before making a decision.

• The algorithm should be general enough to incorporate other patentdatabases like European Union database.

These are the broader objectives for a commercial quality inventor disam-biguation system.

3.3.2 Project Goals

Before these objectives could be realized a number of smaller goals have to beachieved. These goals focus on understanding of what is available currentlyand how can it be improved. The goals of the project as opposed to theabove broader objectives, are summarized as:

• Get Lai et al. (2011) working as a base model [9].

• Can this be replaced with simple machine learning classification algo-rithms?

• Build a naive sub system that works with a smaller data set.

• Evaluate the performance of the built system against Lai et al. [9].

12

Page 15: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

• Come up with new features and information that can be used to builda future system, which better models the information flow in patent-inventor database.

Note: We are currently not using the entire database for creating clustersfor disambiguation. The goal is to evaluate a new system, not to completeits development.

3.4 Information Landscape

As stated in Section 3.3.2 one of the objectives of the project is to understandhow the flow of information takes place in an inventor-patent database. Aseries of discussions with people at Patent Lens and analysis of current dataavailable led to conclusion that following features would be very helpful inmodeling the disambiguation engine:

Co-Authorship Modeling: Model a co-author relationship in such a waythat the indirect relationships between inventors are also part of the features.This will represent the exact interaction network of an inventor.

Patent Families: Patent families data provided by Patent lens groupshould be used as part of the blocking (see Section 4.2 for details) key. Thiswill help us to restrict our comparisons only to better potential matches.

New Patent Classification System: The newer classification system forpatents is more general and robust. This system will yield better informationabout the technology class of a patent. Better distance measures betweenclassification codes also needed.

Academic citations: There are secondary databases for citations and ar-ticles like Crossref1, which can be used to provide a stronger truth valuetowards comparisons.

Time Span: Use time span as a feature for making decisions, as an addresschange over longer period of time can be very well ignored but in a shorterperiod it should be very informative.

1http://www.crossref.org/

13

Page 16: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

Chapter 4

Methods

As the only unique identifiers for patent data is for patent themselves, thiscreates a challenge to study patents by an inventor. Unavailability of uniqueidentifiers for inventors, creates a dual problem of exactly identifying all workdone by an inventor and consequently identifying work done by other inven-tors with the same name. This method of combining work done by distinctinventors is known as disambiguation [9].

To achieve the above goal, our framework is built as described by Figure4. The individual steps of pre-processing, blocking, training set creation and

Figure 4.1: Framework for disambiguation

classification algorithm is described in detail in the following sections. Forevaluation of Lai et al. (2011), we feed output of our pre-processing step to

14

Page 17: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

their algorithm too [9].

4.1 Data Pre-processing

The dataset lacked some important data fields, identified as the features tobe used in disambiguation. Hence a round of pre-processing data is requiredto extract those features.

The invpat data file does not contain the middle name field. Conse-quently the first step is to generate a field name as middle name for eachrecord. It is an important field required by algorithm implemented by Lai etal. (2011), which forms our basis research algorithm [9].

The next step done in pre-processing is to generate the co-authors forevery inventor-author instance, as it is also not present in the dataset. Coau-thor field is one of the most important indicators for deciding match or non-match. This process involved iterating over all inventor-patent instances fora particular patent number and then extracting names from those instances.These names are then added to each instance leaving aside the name presentin that particular instance.

Once this is done then only few of the columns are retained for furtherprocessing. Only those columns are used that are going to play a role in thedisambiguation process. The dataset created after this step is the datasetused by the algorithms for disambiguation.

4.2 Blocking

Blocking is a technique used in the field of record linkage to decrease thenumber of possible comparisons to be done between records. It helps to de-crease the computational processing power required. The main idea behindblocking is to divide records into potential blocks which have higher possi-bility of matching [2].

We perform blocking to our dataset, where the blocking key is first nameand last name. The blocking key comprises of the first three characters ofthe first name and first three characters of the last name. This makes surethat whenever we do a comparison in our algorithm it is done among onlypotential matches.

15

Page 18: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

4.3 Similarity Profile

All fields are not considered helpful for disambiguation. A selection of fea-tures yielded seven different fields that are used to create the feature vectorfor each inventor-patent instance. The fields used are first name, last name,latitude, longitude, technology class, assignee name, assignee class and num-ber of coauthors. Fields selected can be seen as basically divided into twosub groups; inventor name related attributes and patent related attributes.Inventor name related attributes are first name and last name, the rest allare patent related attributes. These fields are then used whenever we needto do a pairwise comparison of records. A description of how similarity ofvarious fields is given as follows:

• Similarity for first name, last name and assignee name is calculated asstring edit distances, a number between 0 and 1. It is implementedusing Jaro-Winkler string edit distances [15].

• Latitude and longitude are used to find similarity between the addressesgiven in two records in a pair. The value is 0 if country is differentand rest values are spaced from 1 to 5 depending on the difference indistance calculated.

• Technology class and co-authors are defined as no of exact matches ofthese two features as in record pairs. Both values are capped at 4 and6 for technology class and co-authors respectively.

4.4 Training Set Creation

The training set is created out of unlabeled data with a technique that usesan idea given by Torvik et al. (2005) [13]. The basic idea is to acquire somerecord pairs that are almost to be of same inventor and same way select thosepairs that are almost certain to be from different inventors. While creatingtraining sets we have to keep following aspects in mind:

• we should have representation of record pairs for both match and non-match sets adequately;

• the training set should almost cover all kind of possible similarity pro-files that may define our test data;

• training sets should take care that they do not produce those sets whichare either very common or too rare; and

16

Page 19: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

• training pairs should have similarity profiles such that a match or nonmatch pair is derived from both sets of inventor name and patent at-tributes.

All the above constraints are taken care of by using the method defined belowin the Table 4.1. For rare names criterion we select only those names thatare not part of the common names list produced by United States CensusBureau and a particular name should have occurred at least 3 times in thewhole patent database.

Attribute Learned Match Set Non-Match SetPatent Attribute Pairs of records where inven-

tor name matches exactlyand the name is rare (occur-rence count ≥ 2).

Pairs of records where in-ventor name is fully differentand rare (occurrence count≥ 2).

Inventor Name Attribute Pairs of records sharing 2 ormore coauthors and technol-ogy class in a same block.

Pair of records from samepatent.

Table 4.1: Training set creation [13]

4.5 Classification

Once we have blocked our data, the main task left is to classify the pairwisecomparison of records in blocks as either a match or non-match. The classi-fier is trained with help of the training data that we have generated in theSection 4.4.

As described in the Section 3.3.2, the aim was to test simple machinelearning algorithms for the task rather than a specific built system like Laiet al. (2011). We decided to use the following two classification algorithms:

• Support vector machine

• Logistic regression

For both classification techniques we used the implementations provided bylibsvm and liblinear libraries [3, 4]. Choice was restricted to these two classi-fication algorithms, as we wanted a technique that can work well with largeamount of data and can be generalized very well.

17

Page 20: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

4.6 Existing Algorithm

A significant amount of work was done to get the current code for Laiet.al (2011) [9] working, as it was required for evaluation and understand-ing purpose. They had their code made available at https://github.com/

funginstitute/disambiguator. As their system is very cumbersome andspecific running it required an through understanding of their algorithm andimplementation. All this was compounded by use of dependencies not eas-ily available for free academic use. The earlier part of data cleaning anddata pre-processing was centered on getting the dataset ready as per theirspecifications, which varied a lot from what was described in their paper.

18

Page 21: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

Chapter 5

Results and Evaluation

5.1 Classifier Accuracy

A set of experiments was performed to investigate the use of svm’s styleclassifier. To test the classifier, a set of train data was created from theapproach used in creating the labeled data in Section 4.4. Now we use thecross validation technique to verify the performance of the classifier [1]. Theresult is presented in the Table 5.1 As results for C-SVC and nu-SVC are

Classifier Accuracy (%)

C-SVC [3] 95.46nu-SVC [3] 94.30L2-regularized logistic regression [4] 87.13

Table 5.1: Classifier Initial Accuracy

quite similar we use C-SVC as the only Support Vector Classification methodfurther in our analysis.

5.2 Illustrations

Here we present some of the interesting individual results that have comeout from the disambiguation process.

As seen from Figure 5.1, John Ayres has four records with two variantsin his name. The classifier is able to predict that they are same person withhelp of address data with a very high confidence, as demonstrated by Figure5.2.

19

Page 22: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

Figure 5.1: John L Aryes Data

Figure 5.2: John L Aryes Clusters: numbers at bottom are datapoints andnumbers to left indicate merge probabilities

In this second example from Figure 5.3, Pie has four records with the samename but different addresses. But still with the help of assignee and co-author information the algorithm predicts them as same person with highconfidence, as seen in Figure 5.4.

Figure 5.3: Pie Zhong Data

5.3 Evaluation Metrics

Once classifier accuracy was checked, the next step is to evaluate the disam-biguation algorithm. The results of the disambiguation are best describedby the terms ’splitting’ and ’lumping’ [11]. Splitting happens when patentsfrom the same inventor are classified as patents from a different inventor.

20

Page 23: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

Figure 5.4: Pie Zhong Clusters: numbers at bottom are datapoints andnumbers to left indicate merge probabilities

Lumping is described as assigning patents from different inventors to a sin-gle inventor.

The formulation of splitting and lumping used here is bit different fromwhat is followed by Lai et al. (2011) [9]. They defined their statistics withrespect to the assignment of records to the largest cluster that is representedby an inventor. Approach followed here to report this error is to relate theerrors in all pairwise comparisons. The exact formula defining splitting andlumping are the ones described by Ventura et al. [14] :

Splitting =no of comparisons incorrectly labeled as non−matches across all inventors

Total no of pairwise true matches

=no of False Negatives

no of True Positives + no of False Negatives(5.1)

Lumping =no of comparisons incorrectly labeled as matches across all inventors

Total no of pairwise true matches

=no of False Positives

no of True Positives + no of False Negatives(5.2)

Note this means evaluation can proceed without performing the clusteringstep, thus making the test simpler.

21

Page 24: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

5.4 Data Analysis

To evaluate the algorithm against these metrics, we use a set of hand curatedbenchmark labeled data provided by Lai et al. (2011) [9]. This labeled datahas records for 95 US based researchers. The evaluation carried out is a man-ual process. A need for manual evaluation arises from the problem that datain this hand curated set is not complete. It is void of many patent records forsome inventors. For example, Stephen Smith has half of his records missing.As per the methodology presented in Section 4.2, we need to do all pairwisecomparisons in a block so we need truth for at least those values to come upwith measures. If we ignore these records then any process of disambiguationdone will be a biased process.

An analysis of evaluation data reveals there are only 3 people in theevaluation set for whom the number of possible potential comparisons donot happen because of our blocking rules, they have name variations in thefirst three characters of first name or last name, namely Roy Curtiss III, QY Tong and Noel D Dey. This does not play a bigger role as for all threenames there is only one instance each.

5.5 Splitting and Lumping Results

Evaluation for classifiers is presented in Table 5.2 on basis of the metricsdefined in Equation 5.1 and Equation 5.2. From the table we see that all our

Method Splitting (%) Lumping (%)

C-SVC 1.36 4.2L2-regularized logistic regression (primal) 2.67 1.85Lai et al (2011) 3.57 1.50

Table 5.2: Comparisons of Different Methods

classifiers have a tendency to merge records together of different inventorsrather than to split records of one inventor to multiple inventors. A detailedlook into results gives us an insight that most times when lumping happenedin the benchmark dataset it was for common names like Stephen, Eric, Don-ald and Mark. Note: Comparison with Lai et al. (2011) is somewhat indirect.As their algorithm is run on the whole USPTO dataset for disambiguation,while we run our classification algorithms on benchmark data only. More-over their system uses clusters as a basis of evaluation metrics, while we have

22

Page 25: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

pairwise comparisons as our basis and their system cannot produce explicitpairwise results. Still these numbers are related in their nature of growth.

23

Page 26: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

Chapter 6

Conclusion and Future Work

An insight into a semi-supervised algorithm for disambiguation has beenprovided. An evaluation is done of replacing the specific built classificationengine of Lai et al. (2011) [9]. Two different classifiers were tested in LogisticRegression and C-SVC. In order to get training data for the classifiers, gen-eration of pseudo labeled data is suggested to utilize the potential providedby huge unlabeled data.

Results for Logistic Regression and C-SVC classifier are good enoughto be used in disambiguation tasks. Logistic Regression has improvementsover Lai et al. (2011) for the lumping measure and is very close in termsof splitting [9]. Though C-SVC has a higher lumping rate but the splittingmeasure is well below. As seen in Section 5.4 our blocking methodology canat times leave out a small set of potential match data, introducing errors ofaround 1%. An alternate approach to handle these cases has to be developed.

There is still a lot of work to be done in the field of inventor disambigua-tion. A more sophisticated model has to be developed to capture the wholespectrum of information by using some more features as discussed in Section3.4. Rare names and string comparisons are modeled to work for Americanor European names, so a way to treat Asian names differently is required.

24

Page 27: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

Bibliography

[1] Christopher M. Bishop. Pattern Recognition and Machine Learning (In-formation Science and Statistics). Springer-Verlag New York, Inc., Se-caucus, NJ, USA, 2006.

[2] Tony Blakely and Clare Salmond. Probabilistic record linkage and amethod to calculate the positive predictive value. International Journalof Epidemiology, 31(6):1246–1252, 2002.

[3] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for supportvector machines. ACM Transactions on Intelligent Systems and Tech-nology, 2:27:1–27:27, 2011.

[4] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, andChih-Jen Lin. LIBLINEAR: A Library for Large Linear Classification.Journal of Machine Learning Research, 9:1871–1874, 2008.

[5] Lee Fleming, Charles King, and Adam I Juda. Small worlds and regionalinnovation. Organization Science, 18(6):938–954, 2007.

[6] Bronwyn H Hall, Adam B Jaffe, and Manuel Trajtenberg. The NBERpatent citation data file: Lessons, insights and methodological tools.Technical report, National Bureau of Economic Research, 2001.

[7] Hui Han, Hongyuan Zha, and C Lee Giles. Name disambiguation inauthor citations using a k-way spectral clustering method. In DigitalLibraries, 2005. JCDL’05. Proceedings of the 5th ACM/IEEE-CS JointConference on, pages 334–343. IEEE, 2005.

[8] Ronald Lai, Alexander DAmour, and Lee Fleming. The careers andco-authorship networks of U.S. patent-holders, since 1975. Technicalreport, Harvard Institute for Quantitative Social Science, 2009.

[9] Ronald Lai, Alexander DAmour, Amy Yu, Ye Sun, Vetle Torvik, and LeeFleming. Disambiguation and co-authorship networks of the US Patent

25

Page 28: Inventor Disambiguation for Patents led at USPTOcourses.cecs.anu.edu.au/courses/CSPROJECTS/13S1/Reports/... · 2014-02-18 · Chapter 1 Introduction 1.1 Overview United States Patent

Inventor Database. Harvard Institute for Quantitative Social Science,Cambridge, MA, 2138, 2011.

[10] Ronald Lai (Harvard Business School); Alexander D’Amour (Har-vard Institute of Quantitative Social Science); Amy Yu (Harvard Busi-ness School); Ye Sun (Harvard Institute of Quantitative Social Sci-ence); Lee Fleming (Harvard Business School). Disambiguation andCo-authorship Networks of the U.S. Patent Inventor Database (1975 -2010), 2011.

[11] Neil R Smalheiser and Vetle I Torvik. Author name disambiguation.Annual review of Information Science and Technology, 43(1):1–43, 2009.

[12] Jie Tang, Jing Zhang, Duo Zhang, and Juanzi Li. A unified frame-work for name disambiguation. In Proceeding of the 17th InternationalConference on World Wide Web, pages 1205–1206. ACM, 2008.

[13] Vetle I Torvik, Marc Weeber, Don R Swanson, and Neil R Smalheiser.A probabilistic similarity metric for Medline records: a model for authorname disambiguation. Journal of the American Society for InformationScience and Technology, 56(2):140–158, 2005.

[14] Samuel Ventura, Rebecca Nugent, and Erica Fuchs. Methods Mat-ter: Revamping Inventor Disambiguation Algorithms with Classifica-tion Models and Labeled Inventor Records. Available at SSRN 2079330,2012.

[15] William E. Winkler. String comparator metrics and enhanced decisionrules in the fellegi-sunter model of record linkage. In Proceedings of theSection on Survey Research, pages 354–359, 1990.

26