Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website:...

25
Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com

Transcript of Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website:...

Page 1: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

Name matching for PATSTAT data

Gianluca Tarasconi

KITeS Database Administrator

1

Website: rawpatentdata.blogspot.com

Page 2: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

KITeSKnowledge, Internationalization and Technology Studies

KITeS’s mission is understanding the relationship between innovation, technology management, firms’ competitiveness and economic growth in the global economy. KITeS’ research intends to be rigorous, relevant and inter-disciplinary. It focuses on three main areas: innovation, technology management and trade.

22

Page 3: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

KITeS –The centre

KITeS was founded in 2008, building upon the experience of research centres such as CESPRI and CRITOM. It’s guested @ Bocconi University.

KITeS is an inter-departmental research centre, integrating researchers from the Economics Dpt., the Management Dpt. and the Institutional Analysis Dpt. KITeS researchers hold doctoral degrees from Yale, Stanford, London School of Economics, Bocconi, Manchester, Leuven, Sussex, Maastricht, and others.

Patent statistics have been widely used at KITeS for many years now, dating back to CESPRI's early research in industrial dynamics.

This tradition has led to the cumulative creation and updating of a large database, known as EP-CESPRI. Inventors' data used so far are organized in a sub-section of such database, known as EP-INV.

… who’s who: www.kites.unibocconi.it

3

Page 4: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

The EP-CESPRI Database (i)

The EP‐CESPRI database contains information on patents applied for at the European Patent Office (EPO), from 1978 to October 2009.

The EP‐CESPRI database was first created by making use of information downloaded regularly from EPO Bulletins. Since October 2007 it is based upon applications published on a regular basis by EPO in PATSTAT ; presently, it contains about 2.090.000 patent applications.

A beta version for USPTO was released in 2009 and SIPO (chinese patent office) version is forecasted for 2010.

4

Page 5: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

The EP-CESPRI Database (ii)

EP-CESPRI data fall into three broad categories:

1. Patent data, such as the patent’s publication number, its priority/application date, and main/secondary technological class (IPC12‐digit).

2. Applicant data, such as a unique code assigned by KITeS to each applicant after cleaning the applicant’s data, plus the applicant ‘s name and address.

3. Inventor data: such as name, surname, address and a unique code (CODINV) assigned by KITeS to all inventors found to be the same person. This section of EP-CESPRI is also known as EP-INV and it is the one of major interest to today’s seminar

5

Page 6: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

EP-INV: From raw data to structured data

Data coming from PATSTAT are cleaned, standardized and re-structured CODINV2 code

Eventually a similarity score is calculated for pairs of inventors who have the same name and surname, but different addresses CODINV code

6

Page 7: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

Standardization of inventors’ names and addresses

Original EPO data on inventors come from PATSTAT table TLS206_ASCII, where data are only partially parsed for names, address, city, zip codes.

Further steps are as follows:

1. Cleaning of address data

2. Cleaning of names

3. Computation of similarity scores

7

CODINV2 codes

CODINV codes

Page 8: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

Cleaning of address data

Parsed data are given a unique code (CODINV2) and (iteratively) cleaned by:

shifting information contained in wrong fields (like zip code, county…);

standardizing city names or parts of names (e.g.: “Saint” is turned into “St.”);

fixing mistakes in zip codes, according to national post office tables;

In 10/2007 data there were 2.381.991 codinv2 in EP-INV DB out of 3.278.486 PATSTAT person_id (28% less).

8

Page 9: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

Example of city cleaning

CITY ZIP

ORIGINALDDR-4203 Bad Dürrenberg

ZIP PARSED Bad Dürrenberg 4203

CITY CLEANED BAD DURRENBERG 4203

ZIP LOOKUP BAD DURRENBERG 06231

9

Page 10: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

Cleaning of names

The “name+surname” field was parsed into the following fields: first, second, third name, extension (e.g. Jr, Sr, III), surname, and academic title (e.g. Dr., Prof, Ing….).

This operation was mainly based on two iterative steps: Pairs of inventors with the same address and equal first

name, surname, extension and initial of second or third name are corrected for the third name (e.g.: “Rossi Giovanni Paolo” is turned into “Rossi Giovanni P.”);

Pairs of inventors’ records where 2 out of the 3 fields city, address and name are the same and the remaining one has a low edit distance (Levenshtein/alfanum) are updated on the data for the inventor with the higher number of patents.

10

Page 11: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

An example

11

Name Address City Zip codinv2

Tarasconi, Gianluca Via P. Maspero, 24 Milan 1

Tarasconi, Gianluca Via Maspero, 24 IT-20137 Milan 2

Tarasconi, G. c/o university bocconi Milano 20136 3

Tarasconi, Gianluca c/o university bocconi Milano 20136 4

Tarasconi, Gianluca 35, Via Tertulliano Milan 5

Name Address City Zip codinv2

Tarasconi, Gianluca Via Maspero, 24 Milano 20137 1

Tarasconi, Gianluca c/o university bocconi Milano 20136 3

Tarasconi, Gianluca Via Tertulliano, 35 Milano 20135 5

Page 12: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

Further info on cleaning names and addresses

Cleaning of names and address has been realized by MySQL;

The sql code is based on 25 lookup tables and 950 recursive queries;

The aggregation algorithm was quite conservative (to allow ‘new entries’ to be quickly linked);

12

Page 13: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

Computation of similarity score

• Inventors data are restructured following a structure person (CODINV) vs person@location (CODINV2)

• All inventors with anything different other than name and surname are compared in pairs, through the Massacrator

SQL routine

13

Page 14: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

Introduction of CODINV

14 14

Name Address City Zip codinv2 Codinv

Tarasconi, Gianluca Via Maspero, 24 Milano 20137 1 1

Tarasconi, Gianluca c/o university bocconi

Milano 20136 3 2

Tarasconi, Gianluca Via Tertulliano, 35 Milano 20135 5 3

Page 15: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

Similarity

Score

Workplace: same applicant/ company/ group

Social networks: coinventors in

common, 3 degrees of distance in

coinventorship

Toponymic permanence:

same address, town, county…

Citation’s linkages:

(self)citing or cited

Time lag: how long since

last patent?

IPC: patenting in the same tech fields

Computation of similarity score

15

Page 16: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

Scores by categoryWorkplace IPCSame applicant 5 Same IPC code (4 digits) 5Same applicant (the applicant has <50 inventors) 5 Same IPC code (6 digits) 5Same group (if available) 5 Same IPC code (12 digits) 10

Toponymic Permanence Time LagSame city 5 Priority dates differ for >20 years -5Same province 5Same region 5 Citation linkagesSame state (US) 5 Inventor 1 cites inventor 2 5Same address [in different cities; it may

indicate misspellings in the city field] 5 Inventor 1 is cited by inventor 2 5

Social Networks OtherSame coinventor 10 Widespread surname -53 degrees of separation 10 16

Page 17: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

Update of CODINV using similarity score

17 17

Name Address City Zip codinv2 Codinv

Tarasconi, Gianluca Via Maspero, 24 Milano 20137 1 1

Tarasconi, Gianluca c/o university bocconi

Milano 20136 3 2

Tarasconi, Gianluca Via Tertulliano, 35 Milano 20135 5 3

codinv

1

1

3

codinv

1

1

1

Algorithm should be run recursively

Intuitively, high similarity scores can be taken as indication of a high probability that the two inventors in the pairs are the same person. Whenever two inventors in a pair are found to be the same the lowest CODINV code is assigned to both inventors.

Page 18: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

Finding a threshold value (I)

18

Manual checking of EP-INV records suggest that a large number paired inventors with total score higher than 20 are indeed the same person.

Percentages vary across countries, largely because of the different distribution of frequent surnames. Therefore, no automatic re-assignment of CODINV codes has been performed so far.

In KEINS research data have been extensively checked for IT, FR, SE; the threshold value of the similarity score was set at 15 (median value): inventors in pairs with score >= 15 are then presumed to be the same person, and assigned the same CODINV code.

Page 19: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

Finding a threshold value (II)

Manual checking suggests that:

no Type 2 error (false positives) is introduced with this choice, i.e. no pair of inventors are assigned erroneously the same CODINV code)

several Type 1 errors remains, i.e. pairs of inventors who are indeed the same person have scores <15 and are not given the same CODINV code

19

Page 20: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

Applying Massacrator to all EPO (I)

distribution of score

1

10

100

1000

10000

100000

1000000

-10 8

25

43

60

78

95

113

130

148

165

183

200

218

235

253

270

288

305

323

340

358

375

score

n c

ou

ple

s

At 10/2007 we get 2.672.671 couples out of 2.363.501 inventors Mode is 0 pts (764946 couples) but 758.471 couples have >= 15pts

20

Page 21: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

Applying Massacrator to all EPO (II)

16,78 % of couples are >= 20 pts 22,72% of couples are >= 15 pts

0,00%

10,00%

20,00%

30,00%

40,00%

50,00%

60,00%

70,00%

80,00%

90,00%

100,00%

-10 -2 5 13 20 28 35 43 50 58 65 73 80 88 95 103

110

118

125

133

140

148

155

21

Page 22: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

Applying Massacrator to all EPO (III)

A raw version of the algorithm for getting a proxy of the possible reductions may be

same IPC (12 digits) OR

same applicant OR

same address OR

3 degrees of distance OR

1 coinventor in common OR

citation linkage OR

same IPC (6 digits) and same country

Compressing 571970 CODINVs out of 2363501 (-24%)

22

Page 23: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

Some publications using the EP-INV data

Lissoni, F., Llerena, P., McKelvey, M., and B. Sanditov "Academic Patenting in Europe: New Evidence from the KEINS Database," Research Evaluation, 17(2): 87-102.

Bacchiocchi E., Montobbio F. (2009); Knowledge Diffusion from University and Public Research. A Comparison between US Japan and Europe using Patent Citations. Journal of Technology Transfer, vol.34 (2), pp.169-181.

Breschi S., Lissoni F., Montobbio F. (2008). University patenting and scientific productivity. A quantitative study of Italian academic inventors. European Management Review. The Journal of the European Academy of Management 5(2): 91-109

Corrocher N., Malerba F., Montobbio F. (2007); Schumpeterian Patterns of Innovative Activity in the ICT Field. Research Policy. vol. 36, pp. 418-432

Breschi S., Lissoni F., Montobbio F. (2007). The Scientific Productivity Of Academic Inventors: New Evidence From Italian Data. Economics of Innovation and New Technology, Vol. 16, Issue 2, pp. 101-118Della Malva A, Breschi S, Lissoni F, Montobbio F. (2007). L'attivita' brevettuale dei docenti universitari: L'Italia in un confronto internazionale. Economia e Politica Industriale.v.2 pp.43-70. [pdf]

Montobbio F. (2008); Patenting Activity in Latin American and Caribbean Countries.In World Intellectual Property Organization(WIPO) - Economic Commission for Latin America and the Caribbean (ECLAC) - Study on Intellectual Property Management in Open Economies: A Strategic Vision for Latin America". Forthcoming

23

Page 24: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

Future uses of the algorithm (I)

Cross Patent-office match:

Is J. Smith in EPO the same of USPTO ?

Decompression:

Where toponymic data are few (USPTO data FI), a mere data cleaning would group inventors who are not the same; the algorithm could help to avoid type 2 errors

24

Page 25: Name matching for PATSTAT data Gianluca Tarasconi KITeS Database Administrator 1 Website: rawpatentdata.blogspot.com.

Future uses of the algorithm (II)

Companies’ match:

Identify applicants who have similar companies names as the same;

NPL match:

Helping to deduplicate authors / affiliations

25