Author disambiguation for enhanced science 2.0 services

Author Disambiguation for

enhanced science 2.0 services

Generating groups based on “thin” data

Jeffrey Demaine Institute for Research Information and Quality Assurance (iFQ)

Bonn, Germany

www.research-information.de

Outline

• Describe author-disambiguation efforts at iFQ – Good results based on limited metadata.

– Simple approach can achieve remarkable results.

• Adapt the findings to the context of an online service.

• Most author-disambiguation articles achieve remarkable

results by starting with clean (manually curated) data.

– What if you don't have clean, curated data?

– i.e.: the community-generated content of Science 2.0

Jeffrey Demaine

08/2011 Institute for Research Information and Quality Assurance

Challenge

• Web of Science database does not group author names

by identity: each name-instance gets a different ID number.

• Example: There are 100 articles by Smith, J

– 100 different authorID numbers

– 100 different researchers?

– 1 researcher published them all?

– Something in-between?

• How can iFQ measure a researcher´s productivity if we

don't know who that person is?

Jeffrey Demaine


Challenge

Goal:

• To reduce uncertainty about identity

• Identify and group homonyms

– same spelling = same thing

Constraints:

• WoS only uses the first initial. – Few records have a first name

• Affiliation data is suspect – Corresponding author only

Jeffrey Demaine


Ignoring synonyms

• Spelling mistakes, anglicisation of characters (changing

ö into oe or o), name changes due to marriage, etc. are

beyond the scope of this project.

• While J. Smith can be disambiguated, we will never

know if Jane Smith is a mis-spelling of June Smith.

(Maybe it’s really Janet Smith, or Jane Smyth, or…)

• Speculation about what could be cannot be resolved.

Jeffrey Demaine


Matching on metadata Thiel, C

Thiel, CM; Bentley, P; Dolan, RJ (2002) Effects of

cholinergic enhancement on conditioning-related

responses in human auditory cortex. EUR J NEUROSCI

Thiel, CM; Henson, R; Morris, JS; Friston, KJ; Dolan, RJ

(2001) Pharmacological modulation of behavioral and

neuronal correlates of repetition priming. J NEUROSCI

These articles have on co-author in common: we can be pretty

sure they were both written by the same Christiane Thiel

Jeffrey Demaine


Thiel, C

?

Matching on metadata

Jeffrey Demaine


Thiel, C

Thiel, C

Author-instances are matched

based on Bibliographic Coupling

Smith, J

Types of matching tested

4 types of matching were tested:

• > 1 co-author

• > 1 co-author and/or > 1 reference

• > 1 co-author and/or > 2 references

• > 1 co-author and/or > 1 reference [Jaccard Similarity Coeff.]

Evaluation of matching:

• Fewer groups = simplification of identity. RATIO

• How accurate are the groups? PRECISION

• How efficient are the groups? RECALL

Jeffrey Demaine



• Compare each group against the Emmy Noether

data, which we know represents a single person.

• How accurate is the group with the most matches?

That is:

• We know that a J Smith wrote 15 articles.

• Which group contains the most of those?

• Which method of matching creates the best groups?

Jeffrey Demaine


Name tested WoS

B Borasoy 59

R Huber 945

A Blaukat 44

G Behre 98

M Albrecht 652

L Ackermann 60

R Everaers 43

P Neumann 317

K Niebuhr 13

C Ochsenfeld 33

JW Pan 192

T Pietschmann 37

B Regenberg 30

K Wiegand 28

Goal #1: Reducing uncertainty 14 names from Emmy Noether dataset

(all records since 1992)

R Huber - manual matching

70,839,681 Co-author comparisons

1180661 minutes (1 per second)

147583 8-hour days

29517 5-day work weeks

615 years (4 weeks vacation)

Simplification of identity

Number of groups by:

Co-author matching 95 90%

Co-authors & reference 65 93%

Co-authors & 2 references 76 92%

Co-authors & reference Jaccard 154 84%

Jeffrey Demaine


Co-author

matching

took 5m 27sec

(945 x 3) - 1 x (944 x 3) - 1 > 8 million


Pre

cisi

on

(%

)

Jeffrey Demaine


Which method of matching creates the best groups?

Matching by co-authors & 2 references 75% Precision

Matching by co-authors & 2 references & author starting year 79% Precision

Precision = The ratio of correct matches to all matches (% correct)

Goal #2: Split known names

• Use the co-author & 2-references matching technique.

• We know a priori the identities of the persons, so we

know how many groups there should be.

• This allows us to calculate Recall.

• Also allows us to calculate correlation coefficients.

Jeffrey Demaine


Who is E. Cardellach?

Esteve Cardellach

Bellaterra, Spain Universitat Autònoma de Barcelona

• Environmental Engineering

• Environmental Sciences

• Geochemistry & Geophysics

• Geology

• Mineralogy

• Civil Engineering

Estel Cardellach

Bellaterra, Spain Insitut de Ciècies de l’ESPAI

• Computer Science, A.I.

• Electrical & Electronic Engineering

• Imaging Science & Photo. Tech.

• Multidisciplinary Geosciences

• Remote Sensing

• Telecommunications

Web of Science identifies both as simply “Cardellach, E”

Jeffrey Demaine


Using the program

• There are 41 distinct records published since 2002 that

match the name “Cardellach, E”.

• The program grouped the names according to the co-

authors and/or references (minimum 2) they share.

• Involved the comparison of 9202 co-authors and of

19777 references, which took only 21 seconds (!).

• The RESULT was 4 groups containing all 41 records.

• Googled article titles to verify author name & address.

Jeffrey Demaine


Precision & Recall

Precision: % of results that are correct (accuracy).

Recall: % of all relevant matches that were made (efficiency).

Must know the total number of correct answers there are.

Jeffrey Demaine


Condition (reality) TRUE

Condition (reality) FALSE

Test (prediction) TRUE

True positive False positive Precision

TP ÷ (TP + FP)

Test (prediction) FALSE

False negative True negative

Recall TP ÷ (TP + FN)

The Matthews correlation coefficient

gives a statistical measure of the results:

Disambiguation of Estel and Esteve

Jeffrey Demaine


In reality: Esteve

In reality: Estel

Predicted: Esteve 25 1 Precision

96%

Predicted: Estel 3 12

Recall 89%

Matthews c.c. 0.79

The one false positive occurred when Estel published a

paper in the field of Geochemistry & Geophysics.

Disadvantages

• Not the most advanced approach.

– Does not use NLP on title, abstract.

• High variablility of precision: groups can contain errors.

• Smaller (“orphan“) groups require manual matching.

• Very common names produce a large number of

groups (naturally). Requires some post-processing.

Jeffrey Demaine


Advantages

• It is highly automated, requiring only the input of a

name and a year with which to limit the search.

• It runs very quickly, with hundreds of names being

disambiguated in a few minutes.

• Only have to verify one name-instance per group.

• Works with the real-world data we have (FIZ

databases), not some hand-curated test set.

Jeffrey Demaine


What about science 2.0?

• Most author-disambiguation articles achieve remarkable

results by starting with a very clean data-set.

• In a science 2.0 context, content is user-generated and

nobody cleans the data.

• But that does not mean one cannot extract interesting

patterns from the metadata.

• Example: author-disambiguation based on co-authors,

co-citation, and shared keywords.

Jeffrey Demaine


Jeffrey Demaine


“Loren M. Frank” found 7 times in Karlmarx20’s library

• Emery N. Brown, David P. Nguyen, Loren M. Frank, Matthew A. Wilson, Victor Solo. “An analysis of neural

receptive field plasticity by point process adaptive filtering[Quick Edit]” Proceedings of the National Academy of

Sciences of the United States of America, Vol. 98, No. 21. (9 October 2001), pp. 12261-12266.

• Emery N. Brown, Loren M. Frank, Dengda Tang, Michael C. Quirk, Matthew A. Wilson . “A Statistical Paradigm

for Neural Spike Train Decoding Applied to Position Prediction from Ensemble Firing Patterns of Rat Hippocampal

Place Cells”[Quick Edit] J. Neurosci., Vol. 18, No. 18. (15 September 1998), pp. 7411-7425.

• Emery N. Brown, Riccardo Barbieri, Valérie Ventura, Robert E. Kass, Loren M. Frank . “The Time-Rescaling

Theorem and Its Application to Neural Spike Train Data Analysis”[Quick Edit]. Neural Computation, Vol. 14, No. 2.

(19 December 2010), pp. 325-346.

• Riccardo Barbieri, Loren M. Frank, Michael C. Quirk, Matthew A. Wilson, Emery N. Brown. “Diagnostic methods

for statistical models of place cell spiking activity[Quick Edit]” Neurocomputing, Vol. 38-40 (June 2001), pp. 1087-

1093.

• Emery N. Brown, Riccardo Barbieri, Valerie Ventura, Robert E. Kass, Loren M. Frank. “The Time-Rescaling

Theorem and Its Application to Neural Spike Train Data Analysis” [Quick Edit]Neural Comp., Vol. 14, No. 2.

(February 2002), pp. 325-346.

• Riccardo Barbieri, Michael C. Quirk, Loren M. Frank, Matthew A. Wilson, Emery N. Brown. “Construction and

analysis of non-Poisson stimulus-response models of neural spiking activity”. [Quick Edit]Journal of Neuroscience

Methods, Vol. 105, No. 1. (January 2001), pp. 25-37.

• Uri T. Eden, Loren M. Frank, Riccardo Barbieri, Victor Solo, Emery N. Brown. “Dynamic Analysis of Neural

Encoding by Point Process Adaptive Filtering” [Quick Edit]Neural Comp., Vol. 16, No. 5. (May 2004), pp. 971-998.

Jeffrey Demaine


“Loren M. Frank” found 7 times in Karlmarx20’s library

• Emery N. Brown, David P. Nguyen, Loren M. Frank, Matthew A. Wilson, Victor Solo. “An analysis of neural

receptive field plasticity by point process adaptive filtering” Proceedings of the National Academy of Sciences of

the United States of America, Vol. 98, No. 21. (9 October 2001), pp. 12261-12266.

• Emery N. Brown, Loren M. Frank, Dengda Tang, Michael C. Quirk, Matthew A. Wilson . “A Statistical Paradigm

for Neural Spike Train Decoding Applied to Position Prediction from Ensemble Firing Patterns of Rat Hippocampal

Place Cells” J. Neurosci., Vol. 18, No. 18. (15 September 1998), pp. 7411-7425.

• Emery N. Brown, Riccardo Barbieri, Valérie Ventura, Robert E. Kass, Loren M. Frank . “The Time-Rescaling

Theorem and Its Application to Neural Spike Train Data Analysis”. Neural Computation, Vol. 14, No. 2. (19

December 2010), pp. 325-346.

• Riccardo Barbieri, Loren M. Frank, Michael C. Quirk, Matthew A. Wilson, Emery N. Brown. “Diagnostic methods

for statistical models of place cell spiking activity” Neurocomputing, Vol. 38-40 (June 2001), pp. 1087-1093.

• Emery N. Brown, Riccardo Barbieri, Valerie Ventura, Robert E. Kass, Loren M. Frank. “The Time-Rescaling

Theorem and Its Application to Neural Spike Train Data Analysis” Neural Comp., Vol. 14, No. 2. (February 2002),

pp. 325-346.

• Riccardo Barbieri, Michael C. Quirk, Loren M. Frank, Matthew A. Wilson, Emery N. Brown. “Construction and

analysis of non-Poisson stimulus-response models of neural spiking activity”. Journal of Neuroscience Methods,

Vol. 105, No. 1. (January 2001), pp. 25-37.

• Uri T. Eden, Loren M. Frank, Riccardo Barbieri, Victor Solo, Emery N. Brown. “Dynamic Analysis of Neural

Encoding by Point Process Adaptive Filtering” Neural Comp., Vol. 16, No. 5. (May 2004), pp. 971-998.

Jeffrey Demaine


Disambiguation based on shared co-authors.

What about reference matches?

• CiteULike does not have citation data!

• The references collected in a user's library (‟bookmarks”)

are a form of citation.

– Indicate common academic interest.

• So: leverage the personal collections generated by users

to disambiguate authors based on ‟co-collection” of

articles.

Jeffrey Demaine


Matching on metadata

Jeffrey Demaine


Thiel, C

Thiel, C

Author-instances are matched

based on Co-citation

“Wentai Liu” found 10 times in Karlmarx20’s library

• Zhi Yang, Qi Zhao, Wentai Liu. “Improving spike separation using waveform derivatives”. Journal of Neural Engineering,

Vol. 6, No. 4. (2009)

• Zhi Yang, Qi Zhao, Wentai Liu. “Neural signal classification using a simplified feature set with nonparametric clustering“.

Neurocomputing, Vol. 73, No. 1-3. (2009), pp. 412-422.

• Zhi Yang, Qi Zhao, Wentai Liu. “Improving spike separation using waveform derivatives.” Journal of Neural Engineering,

Vol. 6, No. 4. (2009)

• Zhi Yang, Qi Zhao, Wentai Liu. “Energy based evolving mean shift algorithm for neural spike classification“. In 2009 31st

Annual International Conference of the {IEEE} Engineering in Medicine and Biology Society. {EMBC} 2009, 3-6 Sept. 2009

(2009), pp. 966-9.

• Linh Hoang, Zhi Yang, Wentai Liu. "VLSI architecture of NEO spike detection with noise shaping filter and feature extraction

using informative samples“ In 2009 31st Annual International Conference of the {IEEE} Engineering in Medicine and Biology

Society. {EMBC} 2009, 3-6 Sept. 2009 (2009), pp. 978-81.


using informative samples“ Conference Proceedings: Annual International Conference of the IEEE Engineering in Medicine

and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference, Vol. 1 (2009), pp. 978-981.

• Zhi Yang, Qi Zhao, Wentai Liu. “Energy based evolving mean shift algorithm for neural spike classification“. Conference

Proceedings: Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering

in Medicine and Biology Society. Conference, Vol. 1 (2009), pp. 966-969.


using informative samples.” Conference Proceedings: Annual International Conference of the IEEE Engineering in Medicine

and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference, Vol. 1 (2009), pp. 978-981.

• Tung-Chien Chen, Wentai Liu, Liang-Gee Chen. “VLSI architecture of leading eigenvector generation for on-chip principal

component analysis spike sorting system“. Conference Proceedings: Annual International Conference of the IEEE

Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference, Vol. 1 (2008),

pp. 3192-5.

• Zhi Yang, Qi Zhao, Wentai Liu. “Improving spike separation using waveform derivatives“. Journal of Neural Engineering,

Vol. 6, No. 4. (July 2009).

Jeffrey Demaine


Matching based on collection

• Some precedence for this in the Science 2.0 literature:

– Haustein, S. et al. equate bookmarking on CiteULike as a proxy

for citations in measuring journal impact.

– Usage ratio, diffusion, intensity based on counting user profiles.

• Matching based on membership in a collection: data that

is extrinsic to the article itself.

– Match on user-generated keywords

Haustein, S., Golov, E., Luckanus, K., Reher, S. & Terliesner, J. (2010). “Journal evaluation and science 2.0:

Using social bookmarks to analyze reader perception”. Proceedings of the 11th International Conference on

Science and Technology Indicators, Leiden, 117-119.

Jeffrey Demaine


Jeffrey Demaine


Practicalities

• The accuracy of the disambiguation is not mission-

critical – it's just an online web-service.

• Could be easily implemented to run on-the-fly using Java

Server Page technology.

– When a user loads a web-page, the server matches the

references based on their metadata and profile membership,

creating a web-page with groups.

• An example of an automated value-added service in

a Science 2.0 context.

Jeffrey Demaine


Thank you very much

for your attention!

[email protected]

Jeffrey Demaine


Author disambiguation for enhanced science 2.0 services

Education

Transcript of Author disambiguation for enhanced science 2.0 services