Borner-Data analytics for science and innovation
-
Upload
innovationoecd -
Category
Data & Analytics
-
view
17 -
download
0
Transcript of Borner-Data analytics for science and innovation
![Page 1: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/1.jpg)
Data analytics for science and innovationModerator: Katy Börner, Indiana University, United States
• High-Impact and Transformative Science (HITS) Metrics, Bruce Weinberg and Joseph Staudt, Ohio, State U, Jerry Marschke and Huifeng Yu, SUNY Albany, Katy Bӧrner and Robert P. Light, Indiana University
• Validation of a web mining technique to measure innovation in the Canadian nanotechnology-related community, Constant Rietsch, Catherine Beaudry, and Mikaël Héroux-Vaillancourt (Department of Mathematics and Industrial Engineering, Polytechnique Montréal, Canada)
• Text mining to identify similar patents: a reassessment of the localization of knowledge spillovers from R&D, Sam Arts, Bruno Cassiman, KU Leuven, Belgium
• Developing science culture indicators through text mining and online media monitoring, Martin W. Bauer (London School of Economics) & Ahmet Suerdem (Bilgi University, Istanbul)
![Page 2: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/2.jpg)
2Olivier H. Beauchesne, 2011. Map of Scientific Collaborations from 2005-2009.
![Page 3: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/3.jpg)
Bollen, Johan, Herbert Van de Sompel, Aric Hagberg, Luis M.A. Bettencourt, Ryan Chute, Marko A. Rodriquez, Lyudmila Balakireva. 2008. A Clickstream Map of Science. 3
![Page 4: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/4.jpg)
Eric Fischer. 2012. Language Communities of Twitter. 4
![Page 5: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/5.jpg)
Alan Mislove, Sune Lehmann, Yong-Yeol Ahn, Jukka-Pekka Onnela, and James Niels Rosenquist. 2010. Pulse of the Nation.
![Page 6: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/6.jpg)
Martin Vargic. 2014. Map of the Internet.
![Page 7: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/7.jpg)
Martin Vargic. 2014. Map of the Internet.
![Page 9: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/9.jpg)
Data analytics for science and innovationModerator: Katy Börner, Indiana University, United States
• High-Impact and Transformative Science (HITS) Metrics, Bruce Weinberg and Joseph Staudt, Ohio, State U, Jerry Marschke and Huifeng Yu, SUNY Albany, Katy Bӧrner and Robert P. Light, Indiana University
• Validation of a web mining technique to measure innovation in the Canadian nanotechnology-related community, Constant Rietsch, Catherine Beaudry, and Mikaël Héroux-Vaillancourt (Department of Mathematics and Industrial Engineering, Polytechnique Montréal, Canada)
• Text mining to identify similar patents: a reassessment of the localization of knowledge spillovers from R&D, Sam Arts, Bruno Cassiman, KU Leuven, Belgium
• Developing science culture indicators through text mining and online media monitoring, Martin W. Bauer (London School of Economics) & Ahmet Suerdem (Bilgi University, Istanbul)
![Page 10: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/10.jpg)
INFORMING SCIENCE AND INNOVATION POLICIES#BlueSky3
TOWARDS THE NEXT GENERATION OF DATA AND INDICATORS
19-21 September 2016Ghent, Belgium
![Page 11: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/11.jpg)
Identifying High-Impact and Transformative Science (HITS)
Jerry Marschke and Huifeng Yu (SUNY Albany); Katy Bӧrner and Robert Preston Light (Indiana U); Bruce Weinberg and Joseph Staudt and (Ohio State U)
![Page 12: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/12.jpg)
Transformative Work
The vast majority of scientific understanding advances incrementally…. Less frequently, scientific understanding advances dramatically, through the application of radically different approaches or interpretations that result in the creation of new paradigms or new scientific fields. This progress is revolutionary, for it transforms science by overthrowing entrenched paradigms and generating new ones. (National Science Board [2007])
12
![Page 13: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/13.jpg)
![Page 14: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/14.jpg)
Importance
• Interest in supporting transformative versus incremental work at NSF, NIH, Universities,…
• Identify factors that generate transformation– Networks, demographics, funding,…
• Predict what work will be transformative?• Apply methods to broader / corpa with data?
• Ultimately measure value for health
14
![Page 15: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/15.jpg)
Transformative Work
• We develop metrics to identify transformative work:– We apply• Text analysis• Rich characterization of citation patterns from WOS
– To 11M+ articles in MEDLINE from 1983-2012– At the level of fields (applicable to articles)
• Validate and develop visualizations to communicate to subject matter experts
15
![Page 16: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/16.jpg)
Metrics1. Radical - Generates New Paradigms and Scientific Fields Introduction and use of heavily-used new concepts2. Radical - Destructive Age of backward citations3. Risky Variance of Forward Citations4. Multidisciplinary Breadth of articles (citations) and concepts used5. Wide Impact Breadth of forward citations and concepts introduced6. Growing Impact Time to forward citations7. High Impact Forward citation counts
![Page 17: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/17.jpg)
Aspects of Transformativeness Related to Impact and Transformativeness
![Page 18: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/18.jpg)
Citations Related to Impact and Transformativeness
![Page 19: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/19.jpg)
Patterns
• Radical Generative and Destructive weakly related – New ideas can be born without obsolescence
• Multi-disciplinarity and wide impact related• Multi-disciplinarity weakly related to impact• Transformative work has shorter forward
citation ages– Impactful over time, but even more so soon
![Page 20: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/20.jpg)
Impact v Transformativeness
![Page 21: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/21.jpg)
Findings
• Cohesive indicators for transformative work; correlated (.4) with, but distinct, from impact– Extreme high impact related to transformative
• People flow into transformative fields• “Young fields” are more transformative; “Old
fields” are higher impact• Currently working on validation
21
![Page 22: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/22.jpg)
Collaborators
• Katy Borner, IU• Robert Light, IU• Gerald Marschke,
Albany
• Joe Staudt, OSU / Census
• Huifeng Yu, Albany• Bruce Weinberg, Ohio
State, [email protected]
![Page 23: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/23.jpg)
Text Matching to Measure Patent Similarity
Sam ArtsFaculty of Business and Economics
Bruno CassimanIESE Business School, KU Leuven
Juan Carlos GomezUniversity of Guanajuato
OECD Blue Sky Conference 2016
![Page 24: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/24.jpg)
24
The United States Patent Classification System (USPC)
• Prior and current research relies on patent classification (USPC)– To identify similar patents (counterfactual control)
– e.g., Jaffe, Trajtenberg, and Henderson, 1993; Almeida, 1996; Agrawal, Cockburn, and Rosell, 2010
– To measure similarity between patents and patent portfolios– e.g., Argyres, 1996; Ahuja, 2000; Rosenkopf and Almeida, 2003; Makri, Hitt, and Lane, 2010
• USPC – Too broad– Changes over time (patents are reclassified)– Manually assigned– e.g. Thompson and Fox-Kean, 2005; Belenzon and Schankerman, 2013; …
![Page 25: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/25.jpg)
25
• Unclear what the bias– Type I: false positive (dissimilar patents, same USPC)– Type II: false negative (similar patents, different USPC)
• No alternatives – Using subclasses instead of classes
– e.g. Thompson and Fox-Kean, 2005
– Using all classes instead of primary– e.g. Benner and Waldfogel, 2008
• Unclear how alternatives affect Type I or Type II bias
The United States Patent Classification System (USPC)
![Page 26: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/26.jpg)
26
• Title and abstracts from all US utility patents granted between 1976-2013 (4.4 million)
• Concatenate title and abstract, lowercase, eliminate stop words (SMART system >600 words), words<2 characters, numbers, words which appear only once
• Each patent collection of unique keywords
• 526,561 keywords; avg 37 per patent
• Drop patents with less than 10 keywords (0.3% of sample)
Text-based measure of similarity
![Page 27: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/27.jpg)
27
• Simple Jaccard index – Range 0-1
• For each of 4.4 million patents, select closest text-matched patent within same year (cfr JHT 1993)– Min Jaccard of 0.05 (0.5% drop)– More drop when matching on USPC!
• Avg Jaccard 0.24– 14 common keywords for 2 patents with 37 keywords
• As a baseline, select distant text-match patent within same year (Jaccard=0, closest filing date)
Text matching (instead of USPC)
![Page 28: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/28.jpg)
28
Validation: closest text-matched patents in same year
Patent pairs with a larger Jaccard are more like to belong to same patent family (docdb), inventor(s), assignee(s), and are more likely to cite each other
![Page 29: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/29.jpg)
Validation: expert assessment
29
• 5 independent R&D scientists– Semiconductor devices, chemical engineering, power plants, genetics, and
optical inspection systems
• For each expert– Randomly select 10 baseline patents– For each baseline patent one random patent with Jaccard
– 0.00– 0.05-0.25, – 0.25-0.50, – 0.50-0.75, – 0.75 onwards
– Randomize order and ask experts to rate similarity 1-7
![Page 30: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/30.jpg)
30
Validation: expert assessment
![Page 31: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/31.jpg)
31
Estimate bias related to USPC
• For each of the 4.4 million patents select three USPC matched patents
• Three common ways of matching, approximate filing date and …– Primary class
– e.g. Jaffe et al. 1993– No match for 2% of patents
– Primary class and subclass (nested)– e.g., Almeida 1996– No match for 20% of patents
– All classes and subclasses– Jaccard overlap in subclasses– e.g. Agrawal et al. 2010– No match for 4% of patents
![Page 32: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/32.jpg)
32
Type I error – false positive matches
• Dissimilar patents, same USPC
• Low similarity– Primary class: 0.054 – Primary class and subclass (nested): 0.092– All classes and subclasses: 0.097
• Lower bound: % USPC matches with Jaccard=0– Primary class: 12% – Primary class and subclass (nested): 4.3% – All classes and subclasses: 4.0%
![Page 33: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/33.jpg)
33
Type II error – false negative matches
• Similar patents, different USPC
• Lower bound: % different USPC among patents with Jaccard index of 1 – Primary class: 22.4%– Primary class and subclass (nested): 52.3%– All classes and subclasses: 20.0%
![Page 34: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/34.jpg)
Validation: superiority text-matching over USPC
34
Text-matched patents are more like to belong to same patent family (docdb), inventor(s), assignee(s), and are more likely to cite each other
![Page 35: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/35.jpg)
Validation: superiority text-matching over USPC
35
![Page 36: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/36.jpg)
36
Conclusions
• Text mining– To measure patent similarity and select counterfactual control patents– Outperforms USPC
• Fine-grained• Does not rely on human classification• No changes over time
– Measure similarity between portfolio’s, aggregate keywords at portfolio level
• Bias related to USPC– Matching on primary subclass instead of class reduces Type I but increases Type
II– Matching on all subclasses instead of primary reduces both Type I and Type II– Unexpected large share of Type I and particularly Type II errors remain present
• Code and data publically available– JAVA standard libraries, csv files with cleaned words and 200 closest matches.
![Page 37: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/37.jpg)
37
• Develop new measure of patent similarity based on text
• Validate new measure– Same patent family, assignee, inventors, cite each other– Expert assessments
• Estimate bias related to USPC
• Validate superiority over USPC– Patent family, assignee, inventors, cite each other– Expert assessments
Text mining
![Page 38: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/38.jpg)
38
Test-based measure of similarity
![Page 39: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/39.jpg)
39
• Title + abstract: Process for amplifying, detecting, and/or-cloning nucleic acid sequences, The present invention is directed to a process for amplifying and detecting any target nucleic acid sequence contained in a nucleic acid or mixture thereof. The process comprises treating separate complementary strands of the nucleic acid with a molar excess of two oligonucleotide primers, extending the primers to form complementary primer extension products which act as templates for synthesizing the desired nucleic acid sequence, and detecting the sequence so amplified. The steps of the reaction may be carried out stepwise or simultaneously and can be repeated as often as desired. In addition, a specific nucleic acid sequence may be cloned into a vector by using primers to amplify the sequence, which contain restriction sites on their non-complementary ends, and a nucleic acid fragment may be prepared from an existing shorter fragment using the amplification process
• 52 unique keywords: acid act addition amplification amplified amplify amplifying carried cloned complementary comprises contained desired detecting directed ends excess existing extending extension form fragment invention mixture molar non-complementary nucleic oligonucleotide prepared present primer primers process products reaction repeated restriction separate sequence sequencesthe shorter simultaneously sites specific steps stepwise strands synthesizing target templates treating vector
Text-based measure of similarity
![Page 40: Borner-Data analytics for science and innovation](https://reader036.fdocuments.net/reader036/viewer/2022081521/586f9d311a28abcc238b5d45/html5/thumbnails/40.jpg)
Validation: superiority text-matching over USPC
40