Post on 11-May-2015
description
The Research Intelligence Project
California Institute for Telecommunications and Information Technology (Calit2)
Jerry Sheehan, Chief of Staff
June 25th, 2010
SemTech 2010
The Research Intelligence Project
SemTech 2010
Outline
Our Problem
The Research Intel Tools
Semantic Data
Evolution
Concluding Thoughts
The Research Intel Tools
Future Directions
My Bias
SemTech 2010 Image Courtesy of Matt Jones, Creative Commons License, Flickr (blackbeltjones)
PreferFoundElsewhere
Topic I
SemTech 2010
Our Problem
Who Are We?
SemTech 2010
What Do We Do?
SemTech 2010
The Standard “Completed” Faculty Profile
SemTech 2010
Dr. H’s
Dr H.
drh@edu
Different Way to Think About Our Problem
SemTech 2010 Image Courtesy of Scott Granneman, Creative Commons License, Flickr (rsgranne)
Topic II
SemTech 2010
Tools We Developed
How Research Universities Look
At Their Business Data
SemTech 2010 Image Courtesy of HA! Designers, Creative Commons License, Flickr (artbyheather)
How We Could Look At Our Data
SemTech 2010 Logo Design by Kyle Bowen, http://www.educause.edu/Community/MemDir/Profiles/KyleBowen/58744
Research Intelligence Platform Development
SemTech 2010
2005 20102006 2007 2008 2009
Idea Proof of Concept Alpha/Beta for Calit2 Beta for Others Production for Campus New Domains
460 250 300 480 900 Faculty 71 Companies
# o
f Use
rs
Research Intelligence Development in Web History Timeline
SemTech 2010
2005: Topic Modeling of Researchers
SemTech 2010 Initial Site Developed by David Newman with Direction from Padhric Smyth, University of California, Irvine
2005: The Topic Modeling Proof of Concept
SemTech 2010 http://datalab-1.ics.uci.edu/calit2/
Conceptual Challenges with 2005 Model
SemTech 2010
NLP Algorithm Human Intervention
Discipline Bias
The Folksonomy vs Taxonomy Debate
SemTech 2010
Felis Bengalensis
Bengal Cat
Folksonomy•Cat•Bengal Cat•F6•Leopard•Hybrid•Nikita
Taxonomy•Kingdom: Animalia•Phylum: Chordata•Class: Mammalia•Order: Carnivora•Family: Felidea•Genus: Felis•Species: Bengalensis
Manual Tagging Experiment
SemTech 2010
• Three person team examined one university affiliated web page for affiliated faculty and associated a minimum of three keywords with each person.
• No controlled vocabulary but rather a narrative question to focus manual tagging.
• What type of research does this person primarily do?
• Created SQL Database of all UCSD affiliated academic researchers.
Unfiltered Tags: Automated Extraction
SemTech 2010
1. ucsd (157)2. email (117)3. university of california san diego (112)4. sdsc (55)5. contact (50)6. california san diego (47)7. professor (44)8. university of california (44)9. computer science (36)10. mail (36)11. edu (34)12. wireless (31)13. telecommunications (31)14. california institute (28)15. photonics (27)16. physics (26)17. signal processing (23)18. visualization (22)19. computer engineering (22)20. bioinformatics (21)21. capsule bio (21)22. nanotechnology (19)23. uc san diego (19)24. sensors (18)25. scripps institution of oceanography (18)26. information technology (17)27. ucsd faculty (17)
28. structural engineering (16)29. associate professor (16)30. electrical engineering (16)31. department of computer science (16)32. cse (16)33. responsphere (16)34. computational biology (15)35. adjunct professor (15)36. algorithms (15)37. nsf (14)38. networking (14)39. digital signal processing (14)40. geophysics (14)41. (14)42. california institutes (14)43. information technology staff (14)44. cwc (13)45. san diego supercomputer center (13)46. biology (13)47. cognitive science (13)48. information theory (13)49. optical networking (13)50. mit (13)
Filtered Tags: Automated Extraction
SemTech 2010
1. wireless (31)2. telecommunications (31)3. photonics (27)4. physics (26)5. signal processing (23)6. visualization (22)7. computer engineering (22)8. bioinformatics (21)9. nanotechnology (19)10. sensors (18)11. information technology (17)12. structural engineering (16)13. electrical engineering (16)14. responsphere (16)15. computational biology (15)16. algorithms (15)17. nsf (14)18. networking (14)19. digital signal processing (14)20. geophysics (14)21. (14)22. cwc (13)23. san diego supercomputer center (13)24. biology (13)25. cognitive science (13)26. information theory (13)27. optical networking (13)
28. computer (13)29. san diego supercomputer (13)30. supercomputing (12)31. communications (12)32. embedded systems (12)33. semiconductors (11)34. networks (11)35. biochemistry (11)36. pharmacology (11)37. systems biology (11)38. chemistry (11)39. neural networks (11)40. computer vision (11)41. http (11)42. journal of geophysical research (11)43. music (10)44. integrated circuits (10)45. vlsi (10)46. information storage (10)47. artificial intelligence (10)48. engineering (10)49. engineering university (10)50. rescue (10)
The Archimedes Project 2006
SemTech 2010
Importance of Value Propositions
SemTech 2010 Image: Norman Rockwell for Tom Sawyer and Huck Finn, 1935
What Researchers are Interested In
SemTech 2010 TreeMap, Federal Funding, May 2010, Data and Visualization Calit2
Really Good Government
SemTech 2010
Federal Funding Opportunities 2006
SemTech 2010
Federal Funding Opportunities Production Workflow
SemTech 2010
Research Intelligence 2007: Faculty and Funding Keywords
SemTech 2010606 Grants, 5700 Tags
Research Intelligence 2007 Workflow
SemTech 2010
Research Intelligence Campus 2009
SemTech 2010 http://ric.ucsd.edu
Campus RI: Integrated Researcher Metadata
SemTech 2010 http://ric.ucsd.edu
Research Intelligence The 2009 Semantic Engine
SemTech 2010
900 Users
5400 Documents
70,000 Tags
Keywords
Relevancy
Keywords
Semantics, Linked Data
Keywords
Topics
Keywords
Semantics
Keywords
Semantics
Community Research Intelligence: New Application Thrust 2010
SemTech 2010
Topic III
SemTech 2010
Semantic Data Evolution
Research Intelligence View of Semantic Data Evolution
SemTech 2010
Closed NLP Text Mining
Few Open APIs for NLP
Initial Open APIS Semantic Services
Initial Open Linked Data Repositories
Com
plex
ity
Time2005 2008 2009 2010
Research Intelligence: The Data, Grant Abstract
SemTech 2010
Computation is accepted as the third pillar supporting innovation and discovery in science and engineering and is central to NSF's future vision of Cyberinfrastructure Framework for 21st Century Science and Engineering (CF21)[1]. Software is an integral part of the computation paradigm and a primary modality for realizing the CF21 vision. Scientific discovery and innovation are advancing fundamentally new pathways opened by development of increasingly sophisticated software. Software is also directly responsible for increased scientific productivity and significant enhancement of researchers' capabilities. In order to nurture, accelerate and sustain this critical mode of scientific progress, NSF is establishing a new program, Software Infrastructure for Sustained Innovation (SI2), with the overarching goal of transforming innovations in research and education into sustained software resources that are an integral part of the cyberinfrastructure. SI2 is a long-term investment focused on catalyzing new thinking, paradigms, and practices in using software to understand natural, human, and engineered systems. SI2's intent is to foster a pervasive cyberinfrastructure to help researchers address problems of unprecedented scale, complexity, resolution, and accuracy by integrating computation, data, networking and experiments in novel ways. It is NSF's expectation that SI2 investment will result in robust, reliable, usable and sustainable software infrastructure that is critical to the CF21 vision and will transform science and engineering. It is expected that SI2 will generate and nurture the multidisciplinary processes required to support the entire software lifecycle and will result in the development of sustainable software communities. SI2 envisions vibrant partnerships among academia, government laboratories and industry for the development and stewardship of a sustainable software infrastructure that can enhance productivity and accelerate innovation in science and engineering. The goal of the SI2 program is to create a software ecosystem that includes all levels of the software stack and scales from individual or small groups of software innovators to large hubs of software excellence. The program addresses all aspects of CI, from embedded sensor systems and instruments, to desktops and high-end data and computing systems, to major instruments and facilities.The SI2 program envisions three classes of awards:1. Scientific Software Elements (SSE): SSE awards target small groups that will create and deploy robust software elements for which there is a demonstrated need, encapsulating innovation in science and engineering. The effort targeted by a SSE award is up to a level roughly comparable to: summer support for two investigators with complementary expertise; two graduate students; and their collective research needs (e.g. materials, supplies, travel) for three years.2. Scientific Software Integration (SSI): SSI awards target larger groups of PIs organized around common research problems as well as common software infrastructure, and will result in a sustainable community software framework. The effort targeted by a SSI award is up to a level roughly comparable to: summer support for three to four investigators with complementary expertise; three to four graduate students; one or two senior personnel (including post-doctoral researchers, software developers, and staff); and their collective research needs (e.g., materials, supplies, travel) for three to five years. The integrative contributions of the SSI team should clearly be greater than the sum of the contributions of each individual member of the team.3. Scientific Software Innovation Institutes (S2I2): S2I2 awards will focus on the establishment of long-term community-wide hubs of software excellence. These hubs will provide expertise, processes, resources and implementation mechanism to transform computational science and engineering innovations and community software into robust and sustained tools for enabling science and engineering. S2I2 proposals will bring together multidisciplinary teams of domains scientists and engineers, computer scientists and software engineers, technologists and educators.The FY 2010 SI2 competition will be limited to SSE and SSI awards. The solicitation in FY 2011, and in subsequent years, will outline funding opportunities for all three classes of awards (SSE, SSI and S2I2), subject to availability of funds.[1] http://www.nsf.gov/pubs/2010/nsf10015/nsf10015.jsp
NSF Solicitation: Software Infrastructure for Sustained Innovation
Keyword Extraction Across Sources
SemTech 2010
Term Human Yahoo KEA Calais Alchemy OAmplifyCommon Software Infrastructure
Community SoftwareCyberinfrastructure
Embedded Sensor
Engineering
Hubs of Scientific Innovation
Innovation
NSF
Scientific
Scientific Discovery
Scientific Software
Scientific Software Integration
Scientific Software Innovation Institutes
SI2
Software
Software Developers
Software Ecosystem
Software Elements
Software Engineers
Software Infrastructure
Software Innovators
Software Lifecycle
Software Stack
SSI
Sustainable Software
Sustained Tool
Vision
12 3 9 15 20 10
Semantic Structure Returned by Open Calais
SemTech 2010
Industry Terms•Community Software•Software Lifecycle•Sustainable Software Communities•Usable and Sustainable Software Infrastructure •Software Infrastructure•Software Stack•Software Developers•Sustainable Community Software Framework•Sustained Software Resources•Software Ecosystem•Software Excellence•Embedded Sensor Systems•Software Elements•Sustainable Software Infrastructure
Organization•National Science Foundation
Social Tags•Cyberinfrastructure•E-Science•Computing•Computer Software•Innovation•Software Engineer•Technology•Science•Technology_Internet
URL•http://www.nsf.gov/pubs/2010/nsf10015/nsf10015.jsp
http://www.opencalais.com/
Semantic Structure Returned by Alchemy API
SemTech 2010
Tags•Scientific productivity•overarching goal•graduate students•scientific discovery•21st century science•collective research•common research problems•common software infrastructure•community software•complementary expertise•computation paradigm•cyberinfrastructure framework•entire software lifecycle•envisions vibrant partnerships•innovation computation•innovations•long-term community-wide hubs•nsf's expectation•nsf's future vision•pervasive cyberinfrastructure
Company•Scientific Software
Field Terminology•Software•Software Stack•Software Developers•Software Ecosystems•Software Engineers
Organization•NSF•SSI
•pillar supporting innovation•primary modality•program envisions•researchers address problems•robust software elements•scientific progress•scientific software elements•scientific software innovation•scientific software integration•si2's intent•small groups•software elements•software excellence•software infrastructure•software innovators•software resources•sophisticated software•sse award•ssi awards•ssi team•summer support•sustainable community software•sustainable software communities•sustainable software infrastructure
Category•Science and Technology
http://www.openamplify.com/
Semantic Modeling Challenge Even with XML/DTD
SemTech 2010 HTTP://vivoweb.org
Grants.Gov Technical Support Doesn’t Like Data Questions
SemTech 2010 HTTP://vivoweb.org
Open Calais Faculty Linked Data Results
SemTech 2010
Tag Type Linked Data Relevancy
National Science Foundation Organization http://d.opencalais.com/genericHasher-1/f7d1451f-915f-31bc-8194-b9794401ea2d.html 52%
Software Excellence Industry Term http://d.opencalais.com/genericHasher-1/3da6f84d-cff9-3eec-8fce-99ea792e370c.html 34%
Sustained Software Resources Industry Term h,p://d.opencalais.com/genericHasher-‐1/61a1eb6d-‐196d-‐3493-‐ad6c-‐8ea0b85ce421.html 32%
Usable and Sustainable Software Infrastructure
Industry Term http://d.opencalais.com/genericHasher-1/9e6fe116-e562-3753-9b93-8f938095a715.html 31%
Software Lifecycle Industry Term http://d.opencalais.com/genericHasher-1/9c7876e1-a85f-307c-8b38-163c129f19f7.html 30%
Sustainable Software Communities Industry Term http://d.opencalais.com/genericHasher-1/5228ac30-2bf5-397e-bc1a-04275a3f5045.html 29%
Sustainable Software Infrastructure Industry Term http://d.opencalais.com/genericHasher-1/4be05ead-30cd-3c3a-bd88-5dbb8427acc9.html 27%
Software Stack Industry Term http://d.opencalais.com/genericHasher-1/c22ad2e5-bd08-3083-9dc5-14945fb77010.html 24%
Software Innovators Industry Term http://d.opencalais.com/genericHasher-1/eba4d676-5aa8-3b1e-83dc-c4bd91b4d0f4.html 21%
Open Calais Linked Data Examples
SemTech 2010
National Science Foundation
Software Excellence
Zemanta Linked Data Examples from Grant Abstract
SemTech 2010
Linking to Freebase Via API from Grant Abstract
SemTech 2010
Are Faculty Yet Data Objects? Depends on Their Popularity
SemTech 2010
My Boss is 32 Triples
SemTech 2010
Faculty Web Page
SemTech 2010
Open Calais Faculty Linked Data Example
SemTech 2010
Tag Type Linked Data Relevancy
Lo Research Group Company http://d.opencalais.com/comphash-1/2cf74602-005c-3d32-a184-4bc49ef2d5f2.html 50%
California Institute Facility http://d.opencalais.com/genericHasher-1/37ab20cd-0681-3775-bf97-7583b4ec1434.html 46%
X@ece.ucsd.edu EmailAddress h,p://d.opencalais.com/genericHasher-‐1/babf08c8-‐1f57-‐3b99-‐b020-‐7e0dd8eaf1fc.html 31%
California Institute for Telecommunications
Organization http://d.opencalais.com/genericHasher-1/6a1fba6f-cf57-300b-94fc-f36d027c8ff0.html 31%
858-xxx-xxxx PhoneNumber http://d.opencalais.com/genericHasher-1/e8e3ad15-ace3-3616-be5a-ae9038bc0678.html 31%
858-xxx-xxxx PhoneNumberhttp://d.opencalais.com/genericHasher-1/5228ac30-2bf5-397e-bc1a-04275a3f5045.html 31%
Information Technology Technology http://d.opencalais.com/genericHasher-1/a0f02cf0-dc13-3b0f-a139-5509b026bd96.html 31%
optoelectronic devices Industry Term http://d.opencalais.com/genericHasher-1/7f81f0c9-b94f-3959-b35b-67be2f703ab4.html 29%
International Business Machines
Company http://d.opencalais.com/er/company/ralg-tr1r/9e3f6c34-aa6b-3a3b-b221-a07aa7933633.html 6%
Open Calais Linked Data Examples
SemTech 2010
Calit2
Open Calais Linked Data Examples
SemTech 2010
IBM
Zemanta Linked Data Results
SemTech 2010
Tag Linked Data Confidence
Integrated Circuits wikipedia: Integrated circuit 0.65
UC Berkeleygeolocation: University of California, Berkeleyhomepage: University of California, Berkeleywikipedia: University of California, Berkeley
0.64
Information Technology wikipedia: InformaHon technology 0.63
Calit2geolocation: California Institute for Telecommunications and Information Technologywikipedia: California Institute for Telecommunications and Information Technology 0.60
Almaden Research Centergeolocation: IBM Almaden Research Centerwikipedia: IBM Almaden Research Center 0.59
Age related Macular Degeneration
wikipedia: Macular degeneration 0.59
Minimally Invasive Surgery wikipedia: Invasiveness of surgical procedures 0.58
Cancer http://en.wikipedia.org/wiki/Cancer 0.57
Cornell
geolocation: Cornell Universityhomepage: Cornell Universitywikipedia: Cornell Universityyoutube: Cornell University
0.57
Fluorescence Activated Cell Sorter
wikipedia: Flow cytometry 0.57
Zemanta Linked Data Examples
SemTech 2010
Integrated Circuit Calit2
The Linked Data Cloud
SemTech 2010
Linked Data and a Wikipedia Base
SemTech 2010
Source: Jeremy Hsu, “Wikipedia: How Accurate is it?”November 2009, Live Science, http://www.livescience.com/technology/091106-ttr-wikipedia.html#comments
Wikipedia: How Accurate?
Is It A Problem?
SemTech 2010 SOURCE: USA Today, November 29, 2005
John S., Is a Possible Assassin of, John K
Maybe Not?
SemTech 2010 SOURCE: PHARMANEWS.EU, January 23, 2009
How Important is Validity to Researchers?
Topic IV
SemTech 2010
Future Directions?
Life Sciences Example
SemTech 2010 http://www.collexis.com/
SciVal From Elsevier
SemTech 2010 http://www.scival.com/
SciVal Terms And Conditions
SemTech 2010 http://www.scival.com/terms-and-conditions
The Value of Your Data
SemTech 2010 http://www.turbulence.org/Works/swipe/calculator.html
Beginning to Have Data Portability Policies...for Sites
SemTech 2010 http://portabilitypolicy.org:80/sample-policies.html
Future Direction for Semantic Academic Communities?
SemTech 2010 HTTP://vivoweb.org
Emerging/Growing Semantic Catalog
SemTech 2010 http://www.data.gov/semantic/catalog
Example: DOE Awards Semantic Catalog
SemTech 2010 http://www.data.gov/semantic/catalog
Topic IV
SemTech 2010
Conclusions
Words of Wisdom
SemTech 2010
“One of the Most Important Things I Learned is What Not to Pay Attention To”
John Wooley
Is the Semantic Web and Linked Data This?
SemTech 2010 Image Courtesy of Alan Vernon, Creative Commons License, Flickr (alanvernon)
Is the Semantic Web and Linked Data or This?
SemTech 2010 Image Courtesy of Vince Huang, Creative Commons License, Flickr (vincehuang)