www.tugraz.at n
W I S S E N n T E C H N I K n L E I D E N S C H A F T
u www.tugraz.at
Science 2.0 VU E-Science, E-Infrastructures, Content/Data Mining
WS 2014/15
Elisabeth Lex KTI, TU Graz
www.tugraz.at n
Agenda
• Repetition from last time: altmetrics • E-Science • E-Infrastructures • Content/Data Mining
2
www.tugraz.at n
Altmetrics (repetition)
„Altmetric is the creation and study of new metrics based on the Social Web for analyzing and informing
scholarship“ - Altmetrics Manifesto, http://altmetrics.org/about
• Aggregated from many sources (e.g. Twitter, Mendeley, github, slideshare,...)
• Article Level Metrics (ALM) • multidimensional suite of transparent and established metrics at
article level
3
www.tugraz.at n
Examples for Altmetrics sources (repetition) • Usage
• Views, downloads,.. • Captures
• Bookmarks, readers,.. • Mentions
• Blog posts, news stories, Wikipedia articles, comments, reviews
• Social Media • Tweets, Google+, Facebook likes, shares, ratings
• Citations • Web of Science, Scopus, Google Scholar,...
4
www.tugraz.at n
Examples: Altmetric.com
5 Source: http://www.altmetric.com/details.php?domain=www.altmetric.com&citation_id=843656
www.tugraz.at n
Lessons learned (repetition)
• Alternative ways to assess impact of various scientific outputs
• No common understanding of altmetrics yet • What do they really express? • Are they useful and for which part of the research
process? • Not necessarily „better“ metrics
• E.g. Gamification • Can help to get an overview of a research field
• Visualizations based on altmetrics
6
www.tugraz.at n
e-Science, e-Infrastructures, Content Mining
7
www.tugraz.at n
Modern Science: What has changed?
• 150 years later: Searching for new particles like Higgs boson with the Large Hadron Collider
• Built in collaboration with over 10,000 scientists and engineers from over 100 countries, hundreds of universities and laboratories. In a tunnel of 27 km in circumference,175 m deep, near Geneva
8
www.tugraz.at n
Motivation
• Internet and science disciplines (e.g. physical sciences, biological sciences, medicine, and engineering) generate large and complex datasets (Big Data)
• require more advanced database and architectural support
• „New kind of research methodology“ has emerged (fourth paradigm of scientific exploration (Hey, 2007)
• based on statistical exploration of big amounts of data
à Led to e-science 9 http://www.ksi.mff.cuni.cz/astropara/
www.tugraz.at n
e-Science
• Large scale science (since 1999) • Data-driven discovery • Focus on computationally intensive science and how
to tackle it using highly distributed environments • Powerful computers: Supercomputers, High
Performance Computing (HPC), Grid,… • Distributed Computing • Powerful research infrastructures – “e-infrastructures”,
grids, clouds
10 http://www.anandtech.com/show/6421/inside-the-titan-supercomputer-299k-amd-x86-cores-and-186k-nvidia-gpu-cores/3
www.tugraz.at n
Supercomputers
11 http://www.top500.org/lists/2014/06/ http://www.wikihow.com/Build-a-Supercomputer
• large, expensive systems, usually housed in a single room, in which multiple processors are connected by fast local network
• Suited for highly complex, real-time applications and simulation
Pros: data can move between processors rapidly àall processors can work together on same tasks Cons: expensive to build and maintain. Do not scale well, e.g. adding more processors is challenging
www.tugraz.at n
Distributed Computing
• systems in which processors are not necessarily located in close proximity to one another—and can even be housed on different continents—but which are connected via the Internet or other networks
12
• Pros: relative to supercomputers much less expensive.
• Cons: less speed achieved than with supercomputers
www.tugraz.at n
Example: Hadoop
• Ecosystem of tools for processing big data
• Simple computational model • two-stage method for processing large data amounts • design an algorithm for operating on one chunk of the
data in two stages (a Map and a Reduce stage), MapReduce automatically distributes that algorithm to cluster à hides complexity in framework
13 http://hadoop.apache.org http://architects.dzone.com/articles/how-hadoop-mapreduce-works
www.tugraz.at n
Hadoop in eScience: Example: Astronomical Image Processing
• Large telescopes survey sky over a prolonged period of time.
• Large Synoptic Survey Telescope LSST - under construction - will capture 1/2 of sky over 10 years - 30TB of data every night - ~60PBs in 10 years
• Astronomers pick out faint objects for study by capturing multiple images of same area and by combining them – „coaddition“
• Challenge: how to organize and process all the resulting data.
14 http://www.lsst.org/lsst/
www.tugraz.at n
Using Hadoop to help with image coaddition
15 http://escience.washington.edu/get-help-now/astronomical-image-processing-hadoop
www.tugraz.at n
Example: Big Data in Science - European Exascale Projects
16 http://exascale-projects.eu
Exascale computing: computers capable of at least one exaflops (1018 floating point operations per second) à Not yet achieved, currently 1015
www.tugraz.at n
Virtual Science Environments
• Not only HPC but also sharing of knowledge and data is becoming a requirement for scientific discovery
• providing useful mechanisms to facilitate this sharing • Preserve and organize research data
à Virtual Science Environments: „virtual environments in which researchers work together through ubiquitous, trusted and easy access to services for scientific data, computing and networking, enabled by e-Infrastructures“
17
www.tugraz.at n
Defining e-Infrastructures
European e- Infrastructure Reflection group (e-IRG):
‘The term e-Infrastructure refers to this new research environment in which all researchers—whether working in the context of their home institutions or in national or multinational scientific initiatives—have shared access
to unique or distributed scientific facilities (including data, instruments, computing and communications),
regardless of their type and location in the world.’
18 http://www.e-irg.eu/about-e-irg.html
www.tugraz.at n
e-Infrastructures - Goals
• Opening access to knowledge through reliable, distributed and participatory data e-infrastructures
• Cost effective infrastructures for preservation and curation for re-use of data
• Persistent availability of information and linking people and data through flexible and robust digital identifiers
• Interoperability for consistency of approaches on global data exchange (e.g. standards)
• Enabling trust through authentication and authorisation mechanisms
19 http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/framework-for-action-in-h2020_en.pdf
www.tugraz.at n
Example: e-Infrastructure OpenAIRE
• The European Open Access Data Infrastructure for Scholarly and Scientific Communication
• Functionality: • Harvesting and storing of information about
publications from various repos (OAI-PMH) • Enables searching for publications and related
infos (e.g. funding,..) • Provides list of OA repos that can be used to store
publications • Orphan repo
• Shows statistics of stored data 20 https://www.openaire.eu
www.tugraz.at n
OpenAIRE - Applications
21
www.tugraz.at n
Example: e-Infrastructures Austria 1/2
22 http://www.e-infrastructures.at
www.tugraz.at n
Example: e-Infrastructures Austria 2/2
23
www.tugraz.at n
e-Science, e-Infrastructures, Content Mining
24
www.tugraz.at n
• Make science discoverable • Extract facts for research • Build reusable objects • Aggregate • Create new businesses • Check for errors => better science
Content Mining - Motivation
www.tugraz.at n
Content Mining
à to extract, process and republish content manually or by machine
• Content: can be text, numerical data, static images, videos, audio, metadata, bibliographic data or any digital information, and/or a combination of them all à all types of information.
• Mining: large-scale of information extraction from target content
26 http://access.okfn.org/2014/03/27/what-is-content-mining/
www.tugraz.at n
Data Mining
27
Content Mining vs. Data Mining à content mining is more generic
www.tugraz.at n
• Secondary publishers create walled gardens • E.g. ResearchGate portal
• Publishers’ contracts ban content-mining. • Publishers may cut off universities who mine • Publishers lobby governments to require “licences
for content mining” UK à “the right to read is the right to mine”
Content Mining Problems
http://blogs.ch.cam.ac.uk/pmr/2013/10/02/text-and-data-mining-fighting-for-our-digital-future-peter-murray-rust-is-the-problem/
www.tugraz.at n
Example: ContentMine
29 http://contentmine.org
Idea: • facts cannot
be copyrighted • Billion of facts
in copyright-protected research articles
à Make them publicly accessible!
www.tugraz.at n
Content to mine in scientific paper
30
date
researcher resouce
www.tugraz.at n
1. Crawl scientific literature 2. Scrape each scientific article 3. Extract facts 4. Index 5. Republish (WikiData)
Machine Extraction of scientific facts
https://github.com/ContentMine
www.tugraz.at n
Example: retrieve metadata for specific article
26/11/14 32
www.tugraz.at n
Example: Measuring quality of Wikipedia
33 Elisabeth Lex, Michael Voelske, Marcelo Errecalde, Edgardo Ferretti, Leticia Cagnina, Christopher Horn, Benno Stein, and Michael Granitzer. 2012. Measuring the quality of web content using factual information. In Proceedings of WebQuality '12 at WWW‘12
(a) Unbalanced (b) Balanced
Figure 1: Histograms of Wikipedia corpora for unbalanced dataset and balanced dataset.
is the word count of t, and t is a Wikipedia article. Thesame holds for “Factual-density/sentence-count”.
The word count measure outperforms the factual densitymeasure normalized to sentence count as well as the wordcount on the unbalanced corpus. Apparently, word count isa strong feature on the unbalanced corpus.
We then evaluated the factual density measure on the bal-anced corpus where both featured/good and non-featuredarticles are more similar in respect to document length.The results for this experiment are shown in Figure 2(b)as precision-recall curves. On the balanced corpus, factualdensity normalized to sentence count as well as word countperforms much better than on the unbalanced corpus, whileword count, as expected, performs worse. There is not muchdi↵erence between the normalization to word or sentencecount since here, the number of words per document has asmaller influence on the result.
We also analyzed the distributions of featured/good andnon-featured articles if factual density is used as measure,as depicted in Figure 3. We found that the distributionof the featured/good articles is clearly separated from thedistribution of the non-featured articles, with peaks at twodi↵erent factual density values (0.06 and 0.03 respectively).This finding is in contrast to the fact that the distributionsof featured/good articles and non-featured articles have ahigh degree of overlap if word count is used, as shown inFigure 1(b). Consequently, on the balanced corpus, factualdensity clearly outperforms our baseline word count.
In a related experiment, we investigated the relational in-formation contained in the binary relationships ReVerb ex-tracts from sentences. We used the relations, i.e. only thepredicates from the extracted triples as a vocabulary to rep-resent the documents. We then tested the discriminativepower of these features by training a classifier to solve the bi-nary classification problem of distinguishing featured/goodfrom non-featured articles. The results reported in Table 1were obtained using the WEKA6 implementation of a NaiveBayes Classifier in combination with feature selection basedon Information Gain (IG). From 40 000 relations, we selected
6http://www.cs.waikato.ac.nz/~ml/weka/
Figure 3: Distribution of articles by factual density.
the 10% best features in terms of IG. We achieved similarresults for both corpora.
Table 1: Classification results using relational fea-tures on both corpora.
Unbalanced Balanced
Measure Value [%] Value [%]
Accuracy 84.01 87.14F-Measure 84 86.7Precision 84 89.2Recall 84 87.1
Apparently, relational features are more robust when thedocument length varies. However, we need to investigatethis in more detail.
www.tugraz.at n
Possible questions for content mining
• Find references to papers by a given author. This is metadata and therefore factual. It is usually trivial to extract references and authors. More difficult, of course to disambiguate.
• Find papers about Science 2.0 in German. Highly tractable. Typical approach would be to find the 50 commonest words (e.g. "ein", "das",...) in a paper and show the frequency is very different from English ("one", "the" ...)
34
www.tugraz.at n
Example: Facilitate exploratory search in social bookmarking sites
• by topic of interest • Setting: Social bookmarking dataset, URLs
described by tags - dataset size: 61 665 posts (~430 000 triples)
§ Research Questions:
§ What groups of interests exist? § Are they somehow related? § How do they evolve over time?
www.tugraz.at n
Approach
www.tugraz.at n
www.tugraz.at n
Take away message
• e-Science: data-driven, large scale science • Supercomputers and distributed computing
• Virtual research environments • e-Infrastructures
• Mining content/data in large repositories • E.g. fact extraction • E.g. Exploratory analysis of large datasets
• Find groups of interest expressed by user generated tags and their relations
38
www.tugraz.at n
Your Assignment!
39
www.tugraz.at n
Assignment 1/2 • Implementation (50%)
1. Compute altmetrics (25 pts) • Use rOpenSci to first search in arxiv.org for papers related to a
topic of your choice and then to retrieve with rAltmetric their altmetrics (http://ropensci.org)
• Result: List of 10 dois from arxiv.org with altmetrics, according altmetrics from altmetics.org (10 pts)
• Plot and interpret the results • Result: plot and textual interpretation (15 pts)
2. Use #altmetrics14 Twitter collection (25 pts)http://figshare.com/articles/An_altmetrics14_Twitter_Archive/1151577 • Extract mentions of users (user A mentions user B in tweet)
• Result:Table: userid userscreenname mentions (10 pts)
• Plot mentions in matrix • Result: plot and textual interpretation (15 pts)
40
www.tugraz.at n
Assignment 2/2
• Report (25 points) • Collect related work in Mendely group (tag it with your name) (5
pts) • upload your paper and your source code in Mendeley (tag it with
submission_yourname, e.g. submission_xyz) (5 pts) • Write a scientific paper (4 pages) (15 pts)
• Presentation (25 pts): Present your paper in class • Motivate the work you have done (e.g. why altmetrics) in 1 slide • Present your results and how you got them • Bonuspoints: Present your own ideas for the Twitter dataset and
how you would tackle them à further bonus points if you implement them J
41
www.tugraz.at n
Part 1: Compute altmetrics
42
www.tugraz.at n
Short intro into R
• The R project for Statistical Computing • http://www.r-project.org
• Free software environment for statistical computing and graphics
• Classification, clustering, statistical tests, time-series analysis,...
• Simple way to produce „publication ready“ plots • Windows, unix, osx: CRAN mirror.
43
www.tugraz.at n
The package rAltmetric • Package that enables to retrieve altmetric data from
altmetric.com for publications • Altmetric tracks what people are saying about
papers online on behalf of publishers, authors, libraries and institutions.
• http://cran.r-project.org/web/packages/rAltmetric/ • http://ropensci.github.io/rAltmetric/
• 2 major functions: • altmetrics() - Download metrics • altmetric_data() – Extract data
• Plus: functions to plot/print metrics 44
www.tugraz.at n
Example
45
www.tugraz.at n
Part 2: Use #altmetrics14 Twitter collection
46
www.tugraz.at n
Analysis of Twitter dataset
• Determine impact of users at a scientific conference • Extract mentions of users (user A mentions user B
in tweet) • Plot mentions in matrix and interpret results
• https://www.miskatonic.org/2013/02/22/one-last-c4l13-tweet-thing-who-mentioned-whom/
47
www.tugraz.at n
Write the scientific paper about your work
• 4 pages Springer LNCS format: • http://www.springer.com/computer/lncs?
SGWID=0-164-6-793341-0 • Structure of your paper:
• Abstract (= a short, complete summary of the paper with key findings)
• Introduction and Related Work (describes the theoretical background, indicates why the work is important, states a research question)
• Experiments and Results • Conclusion • References
48
www.tugraz.at n
Presentation
• Workshop-Style: Presentation and Discussion • Present your work in max 10 min + 5 mins for
questions from the audience • No exam situation J • Mandatory attendence though
No plagiarism allowed!! But: Open Science – so if you use work of others, cite it properly – if you use work of your colleagues – cite them and give them credits!
49
Top Related