Download - Science 2.0 VUkti.tugraz.at/staff/elex/courses/science20/slides/... · ! Motivation • Internet and science disciplines (e.g. physical sciences, biological sciences, medicine, and

www.tugraz.at n

W I S S E N n T E C H N I K n L E I D E N S C H A F T

u www.tugraz.at

Science 2.0 VU E-Science, E-Infrastructures, Content/Data Mining

WS 2014/15

Elisabeth Lex KTI, TU Graz

www.tugraz.at n

Agenda

•  Repetition from last time: altmetrics •  E-Science •  E-Infrastructures •  Content/Data Mining

2

www.tugraz.at n

Altmetrics (repetition)

„Altmetric is the creation and study of new metrics based on the Social Web for analyzing and informing

scholarship“ -  Altmetrics Manifesto, http://altmetrics.org/about

•  Aggregated from many sources (e.g. Twitter, Mendeley, github, slideshare,...)

•  Article Level Metrics (ALM) •  multidimensional suite of transparent and established metrics at

article level

3

www.tugraz.at n

Examples for Altmetrics sources (repetition) •  Usage

•  Views, downloads,.. •  Captures

•  Bookmarks, readers,.. •  Mentions

•  Blog posts, news stories, Wikipedia articles, comments, reviews

•  Social Media •  Tweets, Google+, Facebook likes, shares, ratings

•  Citations •  Web of Science, Scopus, Google Scholar,...

4

www.tugraz.at n

Examples: Altmetric.com

5 Source: http://www.altmetric.com/details.php?domain=www.altmetric.com&citation_id=843656

www.tugraz.at n

Lessons learned (repetition)

•  Alternative ways to assess impact of various scientific outputs

•  No common understanding of altmetrics yet •  What do they really express? •  Are they useful and for which part of the research

process? •  Not necessarily „better“ metrics

•  E.g. Gamification •  Can help to get an overview of a research field

•  Visualizations based on altmetrics

6

www.tugraz.at n

e-Science, e-Infrastructures, Content Mining

7

www.tugraz.at n

Modern Science: What has changed?

•  150 years later: Searching for new particles like Higgs boson with the Large Hadron Collider

•  Built in collaboration with over 10,000 scientists and engineers from over 100 countries, hundreds of universities and laboratories. In a tunnel of 27 km in circumference,175 m deep, near Geneva

8

www.tugraz.at n

Motivation

•  Internet and science disciplines (e.g. physical sciences, biological sciences, medicine, and engineering) generate large and complex datasets (Big Data)

•  require more advanced database and architectural support

•  „New kind of research methodology“ has emerged (fourth paradigm of scientific exploration (Hey, 2007)

•  based on statistical exploration of big amounts of data

à Led to e-science 9 http://www.ksi.mff.cuni.cz/astropara/

www.tugraz.at n

e-Science

•  Large scale science (since 1999) •  Data-driven discovery •  Focus on computationally intensive science and how

to tackle it using highly distributed environments •  Powerful computers: Supercomputers, High

Performance Computing (HPC), Grid,… •  Distributed Computing •  Powerful research infrastructures – “e-infrastructures”,

grids, clouds

10 http://www.anandtech.com/show/6421/inside-the-titan-supercomputer-299k-amd-x86-cores-and-186k-nvidia-gpu-cores/3

www.tugraz.at n

Supercomputers

11 http://www.top500.org/lists/2014/06/ http://www.wikihow.com/Build-a-Supercomputer

•  large, expensive systems, usually housed in a single room, in which multiple processors are connected by fast local network

•  Suited for highly complex, real-time applications and simulation

Pros: data can move between processors rapidly àall processors can work together on same tasks Cons: expensive to build and maintain. Do not scale well, e.g. adding more processors is challenging

www.tugraz.at n

Distributed Computing

•  systems in which processors are not necessarily located in close proximity to one another—and can even be housed on different continents—but which are connected via the Internet or other networks

12

•  Pros: relative to supercomputers much less expensive.

•  Cons: less speed achieved than with supercomputers

www.tugraz.at n

Example: Hadoop

•  Ecosystem of tools for processing big data

•  Simple computational model •  two-stage method for processing large data amounts •  design an algorithm for operating on one chunk of the

data in two stages (a Map and a Reduce stage), MapReduce automatically distributes that algorithm to cluster à hides complexity in framework

13 http://hadoop.apache.org http://architects.dzone.com/articles/how-hadoop-mapreduce-works

www.tugraz.at n

Hadoop in eScience: Example: Astronomical Image Processing

•  Large telescopes survey sky over a prolonged period of time.

•  Large Synoptic Survey Telescope LSST - under construction - will capture 1/2 of sky over 10 years - 30TB of data every night - ~60PBs in 10 years

•  Astronomers pick out faint objects for study by capturing multiple images of same area and by combining them – „coaddition“

•  Challenge: how to organize and process all the resulting data.

14 http://www.lsst.org/lsst/

www.tugraz.at n

Using Hadoop to help with image coaddition

15 http://escience.washington.edu/get-help-now/astronomical-image-processing-hadoop

www.tugraz.at n

Example: Big Data in Science - European Exascale Projects

16 http://exascale-projects.eu

Exascale computing: computers capable of at least one exaflops (1018 floating point operations per second) à Not yet achieved, currently 1015

www.tugraz.at n

Virtual Science Environments

•  Not only HPC but also sharing of knowledge and data is becoming a requirement for scientific discovery

•  providing useful mechanisms to facilitate this sharing •  Preserve and organize research data

à Virtual Science Environments: „virtual environments in which researchers work together through ubiquitous, trusted and easy access to services for scientific data, computing and networking, enabled by e-Infrastructures“

17

www.tugraz.at n

Defining e-Infrastructures

European e- Infrastructure Reflection group (e-IRG):

‘The term e-Infrastructure refers to this new research environment in which all researchers—whether working in the context of their home institutions or in national or multinational scientific initiatives—have shared access

to unique or distributed scientific facilities (including data, instruments, computing and communications),

regardless of their type and location in the world.’

18 http://www.e-irg.eu/about-e-irg.html

www.tugraz.at n

e-Infrastructures - Goals

•  Opening access to knowledge through reliable, distributed and participatory data e-infrastructures

•  Cost effective infrastructures for preservation and curation for re-use of data

•  Persistent availability of information and linking people and data through flexible and robust digital identifiers

•  Interoperability for consistency of approaches on global data exchange (e.g. standards)

•  Enabling trust through authentication and authorisation mechanisms

19 http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/framework-for-action-in-h2020_en.pdf

www.tugraz.at n

Example: e-Infrastructure OpenAIRE

•  The European Open Access Data Infrastructure for Scholarly and Scientific Communication

•  Functionality: •  Harvesting and storing of information about

publications from various repos (OAI-PMH) •  Enables searching for publications and related

infos (e.g. funding,..) •  Provides list of OA repos that can be used to store

publications •  Orphan repo

•  Shows statistics of stored data 20 https://www.openaire.eu

www.tugraz.at n

OpenAIRE - Applications

21

www.tugraz.at n

Example: e-Infrastructures Austria 1/2

22 http://www.e-infrastructures.at

www.tugraz.at n

Example: e-Infrastructures Austria 2/2

23

www.tugraz.at n

e-Science, e-Infrastructures, Content Mining

24

www.tugraz.at n

•  Make science discoverable •  Extract facts for research •  Build reusable objects •  Aggregate •  Create new businesses •  Check for errors => better science

Content Mining - Motivation

www.tugraz.at n

Content Mining

à to extract, process and republish content manually or by machine

•  Content: can be text, numerical data, static images, videos, audio, metadata, bibliographic data or any digital information, and/or a combination of them all à all types of information.

•  Mining: large-scale of information extraction from target content

26 http://access.okfn.org/2014/03/27/what-is-content-mining/

www.tugraz.at n

Data Mining

27

Content Mining vs. Data Mining à content mining is more generic

www.tugraz.at n

•  Secondary publishers create walled gardens •  E.g. ResearchGate portal

•  Publishers’ contracts ban content-mining. •  Publishers may cut off universities who mine •  Publishers lobby governments to require “licences

for content mining” UK à “the right to read is the right to mine”

Content Mining Problems

http://blogs.ch.cam.ac.uk/pmr/2013/10/02/text-and-data-mining-fighting-for-our-digital-future-peter-murray-rust-is-the-problem/

www.tugraz.at n

Example: ContentMine

29 http://contentmine.org

Idea: •  facts cannot

be copyrighted •  Billion of facts

in copyright-protected research articles

à Make them publicly accessible!

www.tugraz.at n

Content to mine in scientific paper

30

date

researcher resouce

www.tugraz.at n

1.  Crawl scientific literature 2.  Scrape each scientific article 3.  Extract facts 4.  Index 5.  Republish (WikiData)

Machine Extraction of scientific facts

https://github.com/ContentMine

www.tugraz.at n

Example: retrieve metadata for specific article

26/11/14 32

www.tugraz.at n

Example: Measuring quality of Wikipedia

33 Elisabeth Lex, Michael Voelske, Marcelo Errecalde, Edgardo Ferretti, Leticia Cagnina, Christopher Horn, Benno Stein, and Michael Granitzer. 2012. Measuring the quality of web content using factual information. In Proceedings of WebQuality '12 at WWW‘12

(a) Unbalanced (b) Balanced

Figure 1: Histograms of Wikipedia corpora for unbalanced dataset and balanced dataset.

is the word count of t, and t is a Wikipedia article. Thesame holds for “Factual-density/sentence-count”.

The word count measure outperforms the factual densitymeasure normalized to sentence count as well as the wordcount on the unbalanced corpus. Apparently, word count isa strong feature on the unbalanced corpus.

We then evaluated the factual density measure on the bal-anced corpus where both featured/good and non-featuredarticles are more similar in respect to document length.The results for this experiment are shown in Figure 2(b)as precision-recall curves. On the balanced corpus, factualdensity normalized to sentence count as well as word countperforms much better than on the unbalanced corpus, whileword count, as expected, performs worse. There is not muchdi↵erence between the normalization to word or sentencecount since here, the number of words per document has asmaller influence on the result.

We also analyzed the distributions of featured/good andnon-featured articles if factual density is used as measure,as depicted in Figure 3. We found that the distributionof the featured/good articles is clearly separated from thedistribution of the non-featured articles, with peaks at twodi↵erent factual density values (0.06 and 0.03 respectively).This finding is in contrast to the fact that the distributionsof featured/good articles and non-featured articles have ahigh degree of overlap if word count is used, as shown inFigure 1(b). Consequently, on the balanced corpus, factualdensity clearly outperforms our baseline word count.

In a related experiment, we investigated the relational in-formation contained in the binary relationships ReVerb ex-tracts from sentences. We used the relations, i.e. only thepredicates from the extracted triples as a vocabulary to rep-resent the documents. We then tested the discriminativepower of these features by training a classifier to solve the bi-nary classification problem of distinguishing featured/goodfrom non-featured articles. The results reported in Table 1were obtained using the WEKA6 implementation of a NaiveBayes Classifier in combination with feature selection basedon Information Gain (IG). From 40 000 relations, we selected

6http://www.cs.waikato.ac.nz/~ml/weka/

Figure 3: Distribution of articles by factual density.

the 10% best features in terms of IG. We achieved similarresults for both corpora.

Table 1: Classification results using relational fea-tures on both corpora.

Unbalanced Balanced

Measure Value [%] Value [%]

Accuracy 84.01 87.14F-Measure 84 86.7Precision 84 89.2Recall 84 87.1

Apparently, relational features are more robust when thedocument length varies. However, we need to investigatethis in more detail.

www.tugraz.at n

Possible questions for content mining

•  Find references to papers by a given author. This is metadata and therefore factual. It is usually trivial to extract references and authors. More difficult, of course to disambiguate.

•  Find papers about Science 2.0 in German. Highly tractable. Typical approach would be to find the 50 commonest words (e.g. "ein", "das",...) in a paper and show the frequency is very different from English ("one", "the" ...)

34

www.tugraz.at n

Example: Facilitate exploratory search in social bookmarking sites

•  by topic of interest •  Setting: Social bookmarking dataset, URLs

described by tags - dataset size: 61 665 posts (~430 000 triples)

§  Research Questions:

§  What groups of interests exist? §  Are they somehow related? §  How do they evolve over time?

www.tugraz.at n

Approach

www.tugraz.at n

www.tugraz.at n

Take away message

•  e-Science: data-driven, large scale science •  Supercomputers and distributed computing

•  Virtual research environments •  e-Infrastructures

•  Mining content/data in large repositories •  E.g. fact extraction •  E.g. Exploratory analysis of large datasets

•  Find groups of interest expressed by user generated tags and their relations

38

www.tugraz.at n

Your Assignment!

39

www.tugraz.at n

Assignment 1/2 •  Implementation (50%)

1.  Compute altmetrics (25 pts) •  Use rOpenSci to first search in arxiv.org for papers related to a

topic of your choice and then to retrieve with rAltmetric their altmetrics (http://ropensci.org)

•  Result: List of 10 dois from arxiv.org with altmetrics, according altmetrics from altmetics.org (10 pts)

•  Plot and interpret the results •  Result: plot and textual interpretation (15 pts)

2.  Use #altmetrics14 Twitter collection (25 pts)http://figshare.com/articles/An_altmetrics14_Twitter_Archive/1151577 •  Extract mentions of users (user A mentions user B in tweet)

•  Result:Table: userid userscreenname mentions (10 pts)

•  Plot mentions in matrix •  Result: plot and textual interpretation (15 pts)

40

www.tugraz.at n

Assignment 2/2

•  Report (25 points) •  Collect related work in Mendely group (tag it with your name) (5

pts) •  upload your paper and your source code in Mendeley (tag it with

submission_yourname, e.g. submission_xyz) (5 pts) •  Write a scientific paper (4 pages) (15 pts)

•  Presentation (25 pts): Present your paper in class •  Motivate the work you have done (e.g. why altmetrics) in 1 slide •  Present your results and how you got them •  Bonuspoints: Present your own ideas for the Twitter dataset and

how you would tackle them à further bonus points if you implement them J

41

www.tugraz.at n

Part 1: Compute altmetrics

42

www.tugraz.at n

Short intro into R

•  The R project for Statistical Computing •  http://www.r-project.org

•  Free software environment for statistical computing and graphics

•  Classification, clustering, statistical tests, time-series analysis,...

•  Simple way to produce „publication ready“ plots •  Windows, unix, osx: CRAN mirror.

43

www.tugraz.at n

The package rAltmetric •  Package that enables to retrieve altmetric data from

altmetric.com for publications •  Altmetric tracks what people are saying about

papers online on behalf of publishers, authors, libraries and institutions.

•  http://cran.r-project.org/web/packages/rAltmetric/ •  http://ropensci.github.io/rAltmetric/

•  2 major functions: •  altmetrics() - Download metrics •  altmetric_data() – Extract data

•  Plus: functions to plot/print metrics 44

www.tugraz.at n

Example

45

www.tugraz.at n

Part 2: Use #altmetrics14 Twitter collection

46

www.tugraz.at n

Analysis of Twitter dataset

•  Determine impact of users at a scientific conference •  Extract mentions of users (user A mentions user B

in tweet) •  Plot mentions in matrix and interpret results

•  https://www.miskatonic.org/2013/02/22/one-last-c4l13-tweet-thing-who-mentioned-whom/

47

www.tugraz.at n

Write the scientific paper about your work

•  4 pages Springer LNCS format: •  http://www.springer.com/computer/lncs?

SGWID=0-164-6-793341-0 •  Structure of your paper:

•  Abstract (= a short, complete summary of the paper with key findings)

•  Introduction and Related Work (describes the theoretical background, indicates why the work is important, states a research question)

•  Experiments and Results •  Conclusion •  References

48

www.tugraz.at n

Presentation

•  Workshop-Style: Presentation and Discussion •  Present your work in max 10 min + 5 mins for

questions from the audience •  No exam situation J •  Mandatory attendence though

No plagiarism allowed!! But: Open Science – so if you use work of others, cite it properly – if you use work of your colleagues – cite them and give them credits!

49