Be on science's pulse with NLP or "What are the serious men talking about?"

81
Be on Science’s Pulse with NLP or “What are the serious men talking about?” TETIANA KODLIUK - SEPTEMBER 24th, 2016

Transcript of Be on science's pulse with NLP or "What are the serious men talking about?"

Be on Science’s Pulse with NLP or “What are the serious men talking about?”

TETIANA KODLIUK - SEPTEMBER 24th, 2016

www.vitech.com.ua

Who are you? Why did we wake up so early?

Data Scientist

Lead of Data Scientists

Lecturer in Apache Spark

Lecturer in MathPh. D. in Math

Analyst

Active Wizards

www.vitech.com.ua

Artificial Intelligence is everywhere. Isn't it?

www.digitalgov.gov

image recognition

knowledge managementprobabilistic reasoning

representation of human expression

robotics

speech to text

natural language processing

machine learning

www.vitech.com.ua

Why NLP?Main tasks in NLP:

● process text● order text● understand text● extract information● cluster text● generate text● summarize text ……...

http://milestoneseducation.com/

www.vitech.com.ua

What is NLP? Jumping NLP Curves (Stanford, 2014)

http://sentic.net/jumping-nlp-curves.pdf

www.vitech.com.ua

So our idea was born...

Extract the trendsfrom the scientific publications

https://www.ucl.ac.uk/human-evolution/

www.vitech.com.ua

Keywords extraction

Keyword (keyphrase) extraction is tasked with the automatic identification of terms (phrases)

that best describe the subject of a document

www.socialappshq.com

www.vitech.com.ua

Everyone needs it...

I want to know only KEY news.

I want to now KEY problems of customers.

I want to know KEY approaches in medicine.

I want to know KEY trends in Science.

Part 1The NLP methods

for the keywords (keyphrases) extraction

www.vitech.com.ua

Main problems: native language

language la langue valoda لغة мова ভাষা ulimi ภาษา言語 语言 sprache unicode:)

wikipedia.org

www.vitech.com.ua

Main problems: punctuation, grammar

facebook.com

www.vitech.com.ua

Main problems: polysemy

Водити за ніс Розвісити вуха Точити зуби

Клювати носом Робити з мухи слона

Прикусити язик

Чесати язики Пускати бісиківНе в своїй тарілці

facebook.com

www.vitech.com.ua

Main problems: slang, dialects, metaphors, loan

twitter.com

www.vitech.com.ua

ArXiv Categories

Category Number of subcategories

1 Statistics 5

2 Quantitative Biology 10

3 Computer Science 36

4 Nonlinear Sciences 5

5 Mathematics 32

6 Physics 39

www.vitech.com.ua

ArXiv monthly submissions

http://arxiv.org/stats

Sub

mis

sion

s

Year

www.vitech.com.ua

ArXiv submissions per category

http://arxiv.org/stats

Year Year

Frac

tiona

l S

ubm

issi

ons

Sub

mis

sion

s

www.vitech.com.ua

Computer Science submissions

http://arxiv.org/stats

Year Year

Frac

tiona

l S

ubm

issi

ons

Sub

mis

sion

s

www.vitech.com.ua

How does the input data look like?

arXiv.org

www.vitech.com.ua

Atom structure looks like this<?xml version="1.0" encoding="UTF-8"?><feed xmlns="http://www.w3.org/2005/Atom"> <updated>2016-08-26T00:00:00-04:00</updated> <entry> <id>http://arxiv.org/abs/cond-mat/0102536v1</id> <updated>2001-02-28T20:12:09Z</updated> <published>2001-02-28T20:12:09Z</published> <title>Impact of Electron-Electron Cusp on Configuration Interaction Energies</title> <summary> The effect of the electron-electron cusp on the convergence of configurationinteraction (CI) wave functions is examined. By analogy with the pseudopotential approach for electron-ion interactions, an effective electron-electron interaction is developed which closely reproduces the scattering of the Coulomb interaction but is smooth and finite at zero electron-electron separation.</summary> <author><name>David Prendergast</name> <arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">Department of Physics</arxiv:affiliation> </author>

<author><name>M. Nolan</name> <arxiv:affiliation xmlns:arxiv="http://arxiv.org/schemas/atom">NMRC, University College, Cork, Ireland</arxiv:affiliation> </author> </entry>

arXiv.org

www.vitech.com.ua

Where you can find the Data?

Bulk Metadata Access

● ArXiv API

● OAI-PMH

● RSS

wikipedia.org

www.vitech.com.ua

You can use these queries

● ArXiv API

Url = “http://export.arxiv.org/api/query?id_list={}"● OAI

Url = “http://export.arxiv.org/oai2?verb=ListRecords&from={}&until={}&metadataPrefix=arXiv"

www.metalinjection.net

www.vitech.com.ua

It was pretty easy, but...

➔ The versions of the papers can change at any moment➔ Arxiv.org doesn’t allow to scrape

a lot of atoms at the same time➔ It can be the problem with an

internet➔ It can be the extra characters in

abstracts, author names, title etc.➔ The categories names can

change: "adap-org" = "nlin.AO“, "Q-alg" = "math.QA“➔ Ids of papers can have different format: 1606.04426, 160323

★ We update the versions of papers each month

★ We are waiting 20 second for scraping new atoms

★ We are retrying after 30 seconds each time.

★ We use encoding

★ We learn the changes and map them

Problems Solutions

www.vitech.com.ua

Keywords or keyphrases?

www.vitech.com.ua

Actual methods for keyphrases extraction

Statistical methods

RAKE

TextRank, KeyRank

Supervised Machine Learning

Neural Networks

www.vitech.com.ua

Methods

WordsN-grams, phrasesPOS patternsNamed entities….

FrequenciesWeights: TF-IDF, BM25Rank….

ManualSupervisedDepended….

www.vitech.com.ua

Where

techchunks.com

Where am I?

I thought, she will speak about serious men...

Some strange tables and no word about men...

www.vitech.com.ua

Frequences

techchunks.com

Keyphrase Freq

in this paper, we propose 597

in this paper, we present 323

this paper, we present a 187

we consider the problem of 179

in this paper, we study 163

in this paper we present 149

in this paper we propose 143

the effectiveness of the proposed 134

to the best of our 128

Keyphrase Freq

позич менi сонце в моє 17

в моє вiконце є, є 13

я до тебе, я до 12

тебе нема... коли тебе нема... 12

маю то, я маю то, 12

до тебе, я до тебе, 12

я до тебе, я до 12

коли тебе нема... коли тебе 12

менi сонце в моє вiконце 11

www.vitech.com.ua

Frequences (deleted punctuations ans stopwords, removed duplicates)

techchunks.com

Keyphrase Freq Keyphrase Freq

simultaneous wireless information power transfer 24 позич менi сонце вiконце 10

alternating direction method multipliers admm 20 бебі онлайн твоє лібідо 7

cloud radio access network cran 19 вільний син полетить 7

orthogonal frequency division multiplexing ofdm 17 стіна впала 7

wireless information power transfer swipt 16 ховалися зими 18 хвилин листопад 7

deep convolutional neural networks cnn 14 небi сяє нова зоря 7

improve oncurrent gram modeling techniques 12 десь живе свiтло дискотек бiблiотек 7

polynomial depth quantum circuit effects 10 я не здамся без бою 7

maximum likelihood reestimation procedure belonging

10 тінь твого тіла 7

www.vitech.com.ua

TF-IDF

TF-IDF = TF*IDF

Term Frequency

TF(t) = Count of term t in a document

Inverse Document Frequency

IDF(t) = log[Count of term t in a document / Total count of terms in the documents]

www.vitech.com.ua

TF-IDF

techchunks.com

Keyphrase TF-IDF Keyphrase TF-IDF

complexity extension finite fields note 0.26 коли тебе нема 0.57

upper bounds symmetric bilinear complexity 0.25 так знай другом твоїм 0.54

note bounds asymptotical uniform discuss 0.25 обійми мене 0.50

bounds asymptotical uniform discuss validity 0.25 стіна впала між нами 0.46

fields note bounds asymptotical uniform 0.25 не йди 0.45

establish new upper bounds symmetric 0.25 позич менi сонце моє вiконце 0.44

large number works address rdf 0.24 більше не можу без тебе 0.43

reduced problem application propose heuristic 0.24 твого тіла тінь 0.40

encode data semantic web data 0.24 моя сьюзi мила 0.37

www.vitech.com.ua

TextRank

techchunks.com

Key-sentencesIn this study, the historical roots of tribology are investigated using a newly developed scientometric method called Referenced Publication Years Spectroscopy.

It was found that for one group of contacts the oscillation amplitude nonmonotonously increases with the bias voltage increase.

Cortical Learning Algorithms based on the Hierarchical Temporal Memory, HTM have been developed by Numenta Incorporation from which variations and modifications are currently being investigated

While our Local Degree method is best for preserving connectivity and short distances, other newly introduced local variants are best for preserving the community structure

We propose a combined caching scheme where part of the available cache space is reserved for caching the most popular content in every SBS

Key-sentencesДоля таки зовсім не зла

Бо навіть мить для мене свято

Ми будуєм і ламаєм знов

Нам казали не чекайте, всі давно вже тут.

Там, де кожен день, там де кожен, як останній деньМи будуем і ламаєм знов, як не як свою любов.

І знову жива вода змиває старі сліди,І я думаю так буде завжди.

Так ся дивлю за тобою,Що й не мушу казати слів.

Не показуйте жаль, не впадайте в екстаз,Ми сьогодні залишимо більше для нас.

Ти там віддала більше ніж могла мені.

www.vitech.com.ua

Rapid Automatic Keyword Extraction (RAKE)

Author: Stuart Rose (2010)

● Unsupervised method for extracting keywords● Incorporate cooccurrence and frequency of words

www.vitech.com.ua

Candidate keywords selecting

RAKE partitions the text by using

stop words phrase delimiters

www.vitech.com.ua

Candidate keywords selectingCompatibility of systems of linear constraints over the set of natural numbers.Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types of systems and systems of mixed types.

Compatibility – systems – linear constraints – set – natural numbers – Criteria – compatibility – system – linear Diophantine equations – strict inequations – nonstrict inequations – Upper bounds – components – minimal set – solutions – algorithms – minimal generating sets – solutions – systems – criteria – corresponding algorithms – constructing – minimal supporting set – solving – systems – systems

www.vitech.com.ua

Candidates scoring

Score of phrase = SUM(score(word))

Metrics for calculating word scores:

1. word frequency: freq(w), 2. word degree: deg(w),3. ratio of degree to frequency: deg(w)/freq(w)

www.vitech.com.ua

Graph of word co-occurrences

www.vitech.com.ua

Word scores calculated from the word co-occurrence graph

www.vitech.com.ua

Additional option: adjoining keywords

1. Find the same sequences2. Join keyphrases3. Score

www.vitech.com.ua

RAKE

techchunks.com

Keyphrase RAKE

spectral amplitude coding optical code division multiple access networks intelligent pinning based cooperative secondary control

51.64

service function localization enabling fine grained rdf data completeness assessment balanced ranking mechanisms convolutional neural networks

51.33

strongly magnetized neutron stars powering superluminous supernovae remarkable magnetostructural coupling

49.85

randomized version room temperature tetragonal noncollinear antiferromagnet ptmnga optimal system maneuver

49.62

Keyphrase RAKE

мiнiмум волi за мiнiмум долi 10.56

вона хотiла знiматись в кiно 10.04

вiн менi свої пiснi спiвав 8.83

я-а-а почую голос твiй 7.72

мало-мало-мало менi 7.72

вiсiм днiв вiн її шукав 7.07

усi знайомi ледь знайомi 7.07

ключ не пiдiйшов а може й не посмiв 7.07

я слухав звук дощу i Бiллi Холiдей 7.07

якби тодi сказала ти менi стати тiнню в ночi

7.07

але ж вiзьми собi з полицi молодiсть 7.07

Part 2Our approach

www.vitech.com.ua

Neural Network

www.vitech.com.ua

What is the best method?

Method Advantage Disadvantage

TF-IDF Important keyphrases extraction, n-grams possible Candidates extraction

TextRank Cooccurrences calculating Importance missing

RAKE Frequences, Coocurences calculating, candidates extraction Long phrases

Supervised ML High score of extraction Train dataset is needed

www.vitech.com.ua

Our Approach

TF-IDFRAKE

SciencePulse

www.vitech.com.ua

RAKE

techchunks.com

Keyphrase Weight Keyphrase Weight

massive multiple input output 7,25 все буде добре 1

long short term memory architecture 5,12 коли тебе нема 1

Live action virtual reality games 3,15 небо над дніпром 1

low rank hankel matrix completion 3,04 хочу напитись тобою 0,78

multi point wireless energy transmission 3,01 жити без мети 0,78

tree augmented naive bayes classifier 2,89 мила моя сьюзі 0,78

long short term memorized fusion 2,15 тінь твого тіла 0,75

fine grained entity type classification 1,51 коли настане день 0,75

high speed railway communication systems 1,27 кожну хвилину життя 0,75

partially observable markov decision process 1,13 коли тобі важко 0,75

www.vitech.com.ua

How to join different methods?

Input text RAKE TF-IDF

Keyphrase 1Keyphrase 2Keyphrase 3Keyphrase 4Keyphrase 5

RAKE weight TF-IDF score

www.vitech.com.ua

Text for keyphrases selectingM

onth

\ Y

ear

Cat

egor

yTitle 1Abstract 1-------Title 2Abstract 2-------Title nAbstract n-------

Keyphrase 1Keyphrase 2----------------Keyphrase i----------------Keyphrase 100----------------

Dup

licat

es

rem

ovin

gS

imila

r ke

yphr

ases

www.vitech.com.ua

Text preprocessing

● Formulas deleting● Encoding● Some punctuations removing● Extra spaces removing

www.vitech.com.ua

Duplicates removing

Bag of Words

Keyphrase 1Keyphrase 2----------------Keyphrase n----------------Keyphrase 100

Occurence in Keyphrase1

Occurence in Keyphrase2

word1 1 1

word2 0 0

wordn 1 0

www.vitech.com.ua

Duplicates removing

cosine_similarity = cosine_sim(keyphrase1, keyphrase2)

filter(similarity >= 0.8)

www.vitech.com.ua

Recommendations: Word2Vec

Wikipedia+Gigaword 5 Number of dimensions : 300Windows size: 10

WikipediaNumber of dimensions : 1000Windows size: 10

ArXiv (abstracts)Number of dimensions : 300Windows size: 10

www.vitech.com.ua

Additional rules for similarity

1. The year was selected as period of similar statements searching

2. The cosine distance between sets of words is calculated.

3. The lowest cosine similarity between statements should be equal 0.70

Part 3DEMO

www.vitech.com.ua

Text for keyphrasesWelcome to

Science Pulse

www.vitech.com.ua

Keyphrase-Atom relationships

Mean number of Atoms per Keyphrase = 1.6178478064Max number of Atoms per Keyphrase = 690Min number of Atoms per Keyphrase = 1

onmogul.com

www.vitech.com.ua

Keyphrase-Atom relationships

KeyphraseId Value AtomNumber CategoryId

2106584 low energy effective field theory 690 6

2250344 black hole mass bulge 641 6

2190498 star formation rate stellar mass 460 6

2099199 proposed algorithm outperforms state art 362 3

2106533 ultra high energy cosmic ray 336 6

2118614 high energy proton collisions 311 6

1534841 coupled dark matter energy 290 6

1534818 low mass standard model higgs 289 6

2118593 low energy heavy ion collisions 276 6

2153778 high energy gamma ray sources 270 6

www.vitech.com.ua

Atom-Keyphrases relationships

Mean number of Keyphrases per Atom = 1.69036455056Max number of Keyphrases per Atom = 36Min number of Keyphrases per Atom = 0

onmogul.com

www.vitech.com.ua

Atom-Keyphrases relationships

AtomId AtomTitle KeyphraseNumber

1145661 Active galactic nuclei at gamma-ray energies 36

774479Ultrahigh-Energy Cosmic Rays from the "En Caul" Birth of Magnetars

35

784859 The IBM 2016 Speaker Recognition System 32

777367Hybrid Digital and Analog Beamforming Design for Large-Scale Antenna Arrays

32

778646Chandra X-ray and Hubble Space Telescope Imaging of Optically Selected Kiloparsec-Scale Binary Active Galactic Nuclei II: Host Galaxy Morphology and AGN Activity

31

www.vitech.com.ua

Keyphrases Weights

Id Value WeightCalculation

IdCategory

Id

2250394 maximally supersymmetric yang mills theory 1.48377 829 6

2250395 silver coated cds quantum dots 1.48377 829 6

2250396 gauss bonnet ads black hole 1.44081 829 6

2250397exchange coupled ferri ferromagnetic composite

1.43178 829 6

2250398 hollow core photonic crystal fiber 1.43178 829 6

2250399 neutrinoless double beta minus decays 1.43178 829 6

2250404 cmos monolithic active pixel sensors 1.43178 829 6

2250405 high dimensional quantum key distribution 1.40149 829 6

2250406 body reduced density matrix functional 1.39866 829 6

2250407 hybrid halide perovskite solar cells 1.0000 829 6

2250408 spatially weighted generalized robertson walker 1.0000 829 6

www.vitech.com.ua

Keyphrases weights range

Date

Per

cent

of t

he w

eigh

ts >

1

www.vitech.com.ua

Processing timeThe duration of the general process is about 16 hours

Category: PhysicsYear: 2016Number of atoms: 34000Time of keyphrases calculating: 2 minutesTime of Atom searching: 4 minutesTime of results saving: 2 minutes AMD FX(tm)-6100 Six-Core Processor

RAM16GB

www.vitech.com.ua

● Harvester responsible for keyphrases extraction

● Visualization responsible for application

● MySQL is used as a storage

Components

www.vitech.com.ua

Harvester tools

www.vitech.com.ua

queries

search

REST API

Front end

Google Mail

ArxivAPl

cache

Visualization architecture

www.vitech.com.ua

Visualization stack

www.vitech.com.ua

Deployment pipeline

Harvester MySQL Visualisation

Z A B B I X

www.vitech.com.ua

Plans for future● Add trends, which people search● Add trends extraction per subcategory● Add trends analysis of other sources● Add Author’s analysis, H-index calculating● Add Google Analytics● Scoring

blazepress.com

www.vitech.com.ua

Задати це питання

What keyphrases could you extract from our talk?

http://blog.cleveland.com/

www.vitech.com.ua

Keyphrases Weight

natural language processing keywords extraction 1.0

method rapid automatic keyword extraction 1.0

scientific organizations explore artificial intelligence 0.7

scientists e-print repository arXiv 0.7

extracting hot topics 0.67

economize scientists time 0.67

human product generally text data 0.67

Keyphrases from our talk by SciencePulse

www.vitech.com.ua

Interesting observations