Interactive Visualization and Simplified Pattern Discovery ...
Transcript of Interactive Visualization and Simplified Pattern Discovery ...
1
000
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
ACL 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Interactive Visualization and Simplified Pattern Discovery in theCOVID-19 Open Research Dataset(CORD-19)
Anonymous ACL submission
Abstract
This work explores the use of Natural Lan-guage Processing based algorithms for LargeText Mining and Interactive Visualizationfor the COVID-19 Open Research Dataset(CORD-19) Dataset. We developed a series ofeasy to use online interactive text visualizationbased on different percentages of mined textdata of diseases and chemical entities from theCORD-19 Dataset.This is to enable the studyof patterns based on the frequency of entitiesin a very large dataset of about 2.6 milliondisease and chemical entities extracted from31,376 papers. This will be useful to medicalprofessionals, especially those who are not fa-miliar with data mining techniques to interactwith diseases, symptoms, drugs and chemicalstexts entities to study patterns, relationshipsand trends to derive insights about the COVID-19 disease from publications about the diseaseand similar diseases. These extracted entitieswill also be made publicly available so thatmore work can be done with the dataset.
1 Introduction
In Wuhan, China, a novel and alarmingly conta-gious primary atypical (viral) pneumonia brokeout in December 2019. It has since been identi-fied as a zoonotic coronavirus, similar to SARScoronavirus and MERS coronavirus and namedCOVID-19(Liu et al., 2020). The World HealthOrganization (WHO) on March 11, 2020, has de-clared the novel coronavirus (COVID-19) outbreaka global pandemic(Cucinotta and Vanelli, 2020).Different measures are being taken globally totackle the pandemic, one of them is the releaseof the CORD-19 Dataset.On March 16, 2020, the Allen Institute for AI (AI2),in collaboration with partners at The White HouseOffice of Science and Technology Policy (OSTP),the National Library of Medicine (NLM), the ChanZuckerburg Initiative (CZI), Microsoft Research,
and Kaggle, coordinated by Georgetown Univer-sity’s Center for Security and Emerging Technol-ogy (CSET), released the first version of CORD-19.(Wang et al., 2020)The COVID-19 Open Research Dataset (CORD-19) is a growing resource of scientific papers onCOVID-19 and related historical coronavirus re-search. CORD-19 is designed to facilitate the de-velopment of text mining and information retrievalsystems over its rich collection of metadata andstructured full text papers(Wang et al., 2020)This dataset is intended to mobilize researchersto apply recent advances in natural language pro-cessing to generate new insights in support of thefight against this infectious disease. The corpus isupdated regularly as new research is published inpeer-reviewed publications and archival serviceslike bioRxiv, medRxiv, and othersThis work aims to provide a simple interface formedical professionals via an interactive web-basedvisualization tool using Scattertext, a Python textvisualization library. This will provide a platformto study patterns, relationship based on frequenciesof disease and chemical named entities extractedusing scispaCy, a python Natural Language Pro-cessing Library.scispaCy is a specialized NLP library for process-ing biomedical texts which builds on the robustspaCy library scispaCy models are useful on a widevariety of types of text with a biomedical focus,such as clinical notes, academic papers, clinicaltrials reports and medical records(Neumann et al.,2019)In order to analyse the result of disease and chem-ical entities extraction of about 2.6 million to-kens/phrases, we explored the use of text data visu-alization.Finding words and phrases that discriminate cate-gories of text is a common application of statisticalNatural Language Processing(NLP). A wide range
2
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
ACL 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
of visualizations have been used to highlight dis-criminating words– simple ranked lists of words,word clouds, word bubbles, and word-based scatterplots These techniques have a number of limita-tions. For example, the difficulty in comparingthe relative frequencies of two terms in a wordcloud, or in legibly displaying term labels in scat-terplots.(Kessler, 2017)Scattertext is an interactive, scalable tool whichovercomes many of these limitations. It is builtaround a scatterplot which displays a high num-ber of words and phrases used in a corpus. Pointsrepresenting terms are positioned to allow a highnumber of unobstructed labels and to indicate cat-egory association. The coordinates of a point in-dicate how frequently the word is used in eachcategory.(Kessler, 2017)
2 Method
For this analysis, The custom licence subset of theCORD-19 Data set was downloaded on the 23rd ofApril,2020. This PDF subset of the data containing31376 publications was used.During the data preprocessing,stop words and sin-gle lettered words were retained because of chemi-cal entities with single lettered symbols E.g. K forPotassium
Figure 1: Cord-19 Data Mining Process
Scispacy Named Entity Recognition Model(en ner bc5cdr md) trained on the BioCreativeV Chemical-Disease Relations (BC5CDR) cor-pus which consists of 1500 PubMed articles with4409 annotated chemicals, 5818 diseases and 3116chemical-disease interactions(Li et al., 2016) wasused to extract entities related to diseases and chem-icals in the CORD 19 dataset.The computation for the named entity extractionprocess was run in google colaboratory notebookGPU and the process took 8 hours to extract about2.6 Million disease and chemical entitiesExtracted entities of disease and chemicals were
then loaded as corpus for visualization with scatter-text, only tokens with at least 100 tokens were .Since the dataset is large, to aid exploration, the cor-pus was divided into 1,10,20,30,40,50,60,70,80,90and 100 percentages. Insights that can be gleanedat each percentage of corpus differs based on wordfrequency
3 Results
Table1 shows the breakdown of entities per entitytype
Entity Name Number of Extracted TokensDisease 1371743Chemical 1278831Total 2650680
Table 1: Number of Disease and Chemical Entitiesmined from 31376 CORD-19 Publications
Figure 2: View of Html Interface for Interactive textvisualization of disease and chemical entities
Figure 3: Interactive view of 1% of disease and chemi-cals entities corpus
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
ACL 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Figure 4: Interactive view of 10% of disease and chem-icals entities corpus
Figure 5: Interactive view of 20% of disease and chem-icals entities corpus
Figure 6: Interactive view of 50% of disease and chem-icals entities corpus
Figure 7: Interactive view of 80% of disease and chem-icals entities corpus
Figure 8: Interactive view of 100% of disease andchemicals entities corpus
4 Discussion
Each of the data point in the text visualization,when clicked shows the text data in different con-texts from the CORD-19 dataset.There is also asearch box where users of this tool can type inwords related to diseases and chemicals they sus-pect may be significant to their study.E.g name ofa particular medication or symptom.Data points with deeper color tone implies that theparticular word has many occurrences in the mineddataset.Links to the deployed interactive web visualizationwill be included as embedding in the final submis-sion(This is to preserve anonymity)
5 Conclusion
This work presented a text visualization methodto interact with extracted diseases and chemicalentities data that was extracted using Named EntityRecognition from the COVID-19 Open Research
4
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
ACL 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Dataset.The interactive web interface is intuitiveand can be used by anyone who understand dis-eases and chemicals to explore the data.This canbe used by medical professionals to have masteryof the dynamic pattern in the COVID-19 manage-ment by understanding and exploring patterns andrelationships in an effortless manner.
ReferencesD Cucinotta and M Vanelli. 2020. Who declares covid-
19 a pandemic. Acta bio-medica: Atenei Parmensis,91(1):157–160.
Jason S Kessler. 2017. Scattertext: a browser-basedtool for visualizing how corpora differ. arXivpreprint arXiv:1703.00565.
Jiao Li, Yueping Sun, Robin J Johnson, Daniela Sci-aky, Chih-Hsuan Wei, Robert Leaman, Allan PeterDavis, Carolyn J Mattingly, Thomas C Wiegers, andZhiyong Lu. 2016. Biocreative v cdr task corpus:a resource for chemical disease relation extraction.Database, 2016.
Ying Liu, Albert A Gayle, Annelies Wilder-Smith, andJoacim Rocklov. 2020. The reproductive numberof covid-19 is higher compared to sars coronavirus.Journal of travel medicine.
Mark Neumann, Daniel King, Iz Beltagy, and WaleedAmmar. 2019. Scispacy: Fast and robust modelsfor biomedical natural language processing. arXivpreprint arXiv:1902.07669.
Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar,Russell Reas, Jiangjiang Yang, Darrin Eide, KathrynFunk, Rodney Kinney, Ziyang Liu, William Merrill,et al. 2020. Cord-19: The covid-19 open researchdataset. arXiv preprint arXiv:2004.10706.