Interactive Visualization and Simplified Pattern Discovery ...

4
1 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 ACL 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. Interactive Visualization and Simplified Pattern Discovery in the COVID-19 Open Research Dataset(CORD-19) Anonymous ACL submission Abstract This work explores the use of Natural Lan- guage Processing based algorithms for Large Text Mining and Interactive Visualization for the COVID-19 Open Research Dataset (CORD-19) Dataset. We developed a series of easy to use online interactive text visualization based on different percentages of mined text data of diseases and chemical entities from the CORD-19 Dataset.This is to enable the study of patterns based on the frequency of entities in a very large dataset of about 2.6 million disease and chemical entities extracted from 31,376 papers. This will be useful to medical professionals, especially those who are not fa- miliar with data mining techniques to interact with diseases, symptoms, drugs and chemicals texts entities to study patterns, relationships and trends to derive insights about the COVID- 19 disease from publications about the disease and similar diseases. These extracted entities will also be made publicly available so that more work can be done with the dataset. 1 Introduction In Wuhan, China, a novel and alarmingly conta- gious primary atypical (viral) pneumonia broke out in December 2019. It has since been identi- fied as a zoonotic coronavirus, similar to SARS coronavirus and MERS coronavirus and named COVID-19(Liu et al., 2020). The World Health Organization (WHO) on March 11, 2020, has de- clared the novel coronavirus (COVID-19) outbreak a global pandemic(Cucinotta and Vanelli, 2020). Different measures are being taken globally to tackle the pandemic, one of them is the release of the CORD-19 Dataset. On March 16, 2020, the Allen Institute for AI (AI2), in collaboration with partners at The White House Office of Science and Technology Policy (OSTP), the National Library of Medicine (NLM), the Chan Zuckerburg Initiative (CZI), Microsoft Research, and Kaggle, coordinated by Georgetown Univer- sity’s Center for Security and Emerging Technol- ogy (CSET), released the first version of CORD- 19.(Wang et al., 2020) The COVID-19 Open Research Dataset (CORD- 19) is a growing resource of scientific papers on COVID-19 and related historical coronavirus re- search. CORD-19 is designed to facilitate the de- velopment of text mining and information retrieval systems over its rich collection of metadata and structured full text papers(Wang et al., 2020) This dataset is intended to mobilize researchers to apply recent advances in natural language pro- cessing to generate new insights in support of the fight against this infectious disease. The corpus is updated regularly as new research is published in peer-reviewed publications and archival services like bioRxiv, medRxiv, and others This work aims to provide a simple interface for medical professionals via an interactive web-based visualization tool using Scattertext, a Python text visualization library. This will provide a platform to study patterns, relationship based on frequencies of disease and chemical named entities extracted using scispaCy, a python Natural Language Pro- cessing Library. scispaCy is a specialized NLP library for process- ing biomedical texts which builds on the robust spaCy library scispaCy models are useful on a wide variety of types of text with a biomedical focus, such as clinical notes, academic papers, clinical trials reports and medical records(Neumann et al., 2019) In order to analyse the result of disease and chem- ical entities extraction of about 2.6 million to- kens/phrases, we explored the use of text data visu- alization. Finding words and phrases that discriminate cate- gories of text is a common application of statistical Natural Language Processing(NLP). A wide range

Transcript of Interactive Visualization and Simplified Pattern Discovery ...

Page 1: Interactive Visualization and Simplified Pattern Discovery ...

1

000

001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

090

091

092

093

094

095

096

097

098

099

ACL 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Interactive Visualization and Simplified Pattern Discovery in theCOVID-19 Open Research Dataset(CORD-19)

Anonymous ACL submission

Abstract

This work explores the use of Natural Lan-guage Processing based algorithms for LargeText Mining and Interactive Visualizationfor the COVID-19 Open Research Dataset(CORD-19) Dataset. We developed a series ofeasy to use online interactive text visualizationbased on different percentages of mined textdata of diseases and chemical entities from theCORD-19 Dataset.This is to enable the studyof patterns based on the frequency of entitiesin a very large dataset of about 2.6 milliondisease and chemical entities extracted from31,376 papers. This will be useful to medicalprofessionals, especially those who are not fa-miliar with data mining techniques to interactwith diseases, symptoms, drugs and chemicalstexts entities to study patterns, relationshipsand trends to derive insights about the COVID-19 disease from publications about the diseaseand similar diseases. These extracted entitieswill also be made publicly available so thatmore work can be done with the dataset.

1 Introduction

In Wuhan, China, a novel and alarmingly conta-gious primary atypical (viral) pneumonia brokeout in December 2019. It has since been identi-fied as a zoonotic coronavirus, similar to SARScoronavirus and MERS coronavirus and namedCOVID-19(Liu et al., 2020). The World HealthOrganization (WHO) on March 11, 2020, has de-clared the novel coronavirus (COVID-19) outbreaka global pandemic(Cucinotta and Vanelli, 2020).Different measures are being taken globally totackle the pandemic, one of them is the releaseof the CORD-19 Dataset.On March 16, 2020, the Allen Institute for AI (AI2),in collaboration with partners at The White HouseOffice of Science and Technology Policy (OSTP),the National Library of Medicine (NLM), the ChanZuckerburg Initiative (CZI), Microsoft Research,

and Kaggle, coordinated by Georgetown Univer-sity’s Center for Security and Emerging Technol-ogy (CSET), released the first version of CORD-19.(Wang et al., 2020)The COVID-19 Open Research Dataset (CORD-19) is a growing resource of scientific papers onCOVID-19 and related historical coronavirus re-search. CORD-19 is designed to facilitate the de-velopment of text mining and information retrievalsystems over its rich collection of metadata andstructured full text papers(Wang et al., 2020)This dataset is intended to mobilize researchersto apply recent advances in natural language pro-cessing to generate new insights in support of thefight against this infectious disease. The corpus isupdated regularly as new research is published inpeer-reviewed publications and archival serviceslike bioRxiv, medRxiv, and othersThis work aims to provide a simple interface formedical professionals via an interactive web-basedvisualization tool using Scattertext, a Python textvisualization library. This will provide a platformto study patterns, relationship based on frequenciesof disease and chemical named entities extractedusing scispaCy, a python Natural Language Pro-cessing Library.scispaCy is a specialized NLP library for process-ing biomedical texts which builds on the robustspaCy library scispaCy models are useful on a widevariety of types of text with a biomedical focus,such as clinical notes, academic papers, clinicaltrials reports and medical records(Neumann et al.,2019)In order to analyse the result of disease and chem-ical entities extraction of about 2.6 million to-kens/phrases, we explored the use of text data visu-alization.Finding words and phrases that discriminate cate-gories of text is a common application of statisticalNatural Language Processing(NLP). A wide range

Page 2: Interactive Visualization and Simplified Pattern Discovery ...

2

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

ACL 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

of visualizations have been used to highlight dis-criminating words– simple ranked lists of words,word clouds, word bubbles, and word-based scatterplots These techniques have a number of limita-tions. For example, the difficulty in comparingthe relative frequencies of two terms in a wordcloud, or in legibly displaying term labels in scat-terplots.(Kessler, 2017)Scattertext is an interactive, scalable tool whichovercomes many of these limitations. It is builtaround a scatterplot which displays a high num-ber of words and phrases used in a corpus. Pointsrepresenting terms are positioned to allow a highnumber of unobstructed labels and to indicate cat-egory association. The coordinates of a point in-dicate how frequently the word is used in eachcategory.(Kessler, 2017)

2 Method

For this analysis, The custom licence subset of theCORD-19 Data set was downloaded on the 23rd ofApril,2020. This PDF subset of the data containing31376 publications was used.During the data preprocessing,stop words and sin-gle lettered words were retained because of chemi-cal entities with single lettered symbols E.g. K forPotassium

Figure 1: Cord-19 Data Mining Process

Scispacy Named Entity Recognition Model(en ner bc5cdr md) trained on the BioCreativeV Chemical-Disease Relations (BC5CDR) cor-pus which consists of 1500 PubMed articles with4409 annotated chemicals, 5818 diseases and 3116chemical-disease interactions(Li et al., 2016) wasused to extract entities related to diseases and chem-icals in the CORD 19 dataset.The computation for the named entity extractionprocess was run in google colaboratory notebookGPU and the process took 8 hours to extract about2.6 Million disease and chemical entitiesExtracted entities of disease and chemicals were

then loaded as corpus for visualization with scatter-text, only tokens with at least 100 tokens were .Since the dataset is large, to aid exploration, the cor-pus was divided into 1,10,20,30,40,50,60,70,80,90and 100 percentages. Insights that can be gleanedat each percentage of corpus differs based on wordfrequency

3 Results

Table1 shows the breakdown of entities per entitytype

Entity Name Number of Extracted TokensDisease 1371743Chemical 1278831Total 2650680

Table 1: Number of Disease and Chemical Entitiesmined from 31376 CORD-19 Publications

Figure 2: View of Html Interface for Interactive textvisualization of disease and chemical entities

Figure 3: Interactive view of 1% of disease and chemi-cals entities corpus

Page 3: Interactive Visualization and Simplified Pattern Discovery ...

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

ACL 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Figure 4: Interactive view of 10% of disease and chem-icals entities corpus

Figure 5: Interactive view of 20% of disease and chem-icals entities corpus

Figure 6: Interactive view of 50% of disease and chem-icals entities corpus

Figure 7: Interactive view of 80% of disease and chem-icals entities corpus

Figure 8: Interactive view of 100% of disease andchemicals entities corpus

4 Discussion

Each of the data point in the text visualization,when clicked shows the text data in different con-texts from the CORD-19 dataset.There is also asearch box where users of this tool can type inwords related to diseases and chemicals they sus-pect may be significant to their study.E.g name ofa particular medication or symptom.Data points with deeper color tone implies that theparticular word has many occurrences in the mineddataset.Links to the deployed interactive web visualizationwill be included as embedding in the final submis-sion(This is to preserve anonymity)

5 Conclusion

This work presented a text visualization methodto interact with extracted diseases and chemicalentities data that was extracted using Named EntityRecognition from the COVID-19 Open Research

Page 4: Interactive Visualization and Simplified Pattern Discovery ...

4

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

ACL 2020 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Dataset.The interactive web interface is intuitiveand can be used by anyone who understand dis-eases and chemicals to explore the data.This canbe used by medical professionals to have masteryof the dynamic pattern in the COVID-19 manage-ment by understanding and exploring patterns andrelationships in an effortless manner.

ReferencesD Cucinotta and M Vanelli. 2020. Who declares covid-

19 a pandemic. Acta bio-medica: Atenei Parmensis,91(1):157–160.

Jason S Kessler. 2017. Scattertext: a browser-basedtool for visualizing how corpora differ. arXivpreprint arXiv:1703.00565.

Jiao Li, Yueping Sun, Robin J Johnson, Daniela Sci-aky, Chih-Hsuan Wei, Robert Leaman, Allan PeterDavis, Carolyn J Mattingly, Thomas C Wiegers, andZhiyong Lu. 2016. Biocreative v cdr task corpus:a resource for chemical disease relation extraction.Database, 2016.

Ying Liu, Albert A Gayle, Annelies Wilder-Smith, andJoacim Rocklov. 2020. The reproductive numberof covid-19 is higher compared to sars coronavirus.Journal of travel medicine.

Mark Neumann, Daniel King, Iz Beltagy, and WaleedAmmar. 2019. Scispacy: Fast and robust modelsfor biomedical natural language processing. arXivpreprint arXiv:1902.07669.

Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar,Russell Reas, Jiangjiang Yang, Darrin Eide, KathrynFunk, Rodney Kinney, Ziyang Liu, William Merrill,et al. 2020. Cord-19: The covid-19 open researchdataset. arXiv preprint arXiv:2004.10706.