Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo and Ricardo Ciferri

18
A Process Based on Paragraph for Treatment Extraction in Scientific Papers of the Biomedical Domain Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo and Ricardo Ciferri presented by Juliana Duque UFSCar Database Group and USP Natural Language Processing Group São Carlos, BR

description

A Process Based on Paragraph for Treatment Extraction in Scientific Papers of the Biomedical Domain. Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo and Ricardo Ciferri presented by Juliana Duque. UFSCar Database Group and USP Natural Language Processing Group São Carlos, BR. - PowerPoint PPT Presentation

Transcript of Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo and Ricardo Ciferri

Page 1: Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo  and  Ricardo Ciferri

A Process Based on Paragraph for Treatment Extraction in

Scientific Papers of the Biomedical Domain

Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo and Ricardo Ciferri

presented by Juliana Duque

UFSCar Database Group and USP Natural Language Processing Group

São Carlos, BR

Page 2: Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo  and  Ricardo Ciferri

http://gbd.dc.ufscar.brContext and Motivation A lot of electronic documents that report experiments

treatment adopted patients with some kind of disease number of patients enrolled in the treatment symptoms and risk factors positive and negative effects

Nowadays, researchers and doctors are not able to process this huge number of documents

A Process for Treatment Extraction

2/16

Page 3: Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo  and  Ricardo Ciferri

http://gbd.dc.ufscar.brContext and Motivation

These documents are in unstructured format, i.e., in plain textual form, specially in PDF

It is necessary to transform these data from unstructured to structured format in order to submit it to an automatic knowledge discovery process

A Process for Treatment Extraction

3/16

Page 4: Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo  and  Ricardo Ciferri

http://gbd.dc.ufscar.brGoal Identify and extract treatments

Drugs, therapies and procedures

Process by paragraph Empirical analysis of papers from Sickle Cell Anemia

Treatments mainly occurs in sentences with complications or in sentences very near in the same paragraph

Approaches for Extracting Information Machine Learning Rules Dictionary

A Process for Treatment Extraction

4/16

Page 5: Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo  and  Ricardo Ciferri

http://gbd.dc.ufscar.brContributions

Theoretical: Domain Knowledge Methodology of Information Extraction

Practical: Resources: collection of documents, dictionary

and rules Tools: Information Extraction

A Process for Treatment Extraction

5/16

Page 6: Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo  and  Ricardo Ciferri

http://gbd.dc.ufscar.brExtraction Process for Treatment

A Process for Treatment Extraction

6/16

Final goal: data mining!

Page 7: Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo  and  Ricardo Ciferri

http://gbd.dc.ufscar.brSentence Classification

This result is both clinically meaningful and statistically significant.

Hydroxyurea (HU) is considered to be the most successful drug therapy for severe sickle cell disease (SCD).

The HU dose was given orally once a day, initially at 20 mg/kg.

ML Algorithm

Others

This result is both ……

Treatment

Hydroxyurea (HU) is …..

The HU dose was…

A Process for Treatment Extraction

7/16

Page 8: Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo  and  Ricardo Ciferri

http://gbd.dc.ufscar.br

Sentence Classification Process:training and testing phase 1/2Bag-of-words model

AVM configuration: Minimum Frequency = 2 Attribute Selection:

1, for the case the n-gram occurs in the sentence (present) 0 otherwise (absent)

Attributes: 1 to 3-grams Not considered: stopwords removal and stemming

Partitioning Method: 10-fold cross-validation

Removed parentheses, brackets and apostrophesA Process for Treatment

Extraction 8/16

Page 9: Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo  and  Ricardo Ciferri

http://gbd.dc.ufscar.br

Filter pre-processing: 1) No Filter 2) Randomize 3) Remove Misclassified - remove noise 4) Resample - balancing of the classes

Algorithms: Support Vector Machine and Naïve Bayes

Best result: SVM - Remove Misclassified – Resample C1: 95.01% accuracy C2: 96.62% accuracy

A Process for Treatment Extraction

Sentence Classification Process:training and testing phase 2/2

9/16

Page 10: Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo  and  Ricardo Ciferri

http://gbd.dc.ufscar.brResults - Automatic Classification SVM Algorithm

A Process for Treatment Extraction

Classifier Quant. Sentences Precision RecallF-

measureAccuracy

C1 120 Complication 85% 64% 73% 79%

C2 107 Treatment 88% 51% 64.5% 71%

10/16

Page 11: Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo  and  Ricardo Ciferri

http://gbd.dc.ufscar.brRules

Sentence without POS:Fourteen patients had brain MRI and MRA evaluation after 4 years of hydroxyurea therapy.

Sentence with POS:Fourteen_CD patients_NNS had_VBD brain_NN MRI_NNP and_CC MRA_NNP evaluation_NN after_IN 4_CD years_NNS of_IN hydroxyurea_NN therapy_NN ._.

Sentence of Treatment

[\w\-]*_IN (?:[\w-/\\]* )?([\w\-]*_NN|[\w\-]*_NNP|[\w\-]*_NNS) (?:treatment_NN|therapy_NN)

Rule - word representative + POS

A Process for Treatment Extraction

11/16

Page 12: Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo  and  Ricardo Ciferri

http://gbd.dc.ufscar.brRules

Sentence without POS:All patients were treated with antibiotics and,on average, became afebrile after a mean of two days of hospitalization.

Sentence with POS:All_DT patients_NNS were_VBD treated_VBN with_IN antibiotics_NNS and_CC ,_, on_IN average_NN ,_, became_VBD afebrile_JJ after_IN a_DT mean_NN of_IN two_CD days_NNS of_IN hospitalization_NN ._.

Sentence of Treatment

(?:[\w\-]*_VBD|[\w\-]*_VBN) (?:[\w\-]*_IN )?([\w\-]*_NN|[\w\-]*_NNP|[\w\-]*_NNS)

Rule – only POS

A Process for Treatment Extraction

12/16

Page 13: Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo  and  Ricardo Ciferri

http://gbd.dc.ufscar.brDictionaryBiomedical Database

In the MSH study, 299 adults were randomized to receive HU or placebo for a period of approximately 2 years.

These results confirm the benefit of HU, even in very young children, and its possible role in primary stroke prevention.

Term: HydroxyureaVariation: HU

A Process for Treatment Extraction

13/16

Page 14: Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo  and  Ricardo Ciferri

http://gbd.dc.ufscar.brConclusions Classification

79% accuracy – Classifier C1 – Complication 71% accuracy – Classifier C2 – Treatment

Rules 45% precision 70% recall New experiments: 59% precision and 75% recall

Dictionary 100% precision - known occurrences of treatments Variations of terms and synonyms

A Process for Treatment Extraction

14/16

Page 15: Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo  and  Ricardo Ciferri

http://gbd.dc.ufscar.brFuture Work Apply the proposed process to others terms in the

context of Sickle Cell Anemia

Investigate the identification of treatment and symptoms information in scientific papers of other diseases

Using indexes to speed up the identification of terms

Other biomedical areas may also benefit from our text mining approach

A Process for Treatment Extraction

15/16

Page 16: Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo  and  Ricardo Ciferri

A Process Based on Paragraph for Treatment Extraction in

Scientific Papers of the Biomedical Domain

Questions?

UFSCar Database Group and USP Natural Language Processing Group

São Carlos, BR

Page 17: Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo  and  Ricardo Ciferri

http://gbd.dc.ufscar.brReferences

Ananiadou, S.; McNaught, J. (2006) (Ed.). Text mining for biology and biomedicine. Norwood, MA: Artech House, 302 p.

Cohen, K. B.; Hunter, L. (2008) Getting started in text mining. PLoS Computational Biology, v. 4, n. 1, p. 1-3.

Matos, P. F. (2010) Metodologia de pré-processamento textual para extração de informação sobre efeitos de doenças em artigos científicos do domínio biomédico. 159 f. Dissertação (Mestrado em Ciência de Computação) – Departamento de Computação, Universidade Federal de São Carlos, São Carlos.

Matos, P. F. et al. (2010) An environment for data analysis in biomedical domain: information extraction for decision support systems. In: García-Pedrajas, N. et al. (Eds.). IEA-AIE. 23th. Heidelberg: Springer, p. 306-316.

Tsuruoka, Y.; Tsujiii, J. I. (2004) Improving the performance of dictionary-based approaches in protein name recognition. Journal of Biomedical Informatics, v. 37, n. 6, p. 461-470.

A Process for Treatment Extraction

17/16

Page 18: Juliana Duque, Pablo Matos, Cristina Ciferri, Thiago Pardo  and  Ricardo Ciferri

http://gbd.dc.ufscar.brFormula

Precision: TP / (TP + FP)

Recall: TP / (TP + FN)

F-measure: (2 x Prec x Rec) / (Prec + Rec)

Accuracy: TP + TN / (TP + TN + FN + FP)

A Process for Treatment Extraction

18/16