Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible...

34
Sub-Topic Classification of HIV related Opportunistic Infections Miguel Anderson and Joseph Fonseca

Transcript of Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible...

Page 1: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Sub-Topic Classification of

HIV related Opportunistic

InfectionsMiguel Anderson and Joseph Fonseca

Page 2: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Introduction

● Image collected from the CDC https://www.cdc.gov/hiv/basics/statistics.html

Page 3: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Background Info

● What is HIV?

● How is it transferred?

● How is it treated?

● What are opportunistic infections?

Page 4: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

The viral particles

● HIV Is a Retrovirus.” The Montgolfier Brothers, University of Bristol, www.chm.bris.ac.uk/webprojects2002/levasseur/hiv/hiv3.htm.

Page 5: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Binding of the coat protein to receptor

VIRAL

COAT

PROTEIN

● “HIV life cycle: How HIV infects a cell and replicates itself using reverse transcriptase,” Youtube Video,

posted by, Kleptoplast, Jan 6, 2012. https://www.youtube.com/watch?v=odRyv7V8LAE

Page 6: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Conformational change; binding of second protein

● “HIV life cycle: How HIV infects a cell and replicates itself using reverse transcriptase,” Youtube Video, posted by,

Kleptoplast, Jan 6, 2012. https://www.youtube.com/watch?v=odRyv7V8LAE

Page 7: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Fusion of the Membranes; phospholipids

● “HIV life cycle: How HIV infects a cell and replicates itself using reverse transcriptase,” Youtube

Video, posted by, Kleptoplast, Jan 6, 2012. https://www.youtube.com/watch?v=odRyv7V8LAE

Page 8: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Degradation of the matrix and capsid protein

MATRIX

PROTEINCAPSID

PROTEIN

● “HIV life cycle: How HIV infects a cell and replicates itself using reverse transcriptase,” Youtube Video, posted by,

Kleptoplast, Jan 6, 2012. https://www.youtube.com/watch?v=odRyv7V8LAE

Page 9: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

HIV RNA in Cell

VIRAL SS

RNA

VIRAL DNA with host nucleotides; single

stranded

VIRAL SS DNA

● “HIV life cycle: How HIV infects a cell and replicates itself using reverse transcriptase,” Youtube Video, posted by,

Kleptoplast, Jan 6, 2012. https://www.youtube.com/watch?v=odRyv7V8LAE

Page 10: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Double stranded DNA with Reverse Transcriptase

VIRAL DOUBLE

STRANDED DNA

● “HIV life cycle: How HIV infects a cell and replicates itself using reverse transcriptase,” Youtube Video,

posted by, Kleptoplast, Jan 6, 2012. https://www.youtube.com/watch?v=odRyv7V8LAE

Page 11: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Integrase Protein

● “HIV life cycle: How HIV infects a cell and replicates itself using reverse transcriptase,” Youtube Video, posted

by, Kleptoplast, Jan 6, 2012. https://www.youtube.com/watch?v=odRyv7V8LAE

Page 12: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Inside the Nucleus

NUCLEAR PORE

HOST

CHROMOSOME

VIRAL DNA

INTEGRATION ESTABLISHES LIFE LONG INFECTION; ENDONUCLEASE ACTIVITY

Page 13: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Production of Viral Particles-Transcription

RNA POLYMERASE

● “HIV life cycle: How HIV infects a cell and replicates itself using reverse transcriptase,” Youtube Video, posted by,

Kleptoplast, Jan 6, 2012. https://www.youtube.com/watch?v=odRyv7V8LAE

Page 14: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Any viral

protein

Ribosome

mRNA

ROUGH

ER

● “HIV life cycle: How HIV infects a cell and replicates itself using reverse transcriptase,”

Youtube Video, posted by, Kleptoplast, Jan 6, 2012.

https://www.youtube.com/watch?v=odRyv7V8LAE

Translation

Page 15: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

VIRAL Polyproteins and RNA at Infected Cell

Surface

Page 16: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Virus Budding Off From Infected Cell

Page 17: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Protease Clips Polyprotein Chains

I

PROTEASE

POLYPROTEIN

CHAIN

Page 18: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

https://www.cdc.gov/hiv/pdf/library/factsheets/hiv101-consumer-

info.pdf

Page 19: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Treatment with Molecular Targets---Antiretrovirals

● Two to three drugs to prevent resistance; fast mutation rate.

● Fusion inhibitors (Gp120 proteins)

● CC25 antagonists

● Nucleoside Reverse Transcriptase Inhibitor (affects reverse transcriptase with

dummy nucleosides(base and sugar))

● Integrase Inhibitors (Allosteric Sites). Less integration causes less CD4 cells

to go through apoptosis

● Protease Inhibitor (active site)

Page 20: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Opportunistic Infections and Stages of HIV

● Opportunistic infections (OIs) are infections that occur more frequently and

are more severe in individuals with weakened immune systems, including

people with HIV

● Acute Stage: 2-4 weeks of infections

● Stage 2: Clinical Latency (HIV inactivity)

● Stage 3: AIDS which can lead to opportunistic Diseases

Page 21: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Overview

● Fonseca et al. (2018) - Social Network analysis of HIV/AIDS literature

○ Similar to Golgi 2 approach

● Pletscher-Frankild et al. (2015) - Co-occurrences of features in abstracts

based on count occurances

○ Different from approach used - tf*idf and cosine similarity network

Page 22: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Approaches testedTrain Model in Two Ways

● Binary

○ Viral vs Bacterial opportunistic infections

● Multi Class

○ HBV vs HCV vs Syphilis vs Tuberculosis

Precision and recall for the model

Page 23: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Approaches

tested

Infection

Type

MeSH Terms searched [all searched

with HIV and Boolean Operators]

# of Abstracts retrieved

Viral Viral Infection 67667

Viral Hepatitis C 2370

Viral Hepatitis B 1866

Bacterial Bacterial infection 3429

Bacterial Tuberculosis 1829

Bacterial Syphilis 581

Fungal Fungal Infection 947

Fungal Pneumocystis Pneumonia 459

Fungal Candidiasis 377

Fungal Cryptococcal Meningitis 131

● Query the actual

disease with HIV

● Retrieved by Golgi2

and PubMed manual

search

Page 24: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Exploratory Analysis Methods

Golgi 2 Parameters

Min doc frequency %: 5

# tokens can be included in phrase: 3

Threshold rank b/w 0 and 1: .6

Golgi 2 Parameters

Min doc frequency %: 3

# tokens can be included in phrase: 3

Threshold rank b/w 0 and 1: .7

Golgi 2 and PubMed

scrapper

● Vectorize documents

and weighting

scheme

● n-gram*Freq-IDF

ranking

● Latent semantic

analysis

● Semantic Concept

clustering

● Cluster visualization

Page 25: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Predictive Analysis Methods

● LightSIDE binary

classification using Naïve

Bayes

● LightSIDE sub-topic

classification using Logistic

Regression

● Weka sub-topic

classification using Logistic

regression

● 10 Fold Cross Validation for

Both

● Evaluation Metric- Accuracy

of predicted labels

Page 26: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Subtopic Classification Training Features

•Features were used for training and selected based on accuracy of the model

•Unigrams outperformed other features

•The best features of the LightSIDE and Weka models were then compared in

Page 27: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Model Confusion Matrix for LightSIDE Model

•The confusion matrix

shows where the

model is accurately

classifying the labels

of the abstracts using

the LightSIDE

features.

•This model

performed at 78%

accuracy

Page 28: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Model Confusion Matrix for Weka Model

•The confusion matrix

shows where the model

is accurately classifying

the labels of the

abstracts using the

UMLS concepts in

Weka.

•This model performed

at 45% accuracy

Page 29: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Results

•LightSIDE Model Testing

Results

•The LightSIDE Model +

selected features were

used to test the model

accuracy

•The model performed at

80.5% accuracy

Page 30: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

TAPoR Text Analysis Comparator

Figure to right

represents the

Word Distribution

of most common

words

POI- Relative

ratio

Figure to left

represents the

Word Distribution

of unique words

POI- text counts

Page 31: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

TAPoR Statistics on the text analysis

Page 32: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Discussion/Conclusion

● See why the predictive model still showed confusion between HBV (text 1)

and HCV (text 2) subtopics (possible indication for the high rate of false

negatives in the model).

● The word count for the abstracts of each subtopic were not equal.

● Although the same amount of abstracts were used, the quality of the

abstracts were not accounted for.

● Based on the analyses it apparent that there were biases towards HCV

subtopic classification due to the amount of unique words it possessed.

Page 33: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

Future Direction

•Combine UMLS and MeSH terms to see if this increases classification model

accuracy

•Add more robust features to the abstract scraper that test for quality of results

•Build a predictive model to test where common opportunistic infections may arise

in a population. This can be used to determine if there are undiagnosed HIV

positive patients in the population

Page 34: Sub-Topic Classification of HIV related Opportunistic ... · and HCV (text 2) subtopics (possible indication for the high rate of false negatives in the model). The word count for

References● Grimwade, K., & Swingler, G. H. (2006). Cotrimoxazole prophylaxis for opportunistic infections in children with HIV

infection. Cochrane Database of Systematic Reviews. doi:10.1002/14651858.cd003508.pub2

● Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245),

255-260. doi:10.1126/science.aaa8415

● Ortiz, M. S. (2018, June 11). Tokens [Video file]. Retrieved from

https://drive.google.com/file/d/1n57f6eDKRHK7Zx_nqe9e_sy3JCnFSsN7/view

● Ortiz, M. S. (2018, June 11). How does TF-IDF weighting really work? [Video file]. Retrieved from

https://drive.google.com/file/d/1uUnZgJhMZ4S7qQhHf395dOt6bMHzVm4-/view

● HIV Is a Retrovirus.” The Montgolfier Brothers, University of Bristol,

www.chm.bris.ac.uk/webprojects2002/levasseur/hiv/hiv3.htm

● “HIV life cycle: How HIV infects a cell and replicates itself using reverse transcriptase,” Youtube Video, posted by,

Kleptoplast, Jan 6, 2012. https://www.youtube.com/watch?v=odRyv7V8LAE

● “HIV/AIDS.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 30 May 2017,

www.cdc.gov/hiv/basics/livingwithhiv/opportunisticinfections.html.