Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J....

28
Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. [email protected]

Transcript of Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J....

Page 1: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

Modern Neoplasm ClassificationDept of PathologyUniversity of MichiganOctober 27, 2005

Jules J. Berman, Ph.D., [email protected]

Page 2: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

What is a [tumor] classification?

A grouped taxonomy [listing of all tumors] with the following properties:

Inheritance: Hierarchical structure, with each class of tumors inheriting properties of its ancestors

Uniqueness: Each tumor occurs in only one place in the classification

Comprehensive: All tumors are included

Class-intransitive: A tumor from one class does not change into a tumor from another class (e.g. an adenocarcinoma does not become a lymphoma)

Ernst Mayr: The growth of biological thought: diversity, evolution and inheritance. Cambridge: Belknap Press; 1982.

Page 3: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

Problems with current tumor classifications

Mixed bag of tumor classes based on:

Anatomic site (roughly distance from the tumor to the floor as in “head and neck” tumors)

Clinical specialty (dermatologic tumors)

Functional similarity of cell types (e.g. endocrine tumors)

Not based on any describable biologic premise.

Page 4: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

Molecular classification of cancer

The so-called molecular classifications (based largely on gene expression arrays of tumors) are simply a way of finding variants within a population.

Mostly, you see experiments designed to cluster out variants of a tumor type (slow-growing, responsive to a specific treatment, prone to metastasize, etc.)

This is simply not classification (ignores the intransitive law), and in fact, no classification has emerged from any of the work that's been done with molecular diagnostics.

My opinion: Gene expression array studies do not create classifications – but are very useful taxon finders

Page 5: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

Developmental Lineage Classification and Taxonomy of Neoplasms

Similar to (but different from) the classification efforts of the 1950s (particularly Willis)

Old hypothesis (more or less discredited) is that tumor development recapitulates embryologic development.

New (my) hypothesis is that tumors will tend to inherit the molecular pathways from their developmental ancestors. May be helpful in selecting classes of tumors responsive to molecular targets.

Despite the difference in hypotheses, either way you end up with a classification that follows embryologic lines and that fits in will stem cell hypothesis.

Page 6: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.
Page 7: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

Developmental Lineage Classification and Taxonomy of Neoplasms

Now 145,000+ terms (10+ Megabytes)

Publicly available and free

The latest version at:

www.pathologyinformatics.org

Page 8: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

53 ways of writing prostate cancer Prostate cancer is the concept, the 53 synonyms are the terms for the concept, and C486300 is the code

<name nci-code = "C4863000">prostate with adenoca</name><name nci-code = "C4863000">adenoca arising in prostate</name><name nci-code = "C4863000">adenoca involving prostate</name><name nci-code = "C4863000">adenoca arising from prostate</name><name nci-code = "C4863000">adenoca of prostate</name><name nci-code = "C4863000">adenoca of the prostate</name><name nci-code = "C4863000">prostate with adenocarcinoma</name><name nci-code = "C4863000">adenocarcinoma arising in prostate</name><name nci-code = "C4863000">adenocarcinoma involving prostate</name><name nci-code = "C4863000">adenocarcinoma arising from prostate</name><name nci-code = "C4863000">adenocarcinoma of prostate</name><name nci-code = "C4863000">adenocarcinoma of the prostate</name><name nci-code = "C4863000">adenocarcinoma arising in the prostate</name><name nci-code = "C4863000">adenocarcinoma involving the prostate</name><name nci-code = "C4863000">adenocarcinoma arising from the prostate</name><name nci-code = "C4863000">prostate with ca</name><name nci-code = "C4863000">ca arising in prostate</name><name nci-code = "C4863000">ca involving prostate</name><name nci-code = "C4863000">ca arising from prostate</name><name nci-code = "C4863000">ca of prostate</name><name nci-code = "C4863000">ca of the prostate</name><name nci-code = "C4863000">prostate with cancer</name><name nci-code = "C4863000">cancer arising in prostate</name><name nci-code = "C4863000">cancer involving prostate</name><name nci-code = "C4863000">cancer arising from prostate</name><name nci-code = "C4863000">cancer of prostate</name>

Page 9: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

More:

<name nci-code = "C4863000">cancer of the prostate</name><name nci-code = "C4863000">cancer arising in the prostate</name><name nci-code = "C4863000">cancer involving the prostate</name><name nci-code = "C4863000">cancer arising from the prostate</name><name nci-code = "C4863000">prostate with carcinoma</name><name nci-code = "C4863000">carcinoma arising in prostate</name><name nci-code = "C4863000">carcinoma involving prostate</name><name nci-code = "C4863000">carcinoma arising from prostate</name><name nci-code = "C4863000">carcinoma of prostate</name><name nci-code = "C4863000">carcinoma of the prostate</name><name nci-code = "C4863000">carcinoma arising in the prostate</name><name nci-code = "C4863000">carcinoma involving the prostate</name><name nci-code = "C4863000">carcinoma arising from the prostate</name><name nci-code = "C4863000">prostate adenoca</name><name nci-code = "C4863000">prostate adenocarcinoma</name><name nci-code = "C4863000">prostate ca</name><name nci-code = "C4863000">prostate cancer</name><name nci-code = "C4863000">prostate carcinoma</name><name nci-code = "C4863000">prostatic cancer</name><name nci-code = "C4863000">prostatic carcinoma</name><name nci-code = "C4863000">prostatic adenocarcinoma</name><name nci-code = "C4863000">prostate gland adenocarcinoma</name><name nci-code = "C4863000">adenocarcinoma of the prostate gland</name><name nci-code = "C4863000">adenocarcinoma of prostate gland</name><name nci-code = "C4863000">prostate gland carcinoma</name><name nci-code = "C4863000">carcinoma of the prostate gland</name><name nci-code = "C4863000">carcinoma of prostate gland</name>

Page 10: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

Is the taxonomy comprehensive?

Let's compare it with SNOMED.

Page 11: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

Comparing the Developmental Lineage Classification with SNOMED.

1. Used the 2005 version of UMLS (free from ww.nlm.gov)

2. MRCON05 650,948,750 1-18-05

andMRCXT 1,610,612,736 1-18-05MRCXT2 1,610,612,736 1-18-05MRCXT3 1,610,612,736 1-18-05MRCXT4 1,610,612,736 1-18-05MRCXT5 1,610,612,736 1-18-05MRCXT6 1,610,612,736 1-18-05MRCXT7 1,196,031,492 1-18-05

4. Extracted the snomed ct terms from mrcon05 using the script:MRCON05 .PL 2,098 5-30-05

Page 12: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

MRCON05.PL

$line = " ";$start = time();open (TEXT,"mrcon05");open (OUT,">snom05");while ($line ne "") { $line = <TEXT>; @linearray = split(/\|/,$line); $cuinumber = $linearray[0]; $language = $linearray[1]; $vocabulary = $linearray[11]; next if ("ENG" ne $language); next if ("SNOMEDCT" ne $vocabulary); print OUT "$cuinumber $linearray[14]\n"; #print "$cuinumber $linearray[14]\n"; }$end = time();$total = $end - $start;print "\ntotal time was $total seconds\n";exit;

Execution time of 132 seconds on a 2.89 Ghz PC

Page 13: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

5. This produced a 35+ MByte file:SNOM05 35,127,210 5-30-05

6. Created a perl script, neopull2.pl that uses the mrcxt "Neoplasm"relationship to identify all the neoplasm CUIs in UMLS and to pullout any of the SNOMED terms that corresponded to a Neoplasm CUI (neopull2.pl)

7. The output file is:

SNOM .OUT 567,372 5-30-05

8. This output file contains a lot of redundant terms and plurals,so I wrote snoclean.pl to get rid of the extraneous terms:SNOCLEAN .PL 1,092 5-30-05

9. The final output file is:SNOCLEAN .OUT 300,834 5-30-05

SNOMED contains 2,673 different neoplasm concepts and 7,696 neoplasm terms

Page 14: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

SNOMED The total number of neoplasm concepts is 2,673The total number of neoplasm terms is 7,696

Developmental LineageThe total number of neoplasm concepts is 6,193The total number of neoplasm terms is 146,666

The Developmental Lineage has:2.3 times the neoplasm concepts as SNOMED19 times the neoplasm terms as SNOMED

Can one pathologist create a better nomenclature than the CAP? maybe

Page 15: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

The large curated nomenclatures can't be used for concept matching and are fast becoming obsolete for their intended mode of human-based implementation due to the explosive growth of the data domain

terabytes and terabytes every day – think about all types of digital data in medical information systems

PRAKASH NADKARNI, MD, ROLAND CHEN, MD, CYNTHIA BRANDT, MD, MPH, UMLS Concept Indexing for Production Databases:A Feasibility StudyJ Am Med Inform Assoc. 2001;8:80-91.

Conclusions: Considerable curation needs to be performed to define a UMLS subset that is suitable for concept matching.

Page 16: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

What is the value of a comprehensive neoplasm classification?

1. A modern classification is the key to retrieving, organizing, and integrating the data held in biomedical databases (including the data held in hospital information systems)

Can we use the taxonomy to code our surgical pathology reports and other textual documents?

2. A classification is a hypothesis about the nature of reality.

Can we use the classification to select classes of tumors (rather than single tumors) to molecular targeted cancer therapy? [We've done this with antibiotics with astounding success].

Can we learn something about the biology of tumors by using the classification to stratify the data found in large biological databases and inspecting the results?

Page 17: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

Autocoding Surgical Pathology Reports

What is the size of the data domain when we're talking about surgical pathology reports.

There are about 25 million surgical pathology reports generated in the U.S. each year (about 50 million cytology reports)

Page 18: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

Autocoding Surgical Pathology Reports

Allowing 1000 bytes per report, these reports occupy 25 Gigabytes of text (25 thousand million bytes)

Here is what 1000 bytes looks like:

To be, or not to be,--that is the question:--Whether 'tis nobler in the mind to sufferThe slings and arrows of outrageous fortuneOr to take arms against a sea of troubles,And by opposing end them?--To die,--to sleep,--No more; and by a sleep to say we endThe heartache, and the thousand natural shocksThat flesh is heir to,--'tis a consummationDevoutly to be wish'd. To die,--to sleep;--To sleep! perchance to dream:--ay, there's the rub;For in that sleep of death what dreams may come,When we have shuffled off this mortal coil,Must give us pause: there's the respectThat makes calamity of so long life;For who would bear the whips and scorns of time,The oppressor's wrong, the proud man's contumely,The pangs of despis'd love, the law's delay,The insolence of office, and the spurnsThat patient merit of the unworthy takes,When he himself might his quietus makeWith a bare bodkin? who would these fardels bear,To grunt and sweat under a weary life, But

Compressed, all of the surgical pathology reports produced in the U.S. In one year will fit easily on one DVD (like 10 episodes of I Love Lucy).

Page 19: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

<?xml version="1.0"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://www.purl.org/dc/elements/1.0/" xmlns:v="http://www.pathologyinformatics.org/informatics_r.htm"> <rdf:Description about="urn:PMID-16160487"> <dc:title> interobserver and intraobserver variability in the diagnosis of hydatidiform mole </dc:title> <v:autocode term="mole" code="C0000000" /> <v:autocode term="hydatidiform mole" code="" /> <de_id> * * and * * in the * of hydatidiform mole * *</de_id> </rdf:Description> <rdf:Description about="urn:PMID-16160486"> <dc:title> primary glial tumor of the retina with features of myxopapillary ependymoma </dc:title> <v:autocode term="tumor" code="C0000000" /> <v:autocode term="myxopapillary ependymoma" code="C0000000" /> <v:autocode term="tumor of the retina" code="C0000000" /> <v:autocode term="glial tumor" code="C3059000" /> <v:autocode term="ependymoma" code="C0000000" /> <de_id> * * glial tumor of the retina with * of myxopapillary ependymoma * *</de_id> </rdf:Description> <rdf:Description about="urn:PMID-16160485"> <dc:title> cd20-negative t-cell-rich b-cell lymphoma as a progression of a nodular lymphocyte-predominant hodgkin lymphoma treated with rituximab a molecular analysis using laser capture microdissection </dc:title> <v:autocode term="lymphoma" code="C0000000" /> <v:autocode term="hodgkin" code="C0000000" /> <v:autocode term="b-cell lymphoma" code="C6858100" /> <v:autocode term="t-cell-rich b-cell lymphoma" code="C9496100" /> <v:autocode term="hodgkin lymphoma" code="" /> <de_id> * * t-cell-rich b-cell lymphoma as a * of a * * hodgkin lymphoma * with * a * * using * * * * *</de_id> </rdf:Description>

Page 20: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

The autocoder prepares an XML file in RDF format (self-describing document) that autocodes and scrubs text concurrently, at a speed of about 8,000 reports per second.... and does an incomparably better job than human coders!

This means that it will code and scrub the 25 million surgical pathology reports in the U.S. In about an hour using a desktop PC

If we had access to a supercomputer (operating more than 3,000 times faster than my desktop PC), we could autocode and scrub every pathology report produced in the country in about a second.

Page 21: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

Why is it so important to autocode fast?

Because we're not really talking about coding (coded datasets cannot be justified on the basis of their scientific value). We're really talking about re-coding very large datasets as necessary.

You almost always need to re-code!!!

1. Whenever you want to change from one nomenclature to another (eliminates problem of brand-name loyalty)

2. Whenever you introduce a new version of a nomenclature

3. Whenever you want to use a new coding algorithm (e.g. Parsimonious versus comprehensive, linking code to a particular extracted portion of report)

4. Whenever you add legacy data to your LIS

5. Whenever you merge different pathology datasets – forget mapping!!!

Page 22: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

How can we integrate the neoplasm classification with OMIM to discover a new biological observation about tumors?

What is OMIM? Omim is a free, comprehensive listing of all the so-called Mendelian inherited diseases.

Omim is 103,610,906 bytes (over 100 million bytes)

Shakespeare's Hamlet is 180,711 bytes

OMIM is about 573 times larger than Hamlet

Each record of OMIM lists the name of the inherited disease, and all the medical conditions (including neoplasms) that may be associated with the condition.

Page 23: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

Let's autocode all of OMIM and examine the results:

1. The time to autocode was 92 seconds

2. The number of records in omim is 16785

3. The number of records listing primitive tumors is 348

4. The number of records listing endoderm_or_ectoderm tumors is 1220

5. The number of records lising mesoderm tumors is 1766 (completely unlike what you might expect with non-inherited tumors)

6. The number of records listing neuroectoderm tumors is 747

So, because we have a class system, we can look at instance-coded datasets and make observations about CLASS

Page 24: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

Easy to count the three combinations of two-lineage (discordant) records:

The number of OMIM records with neoplasm concepts in the record text is 1,015.

ectoderm/mesoderm 72 omim records

ectoderm/neuroectoderm 24 omim records

mesoderm/neuroectoderm 39 omim records

total 135 class-discordant OMIM records

So, 135/1,015 (13%) have a lineage discordance.

Page 25: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

Causes for 135 cases of class discordance:

1. Inherited conditions with an (external) environmental factor

2. Physiologic (internal) effects that cross lineages (breast and ovarian cancers caused by an endocrine sensitivity that extends across lineages)

3. Conditions that included a tumor that occurs too infrequently to be correctly associated with the inherited condition

4. Mistakes in parsing omim (finding the name of a tumor in a record that was never intended to indicate that the condition is associated with the tumor)

5. Bad classification

How do you decide? In this case, you go back and read the 135 records and try to understand what went wrong in each case.

Page 26: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

Classification papers

Autocoding papers (Doublet Method 20,000 times faster than other published methods)

Confidentiality/privacy papers

- De-identification and data scrubbing (Concept Match method)

- Zero-knowledge reconciliation of identities

- Threshold method for exchanging pieces of data

Data integration papers

www.pubmed.org

search on: berman jj

Page 27: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.

end

Page 28: Modern Neoplasm Classification Dept of Pathology University of Michigan October 27, 2005 Jules J. Berman, Ph.D., M.D. jjberman@alum.mit.edu.