Automac)Detec’on)of) Inconsistencies)between) Free)Textand ... · IBM Research - Haifa...

18
Automa’c Detec’on of Inconsistencies between Free Text and Coded Data in Sarcoma Discharge Le=ers Ruty Rino) a, , Michele Torresani b , Rossella Bertulli b , Abigail Goldsteen a , Paolo Casali b , Boaz Carmeli a And Noam Slonim a a IBM Haifa Research Labs, 165 Aba Hushi st., Haifa 31905, Israel b Fondazione IRCCS IsKtuto Nazionale dei Tumori, via Venezian, 1, Milano, Italy Oral Presentation MIE Pisa, August 2012

Transcript of Automac)Detec’on)of) Inconsistencies)between) Free)Textand ... · IBM Research - Haifa...

Page 1: Automac)Detec’on)of) Inconsistencies)between) Free)Textand ... · IBM Research - Haifa Solu’on)outline) Blah blah blah, yadda yadda yadda Code A C3.1 Connective, soft tissues

Automa'c  Detec'on  of  Inconsistencies  between  Free  Text  and  Coded  Data  in  Sarcoma  Discharge  Le=ers    Ruty  Rino)a,,  Michele  Torresanib,  Rossella  Bertullib,  Abigail  Goldsteena,  Paolo  Casalib,  Boaz  Carmelia  And  Noam  Slonima    

a  IBM  Haifa  Research  Labs,  165  Aba  Hushi  st.,  Haifa  31905,  Israel    bFondazione  IRCCS  -­‐  IsKtuto  Nazionale  dei  Tumori,  via  Venezian,  1,  Milano,  Italy  

 

Oral Presentation MIE

Pisa, August 2012

Page 2: Automac)Detec’on)of) Inconsistencies)between) Free)Textand ... · IBM Research - Haifa Solu’on)outline) Blah blah blah, yadda yadda yadda Code A C3.1 Connective, soft tissues

IBM Research - Haifa

Free  Text  Vs.  Coded  Fields  in  EHRs  

§  Data  in  Electronic  Health  Records  (EHR)  can  be  stored  in  free  text  or  coded  fields  

§  Coded  fields  are  useful  for  querying,  mining,  analyzing  and  sharing  data  

§  Free  text  has  more  expressive  power,  ease  of  use  

Diagnosis

myxoid liposarcoma of the right arm, locally extended, with areas cellulate> 5%, 7.0 cm in diameter.

ICDO-T Connective, soft tissues of upper limb

C4.2

Patient Name John Doe

Gender Male

Electronic Health Record

Page 3: Automac)Detec’on)of) Inconsistencies)between) Free)Textand ... · IBM Research - Haifa Solu’on)outline) Blah blah blah, yadda yadda yadda Code A C3.1 Connective, soft tissues

IBM Research - Haifa

Free  Text  Vs.  Coded  Fields  in  EHRs  

§  ONen  both  coded  and  free  text  fields  are  used  to  store  the  same  type  of  informa'on    §  Diagnosis  §  Treatment  §  ….   Diagnosis

myxoid liposarcoma of the right foot, locally extended, with areas cellulate> 5%, 7.0 cm in diameter.

ICDO-T Connective, soft tissues of upper limb

C4.2

Patient Name John Doe

Gender Male

Electronic Health Record

§  This  enables  discordance  in  EHR  data  

§  May  have  devasta'ng  effects  on  pa'ent  care  

§  Singh  et.  al  2009:  §  Of  56,000  prescrip'ons,  1%  contained  inconsistencies.    §  20%  of  errors  could  have  caused  moderate  to  severe  harm    

Page 4: Automac)Detec’on)of) Inconsistencies)between) Free)Textand ... · IBM Research - Haifa Solu’on)outline) Blah blah blah, yadda yadda yadda Code A C3.1 Connective, soft tissues

IBM Research - Haifa

Automa'c  inconsistencies  detec'on  

§  Previous  works  used  extensive  manual  work  by  domain  experts  to  ascertain  such  inconsistencies  are  prevalent  (Singh  et.  al,  Stein  et.  al)  

§  We  suggest  an  automa'c  method  to  iden'fy  poten'al  inconsistencies  between  a  coded  field  and  free  text  field(s)  that  store  overlapping  informa'on.    

Diagnosis

myxoid liposarcoma of the right foot, locally extended, with areas cellulate> 5%, 7.0 cm in diameter.

ICDO-T Connective, soft tissues of upper limb

C4.2

Patient Name John Doe

Gender Male

Electronic Health Record

Diagnosis

myxoid liposarcoma of the right foot, locally extended, with areas cellulate> 5%, 7.0 cm in diameter.

ICDO-T Connective, soft tissues of upper limb

C4.2

Patient Name John Doe

Gender Male

Electronic Health Record

Diagnosis

myxoid liposarcoma of the right foot, locally extended, with areas cellulate> 5%, 7.0 cm in diameter.

ICDO-T Connective, soft tissues of upper limb

C4.2

Patient Name John Doe

Gender Male

Electronic Health Record

Page 5: Automac)Detec’on)of) Inconsistencies)between) Free)Textand ... · IBM Research - Haifa Solu’on)outline) Blah blah blah, yadda yadda yadda Code A C3.1 Connective, soft tissues

IBM Research - Haifa

Solu'on  outline  

Blah blah blah, yadda yadda yadda Code A

C3.1 Connective, soft tissues of lower limb

Diagnosis

ICDO-T Connective, soft tissues of upper limb

C4.2

Patient Name John Doe

Gender Male

Electronic Health Record

C4.2 Connective, soft tissues of upper limb

myxoid liposarcoma of the right foot, locally extended, with areas cellulate> 5%, 7.0 cm in diameter.

myxoid liposarcoma of the right foot, locally extended, with areas cellulate> 5%, 7.0 cm in diameter.

Poten'al  inconsistency!  

§  Train  a  Machine  Learning  classifier  that  can  predict  the  most  expected  code  based  on  the  free  text  data.  

§  To  determine  if  record  x  has  inconsistencies:  §  Use  the  classifier  to  predict  code  for  record  X  §  Compare  predic'on  with  real  code.  

§  Final  decision  if  disagreement  is  a  result  of  inconsistency  or    classifica'on  mistake  remains  in  the  hands  of  domain  expert  

§  By  highligh'ng  poten'al  inconsistencies,  number  of  records  she  needs  to  examine  is  drama'cally  reduced.  

Page 6: Automac)Detec’on)of) Inconsistencies)between) Free)Textand ... · IBM Research - Haifa Solu’on)outline) Blah blah blah, yadda yadda yadda Code A C3.1 Connective, soft tissues

IBM Research - Haifa

How  to  train  a  classifier  

§  Training  a  classifier  requires  “training  data”  –  “good  examples”  of  what  you  want  your  classifier  to  learn    

Sept 07: appearance of lesions at a distance to the right thigh. Code C476

November 08: Wide demolition loggia anteromedial thigh block with right superficial femoral vein.

Code C523

1  2  

appearance of metastases in the right buttock. Code C476 N  

.  

.  

.  

New  text   Code  Text  classifier  

Page 7: Automac)Detec’on)of) Inconsistencies)between) Free)Textand ... · IBM Research - Haifa Solu’on)outline) Blah blah blah, yadda yadda yadda Code A C3.1 Connective, soft tissues

IBM Research - Haifa

How  to  train  a  classifier?    §  Use  medical  records  which  we  wish  to  examine  as  training  data.    §  Assump'on  -­‐  In  most  records  free  text  agrees  with  coded  data  §  Note  –  some  frac'on  of  training  data  will  have  mismatched  

codes  (inconsistent  records)  §  Overcome  by  2  rounds  of  training  

Text  classifier  

Diagnosis myxoid liposarcoma of the right foot, locally extended, with areas cellulate> 5%, 7.0 cm in diameter. ICDO-T Connective, soft tissues of upper limb

C4.2

Patient Name

John Doe Gender Male

Electronic Health Record

Diagnosis myxoid liposarcoma of the right foot, locally extended, with areas cellulate> 5%, 7.0 cm in diameter. ICDO-T Connective, soft tissues of upper limb

C4.2

Patient Name

John Doe Gender Male

Electronic Health Record

Diagnosis myxoid liposarcoma of the right foot, locally extended, with areas cellulate> 5%, 7.0 cm in diameter. ICDO-T Connective, soft tissues of upper limb

C4.2

Patient Name

John Doe Gender Male

Electronic Health Record

Diagnosis

myxoid liposarcoma of the right foot, locally extended, with areas cellulate> 5%, 7.0 cm in diameter.

ICDO-T Connective, soft tissues of upper limb

C4.2

Patient John Doe

Gender Male

Electronic Health Record

Train

Code  C49  Code  C49  Code  C49  Code  C49  

Classify Compare & Filter Re-Train

Page 8: Automac)Detec’on)of) Inconsistencies)between) Free)Textand ... · IBM Research - Haifa Solu’on)outline) Blah blah blah, yadda yadda yadda Code A C3.1 Connective, soft tissues

IBM Research - Haifa

Ensemble  Learning  

§  How  to  improve  classifica'on  performance?    

§  Get  a  second  opinion!  §  Work  with  an  ensemble  of  classifiers  

§  Naïve  Bayes  (NB)  §  K-­‐Nearest  Neighbors  (KNN)  §  Mul'-­‐Class  decision  trees  (MDT)  

§  High  recall  –  require  at  least  one  classifier  to  disagree  with  EHR  code    

§  High  precision  –  require  all  classifiers  to  disagree  with  EHR  code  and  agree  with  one-­‐another  

§  In  the  following  results  we  required  high  precision  

Code  C49  

Code  C49  

Code  in  EHR  Classifica'on  

NB  Code  C35  KNN  Code  C49  MDT  

Code  C35  

Code  C35  

Page 9: Automac)Detec’on)of) Inconsistencies)between) Free)Textand ... · IBM Research - Haifa Solu’on)outline) Blah blah blah, yadda yadda yadda Code A C3.1 Connective, soft tissues

IBM Research - Haifa

Data  §  Anonymized  discharge  le=ers  of  SoN  Tissue  Sarcoma  pa'ents  treated  at  

Italian  Na'onal  Cancer  Ins'tute  in  Milan  (INT).  §  734  discharge  le=ers  spanning  456  treatment  programs.  §  Part  of  a  work  on  the  Cli-­‐G  decision  support  system  (Wed.  1000  Fermi)  

Coded Field   Free text field(s)   # of instances  

# of distinct words  

# of distinct codes  

Presentation (clinical status)  

• Presentation text • Disease extension • Clinical Summary  

261   2967   2  

ICDO-T (Primary anatomic site)  

• Disease extension • Diagnostic text • Oncological history  

410   3792   15  

ICDO-M (Morphology)   • Diagnostic text   435   385   11  

Treatment program (TP)  

• Treatment • Treatment program   128   633   8  

RECIST   • Clinical Summary   218   1406   5  

Page 10: Automac)Detec’on)of) Inconsistencies)between) Free)Textand ... · IBM Research - Haifa Solu’on)outline) Blah blah blah, yadda yadda yadda Code A C3.1 Connective, soft tissues

IBM Research - Haifa

Results  

Coded  Field   Precision  ensemble  

Recall  ensemble  

Presenta'on   0.98   0.77  

ICDO-­‐T     0.93   0.54  

ICDO-­‐M     0.96   0.73  

Treatment  program  

0.64   0.34  

 RECIST   0.83   0.36  

§  Classifica'on  Results  

Page 11: Automac)Detec’on)of) Inconsistencies)between) Free)Textand ... · IBM Research - Haifa Solu’on)outline) Blah blah blah, yadda yadda yadda Code A C3.1 Connective, soft tissues

IBM Research - Haifa

Results  

Coded Field Cases predicted as inconsistent True

Not enough

info

Presenta'on   5   3   0  

ICDO-­‐T     17   5   0  

ICDO-­‐M     14   6   0  

TP   18   15   0  

RECIST   16   4   7  

Precision using top

50% of cases

0.67  

0.  57  

0.86  

0.75  

0.57  

Precision

0.67  

0.29  

0.43  

0.83  

0.44  

§  Manual  valida'on  of  inconsistencies  

Page 12: Automac)Detec’on)of) Inconsistencies)between) Free)Textand ... · IBM Research - Haifa Solu’on)outline) Blah blah blah, yadda yadda yadda Code A C3.1 Connective, soft tissues

IBM Research - Haifa

Summary  

§  Automa'c  method  to  highlight  poten'al  inconsistencies  between  free  text  and  coded  fields  by  classifica'on  

§  Can  be  used  for  retrospec've  correc'on  of  mistakes    –  requires  valida'on  by  domain  expert  

§  Can  be  used  for  online  detec'on      –  draw  clinicians  a=en'on  to  poten'al  mistakes  as  she  is  filling  in  the  record  

Page 13: Automac)Detec’on)of) Inconsistencies)between) Free)Textand ... · IBM Research - Haifa Solu’on)outline) Blah blah blah, yadda yadda yadda Code A C3.1 Connective, soft tissues

IBM Research - Haifa

Acknowledgements  

§  Cli-­‐G  team  (IBM)  §  Noam  Slonim  §  Abigail  Goldsteen  §  Boaz  Carmeli  

§  Is'tuto  Nazionale  dei  Tumori  §  Michele  Torresani §  Dr.  Rossella  Bertulli  §  Dr.  Paolo  Casali  

Page 14: Automac)Detec’on)of) Inconsistencies)between) Free)Textand ... · IBM Research - Haifa Solu’on)outline) Blah blah blah, yadda yadda yadda Code A C3.1 Connective, soft tissues

IBM Research - Haifa

How  to  Represent  Free  Text?  

§  For  classifica'on  need  numerical  representa'on  of  text.    

§  Bag  of  Words  –  popular  representa'on.  

§  Simple  to  use  §  Does  not  preserve  rela'ons  between  words.  

Leg amputation for sarcoma of the right foot, locally extended. myxoid liposarcoma, with areas cellulate> 5%, 7.0 cm in diameter. Appearance of lesions at a distance to the right thigh and the pelvic. Start new chemotherapy Trabectedin (ET-743).

d

0 00 01 0 0 0 0 0 0 0 0 1 0BOW(d)

Page 15: Automac)Detec’on)of) Inconsistencies)between) Free)Textand ... · IBM Research - Haifa Solu’on)outline) Blah blah blah, yadda yadda yadda Code A C3.1 Connective, soft tissues

IBM Research - Haifa

Results  

Coded Field Cases predicted as inconsistent True

Not enough

info ICDO-­‐M     14   6   0  

Precision

0.43  

Precision using top

50% of cases

0.86  

§  Coding  version  used  at  the  'me  had  no  ICDO-­‐M  code  for  Fibromyxosarcoma  §  Out  of  26  cases  with  Fibromyxosarcoma  related  diagnosis,  only  6  used  the  

correct  code  -­‐  “Sarcoma,  not  otherwise  specified”    §  The  other  20  used  the  code  “Malignant  fibrous  hisKocytoma”  §  As  a  result,  the  classifiers  incorrectly  learned  to  classify  Fibromyxosarcoma  as  

“Malignant  fibrous  hisKocytoma”,    leading  to  6  wrong  inconsistency    iden'fica'ons  

§  Pathological  case  where  mistake  is  more  common  than  correct  code  §  Code  for  Fibromyxosarcoma  was  added  in  last  version.  

Page 16: Automac)Detec’on)of) Inconsistencies)between) Free)Textand ... · IBM Research - Haifa Solu’on)outline) Blah blah blah, yadda yadda yadda Code A C3.1 Connective, soft tissues

IBM Research - Haifa

ER

ER

Classifier I Data (Xi)

ER Free Text

Code

Classifier I Classifier

Label(Yi)

Prediction iYPrediction iY

Prediction & confidence

ii CY ,ˆ

Mark potentially mislabeled instance

ER

ER Classifier

I Data (X)

Labels (Y)

ER

Free Text

Code

Classifier I Classifier

ER

ER Classifier

I Data (X)

Labels (Y)

ER

Free Text

Code

Classifier I Classifier

Filtered training data

Page 17: Automac)Detec’on)of) Inconsistencies)between) Free)Textand ... · IBM Research - Haifa Solu’on)outline) Blah blah blah, yadda yadda yadda Code A C3.1 Connective, soft tissues

IBM Research - Haifa

Results  

Coded  Field    Precision  best  method  

Precision  ensemble  

Recall  ensemble  

Presenta'on   0.95  (DT)   0.98   0.77  

ICDO-­‐T     0.74  (NB)   0.93   0.54  

ICDO-­‐M     0.91  (DT)   0.96   0.73  

Treatment  program  

0.59  (NB)   0.64   0.34  

 RECIST   0.60  (DT)   0.83   0.36  

§  Classifica'on  Results  

Page 18: Automac)Detec’on)of) Inconsistencies)between) Free)Textand ... · IBM Research - Haifa Solu’on)outline) Blah blah blah, yadda yadda yadda Code A C3.1 Connective, soft tissues

IBM Research - Haifa

Machine  Learning  101    

§  Learning  -­‐  any  process  by  which  a  system  improves  performance  from  experience.”  (Herbert  Simon)  

§  Task      –  Classify  text  into  correct  code  §  Performance      –  %  of  correctly  classified  texts  §  Experience      –  Training  data  –texts  and  their  matching  codes  (labels)  

Sept 07: appearance of lesions at a distance to the right thigh. Code C476

November 08: Wide demolition loggia anteromedial thigh block with right superficial femoral vein.

Code C523

1  2  

appearance of metastases in the right buttock. Code C476 N  

.  

.  

.  

New  text   Code  Text  classifier