LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout....
Transcript of LayoutLM: Pre-training of Text and Layout for Document ...Toolkit: DOCX parser, ... [CLS]) Layout....
LayoutLM: Pre-training of Text and Layout for Document Image Understanding
Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou
KDD 2020
Outline 1. Background
2. Motivation
3. Method
4. Experiments
5. Conclusion
1. Background
Document Understanding in Real World
Form Receipt Report Invoice
Born-digital DocumentsScanned Documents
Visually-rich Documents
Preprocessing
� Scanned documents� File format: .jpg, .png, …� Toolkit: Optical character recognition, a.k.a. OCR � Open source tools: Tesseract
� Born-digital documents� File format: .docx, .pdf, .pptx, …� Toolkit: DOCX parser, PDF parser, …� Open source tools: python-docx, pdfminer, PyMuPDF
Documents
OCR Toolsor
Specific Parser
Semi-structured Data
Typical Document Understanding Task
Key Value
TO Lorillard Corporation
ADDRESS 666 Fifth Avenue
CITY New York
… …
Key Value
Total 4.95
Company StarBucksStore
Address 11302 Euclid Avenue
Cleveland, OH
Date 12/07/2014
Category: Form
Form Understanding Receipt Understanding Document Image Classification
Sequence Labeling
CRF LSTM
LSTM+CRF BiLSTM+CRFHuang, Zhiheng et al. “Bidirectional LSTM-CRF Models for Sequence Tagging.” ArXiv abs/1508.01991 (2015).
Graph Convolution for Multimodal Information Extraction from Visually Rich Documents
� Propose a graph convolution based model to combine textual and visual information.
� Combine graph embedding with text embedding using a standard BiLSTM-CRF model.
Liu, Xiaojing et al. “Graph Convolution for Multimodal Information Extraction from Visually Rich Documents.” NAACL-HLT (2019).
Examples of VRDs and example entities to extract.
Document Modeling
� Model each document as a full-connected graph of text segments
� Document D is a tuple (T, E), where 𝑇 =𝑡!, 𝑡", … , 𝑡# , 𝑡$ ∈ 𝑇 is a set of n text nodes
� 𝑅 = 𝑟$!, 𝑟$", … , 𝑟$% , 𝑟$% ∈ 𝑅 is a set of edges
� E = 𝑇×𝑅×𝑇is a set of directed edges of the form 𝑡$ , 𝑟$% , 𝑡%
Liu, Xiaojing et al. “Graph Convolution for Multimodal Information Extraction from Visually Rich Documents.” NAACL-HLT (2019).
Document graph
Feature Extraction
� Edge Embedding 𝑟$% = [𝑥$% , 𝑦$% ,&!'!,'"'!,&"'!], where
� 𝑥$% and 𝑦$% are horizontal and vertical distance between the two text boxes
� 𝑤$ and ℎ$ are the width and height of the corresponding text box.
Liu, Xiaojing et al. “Graph Convolution for Multimodal Information Extraction from Visually Rich Documents.” NAACL-HLT (2019).
Popular BERT and his Family
� Contextual embedding� Pre-training technique
BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding� Incorporate contextualized embedding into the grid document
representation
Denk, Timo I. and Christian Reisswig. “BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding.” Document Intelligence Workshop at NeuriPS (2019).
2. Motivation
Motivations
1. Previous work: contextual text embedding + non-contextual spatial information
2. Local invariance in document layout3. Extra information in visually rich documents
Be Contextual
Problem: contextual text embedding + non-contextual spatial information
Contextualizing spatial information to represent local invariance
Local Invariance in Document Layout
� Relative positions of words in a document contribute a lot to the semantic representation.
� Local Invariance� Key-value layout: left/right or up/down� Table layout: grid
� Pre-training technique will utilize the local invariance and better align the layout information with the semantic representation.
Visual Feature in Document Style
� Document-level� the whole image can indicate the document layout
� Word-level� visual features, styles such as bold, underline, and
italic
Insufficient and Expensive Labeled Data
Massive unlabeled documents Few labeled documents
Pre-training Techniques
Self-supervised training on large amounts of text.
Supervised training on a specific task with labeled data.
Language Understanding
Text-only feature
Document Image Understanding
Text feature
Layout feature
Style feature
…
Goals
1. 2D Language Model: contextual text embedding + contextual spatial information
2. Modeling and pre-training local invariance in document layout3. Utilizing visual information in visually rich documents
3. Method
LayoutLM Architecture
2-D Position Embedding
BERT
LayoutLM
Image Embedding
Pre-training for LayoutLM
• Masked Visual-Language Model
Input Date MASK January 11, 1994 Contract MASK 4011
TextEmbeddings
PositionEmbeddings (x0)
PositionEmbeddings (y0)
PositionEmbeddings (x1)
PositionEmbeddings (y1)
E(86) E(117) E(227) E(281) E(303) E(415) E(468) E(556)
E(138) E(138) E(138) E(138) E(139) E(138) E(139) E(139)
E(112) E(162) E(277) E(293) E(331) E(464) E(487) E(583)
E(148) E(148) E(153) E(148) E(149) E(149) E(149) E(150)
+ + + + + + + +
+ + + + + + + +
+ + + + + + + +
+ + + + + + + +
E(Date) E(Routed:) E(January) E(11,) E(1994) E(Contract) E(No.) E(4011)
0000
E(589)
E(139)
E(621)
E(150)
+
+
+
+
E(0000)
[CLS]
E(0)
E(0)
E(maxW)
E(maxH)
+
+
+
+
E([CLS])Text
Layout
Pre-training for LayoutLM
• Document Image Classification
Input Date Routed: January 11, 1994 Contract No. 4011
TextEmbeddings
PositionEmbeddings (x0)
PositionEmbeddings (y0)
PositionEmbeddings (x1)
PositionEmbeddings (y1)
E(86) E(117) E(227) E(281) E(303) E(415) E(468) E(556)
E(138) E(138) E(138) E(138) E(139) E(138) E(139) E(139)
E(112) E(162) E(277) E(293) E(331) E(464) E(487) E(583)
E(148) E(148) E(153) E(148) E(149) E(149) E(149) E(150)
+ + + + + + + +
+ + + + + + + +
+ + + + + + + +
+ + + + + + + +
E(Date) E(Routed:) E(January) E(11,) E(1994) E(Contract) E(No.) E(4011)
0000
E(589)
E(139)
E(621)
E(150)
+
+
+
+
E(0000)
[CLS]
E(0)
E(0)
E(maxW)
E(maxH)
+
+
+
+
E([CLS])Text
Layout
Pre-training Data
11 million scanned document images from IIT-CDIP Test Collection 1.0 https://ir.nist.gov/cdip/
4. Experiments
Downstream Tasks
Form Understanding
Receipt Understanding
Document Image Classification
Form Understanding with LayoutLM[Task] Sequence labeling (B-I-O class labels) for key-value from forms[Data] 149 training, 50 testing[Metric] Precision, Recall, F1[Baseline] Pre-trained BERT and RoBERTa
FUNSD: Form Understanding in Noisy Scanned Documentshttps://guillaumejaume.github.io/FUNSD/
Form Understanding with LayoutLM
Receipt Understanding with LayoutLM
"company": "STARBUCKS STORE #10208","date": "12/07/2014","address": "11302 EUCLID AVENUE, CLEVELAND, OH (216)
229-0749","total": "4.95",
ICDAR 2019 Robust Reading Challenge on Key Information Extraction from Scanned Receiptshttps://rrc.cvc.uab.es/?ch=13&com=tasks
[Task] Sequence labeling (B-I-O class labels) for values from receipts[Data] 626 training, 347 testing[Metric] Precision, Recall, F1[Baseline] Pre-trained BERT, RoBERTa
Receipt Understanding with LayoutLM
Document Image Classification with LayoutLM
[Task] Image Classification (16 classes) [Data] RVL-CDIP dataset (320K training, 40K validation, 40K testing)[Metric] Accuracy[Baseline] InceptionResNetV2, LadderNet, Multimodal
https://www.cs.cmu.edu/~aharley/rvl-cdip/
Document Image Classification with LayoutLM
Different Data and Epochs
Different Initialization Methods
Visualization: Table Detection Task on DocBank
BERT LayoutLM BERT LayoutLM
Error Correct Ground Truth
Li, Minghao et al. “DocBank: A Benchmark Dataset for Document Layout Analysis.” ArXiv abs/2006.01038 (2020).
5. Conclusion
• LayoutLM• 1st document-level pre-trained model using text and layout• Support different downstream tasks
• Form/Invoice understanding• Receipt understanding• Document image classification
• Paper: https://arxiv.org/abs/1912.13318• Code: https://aka.ms/layoutlm
LayoutLM
How to conduct research as an undergraduate?
My suggestions
1. Being self-motivated and hard-working2. Doing well in math and programming courses3. Finding a group/professor/graduate student4. Getting involved in a research project
Working with a Professor/Graduate Student
• Clear goal• A topic or an idea• Conference deadline
• Weekly one-to-one meeting• Progress report: reading, codes, results
More Advice
• How to Do Research With a Professor?• Jason Eisner, CS professor at Johns Hopkins University, ACL Fellow• http://www.cs.jhu.edu/~jason/advice/how-to-work-with-a-professor.html
• How undergraduates can make successful research (in Chinese)• Minlie Huang, CS professor at Tsinghua University• http://coai.cs.tsinghua.edu.cn/hml/media/files/undergraduate-res.pdf
Life at MSRA
Novel Topic/Idea
• Mentorship• Diverse research area
Computing Resource
• Azure Machine Learning
Programming Skill • Research & Develop
Conditions of Good Research
Acknowledgement
Acknowledgement: MSRA NLC Group
Ming Zhou Lei CuiFuru Wei
UniLM Family: https://github.com/microsoft/unilm
� UniLM(v1@NeurIPS'19 | v2@ICML'20): unified pre-training for language understanding and generation
� MiniLM(arXiv'20): small pre-trained models for language understanding and generation
� LayoutLM (v1@KDD’20): multimodal (text + layout/format + image) pre-training for document understanding
� s2s-ft: sequence-to-sequence fine-tuning toolkit
© Copyright Microsoft Corporation. All rights reserved.
Thank you for listening