Contextualized Weak Supervision for Text Classification · 3 Document Contextualization We leverage...

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 323–333July 5 - 10, 2020. c©2020 Association for Computational Linguistics

323

Contextualized Weak Supervision for Text Classification

Dheeraj Mekala1 Jingbo Shang1,2

1 Department of Computer Science and Engineering, University of California San Diego, CA, USA2 Halıcıoglu Data Science Institute, University of California San Diego, CA, USA

{dmekala, jshang}@ucsd.edu

Abstract

Weakly supervised text classification based ona few user-provided seed words has recently at-tracted much attention from researchers. Exist-ing methods mainly generate pseudo-labels ina context-free manner (e.g., string matching),therefore, the ambiguous, context-dependentnature of human language has been longoverlooked. In this paper, we propose anovel framework ConWea, providing contextu-alized weak supervision for text classification.Specifically, we leverage contextualized repre-sentations of word occurrences and seed wordinformation to automatically differentiate mul-tiple interpretations of the same word, and thuscreate a contextualized corpus. This contex-tualized corpus is further utilized to train theclassifier and expand seed words in an iterativemanner. This process not only adds new con-textualized, highly label-indicative keywordsbut also disambiguates initial seed words, mak-ing our weak supervision fully contextualized.Extensive experiments and case studies onreal-world datasets demonstrate the necessityand significant advantages of using contextu-alized weak supervision, especially when theclass labels are fine-grained.

1 Introduction

Weak supervision in text classification has recentlyattracted much attention from researchers, becauseit alleviates the burden of human experts on anno-tating massive documents, especially in specificdomains. One of the popular forms of weak super-vision is a small set of user-provided seed wordsfor each class. Typical seed-driven methods fol-low an iterative framework — generate pseudo-labels using some heuristics, learn the mappingbetween documents and classes, and expand theseed set (Agichtein and Gravano, 2000; Riloff et al.,2003; Kuipers et al., 2006; Tao et al., 2015; Menget al., 2018).

Most of, if not all, existing methods generatepseudo-labels in a context-free manner, therefore,the ambiguous, context-dependent nature of humanlanguages has been long overlooked. Suppose theuser gives “penalty” as a seed word for the sportsclass, as shown in Figure 1. The word “penalty”has at least two different meanings: the penaltyin sports-related documents and the fine or deathpenalty in law-related documents. If the pseudo-label of a document is decided based only on thefrequency of seed words, some documents aboutlaw may be mislabelled as sports. More impor-tantly, such errors will further introduce wrongseed words, thus being propagated and amplifiedover the iterations.

In this paper, we introduce contextualized weaksupervision to train a text classifier based on user-provided seed words. The “contextualized” here isreflected in two places: the corpus and seed words.Every word occurrence in the corpus may be inter-preted differently according to its context; Everyseed word, if ambiguous, must be resolved accord-ing to its user-specified class. In this way, we aimto improve the accuracy of the final text classifier.

We propose a novel framework ConWea, as illus-trated in Figure 1. It leverages contextualized rep-resentation learning techniques, such as ELMo (Pe-ters et al., 2018) and BERT (Devlin et al., 2019),together with user-provided seed information tofirst create a contextualized corpus. This contextu-alized corpus is further utilized to train the classi-fier and expand seed words in an iterative manner.During this process, contextualized seed words areintroduced by expanding and disambiguating theinitial seed words. Specifically, for each word, wedevelop an unsupervised method to adaptively de-cide its number of interpretations, and accordingly,group all its occurrences based on their contex-tualized representations. We design a principledcomparative ranking method to select highly label-

324

User-Provided Seed Words

Messi scored the penalty! …Judge passed the order of …The court issued a penalty …

……

Messi scored the penalty$1! …Judge passed the order of …The court$1 issued a penalty$0 …

……

Raw Docs

Extended Seed Words

Class Seed Words

Soccer soccer, goal, penalty

Law law, judge, court

… …

Contextualized Docs

Class Seed Words

Soccer soccer, goal$0, goal$1, penalty$0, penalty$1,

Law law, judge, court$0, court$1

… …

Text Classifier

Messi scored the penalty$1! …Judge passed the order of …The court$1 issued a penalty$0 …

……

Contextualized Docs with Predictions

Contextualized & Expanded Seed Words

Class Seed Words

Soccer soccer, goal$0, penalty$1, …

Law law, judge, court$1, penalty$0, …

… …

Law Soccer

Cosmos Politics

Comparative Ranking

Figure 1: Our proposed contextualized weakly supervised method leverages BERT to create a contextualizedcorpus. This contextualized corpus is further utilized to resolve interpretations of seed words, generate pseudo-labels, train a classifier and expand the seed set in an iterative fashion.

indicative keywords from the contextualized cor-pus, leading to contextualized seed words. We willrepeat the iterative classification and seed wordexpansion process until the convergence.

To the best of our knowledge, this is the firstwork on contextualized weak supervision for textclassification. It is also worth mentioning that ourproposed framework is compatible with almost anycontextualized representation learning models andtext classification models. Our contributions aresummarized as follows:• We propose a novel framework enabling contex-

tualized weak supervision for text classification.• We develop an unsupervised method to auto-

matically group word occurrences of the sameword into an adaptive number of interpretationsbased on contextualized representations and user-provided seed information.• We design a principled ranking mechanism to

identify words that are discriminative and highlylabel-indicative.• We have performed experiments on real-world

datasets for both coarse- and fine-grained textclassification tasks. The results demonstrate thesuperiority of using contextualized weak supervi-sion, especially when the labels are fine-grained.

Our code is made publicly available at GitHub1.

2 Overview

Problem Formulation. The input of our prob-lem contains (1) a collection of n text documentsD = {D1,D2, . . . ,Dn} and (2) m target classesC = {C1, C2, . . . , Cm} and their seed words S ={S1,S2, . . . ,Sm}. We aim to build a high-quality

1https://github.com/dheeraj7596/ConWea

document classifier from these inputs, assigningclass label Cj ∈ C to each document Di ∈ D.

Note that, all these words could be upgraded tophrases if phrase mining techniques (Liu et al.,2015; Shang et al., 2018) were applied as pre-processing. In this paper, we stick to the words.Framework Overview. We propose a framework,ConWea, enabling contextualized weak supervi-sion. Here, “contextualized” is reflected in twoplaces: the corpus and seed words. Therefore, wehave developed two novel techniques accordinglyto make both contextualizations happen.

First, we leverage contextualized representationlearning techniques (Peters et al., 2018; Devlinet al., 2019) to create a contextualized corpus. Wechoose BERT (Devlin et al., 2019) as an examplein our implementation to generate a contextualizedvector of every word occurrence. We assume theuser-provided seed words are of reasonable quality— the majority of the seed words are not ambigu-ous, and the majority of the occurrences of the seedwords are about the semantics of the user-specifiedclass. Based on these two assumptions, we are ableto develop an unsupervised method to automati-cally group word occurrences of the same wordinto an adaptive number of interpretations, harvest-ing the contextualized corpus.

Second, we design a principled comparativeranking method to select highly label-indicativekeywords from the contextualized corpus, leadingto contextualized seed words. Specifically, we startwith all possible interpretations of seed words andtrain a neural classifier. Based on the predictions,we compare and contrast the documents belong-ing to different classes, and rank contextualizedwords based on how label-indicative, frequent, and

https://github.com/dheeraj7596/ConWea

325

(a) Similarity Distribution: Windows (b) Cluster Visualisation: Windows (c) Cluster Visualisation: Penalty

Figure 2: Document contextualization examples using word “windows” and “penalty”. τ is decided based on thesimilarity distributions of all seed word occurrences. Two clusters are discovered for both words, respectively.

unusual these words are. During this process, weeliminate the wrong interpretations of initial seedwords and also add more highly label-indicativecontextualized words.

This entire process is visualized in Figure 1. Wedenote the number of iterations between classifiertraining and seed word expansion as T , which is theonly hyper-parameter in our framework. We dis-cuss these two novel techniques in detail in the fol-lowing sections. To make our paper self-contained,we will also brief the pseudo-label generation anddocument classifiers.

3 Document Contextualization

We leverage contextualized representation tech-niques to create a contextualized corpus. Thekey objective of this contextualization is to dis-ambiguate different occurrences of the same wordinto several interpretations. We treat every wordseparately, so in the rest of this section, we focuson a given word w. Specifically, given a word w,we denote all its occurrences as w1, . . . , wn, wheren is its total number of occurrences in the corpus.Contextualized Representation. First, we obtaina contextualized vector representation bwi for eachwi. Our proposed method is compatible with al-most any contextualized representation learningmodel. We choose BERT (Devlin et al., 2019) asan example in our implementation to generate acontextualized vector for each word occurrence. Inthis contextualized vector space, we use the cosinesimilarity to measure the similarity between twovectors. Two word occurrences wi and wj of thesame interpretation are expected to have a high co-sine similarity between their vectors bwi and bwj .For the ease of computation, we normalize all con-textualized representations into unit vectors.

Choice of Clustering Methods. We model theword occurrence disambiguation problem as a clus-tering problem. Specifically, we propose to usethe K-Means algorithm (Jain and Dubes, 1988) tocluster all contextualized representations bwi intoK clusters, where K is the number of interpreta-tions. We prefer K-Means because (1) the cosinesimilarity and Euclidean distance are equivalent forunit vectors and (2) it is fast and we are clusteringa significant number of times.Automated Parameter Setting. We choose thevalue of K purely based on a similarity thresholdτ . τ is introduced to decide whether two clustersbelong to the same interpretation by checking ifthe cosine similarity between two cluster centervectors is greater than τ . Intuitively, we shouldkeep increasing K until there exist no two clusterswith the same interpretation. Therefore, we chooseK to be the largest number such that the similaritybetween any two cluster centers is no more than τ .

K = argmaxK{cos(ci, cj) < τ∀i, j} (1)

where ci refers to the i-th cluster center vector afterclustering all contextualized representations intoK clusters. In practice, K is usually no more than10. So we increase K gradually until the constraintis violated.

We pick τ based on user-provided seed infor-mation instead of hand-tuning, As mentioned, wemake two “majority” assumptions: (1) For anyseed word, the majority of its occurrences followthe intended interpretation by the user; and (2) Themajority of the seed words are not ambiguous —they only have one interpretation. Therefore, foreach seed word s, we take the median of pairwisecosine similarities between its occurrences.

τ(s) = median({sim(bsi ,bsj )|∀i, j}) (2)

326

Algorithm 1: Corpus ContextualizationInput: Word occurrences w1, w2, . . . , wn of

the word w, Seed words s1, s2, . . . , sm andtheir occurrences si,j .

Output: Contextualized word occurrencesw1, w2, . . . , wn

Obtain bwi and bsi,j using BERT.Compute τ follow Equation 3.K← 1while True do

Run K-Means on {bwi} for (K+1) clusters.Obtain cluster centers c1, c2, . . . , cK+1.if maxi,j cos(ci, cj) > τ then

BreakK← K + 1

Run K-Means on {bwi} for K clusters.Obtain cluster centers c1, c2, . . . , cK .for each occurrence wi do

Compute wi following Equation 4.Return wi.

Then, we take the median of these medians over allseed words as τ . Mathematically,

τ = median({τ(s)|∀s}) (3)

The nested median solution makes the choice ofτ safe and robust to outliers. For example, considerthe word “windows” in the 20Newsgroup corpus.In fact, the word windows has two interpretationsin the 20Newsgroup corpus — one represents anopening in the wall and the other is an operatingsystem. We first compute the pairwise similaritiesbetween all its occurrences and plot the histogramas shown in Figure 2(a). From this plot, we cansee that its median value is about 0.7. We applythe same for all seed words and obtain τ followingEquation 3. τ is calculated to be 0.82. Based onthis value, we gradually increase K for “windows”and it ends up with K = 2. We visualize its K-Means clustering results using t-SNE (Maaten andHinton, 2008) in Figure 2(b). Similar results canbe observed for the word penalty, as shown in Fig-ure 2(c). These examples demonstrate how ourdocument contextualization works for each word.

In practice, to make it more efficient, one cansubsample the occurrences instead of enumeratingall pairs in a brute-force manner.Contextualized Corpus. The interpretation ofeach occurrence of w is decided by the cluster-IDto which its contextualized representation belongs.Specifically, given each occurrence wi, the word w

Figure 3: The HAN classifier used in our ConWeaframework. It is trained on our contextualized corpuswith the generated pseudo-labels.

is replaced by wi in the corpus as follows:

wi =

{w if K = 1w$j∗ otherwise

(4)

where

j∗ = argK

maxj=1

cos(bwi , cj)

By applying this to all words and their occurrences,the corpus is contextualized. The pseudo-code forcorpus contextualization is shown in Algorithm 1.

4 Pseudo-Label and Text Classifier

We generate pseudo-labels for unlabeled contex-tualized documents and train a classifier based onthese pseudo-labels, similar to many other weaklysupervised methods (Agichtein and Gravano, 2000;Riloff et al., 2003; Kuipers et al., 2006; Tao et al.,2015; Meng et al., 2018). These two parts are notthe focus of this paper. We briefly introduce themto make the paper self-contained.Pseudo-Label Generation. There are severalways to generate pseudo-labels from seed words.As proof-of-concept, we employ a simple but effec-tive method based on counting. Each document isassigned a label whose aggregated term frequencyof seed words is maximum. Let tf(w, d) denoteterm-frequency of a contextualized word w in thecontextualized document d and Sc represents set ofseed words of class c, the document d is assigned alabel l(d) as follows:

l(d) = argmaxl{∑i

tf(si, d)|∀si ∈ Sl} (5)

327

Document Classifier. Our framework is compati-ble with any text classification model. We use Hi-erarchical Attention Networks (HAN) (Yang et al.,2016) as an example in our implementation. HANconsiders the hierarchical structure of documents(document – sentences – words) and includes anattention mechanism that finds the most importantwords and sentences in a document while takingthe context into consideration. There are two lev-els of attention: word-level attention identifies theimportant words in a sentence and sentence levelattention identifies the important sentences in a doc-ument. The overall architecture of HAN is shownin Figure 3. We train a HAN model on contextual-ized corpus with the generated pseudo-labels. Thepredicted labels are used in seed expansion anddisambiguation.

5 Seed Expansion and Disambiguation

Seed Expansion. Given contextualized documentsand their predicted class labels, we propose to rankcontextualized words and add the top few wordsinto the seed word sets. The core element of thisprocess is the ranking function. An ideal seed words of label l, is an unusual word that appears onlyin the documents belonging to label l with signifi-cant frequency. Hence, for a given class Cj and aword w, we measure its ranking score based on thefollowing three aspects:• Label-Indicative. Since our pseudo-label gen-

eration follows the presence of seed words in thedocument, ideally, the posterior probability ofa document belonging to the class Cj after ob-serving the presence of word w (i.e., P (Cj |w))should be very close to 100%. Therefore, we useP (Cj |w) as our label-indicative measure:

LI(Cj , w) = P (Cj |w) =fCj ,w

fCj

where fCj refers to the total number of docu-ments that are predicted as class Cj , and amongthem, fCj ,w documents contain the word w. Allthese counts are based on the prediction resultson the input unlabeled documents.• Frequent. Ideally, a seed word s of label l ap-

pears in the documents belonging to label l withsignificant frequency. To measure the frequencyscore, we first compute the average frequency ofseed word s in all the documents belonging tolabel l. Since average frequency is unbounded,we apply tanh function to scale it, resulting in

the frequency score,

F(Cj , w) = tanh(fCj (w)fCj

)Here, different from fCj ,w defined earlier, fCj (w)is the frequency of word w in documents that arepredicted as class Cj .• Unusual: We want our highly label-indicative

and frequent words to be unusual. To incorporatethis, we consider inverse document frequency(IDF). Let n be the number of documents inthe corpus D and fD,w represents the documentfrequency of word w, the IDF of a word w iscomputed as follows:

IDF(w) = log( n

fD,w

)Similar to previous work (Tao et al., 2015), we

combine these three measures using the geometricmean, resulting in the ranking score R(Cj , w) of aword w for a class Cj .

R(Cj , w) =(LI(Cj , w)×F(Cj , w)× IDF(w)

)1/3Based on this aggregated score, we add top wordsto expand the seed word set of the class Cj .Seed Disambiguation. While the majority of user-provided seed words are nice and clean, some ofthem may have multiple interpretations in the givencorpus. We propose to disambiguate them basedon the ranking. We first consider all possible in-terpretations of an initial seed word, generate thepseudo-labels, and train a classifier. Using the clas-sified documents and the ranking function, we rankall possible interpretations of the same initial seedword. Because the majority occurrences of a seedword are assumed to belong to the user-specifiedclass, the intended interpretation shall be ranked thehighest. Therefore, we retain only the top-rankedinterpretation of this seed word.

After this step, we have fully contextualizedour weak supervision, including the initial user-provided seeds.

6 Experiments

In this section, we evaluate our framework andmany compared methods on coarse- and fine-grained text classification tasks under the weaklysupervised setting.

328

Table 1: Dataset statistics.

Dataset # Docs # Coarse # Fine Avg Doc Len

NYT 13,081 5 25 77820News 18,846 6 20 400

6.1 Datasets

Following previous work (Tao et al., 2015), (Menget al., 2018), we use two news datasets in our ex-periments. The dataset statistics are provided inTable 1. Here are some details.• The New York Times (NYT): The NYT dataset

contains news articles written and published byThe New York Times. These articles are clas-sified into 5 wide genres (e.g., arts, sports) and25 fine-grained categories (e.g., dance, music,hockey, basketball).• The 20 Newsgroups (20News): The 20News

dataset2 is a collection of newsgroup documentspartitioned widely into 6 groups (e.g., recre-ation, computers) and 20 fine-grained classes(e.g., graphics, windows, baseball, hockey).We perform coarse- and fine-grained classifica-

tions on the NYT and 20News datasets. NYTdataset is imbalanced in both fine-grained andcoarse-grained classifications. 20News is nearlybalanced in fine-grained classification but imbal-anced in coarse-grained classification. Being awareof these facts, we adopt micro- and macro-F1 scoresas evaluation metrics.

6.2 Compared Methods

We compare our framework with a wide range ofmethods described below:• IR-TF-IDF treats the seed word set for each

class as a query. The relevance of a documentto a label is computed by aggregated TF-IDFvalues of its respective seed words. The labelwith the highest relevance is assigned to eachdocument.• Dataless (Chang et al., 2008) uses only la-

bel surface names as supervision and leveragesWikipedia to derive vector representations of la-bels and documents. Each document is labeledbased on the document-label similarity.• Word2Vec first learns word vector representa-

tions (Mikolov et al., 2013) for all terms in thecorpus and derive label representations by aggre-gating the vectors of its respective seed words.Finally, each document is labeled with the most

2http://qwone.com/˜jason/20Newsgroups/

similar label based on cosine similarity.• Doc2Cube (Tao et al., 2015) considers label

surface names as seed set and performs multi-dimensional document classification by learningdimension-aware embedding.• WeSTClass (Meng et al., 2018) leverages seed

information to generate pseudo documents andrefines the model through a self-training modulethat bootstraps on real unlabeled documents.We denote our framework as ConWea, which in-

cludes contextualizing corpus, disambiguating seedwords, and iterative classification & key words ex-pansion. Besides, we have three ablated versions.ConWea-NoCon refers to the variant of ConWeatrained without the contextualization of corpus.ConWea-NoSeedExp is the variant of ConWeawithout the seed expansion module. ConWea-WSD refers to the variant of ConWea, with the con-textualization module replaced by Lesk algorithm(Lesk, 1986), a classic Word-sense disambiguationalgorithm (WSD).

We also present the results of HAN-Supervisedunder the supervised setting for reference. We use80-10-10 for train-validation-test splitting and re-port the test set results for it. All weakly supervisedmethods are evaluated on the entire datasets.

6.3 Experiment Settings

We use pre-trained BERT-base-uncased3 toobtain contextualized word representations. Wefollow Devlin et al. (2019) and concatenate theaveraged word-piece vectors of the last four layers.

The seed words are obtained as follows: weasked 5 human experts to nominate 5 seed wordsper class, and then considered the majority words(i.e., > 3 nominations) as our final set of seedwords. For every class, we mainly use the labelsurface name as seed words. For some multi-wordclass labels (e.g., “international business”), we havemultiple seed words, but never exceeds four pereach class. The same seed words are utilized forall compared methods for fair comparisons.

For ConWea, we set T = 10. For any methodusing word embedding, we set its dimension to be100. We use the public implementations of WeST-Class4 and Dataless5 with the hyper-parametersmentioned in their original papers.

3https://github.com/google-research/bert

4https://github.com/yumeng5/WeSTClass5https://cogcomp.org/page/software_

view/Descartes

http://qwone.com/~jason/20Newsgroups/

https://github.com/google-research/bert

https://github.com/google-research/bert

https://github.com/yumeng5/WeSTClass

https://cogcomp.org/page/software_view/Descartes

https://cogcomp.org/page/software_view/Descartes

329

Table 2: Evaluation Results for All Methods on Fine-Grained and Coarse-Grained Labels. Both micro-F1 andmacro-F1 scores are presented. Ablation and supervised results are also included.

NYT 20 Newsgroup

5-Class (Coarse) 25-Class (Fine) 6-Class (Coarse) 20-Class (Fine)Methods Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1

IR-TF-IDF 0.65 0.58 0.56 0.54 0.49 0.48 0.53 0.52Dataless 0.71 0.48 0.59 0.37 0.50 0.47 0.61 0.53

Word2Vec 0.92 0.83 0.69 0.47 0.51 0.45 0.33 0.33Doc2Cube 0.71 0.38 0.67 0.34 0.40 0.35 0.23 0.23WeSTClass 0.91 0.84 0.50 0.36 0.53 0.43 0.49 0.46

ConWea 0.95 0.89 0.91 0.79 0.62 0.57 0.65 0.64

ConWea-NoCon 0.91 0.83 0.89 0.74 0.53 0.50 0.58 0.57ConWea-NoExpan 0.92 0.85 0.76 0.66 0.58 0.53 0.58 0.57

ConWea-WSD 0.83 0.78 0.72 0.64 0.52 0.46 0.49 0.47

HAN-Supervised 0.96 0.92 0.94 0.82 0.90 0.88 0.83 0.83

6.4 Performance Comparison

We summarize the evaluation results of all methodsin Table 2. As one can observe that our proposedframework achieves the best performance amongall the compared weakly supervised methods. Wediscuss the effectiveness of ConWea as follows:

• Our proposed framework ConWea outperformsall the other methods with significant margins.By contextualizing the corpus and resolving theinterpretation of seed words, ConWea achievesinspiring performance, demonstrating the neces-sity and the importance of using contextualizedweak supervision.• We observe that in the fine-grained classifica-

tion, the advantages of ConWea over other meth-ods are even more significant. This can be at-tributed to the contextualization of corpus andseed words. Once the corpus is contextualizedproperly, the subtle ambiguity between words isa drawback to other methods, whereas ConWeacan distinguish them and predict them correctly.• The comparison between ConWea and the ab-

lation method ConWea-NoExpan demonstratesthe effectiveness of our Seed Expansion. Forexample, for fine-grained labels on the 20Newsdataset, the seed expansion improves the micro-F1 score from 0.58 to 0.65.• The comparison between ConWea and the two

ablation methods ConWea-NoCon and ConWea-WSD demonstrates the effectiveness of our Con-textualization. Our contextualization, buildingupon (Devlin et al., 2019), is adaptive to the in-put corpus, without requiring any additional hu-man annotations. However, WSD methods(e.g.,(Lesk, 1986)) are typically trained for a generaldomain. If one wants to apply WSD to some spe-

cific corpus, additional annotated training datamight be required to meet the similar perfor-mance as ours, which defeats the purpose ofa weakly supervised setting. Therefore, we be-lieve that our contextualization module has itsunique advantages. Our experimental resultsfurther confirm the above reasoning empirically.For example, for coarse-grained labels on the20News dataset, the contextualization improvesthe micro-F1 score from 0.53 to 0.62.• We observe that ConWea performs quite close

to supervised methods, for example, on the NYTdataset. This demonstrates that ConWea is quiteeffective in closing the performance gap betweenthe weakly supervised and supervised settings.

6.5 Parameter Study

The only hyper-parameter in our algorithm is T ,the number of iterations of iterative expansion &classification. We conduct experiments to studythe effect of the number of iterations on the perfor-mance. The plot of performance w.r.t. the numberof iterations is shown in Figure 4. We observethat the performance increases initially and gradu-ally converges after 4 or 5 iterations. We observethat after the convergence point, the expanded seedwords have become almost unchanged. While thereis some fluctuation, a reasonably large T , such asT = 10, is a good choice.

6.6 Number of Seed Words

We vary the number of seed words per class andplot the F1 score in Figure 5. One can observe thatin general, the performance increases as the numberof seed words increase. There is a slightly differentpattern on the 20News dataset when the labels arefine-grained. We conjecture that it is caused by the

330

(a) NYT Coarse (b) NYT Fine (c) 20News Coarse (d) 20News Fine

Figure 4: Micro- and Macro-F1 scores w.r.t. the number of iterations.

(a) NYT Coarse (b) NYT Fine (c) 20News Coarse (d) 20News Fine

Figure 5: Micro- and Macro-F1 scores w.r.t. the number of seed words.

subtlety of seed words in fine-grained cases – addi-tional seed words may bring some noise. Overall,three seed words per class are enough for reason-able performance.

6.7 Case Study

We present a case study to showcase the power ofcontextualized weak supervision. Specifically, weinvestigate the differences between the expandedseed words in the plain corpus and contextualizedcorpus over iterations. Table 3 shows a column-by-column comparison for the class For Sale on the20News dataset. The class For Sale refers to doc-uments advertising goods for sale. Starting withthe same seed sets in both types of corpora, fromTable 3, in the second iteration, we observe that“space” becomes a part of expanded seed set in theplain corpus. Here “space” has two interpretations,one stands for the physical universe beyond theEarth and the other is for an area of land. Thiserror gets propagated and amplified over the iter-ations, further introducing wrong seed words like“nasa”, “shuttle” and “moon”, related to its first in-terpretation. The seed set for contextualized corpusaddresses this problem and adds only the wordswith appropriate interpretations. Also, one can seethat the initial seed word “offer” has been disam-biguated as “offer$0”.

7 Related Work

We review the literature about (1) weak supervisionfor text classification methods, (2) contextualizedrepresentation learning techniques, (3) documentclassifiers, and (4) word sense disambiguation.

7.1 Weak Supervision for Text Classification

Weak supervision has been studied for buildingdocument classifiers in various of forms, includ-ing hundreds of labeled training documents (Tanget al., 2015; Miyato et al., 2016; Xu et al., 2017),class/category names (Song and Roth, 2014; Taoet al., 2015; Li et al., 2018), and user-provided seedwords (Meng et al., 2018; Tao et al., 2015). In thispaper, we focus on user-provided seed words asthe source of weak supervision, Along this line,Doc2Cube (Tao et al., 2015) expands label key-words from label surface names and performs multi-dimensional document classification by learningdimension-aware embedding; PTE (Tang et al.,2015) utilizes both labeled and unlabeled docu-ments to learn text embeddings specifically for atask, which are later fed to logistic regression classi-fiers for classification; Meng et al. (2018) leverageseed information to generate pseudo documentsand introduces a self-training module that boot-straps on real unlabeled data for model refining.This method is later extended to handle hierarchicalclassifications based on a pre-defined label taxon-omy (Meng et al., 2019). However, all these weaksupervisions follow a context-free manner. Here,we propose to use contextualized weak supervision.

7.2 Contextualized Word Representations

Contextualized word representation is originatedfrom machine translation (MT). CoVe (McCannet al., 2017) generates contextualized representa-tions for a word based on pre-trained MT models,More recently, ELMo (Peters et al., 2018) lever-ages neural language models to replace MT models,

331

Table 3: Case Study: Seed word expansion of the For Sale class in context-free and contextualized corpora. TheFor Sale class contains documents advertising goods for sale. Blue bold words are potentially wrong seeds.

Seed Words for For Sale class

Iter Plain Corpus Contextualized Corpus

1 sale, offer, forsale sale, offer, forsale

2 space, price, shipping, sale, offer shipping, forsale, offer$0, condition$0, sale

3 space, price, shipping, sale, nasa, price, shipping, sale, forsale, condition$0,offer, package, email offer$0, package, email

4 space, price, moon, shipping, sale, nasa, price, shipping, sale, forsale, condition$0,offer, shuttle, package, email offer$0, package, email, offers$0, obo$0

which removes the dependency on massive paralleltexts and takes advantages of nearly unlimited rawcorpora. Many models leveraging language mod-eling to build sentence representations (Howardand Ruder, 2018; Radford et al., 2018; Devlinet al., 2019) emerge almost at the same time. Lan-guage models have also been extended to the char-acter level (Liu et al., 2018; Akbik et al., 2018),which can generate contextualized representationsfor character spans.

Our proposed framework is compatible with allthe above contextualized representation techniques.In our implementation, we choose to use BERTto demonstrate the power of using contextualizedsupervision.

7.3 Word Sense Disambiguation

Word Sense Disambiguation (WSD) is one of thechallenging problems in natural language process-ing. Typical WSD models (Lesk, 1986; Zhongand Ng, 2010; Yuan et al., 2016; Raganato et al.,2017; Le et al., 2018; Tripodi and Navigli, 2019)are trained for a general domain. Recent works(Li and Jurafsky, 2015; Mekala et al., 2016; Guptaet al., 2019) also showed that machine-interpretablerepresentations of words considering its senses, im-prove document classification. However, if onewants to apply WSD to some specific corpus, ad-ditional annotated training data might be requiredto meet the similar performance as ours, whichdefeats the purpose of a weakly supervised setting.

In contrast, our contextualization, buildingupon (Devlin et al., 2019), is adaptive to the inputcorpus, without requiring any additional humanannotations. Therefore, our framework is moresuitable than WSD under the weakly supervisedsetting.. Our experimental results have verified thisreasoning and showed the superiority of our contex-tualization module over WSD in weakly superviseddocument classification tasks.

7.4 Document ClassifierDocument classification problem has been longstudied. In our implementation of the proposedConWea framework, we used HAN (Yang et al.,2016), which considers the hierarchical structureof documents and includes attention mechanismsto find the most important words and sentences in adocument. CNN-based text classifiers(Kim, 2014;Zhang et al., 2015; Lai et al., 2015) are also popularand can achieve inspiring performance.

Our framework is compatible with all the abovetext classifiers. We choose HAN just for a demon-stration purpose.

8 Conclusions and Future Work

In this paper, we proposed ConWea, a novel con-textualized weakly supervised classification frame-work. Our method leverages contextualized repre-sentation techniques and initial user-provided seedwords to contextualize the corpus. This contextual-ized corpus is further used to resolve the interpre-tation of seed words through iterative seed wordexpansion and document classifier training. Exper-imental results demonstrate that our model outper-forms previous methods significantly, thereby sig-nifying the superiority of contextualized weak su-pervision, especially when labels are fine-grained.

In the future, we are interested in generalizingcontextualized weak supervision to hierarchicaltext classification problems. Currently, we performcoarse- and fine-grained classifications separately.There should be more useful information embeddedin the tree-structure of the label hierarchy. Also,extending our method for other types of textualdata, such as short texts, multi-lingual data, andcode-switched data is a potential direction.

Acknowledgements

We thank Palash Chauhan and Harsh Jhamtani forvaluable discussions.

332

ReferencesEugene Agichtein and Luis Gravano. 2000. Snowball:

Extracting relations from large plain-text collections.In Proceedings of the fifth ACM conference on Digi-tal libraries, pages 85–94. ACM.

Alan Akbik, Duncan Blythe, and Roland Vollgraf.2018. Contextual string embeddings for sequencelabeling. In Proceedings of the 27th InternationalConference on Computational Linguistics, pages1638–1649.

Ming-Wei Chang, Lev-Arie Ratinov, Dan Roth, andVivek Srikumar. 2008. Importance of semantic rep-resentation: Dataless classification. In Aaai, vol-ume 2, pages 830–835.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4171–4186.

Vivek Gupta, Ankit Saw, Pegah Nokhiz, Harshit Gupta,and Partha Talukdar. 2019. Improving documentclassification with multi-sense embeddings. arXivpreprint arXiv:1911.07918.

Jeremy Howard and Sebastian Ruder. 2018. Universallanguage model fine-tuning for text classification. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), volume 1, pages 328–339.

Anil K Jain and Richard C Dubes. 1988. Algorithmsfor clustering data. Englewood Cliffs: Prentice Hall,1988.

Yoon Kim. 2014. Convolutional neural net-works for sentence classification. arXiv preprintarXiv:1408.5882.

Benjamin J Kuipers, Patrick Beeson, Joseph Modayil,and Jefferson Provost. 2006. Bootstrap learning offoundational representations. Connection Science,18(2):145–158.

Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015.Recurrent convolutional neural networks for textclassification. In Twenty-ninth AAAI conference onartificial intelligence.

Minh Le, Marten Postma, Jacopo Urbani, and PiekVossen. 2018. A deep dive into word sense disam-biguation with lstm. In Proceedings of the 27th in-ternational conference on computational linguistics,pages 354–365.

Michael Lesk. 1986. Automatic sense disambiguationusing machine readable dictionaries: how to tell apine cone from an ice cream cone. In Proceedings ofthe 5th annual international conference on Systemsdocumentation, pages 24–26.

Jiwei Li and Dan Jurafsky. 2015. Do multi-sense em-beddings improve natural language understanding?arXiv preprint arXiv:1506.01070.

Keqian Li, Hanwen Zha, Yu Su, and Xifeng Yan.2018. Unsupervised neural categorization for sci-entific publications. In SIAM Data Mining, pages37–45. SIAM.

Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, and Ji-awei Han. 2015. Mining quality phrases from mas-sive text corpora. In Proceedings of the 2015 ACMSIGMOD International Conference on Managementof Data, pages 1729–1744.

Liyuan Liu, Jingbo Shang, Xiang Ren,Frank Fangzheng Xu, Huan Gui, Jian Peng,and Jiawei Han. 2018. Empower sequence la-beling with task-aware neural language model.In Thirty-Second AAAI Conference on ArtificialIntelligence.

Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-sne. Journal of machinelearning research, 9(Nov):2579–2605.

Bryan McCann, James Bradbury, Caiming Xiong, andRichard Socher. 2017. Learned in translation: Con-textualized word vectors. In Advances in Neural In-formation Processing Systems, pages 6294–6305.

Dheeraj Mekala, Vivek Gupta, Bhargavi Paranjape,and Harish Karnick. 2016. Scdv: Sparse com-posite document vectors using soft clusteringover distributional representations. arXiv preprintarXiv:1612.06778.

Yu Meng, Jiaming Shen, Chao Zhang, and JiaweiHan. 2018. Weakly-supervised neural text classifica-tion. In Proceedings of the 27th ACM InternationalConference on Information and Knowledge Manage-ment, pages 983–992. ACM.

Yu Meng, Jiaming Shen, Chao Zhang, and Jiawei Han.2019. Weakly-supervised hierarchical text classifi-cation. In Proceedings of the AAAI Conference onArtificial Intelligence, volume 33, pages 6826–6833.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

Takeru Miyato, Andrew M Dai, and Ian Goodfel-low. 2016. Adversarial training methods for semi-supervised text classification. arXiv:1605.07725.

Matthew E Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of NAACL-HLT, pages2227–2237.

Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training.

333

Alessandro Raganato, Jose Camacho-Collados, andRoberto Navigli. 2017. Word sense disambiguation:A unified evaluation framework and empirical com-parison. In Proceedings of the 15th Conference ofthe European Chapter of the Association for Compu-tational Linguistics: Volume 1, Long Papers, pages99–110.

Ellen Riloff, Janyce Wiebe, and Theresa Wilson. 2003.Learning subjective nouns using extraction patternbootstrapping. In Proceedings of the seventh confer-ence on Natural language learning at HLT-NAACL2003-Volume 4, pages 25–32. Association for Com-putational Linguistics.

Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren,Clare R Voss, and Jiawei Han. 2018. Automatedphrase mining from massive text corpora. IEEETransactions on Knowledge and Data Engineering,30(10):1825–1837.

Yangqiu Song and Dan Roth. 2014. On dataless hierar-chical text classification. In AAAI.

Jian Tang, Meng Qu, and Qiaozhu Mei. 2015. Pte: Pre-dictive text embedding through large-scale hetero-geneous text networks. In Proceedings of the 21thACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, pages 1165–1174.ACM.

Fangbo Tao, Chao Zhang, Xiusi Chen, Meng Jiang,Tim Hanratty, Lance Kaplan, and Jiawei Han. 2015.Doc2cube: Automated document allocation to textcube via dimension-aware joint embedding. Dimen-sion, 2016:2017.

Rocco Tripodi and Roberto Navigli. 2019. Game the-ory meets embeddings: a unified framework forword sense disambiguation. In Proceedings of the2019 Conference on Empirical Methods in Natu-ral Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 88–99.

Weidi Xu, Haoze Sun, Chao Deng, and Ying Tan.2017. Variational autoencoder for semi-supervisedtext classification. In AAAI.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchi-cal attention networks for document classification.In Proceedings of the 2016 conference of the NorthAmerican chapter of the association for computa-tional linguistics: human language technologies,pages 1480–1489.

Dayu Yuan, Julian Richardson, Ryan Doherty, ColinEvans, and Eric Altendorf. 2016. Semi-supervisedword sense disambiguation with neural models.arXiv preprint arXiv:1603.07012.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.Character-level convolutional networks for text clas-sification. In Advances in neural information pro-cessing systems, pages 649–657.

Zhi Zhong and Hwee Tou Ng. 2010. It makes sense:A wide-coverage word sense disambiguation systemfor free text. In Proceedings of the ACL 2010 systemdemonstrations, pages 78–83.

Contextualized Weak Supervision for Text Classification · 3 Document Contextualization We leverage...

Documents

Transcript of Contextualized Weak Supervision for Text Classification · 3 Document Contextualization We leverage...