Online Information Retrieval for Language Learningmchnkina/downloads/Chinkina_Kannan... · Online...

6
Online Information Retrieval for Language Learning Maria Chinkina Madeeswaran Kannan Detmar Meurers Universit¨ at T ¨ ubingen LEAD Graduate School Department of Linguistics {mchnkina,mkannan,dm}@sfs.uni-tuebingen.de Abstract The reading material used in a language learning classroom should ideally be rich in terms of the grammatical constructions and vocabulary to be taught and in line with the learner’s interests. We devel- oped an online Information Retrieval sys- tem that helps teachers search for texts appropriate in form, content, and read- ing level. It identifies the 87 grammati- cal constructions spelled out in the official English language curriculum of schools in Baden-W¨ urttemberg, Germany. The tool incorporates a classical efficient al- gorithm for reranking the results by as- signing weights to selected constructions and prioritizing the documents containing them. Supplemented by an interactive vi- sualization module, it allows for a multi- faceted presentation and analysis of the re- trieved documents. 1 Introduction The learner’s exposure to a language influences their acquisition of it. The importance of input in second language (L2) learning has been repeatedly emphasized by the proponents of major Second Language Acquisition theories (Krashen, 1977; Gass and Varonis, 1994; Swain, 1985), with psy- cholinguists highlighting the significance of fre- quency and perceptual salience of target construc- tions (e.g., Slobin, 1985). In line with this research, a pedagogical ap- proach of input flood (Trahey and White, 1993) is extensively used by L2 teachers. However, manu- ally searching for linguistically rich reading ma- terial takes a lot of time and effort. As a re- sult, teachers often make use of easily accessible schoolbook texts. However, this limits the choice of texts, and they are typically less up-to-date and less in line with students’ interests than authentic texts. In the same vein, a survey conducted by Pur- cell et al. (2012) revealed that teachers expect their students to use online search engines in a typical research assignment with a very high probability of 94%, compared to the 18% usage of printed or electronic textbooks. With this in mind, we developed an online In- formation Retrieval (IR) system that uses efficient algorithms to retrieve, annotate and rerank web documents based on the grammatical construc- tions they contain. The paper presents FLAIR 1 (Form-Focused Linguistically Aware Information Retrieval), a tool that provides a balance of con- tent and form in the search for appropriate reading material. 2 Overview and Architecture The FLAIR pipeline can be broadly reduced to four primary operations – Web Search, Text Crawling, Parsing and Ranking. As demonstrated by the diagram in Figure 1, the first three opera- tions are delegated to the server as they require the most resources. Ranking, however, is performed locally on the client endpoint to reduce latency. Web Crawling We chose to use Microsoft Bing 2 as our primary search engine given its readily available Java bind- ings. By default, the top 20 results are fetched for any given search query. A basic filter is applied to exclude web documents with low text content. The search is conducted repeatedly until the result- ing list of documents contains at least 20 items. 1 The online tool is accessible at: http://purl.org/ icall/flair 2 http://bing.com

Transcript of Online Information Retrieval for Language Learningmchnkina/downloads/Chinkina_Kannan... · Online...

Online Information Retrieval for Language Learning

Maria Chinkina Madeeswaran Kannan Detmar Meurers

Universitat TubingenLEAD Graduate School

Department of Linguistics{mchnkina,mkannan,dm}@sfs.uni-tuebingen.de

Abstract

The reading material used in a languagelearning classroom should ideally be richin terms of the grammatical constructionsand vocabulary to be taught and in linewith the learner’s interests. We devel-oped an online Information Retrieval sys-tem that helps teachers search for textsappropriate in form, content, and read-ing level. It identifies the 87 grammati-cal constructions spelled out in the officialEnglish language curriculum of schoolsin Baden-Wurttemberg, Germany. Thetool incorporates a classical efficient al-gorithm for reranking the results by as-signing weights to selected constructionsand prioritizing the documents containingthem. Supplemented by an interactive vi-sualization module, it allows for a multi-faceted presentation and analysis of the re-trieved documents.

1 Introduction

The learner’s exposure to a language influencestheir acquisition of it. The importance of input insecond language (L2) learning has been repeatedlyemphasized by the proponents of major SecondLanguage Acquisition theories (Krashen, 1977;Gass and Varonis, 1994; Swain, 1985), with psy-cholinguists highlighting the significance of fre-quency and perceptual salience of target construc-tions (e.g., Slobin, 1985).

In line with this research, a pedagogical ap-proach of input flood (Trahey and White, 1993) isextensively used by L2 teachers. However, manu-ally searching for linguistically rich reading ma-terial takes a lot of time and effort. As a re-sult, teachers often make use of easily accessibleschoolbook texts. However, this limits the choice

of texts, and they are typically less up-to-date andless in line with students’ interests than authentictexts. In the same vein, a survey conducted by Pur-cell et al. (2012) revealed that teachers expect theirstudents to use online search engines in a typicalresearch assignment with a very high probabilityof 94%, compared to the 18% usage of printed orelectronic textbooks.

With this in mind, we developed an online In-formation Retrieval (IR) system that uses efficientalgorithms to retrieve, annotate and rerank webdocuments based on the grammatical construc-tions they contain. The paper presents FLAIR1

(Form-Focused Linguistically Aware InformationRetrieval), a tool that provides a balance of con-tent and form in the search for appropriate readingmaterial.

2 Overview and Architecture

The FLAIR pipeline can be broadly reduced tofour primary operations – Web Search, TextCrawling, Parsing and Ranking. As demonstratedby the diagram in Figure 1, the first three opera-tions are delegated to the server as they require themost resources. Ranking, however, is performedlocally on the client endpoint to reduce latency.

Web Crawling

We chose to use Microsoft Bing2 as our primarysearch engine given its readily available Java bind-ings. By default, the top 20 results are fetched forany given search query. A basic filter is appliedto exclude web documents with low text content.The search is conducted repeatedly until the result-ing list of documents contains at least 20 items.

1The online tool is accessible at: http://purl.org/icall/flair

2http://bing.com

Figure 1: FLAIR architecture.

Text ExtractionThe Text Extractor makes use of the Boilerpipelibrary3 extracting plain text with the help of itsDefaultExtractor. The choice is motivated by thehigh performance of the library as compared toother text extraction techniques (Kohlschutter etal., 2010).

ParsingText parsing is facilitated by the StanfordCoreNLP library4 (Manning et al., 2014), whichwas chosen for its robust, performant and open-source implementation. Our initial prototype usedthe standard PCFG parser for constituent parsing,but its cubic time complexity was a significant is-sue when parsing texts with long sentences. Wetherefore switched to a shift-reduce implementa-tion5 that scales linearly with sentence and parselength. While it resulted in a higher memory over-head due to its large language models, it allowedus to substantially improve the performance of ourcode.

RankingThe final stage of the pipeline involves rankingthe results according to a number of grammati-cal constructions and syntactic properties. Eachparameter can be assigned a specific weight thatthen affects its ranking relative to the other pa-rameters. The parsed data is cached locally on

3https://code.google.com/p/boilerpipe/4http://nlp.stanford.edu/software/

corenlp.shtml5http://nlp.stanford.edu/software/

srparser.shtml

the client side for each session. This allows usto perform the ranking calculations on the localcomputer, thereby avoid a server request-responseroundtrip for each re-ranking operation.

We chose the classical IR algorithm BM25(Robertson and Walker, 1994) as the basis for ourranking model. It helps to avoid the dominanceof one single grammatical construction over theothers and is independent of the normalizationunit as it uses a ratio of the document length to theaverage document length in the collection. Thefinal score of each document determines its placein the ranking and is calculated as:

G(q, d) =∑

t∈q∩d(k+1)×tft,d

tft,d+k×(1−b+b× |d|avdl

)× logN+1

dft

where q is a FLAIR query containing one ormore linguistic forms, t is a linguistic form, d isa document, tft,d is the number of occurrencesof t in d, |d| is document length, avdl is theaverage document length in the collection, dft isthe number of documents containing t, and k isa free parameter set to 1.7. The free parameter bspecifies the importance of the document length.The functionality of the tool allows the user toadjust the importance of the document length witha slider that assigns a value from 0 to 1 to theparameter b.

2.1 Technical Implementation

FLAIR is written in Java and implemented as aJava EE web application. The core architecture re-volves around a client-server implementation that

uses WebSocket (Fette and Melnikov, 2011) andAjax (Garrett and others, 2005) technologies forfull-duplex, responsive communication. All serveroperations are performed in parallel, and each op-eration is divided into subtasks that are executedasynchronously. Operations initiated by the clientare dispatched as asynchronous messages to theserver. The client then waits for a response fromthe latter, which are relayed as rudimentary pushmessages encoded in JSON.6 By using WebSock-ets to implement the server endpoint, we were ableto reduce most of the overhead associated withHTTP responses.

The sequence of operations performed withinthe client boundary is described as follows:

1. Send search query to server and initiate websearch

2. Wait for completion signal from server

3. Initiate text parsing

4. Wait for completion signal from server

5. Request parsed data from server

6. Cache parsed data

7. Re-rank results according to parameters

The sequence of operations performed within theserver boundary is described as follows:

1. Receive search query from client

2. Begin web search operation:

(a) Fetch top N valid search results(b) For each search result, fetch page text(c) Signal completion

3. Wait for request from client

4. Begin text parsing operation:

(a) For each valid search result, parse textand collate data

(b) Signal completion

5. Wait for request from client

6. Send parsed data to client

6http://json.org

3 FLAIR Interface

The main layout consists of four elements – a set-tings panel, a search field, a list of results, and areading interface, where the identified target con-structions are highlighted. The interactive visual-ization incorporates the technique of parallel coor-dinates used for visualizing multivariate data (In-selberg and Dimsdale, 1991).

The visualization provides an overview of thedistribution of the selected linguistic characteris-tics in the set of retrieved documents. Verticalaxes represent parameters – linguistic forms, num-ber of sentences, number of words and the read-ability score, and each polyline stands for a doc-ument having certain linguistic characteristics andthus, going through different points on the param-eter axes. The interactive design allows for morecontrol over a user-selected set of linguistic char-acteristics. Users can select a range of values forone or more constructions to precisely identify andretrieve documents.

Figures 2 and 3 demonstrate FLAIR in use: Theuser has entered the query Germany and selectedPast Perfect and Present Perfect as target con-structions. After reranking the 20 retrieved docu-ments, the interactive visualization was used to se-lect only the documents with a non-zero frequencyof both constructions.

4 Detection of Linguistic Forms

We based our choice of the 87 linguistic forms onthe official school curriculum for English in thestate of Baden-Wurttemberg, Germany.7 As mostof the linguistic structures listed there do not havea one-to-one mapping to the standard output ofNLP tools, we used a rule-based approach to ap-proximate them.

For closed word classes, string matching (e.g.,articles) or look-up lists (e.g, prepositions) can beused to differentiate between their forms. How-ever, detection of some grammatical constructionsand syntactic structures requires a deeper syntacticanalysis. Identification of the degrees of compar-ison of long adjectives requires keeping track oftwo consequent tokens and their POS tags, as isthe case with the construction used to that cannotbe simply matched (cf. the passive ”It is used tobuild rockets”). More challenging structures, such

7The curricula for grades 2, 4, 6, 8, 10 are accessible onthe website of the education portal of Baden-Wurttemberg:http://bildung-staerkt-menschen.de

Figure 2: FLAIR interface: the settings panel, the list of results and the reading interface.

Figure 3: The visualization component of FLAIR. Vertical axes correspond to text characteristics and thelines going through the axes represent documents.

as real and unreal conditionals and different gram-matical tenses, are identified by means of complexpatterns and additional constraints. For a moreelaborate discussion of the detection of linguisticforms, the pilot evaluation and the use cases, seeChinkina and Meurers (2016).

5 Performance Evaluation

Parallelization of the tool allowed us to reducethe overall processing time by at least a factorof 25 (e.g., 35 seconds compared to 15 minutesfor top 20 results). However, due to the highlyparallel nature of the system, its performance islargely dependent on the hardware on which it isdeployed. Amongst the different operations per-formed by the pipeline, web crawling and text an-notation prove to be the most time-consuming andresource-intensive tasks. Web crawling is an I/O

task that is contingent on external factors such asremote network resources and bandwidth, therebymaking it a potential bottleneck and also an unre-liable target for profiling. We conducted severalsearches and calculated the relative time each op-eration took. It took around 50-65% of the totaltime (from entering the query till displaying a listof results) to fetch the results and extract the docu-ments and around 20-30% of the total time to parsethem.

The Stanford parser is responsible for text anno-tation operations, and its shift-reduce constituentparser offers best-in-class performance and accu-racy.8 We analyzed the performance of the parseron the constructions that our tool depends on forthe detection of linguistic patterns. Among the

8See http://nlp.stanford.edu/software/srparser.shtml

biggest challenges were gerunds that got anno-tated as either nouns (NN) or gerunds/present par-ticiples (VBG). Phrasal verbs, such as settle in,also appeared to be problematic for the parser andwere sometimes not presented as a single entity inthe list of dependencies.

The FLAIR light-weight algorithm for detectinglinguistic forms builds upon the results of the Stan-ford parser while adding negligible overhead. Toevaluate it, we collected nine news articles withthe average length of 39 sentences by submittingthree search queries and saving the top three re-sults for each of them. We then annotated all sen-tences for the 87 grammatical constructions andcompared the results to the system output. Table 1provides the precision, recall, and F-measure forselected linguistic forms identified by FLAIR9.

Linguistic target Prec. Rec. F1

Yes/no questions 1.00 1.00 1.00Irregular verbs 1.00 0.96 0.98used to 0.83 1.00 0.91Phrasal verbs 1.00 0.61 0.76Tenses (Present Simple, ...) 0.95 0.84 0.88Conditionals (real, unreal) 0.65 0.83 0.73Mean (81 targets) 0.94 0.90 0.91Median (81 targets) 1.00 0.97 0.95

Table 1: Evaluating the FLAIR algorithm

As the numbers show, some constructions are eas-ily detectable (plural irregular noun forms, e.g.,children) while others cannot be reliably identi-fied by the parser (conditionals). The reasons for alow performance are many-fold: the ambiguity ofa construction (real conditionals), the unreliableoutput of the text extractor module (simple sen-tences) or the Stanford Parser (-ing verb forms),and the FLAIR parser module itself (unreal con-ditionals). Given the decent F-scores and ourgoal of covering the whole curriculum, we includeall constructions into the final system – indepen-dent of their F-score. As for the effectiveness ofthe tool ina real-life setting, full user studies withlanguage teachers and learners are necessary fora proper evaluation of distinctive components ofFLAIR (see Section 7).

9The mean and the median are given for 81 targets be-cause six grammatical constructions did not occur in the testset.

6 Related Work

While most of the state-of-the-art IR systems de-signed for language teachers and learners imple-ment a text complexity module, they differ in howthey treat vocabulary and grammar. Vocabularymodels are built using either word lists (LAWSEby Ott and Meurers, 2011) or the data from learnermodels (REAP by Brown and Eskenazi, 2004).Grammar is given little to no attention: Bennohr(2005) takes into account the complexity of differ-ent conjunctions in her TextFinder algorithm.

Distinguishing features of FLAIR aimed at mak-ing it usable in a real-life setting are that (i) it cov-ers the full range of grammatical forms and cate-gories specified in the official English curriculumfor German schools, and (ii) its parallel processingmodel allows to efficiently retrieve, annotate andrerank 20 web documents in a matter of seconds.

7 Conclusion and Outlook

The paper presented FLAIR – an Information Re-trieval system that uses state-of-the-art NLP toolsand algorithms to maximize the number of spe-cific linguistic forms in the top retrieved texts. Itsupports language teachers in their search for ap-propriate reading material in the following way:

• A parsing algorithm detects the 87 linguisticconstructions spelled out in the official cur-riculum for the English language.

• Parallel processing allows to fetch and parseseveral documents at the same time, makingthe system efficient for real-life use.

• The responsive design of FLAIR ensures aseamless interaction with the system.

The tool offers input enrichment of online materi-als. In a broader context of computer-assisted lan-guage learning, it can be used to support input en-hancement (e.g., WERTi by Meurers et al., 2010)and exercise generation (e.g., Language MuseSM

by Burstein et al., 2012).Recent work includes the integration of the

Academic Word List (Coxhead, 2000) to estimatethe register of documents on-the-fly and rerankthem accordingly. The option of searching forand highlighting the occurrences of words fromcustomized vocabulary lists has also been imple-mented. In addition to the already available lengthand readability filters, we are working on the op-tions to constrain the search space by including

support for i) search restricted to specific web do-mains and data sets, such as Project Gutenberg10

or news pages, and ii) search through one’s owndata set. We also plan to implement and testmore sophisticated text readability formulas (Va-jjala and Meurers, 2014) and extend our informa-tion retrieval algorithm. Finally, a pilot online userstudy targeting language teachers is the first stepwe are taking to empirically evaluate the efficacyof the tool.

On the technical side, FLAIR was built fromthe ground up to be easily scalable and extensible.Our implementation taps the parallelizability oftext parsing and distributes the task homogenouslyover any given hardware. While FLAIR presentlysupports the English language exclusively, its ar-chitecture enables us to add support for more lan-guages and grammatical constructions with a min-imal amount of work.

Acknowledgments

This research was funded by the LEAD Gradu-ate School [GSC1028], a project of the ExcellenceInitiative of the German federal and state govern-ments. Maria Chinkina is a doctoral student at theLEAD Graduate School.

We would also like to thank the language teach-ers at Fachsprachzentrum Tubingen for trying outthe tool and providing valuable feedback.

ReferencesJasmine Bennohr. 2005. A web-based personalised

textfinder for language learners. Master’s thesis,University of Edinburgh.

Jonathan Brown and Maxine Eskenazi. 2004. Re-trieval of authentic documents for reader-specificlexical practice. In InSTIL/ICALL Symposium 2004.

Maria Chinkina and Detmar Meurers. 2016. Linguis-tically aware information retrieval: providing inputenrichment for second language learners. In Pro-ceedings of the 11th Workshop on Innovative Useof NLP for Building Educational Applications, SanDiego, CA.

Averil Coxhead. 2000. A new academic word list.TESOL quarterly, 34(2):213–238.

Ian Fette and Alexey Melnikov. 2011. The websocketprotocol.

Jesse James Garrett et al. 2005. Ajax: A new approachto web applications.

10http://gutenberg.org

Susan M Gass and Evangeline Marlos Varonis. 1994.Input, interaction, and second language production.Studies in second language acquisition, 16(03):283–302.

Alfred Inselberg and Bernard Dimsdale. 1991. Paral-lel coordinates. In Human-Machine Interactive Sys-tems, pages 199–233. Springer.

Christian Kohlschutter, Peter Fankhauser, and Wolf-gang Nejdl. 2010. Boilerplate detection using shal-low text features. In Proceedings of the third ACMinternational conference on Web search and datamining, pages 441–450. ACM.

Stephen Krashen. 1977. Some issues relating to themonitor model. On Tesol, 77(144-158).

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In Proceedings of 52ndAnnual Meeting of the Association for Computa-tional Linguistics: System Demonstrations, pages55–60.

Niels Ott and Detmar Meurers. 2011. Information re-trieval for education: Making search engines lan-guage aware. Themes in Science and TechnologyEducation, 3(1-2):pp–9.

Kristen Purcell, Lee Rainie, Alan Heaps, JudyBuchanan, Linda Friedrich, Amanda Jacklin, ClaraChen, and Kathryn Zickuhr. 2012. How teens doresearch in the digital world. Pew Internet & Amer-ican Life Project.

Stephen E Robertson and Steve Walker. 1994. Somesimple effective approximations to the 2-poissonmodel for probabilistic weighted retrieval. In Pro-ceedings of the 17th annual international ACM SI-GIR conference on Research and development in in-formation retrieval, pages 232–241.

Dan I Slobin. 1985. Crosslinguistic evidence for thelanguage-making capacity. The crosslinguistic studyof language acquisition, 2:1157–1256.

Merrill Swain. 1985. Communicative competence:Some roles of comprehensible input and comprehen-sible output in its development. Input in second lan-guage acquisition, 15:165–179.

Martha Trahey and Lydia White. 1993. Positiveevidence and preemption in the second languageclassroom. Studies in second language acquisition,15(02):181–204.

Sowmya Vajjala and Detmar Meurers. 2014. Assess-ing the relative reading level of sentence pairs fortext simplification. In Proceedings of the 14th Con-ference of the European Chapter of the Associationfor Computational Linguistics (EACL-14), Gothen-burg, Sweden. Association for Computational Lin-

guistics.