BMW Parts and Accessories Engine and Powerplant Identification ...
Language Identification of Search Engine Queries
description
Transcript of Language Identification of Search Engine Queries
![Page 1: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/1.jpg)
Language Identification of Search Engine Queries
Hakan Ceylan Yookyung KimDepartment of Computer Science Yahoo! Inc.University of North Texas 2821 Mission College Blvd.Denton,TX,76203 Santa Clara,CA,[email protected] [email protected]
ACL 2009
![Page 2: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/2.jpg)
outline
• Introduction• Data Generation• Language Identification• Conclusions and Future Work
![Page 3: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/3.jpg)
Introduction(1)
• Decide in which language a given text is written
• It is heavily studied• It is critical importance to search engines for
queries• Challenges : lack of any standard or publicly
available data set
![Page 4: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/4.jpg)
Introduction(2)
• A case where a correct identification of language is not necessary.
example : query ”homo sapiens” , a user enter this query from Spain. Add a non-linguistic feature to system
![Page 5: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/5.jpg)
Introduction(3)
![Page 6: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/6.jpg)
Data Generation(1)
• Data set : Constructed by the queries with clicked urls From : Yahoo! Search Engine for each language Time : three months time period
![Page 7: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/7.jpg)
Data Generation(2)
• Preprocess : remove any numbers or special characters or
extra spaces. lowercase all the letters of the queries. Calculating the frequencies of the urls for
each query.• A web page is 474 words on the average• Identify the language for web page using one of
the existing methods.
![Page 8: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/8.jpg)
Data Generation(3)
• Using Table 1(T1) and Table 2(T2) to store the above information
T1 : [ q , u , fu ] T2 : [ u , l ] q : query u : a unique url u : url l : language identified for u fu : the frequency of u
• Combine T1 and T2 into T3 T3 : [ q , l , fl , cu,l ]
l : a language fl : the count of clicks for l cu,l : the count of unique urls in language l
![Page 9: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/9.jpg)
Data Generation(4)
• It has many noise. 1. A query maps to more than one language. solve : Giving a weight wq,l for each query to a language set a threshold parameter W if wq,l < W then remove this query
2.navigational query example : ACL 2009
![Page 10: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/10.jpg)
Data Generation(5)
Solve : set two threshold parameter F and U if Fq > F or Uq < U then remove this query• Algorithm
![Page 11: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/11.jpg)
Data Generation(6)
• How to turn our parameter dependent on the size of data set (Silverstein et al.,1999) W = 1 , F = 50 , U = 5
• How many query will be filter 5%~10% of the queries
• Pick 500 queries randomly and annotate them by human
Category-1: If the query does not contain any foreign terms. Category-2: If there exists some foreign terms but the query would still be expected to bring web pages in the same language. Category-3: If the query belongs to other languages, or all the terms are foreign to the annotator.
![Page 12: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/12.jpg)
Data Generation(7)
• How much of this multi-linguality parameter selection eliminate? result : Category-1 : 47.6% Category-1+2 : 60.2%
![Page 13: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/13.jpg)
Language Identification(1)
• Implement three models use a different existing feature
1.statistical model 2.knowledge based model 3.morphological model• EuroParl Corpora• Combine all three models in a machine learning
framework using a novel approach• Add a non-linguistic
![Page 14: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/14.jpg)
Language Identification(2)
• Test set-3500 human annotated queries
![Page 15: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/15.jpg)
Statistical model
• Character based n-gram feature (n=1 to 7)• Vocabulary from training corpus(EuroParl)• Generate a probability distribution from these
count• Above work can use SRILM Toolkit with
Kneser-Ney Discounting and interpolation
![Page 16: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/16.jpg)
Knowledge based model
• Word based n-gram feature (n=1)• Vocabulary from training corpus(EuroParl)• Generate a probability distribution from these
count
![Page 17: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/17.jpg)
Morphological model
• Gather the affix information from corpora in an unsupervised(Harald Hammarstr¨om 2006)
• Give a score for each affix
![Page 18: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/18.jpg)
Language Identification(3)
• Performance
![Page 19: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/19.jpg)
Decision tree classification
• Each model can complement the other in certain cases
• Train data : automatically annotated data set• Feature : confidence score• Use the Kurtosis measure
![Page 20: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/20.jpg)
Decision tree classification
• An example : query “the sovereign individual” and statistical model identifies it as English k = 7.6 > = = ( 4.47 + 1.96 ) so this query’s confidence score is “en-HIGH”• Implement DT classifier by the Weka Machine
Learning Toolkit (Witten and Frank,2005)
![Page 21: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/21.jpg)
Decision tree classification
• Outperform all the models for each size on average
![Page 22: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/22.jpg)
Decision tree classification
Mli,lj : language li misclassified by the system as lj
![Page 23: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/23.jpg)
non-linguistic feature
• Non-linguistic feature is the language information of the country
• It helps the search engine in guessing the language
example : query “how to tape for plantar fasciits”(it is labelled as Category-2) It is classified to Porteguese query
![Page 24: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/24.jpg)
non-linguistic feature
• Increase test set size to 430 queries
![Page 25: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/25.jpg)
Conclusions
• A completely automated method to generate a reliable data set
• Built a decision tree classifier that improves the results on average
• Built a second classifier that takes into account the geographical information of the users
![Page 26: Language Identification of Search Engine Queries](https://reader036.fdocuments.net/reader036/viewer/2022062315/56814d58550346895dba95e6/html5/thumbnails/26.jpg)
Feature Work
• To improve the accuracy of data generation• More careful examination in parameter values• To extend the number of languages in data set• Consider other alternatives to the decision
tree framework