CS 5604 Information Storage and Retrieval Solr Team Final ...
Transcript of CS 5604 Information Storage and Retrieval Solr Team Final ...
CS 5604 Information Storage and RetrievalSolr Team Final Presentation
Presenters:Liuqing Li, Ye Wang, Anusha Pillai, Ke Tian
{liuqing, yewang16, anusha89, ketian} @vt.edu
Instructor: Dr. Edward A. Fox
Virginia Polytechnic Institute and State UniversityBlacksburg, VA, 24061
December 6, 2016
Solr Team Final Presentation
• Background• Implementation• ProblemsFaced• LessonsLearned• FutureWork• Acknowledgement
Outline
1
Solr Team Final Presentation
Background — Overview
2
Solr Team Final Presentation
Background — Updates
3
Spring 2016 Fall 2016
schema.xml
Coarsegrained Finegrained
Nocopyfields Copyfields forallfieldssearch
Createstopwords.txt &profanity.txt Updatethetwofiles
morphlines.conf
Twofieldtypes:stringandtext Multiplefieldtypes
Field“time”=>string Field“time”=>datetime
Nomultiple-valuedfields Multiple-valuedfield parser
Basic Indexing Smallcollection 1.2billiontweetsdataset
Incremental Indexing VirtualCloudera(VC) VC &HadoopCluster(HC)
Recommendation Brief description ImplementedinVC&HC
Custom Ranking Brief description ImplementedinVC&HC
Solr Admin UIBrief description Detaileddescription
Limitedfacetedsearch Detailedfacetedsearch
Solr Team Final Presentation
• LiveMode• ContinuousstreamofHBase cellupdatesintolivesearchindexers
• Simpleandefficient• Cannothandlebigdata
• BatchMode• BatchindextablesinHBase byusingMapReducejobs• WriteindexfilesintoHDFS(/user/cs5604f16_solr/…)• Canhandlebigdata
Implementation — Basic Indexing
4
Solr Team Final Presentation
• schema.xml:fieldsconfiguration• field(e.g.,ideal-cs5604f16-fake)
• #offields:30• Types:string(22),text_general (2),int (2),float(2),long(1),date(1)• Stored:True(17),False(13)
• dynamicField:matchingmultiplefields,usingwildcard
• copyField
Implementation — Basic Indexing
5
Solr Team Final Presentation
• stopword.txtandprofanity.txt• stopword.txt:tf-idf valuewillnotbecalculated• profanity.txt:quickresponseforsuchsearchqueries• Solr loadsthetwofileswhilereadingschema.xml
Implementation — Basic Indexing
6
Source:https://pypi.python.org/pypi/many-stop-wordshttp://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/
Solr Team Final Presentation
• morphlines.conf:mappingandparsing
Implementation — Basic Indexing
7
MappingdatafromHBase toSolr
Splitmultiplevaluesintolist "topic_label_s": "twitter;social;media;text"
Solr Team Final Presentation
• Indexthebigdataset
Implementation — Basic Indexing
8
ideal-cs5604f16 ideal-cs5604f16-1204
Dataset Allcollections(rawtweets)
Allcollections(rawtweets+processeddata)
Indexing
# of DataNode 18 17
Space Cost 392.33GB 399.21GB
Time Cost
Mapping 1h21m 1h45m
Reducing 5h11m 5h13m
Merging 3h18m 3h10m
Total 9h50m 10h8m
Solr Team Final Presentation
• Purpose• ProcessacontinuousstreamofHBase cellupdatesintolivesearchindexes(NearReal-Time,NRTIndexing)
• Solvetheproblemoffrequentinserts,deletesandupdates
• Howdoesitwork?• EnablingHBase replication(columnfamily)• PointinganNRTIndexerServiceatanHBase table• StartinganNRTIndexerService
• Ourwork
Implementation — Incremental Indexing
8
Source:http://www.cloudera.com/documentation/enterprise/5-6-x/topics/search_config_hbase_indexer_for_search.html
Solr Team Final Presentation
Implementation — Incremental Indexing
CreateandchecktheNRTindexer
9
Solr Team Final Presentation
RestarttheHBase Solr Indexerservice
Implementation — Incremental Indexing
RestarttheserviceinVC
RestarttheserviceinHC
10
Solr Team Final Presentation
Implementation — Incremental Indexing
11
CreateandchecktheNRTindexerChecktheresultsinHBase andSolr AdminUI
Solr Team Final Presentation
• Types• Textualsimilaritybased• Collaborativefiltering
• MoreLikeThisComponent• Identifiessimilardocumentstosearchresultdocuments.• Canbeconfiguredasarequesthandlerorsearchcomponent
• Usestermvectorstocomputesimilarity.• Termvectorcanbecalculatedduringqueryruntimeorprecomputedduringindexing
• Extractshighestmatchingtermsbasedontf-idf similarity
Implementation — Recommendation
12
Solr Team Final Presentation
• schema.xml• Setstored=true• SettermVectors =true(forcalcalating tf-idf)
• Aftermakingchanges,reindexing ismandatory
• solrconfig.xml• Enablemlt
• Defineotherconfigurationparameters• e.g.,mlt.fl,mlt.mintf,mlt.mindf,mlt.maxdf,mlt.qf
Implementation — Recommendation
13
Solr Team Final Presentation
• RequestHandler
Implementation — Recommendation
Link:https://drive.google.com/open?id=0B2iasHDgHqGyYUk0R3RkVktkM2M 14
Solr Team Final Presentation
• SearchComponent
Implementation — Recommendation
Link:https://drive.google.com/open?id=0B2iasHDgHqGyU0doVEpidlh3c2c 15
Solr Team Final Presentation
Implementation — Custom Ranking
16
• Purpose• Customizeandoptimizetherankedresults
• Howdoesitwork?• SearchComponent
• prepare():pre-processing,invokedbeforequeryisexecuted• processing():post-processing,invokedafteralltheresultsarefetched
• CustomScoring
• Re-ranking
𝑺𝒄𝒐𝒓𝒆 = 𝑫𝒐𝒄𝒔𝒄𝒐𝒓𝒆,𝑺𝒐𝒍𝒓 + 𝑫𝒐𝒄𝒊𝒎𝒑𝒐𝒓𝒕𝒂𝒏𝒄𝒆+𝑊45678×𝐷𝑜𝑐=85>?,45678 + 𝑊8@A=4?>×𝐷𝑜𝑐=85>?,8@A=4?>
Solr Team Final Presentation
Implementation — Custom Ranking
BuildandcopyjarfileintoHadoopCluster
16
Solr Team Final Presentation
Implementation — Custom Ranking
BuildandcopyjarfileintoHadoopCluster
16
Modifythesolrconfig.xml
Solr Team Final Presentation
Implementation — Custom Ranking
17
UpdatetheinstanceDirReloadthecollectionChecktheresultsinSolr AdminUI
Solr Team Final Presentation
Implementation — Solr Admin UI
1
2
3
Choose ideal-cs5604f16-fake for querying
DashBoard:providebasicfunctionsforuserstochoose.(LoggingtocheckSolrlogsfordebugging)
CoreSelector:selectthecore(dataset)forqueries
Solr instanceInformation:currentversions,JVMinformation
19
Solr Team Final Presentation
Implementation — Solr Admin UI
22
1
2
4 5
3
Fieldname
Resultstatistics
Therequest-handler:/selectThequeryevent:qParametersforquery:fq (filterqueries)sort(descendingorascending)ExecutequeryResultsoutputs:json format
Solr Team Final Presentation
Implementation — Solr Admin UI
23
1
2
4
3
5
Thefacetedsearchquery:rangeFacetedsearchfield:t_month_iParameters,truewhenenabledSearchResults:countsSearchResults:details
Solr Team Final Presentation
Problem Faced
24
ClouderaandOSVirtualClouderaseems slowandoftencrashesduetothememory
Notfamiliar withthewholearchitectureatthebeginning
VersionsofClouderaandSolr
DataConsistencycheck
Notenoughrealdataavailabletoperformtests
Notmuchinformationavailableregardinglogstoperformcollaborativefiltering
CollaborationCommunicationandmodification
Solr Team Final Presentation
Lessons Learned
25
SolrHBase
HDFS
Patience
Carefulness
TeamCollaboration
Solr Team Final Presentation
Future Work
26
SearchCustomizemorerequesthandlers
Dealwiththeprofanityissue
CustomRankingCustomizemoresearchcomponents
Recommendation
Createacustomrecommendationcomponent(Probabilities– CTAteam)
Implementthecollaborativefiltering(Log files– FEteam)
SolrFigureoutSolrCloud,multipleSolr nodesinClouderaSearch
Solr Team Final Presentation
Acknowledgement
27
Projects
NSFIIS- 1319578 III:Small:IntegratedDigitalEventArchivingandLibrary(IDEAL)
NSFIIS- 1619028 III:Small:CollaborativeResearch:GlobalEventandTrendArchiveResearch(GETAR)
TeamsCMT,CMW,CLA,CTA,FEteams
PersonsInstructor Dr.EdwardA.Fox
GRA Sunshin Lee
Thank you !
Questions?