Develop open source search engine

19
DEVELOP OPEN SOURCE SEARCH ENGINE Ritesh Ambastha – CEO, iWillStudy.com 26 th Feb 2012

Transcript of Develop open source search engine

Page 1: Develop open source search engine

DEVELOP OPEN SOURCE SEARCH ENGINERitesh Ambastha – CEO, iWillStudy.com26th Feb 2012

Page 2: Develop open source search engine

Open Source Search Engines

Sphinx Lucene DataparkSearch

Zettair YaCy Xapian

SWISH-E Seeks Recoll

OpenFTS Nutch Namazu

Page 3: Develop open source search engine

Platform Ideas !

Credits: http://zooie.wordpress.com

Page 4: Develop open source search engine

Comparision

Credits: http://zooie.wordpress.com

Page 5: Develop open source search engine

Comparision

Credits: http://zooie.wordpress.com

Page 6: Develop open source search engine

We are going to talk about

Sphinx & Apache-Solr

Page 7: Develop open source search engine

Sphinx

Sphinx is an open source full text search server.

It's written in C++ and works on Linux (RedHat, Ubuntu, etc), Windows, MacOS, Solaris, FreeBSD, and a few other systems.

Sphinx lets you either batch index and search data stored in an SQL database, NoSQL storage, or just files quickly and easily

Page 8: Develop open source search engine

Sphinx

Text processing features Searching via SphinxAPI is as

simple as 3 lines of code, and querying via SphinxQL is even simpler

Sphinx clusters scale up to billions of documents and tens of millions search queries per day, powering top websites such as Craigslist, DailyMotion, NetLog, etc.

Page 9: Develop open source search engine

Performance and scalability

Indexing performance: Sphinx indexes up to 10-15 MB of text per second per single CPU core.

Searching performance: Searching through 1,000,000-document, 1.2 GB text collection that they use for everyday development and testing runs at 500+ queries/sec on a 2-core desktop machine with 2 GB of RAM.

Scalability: Biggest known Sphinx cluster indexes almost 5 billion documents, resulting in over 6 TB of data.

Busiest known one is, unsurpisingly, Craigslist, top-10 website in the US that serves 50+ million search queries/day.

Page 10: Develop open source search engine

Key Features

Batch and Real-Time full-text indexes Non-text attributes support SQL database indexing Non-SQL storage indexing Easy application integration Advanced full-text searching syntax Rich database-like querying features Better relevance ranking Flexible text processing Distributed searching

Page 11: Develop open source search engine

http://lucene.apache.org/solr/

Page 12: Develop open source search engine

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project.

Page 13: Develop open source search engine

Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search.

Page 14: Develop open source search engine

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat.

Page 15: Develop open source search engine

Solr Features

Advanced Full-Text Search Capabilities Optimized for High Volume Web Traffic Standards Based Open Interfaces - XML,JSON and

HTTP Comprehensive HTML Administration Interfaces Server statistics exposed over JMX for monitoring Scalability - Efficient Replication to other Solr

Search Servers Flexible and Adaptable with XML configuration Extensible Plugin Architecture

Page 16: Develop open source search engine

What is it all about?

Page 17: Develop open source search engine
Page 18: Develop open source search engine

Solr is based on Lucene

Page 19: Develop open source search engine

More about Lucene