GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

22
GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

Transcript of GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)

GOAT SEARCHRevorg GOAT Search Solution (Powered by Lucene)

About Me

Grover Fields Revorg, LLC (Owner) M.S. Information System (Troy University) B.S. Industrial Engineering (Florida A&M

University) Stanford Project Management Courses

About Me 10+ years of development, analysis, and

implementation 10+ years of ColdFusion experience 2+ years of Java experience Commonspot, Strongmail, ClickFix

(Developer) Email: [email protected] Web site: http://www.groverfields.com

Agenda What?

What can we do with GOAT? Why?

Why do we want to use GOAT and not Verity? How?

How do we do that? Conclusion and alternative solutions

What What is a Search Engine?

Builds an index on text Answers queries using that index, a la Verity

Existing database already

A search engine offers? Scalability Reliance Ranking Tweaking Integrates different sources (email, web pages, files,

DATABASES)

What is a search engine? (cont.)

Works on words, not on substrings Auto != automatic, automobile

Indexing process: Convert document Extract text and meta data Normalize text Write (inverted) index

Apache Lucene Overview Lucene Java 2.4

A high-performance, full-featured text search engine library written entirely in Java.

It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

No GUI http://lucene.apache.org

Apache Lucene Overview Java library for indexing and searching No dependencies Works with Java 1.4 or later Input for indexing: Document objects

Each document: set of Fields, field name, field content Stores its index as files on disk or memory No document converters No web crawler

Lucene Java users HBCU.info LinkedIn IBM OmniFind Yahoo! Edition Techorati.com Eclipse Monster.com …

Lucene Java Summary

Java Library for indexing and searching Lightweight /no dependencies Powerful and fast and tested! No document conversion No GUI

Why?

Cost of Enterprise Search Solution Need for search speed Java projects to work on

Things to do

Verity Limitations 10,000 documents for ColdFusion Developer Edition

125,000 documents of ColdFusion Standard Edition

250,000 documents for ColdFusion Enterprise Edition What do developers do in a shared hosting

environment? Is it possible for the hosting company to limit the

number of documents per Web site?

T-SQL Limitations? Search for “Yahoo” on my blog

SELECT entry.id FROM tbl_mango_entry as entry INNER JOIN tbl_mango_post as post ON entry.id = post.id WHERE entry.blog_id = ‘default’ AND (entry.title LIKE ‘%yahoo%’ OR entry.content LIKE ‘%yahoo%’ OR entry.excerpt LIKE ‘%yahoo%’ ) AND post.posted_on <= getdate() AND entry.status = 'published' ORDER BY post.posted_on DESC

Multiply that time 10, 100, 500, or 1000 users/hr?

T-SQL Limitations?

Full table scan = 1 THING PERFORMANCE KILLER!!! No search sorting

RDBMS isn’t designed to do this but allows it Use the right tools!

How? GOAT Search Solution

Lucene 2.4.0 ColdFusion MX 8

MX is fine but GUI needs to be rolled back Commons IO 1.4

Simply package .jar files Simply Web based GUI

How? Macromedia JDBC Drivers

Same drivers that ColdFusion uses No additional drivers to install

Supports RDBMS ONLY MSSQL MySQL Oracle

No File system support (Yet)

Basics? Indexing extracts both meaning and structure from

unstructured information by indexing each document Contains a complete list of all the words used in a given

document along with metadata about that document Lucene creates a collection that normalizes both the

structured and unstructured data. Search requests then check these collections rather than

scanning the actual documents and database fields. This provides a faster search of information, regardless of the

file type and whether the source is structured or unstructured.

Basics? Collection

A special database created by Lucene that contains metadata that describes the documents Documents

A sequence of fields Similar to a row in a database table

Row 1 Row 2, etc

Fields A named sequence of terms Similar to a column in a table

Primary Key Column 1

Terms Is a string

Knowledge? Index

A special database created by Lucene that contains metadata that describes the documents

Query Syntax Similar to Google’s advanced search:

field:value E.G. resume: coldfusion http://lucene.apache.org/java/2_4_0/queryparsersyntax.html

Results Primary Key list of values XML based on the document CFX Tag integration

Alternative Solutions for Search Commercial vendors:

FAST, $100k Autonomy, $80k Google, $50k

Commercial search engines based on Lucene IBM OmniFind Yahoo Edition

RDBMS with Integrated Search Oracle MySQL MSSQL PERFORMANCE KILLERS

RoadMap

Road Map

A set of guidelines, instructions, or explanations: wrote an ethics code as a road map for the behavior of elected officials.

Overhaul Java programming (still novice) Integrate with other products

Aperture Nutch Solr

File system integration .txt, .pdf, .doc, .ppt, etc.

Geospatial based searches E.G. All jobs within a 50 mile radius

References

Apache.org Adobe.com Ben Forta’s Blog Slideshare.net

Multiple authors Other references