Pylucene- Part III: Hightlighting in Search

The search result can be customized to highlight the phrases that contain the requested keyword. The following code uses “Highlighter” class from Pylucene. We emit result in HTML formatted syntax.

from lucene import \
            QueryParser, IndexSearcher, IndexReader, StandardAnalyzer, \
        TermPositionVector, SimpleFSDirectory, File, SimpleSpanFragmenter, Highlighter, \
    QueryScorer, StringReader, SimpleHTMLFormatter, \
            VERSION, initVM, Version
import sys

FIELD_CONTENTS = "contents"
FIELD_PATH = "path"
#QUERY_STRING = "lucene and restored"
QUERY_STRING = sys.argv[1]
STORE_DIR = "/home/kanaujia/lucene_index"

if __name__ == '__main__':
    initVM()
    print 'lucene', VERSION

    # Get handle to index directory
    directory = SimpleFSDirectory(File(STORE_DIR))

    # Creates a searcher searching the provided index.
    ireader  = IndexReader.open(directory, True)

    # Implements search over a single IndexReader.
    # Use a single instance and use it across queries
    # to improve performance.
    searcher = IndexSearcher(ireader)

    # Get the analyzer
    analyzer = StandardAnalyzer(Version.LUCENE_CURRENT)

    # Constructs a query parser.
    queryParser = QueryParser(Version.LUCENE_CURRENT, FIELD_CONTENTS, analyzer)

    # Create a query
    query = queryParser.parse(QUERY_STRING)

    topDocs = searcher.search(query, 50)

    # Get top hits
    scoreDocs = topDocs.scoreDocs
    print "%s total matching documents." % len(scoreDocs)

    HighlightFormatter = SimpleHTMLFormatter();
    query_score = QueryScorer (query)

    highlighter = Highlighter(HighlightFormatter, query_score)

    # Set the fragment size. We break text in to fragment of 64 characters
    fragmenter  = SimpleSpanFragmenter(query_score, 64);
    highlighter.setTextFragmenter(fragmenter); 

    for scoreDoc in scoreDocs:
        doc = searcher.doc(scoreDoc.doc)
    text = doc.get(FIELD_CONTENTS)
    ts = analyzer.tokenStream(FIELD_CONTENTS, StringReader(text))
        print doc.get(FIELD_PATH)
        print highlighter.getBestFragments(ts, text, 3, "...")
    print ""

The code is an extension of search code discussed in Part-II.

We create a HTML formatter with SimpleHTMLFormatter. We create a QueryScorer to iterate over resulting documents in non-decreasing doc ID.

HighlightFormatter = SimpleHTMLFormatter();
query_score = QueryScorer (query)

highlighter = Highlighter(HighlightFormatter, query_score)

We break the text content into 64 bytes character set.
fragmenter  = SimpleSpanFragmenter(query_score, 64);
highlighter.setTextFragmenter(fragmenter);

for scoreDoc in scoreDocs:
doc = searcher.doc(scoreDoc.doc)
text = doc.get(FIELD_CONTENTS)
ts = analyzer.tokenStream(FIELD_CONTENTS, StringReader(text))
print doc.get(FIELD_PATH)

Now we set number of lines for phrases in a document.

print highlighter.getBestFragments(ts, text, 3, “…”)

Results

kanaujia@ubuntu:~/work/Py/pylucy2/pylucy$ python searcher_highlight.py hello
lucene 3.6.1
50 total matching documents.
/home/kanaujia/Dropbox/PyConIndia/fsMgr/root/hello
hi hello

/home/kanaujia/Dropbox/PyConIndia/fsMgr.v4/root/hello
hi hello

/home/kanaujia/Dropbox/.dropbox.cache/2012-09-27/hello (deleted 505bda23-9-e8756a51)
hi hello

/home/kanaujia/Dropbox/PyConIndia/fsMgr.v1/root/hello.html

Hello htmls

{% module Hello() %}
Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s