Search Engine with Lucene – Summary – Clean Dump XML



Search Engine with Lucene – Summary – Clean Dump XML

0 0


slide-Search-Engine


On Github macilry / slide-Search-Engine

Search Engine with Lucene

Summary

  • Clean Dump XML
  • Create index with Lucene
  • Request on index
  • Display result

Clean Dump XML

Extract title and entities

Tool used : SAX (parser XML for JAVA) Regex

For not overload the memory :

Write in file at each node page

Write stream with JAVA

Create index with Lucene

  • Read file clean XML
  • For each page, create document (Lucene's class used)
  • Content of this document : Concatenation entities, title, id
  • Increment (+1) boost title field

An other point :

Specify french analyzer in Lucene configuration for accented character

Request on index

  • Search on multiple fields
  • Use french analyzer for parsing request
  • Request return the first 20 results
  • Create objet for each result and sort entities by occurrences

Display result

Technologies Used

  • JAVA Enterprise Edition ( servlet, jsp, jslt )
  • Apache Tomcat
  • Librairy jqCloud for generate cloud tag in JS