Author Archives: gaurav2forever

Following a simple “create Index” through Apache Lucene internals

Introduction: I was trying to figure out how Indexing works in Lucene last weekend, i will try to share my findings with you, as I run a simple demo “create index” program provided with source code. To start with I downloaded the source code from Apache mirror and build it into eclipse for easier debugging. Apache Lucene Version used is 4.10.2 for testing, the testing done is just to explore various features and apply possible optimizations.

Lucene-Indexing: Apache Lucene is an open source project which offers solution for searching. It is based on inverted-index mechanism which is explained in next sections. This indexing process is very efficient as compared to the previously used searching mechanisms like grep, b-tree indexes etc.  For the sake of understanding Lucene indexing process and its components let’s suppose we have two text files which we want to index.

Doc 1 Doc 2
Search is awesome with lucene. The Web needs efficient search.

Sample code to create an index: Apache Lucene comes with a demo example to create an index, here is the highlighted snippets from the source code to see the major components used in creating lucene index. This code can be found in this package in Lucene core:

package org.apache.lucene.demo;

Directory dir = FSDirectory.open(new File(indexPath)); ----------------------------------(0)
Analyzer analyzer = new StandardAnalyzer(); ---------------------------------------------(1)
IndexWriterConfig iwc = new IndexWriterConfig() -----------------------------------------(2)
IndexWriter writer = new IndexWriter(dir, iwc);  ----------------------------------------(3)
Document doc = new Document(); ----------------------------------------------------------(4)
doc.add(new TextField("contents", new BufferedReader(new
InputStreamReader(fis, StandardCharsets.UTF_8))));
writer.addDocument(doc); ----------------------------------------------------------------(5) 

The big picture: The diagram below is used to visualize the overall components used while creating index. Each component in the picture is linked with the code according to its usage

lucene_component
Lucene Components in detail:

  • Lucene documents: is basically a unit of indexing and searching. Lucene documents are nothing but a set of fields and values, for example in our example field “path”, “contents”, “modified” have values path to file on directory, content of the actual document, and last modified date respectively. It’s the responsibility of application to parse the actual files into Lucene documents or there is something called Lucene document handler interface which we can use to parse documents like pdf, HTML, txt files etc.
  • Directory: Lucene uses this class for opening/creating or updating the directory structure for storing indexes. There are various directory types you can use SimpleFSDirectory OR NIOFSDirectory OR MMapDirectory.
  • Analyzers: So once you have index directory you can now pass Analyzers to the IndexWriterConfig which keeps all necessary configurations to store/open/modify indexes. Analyzers plays a important role in creating index by Tokenizing and Filtering the stream. I will talk about both of them in little detail in next section but for the time assume that it’s the Analyzers who Tokenize your data and removes any unwanted/meaningless words from it. The interesting thing here is that Lucene generates a token stream internally to apply filtering mechanism.

Tokenized document 1:    

“The”
“web”
“needs”
“efficient”
“search”

    Tokenized document 2:  

“Search”
  “is”
 “awesome”
“with”
“lucene”
  • Tokenizers: Tokenizers have a straightforward job of tokenizing the Lucene documents, in the docs it says “Construct a token stream processing the given input.” You can choose from a list of various filters provided by Lucene like whitespace based, porter stemmer based and many others. So in our example if we pass our data through Tokenizer this is what it would look like.
  • Filters: As the name suggest removes unnecessary words from the TokenStream based upon which filter you choose to use. In our case we use a simple stopFilter which will remove the stop words like “the”, “an” etc. See below the list of stopwords which comes with LuceneStopWordsList

After applying this StopFilter to our documents:

“Search”  “the”
  “is” “needs”
 “awesome” “efficient”
“with” “search”
“lucene” web”

Creation of term dictionaries and posting lists: After getting the list of important terms as Tokens from the documents, Lucene calculates the inverted index by calculating the term frequencies and posting lists. Tokens also contains a lot of extra information like Start offset, End Offset, PayLoad etc. A term frequency is nothing but a alphabetically sorted list of terms, their frequency of occurrence combined with the id of documents it occurred in. This is all a part of basic algorithm which creates an inverted index in Lucene.

Term Term-frequency docId  (POSTING-LIST)
awesome 1 1
efficient 1 2
lucene 1 1
needs 1 2
search 2 1,2
Web 1 2

Incremental indexing algorithm in Lucene: One particular awesome thing I found with Lucene was its incremental indexing algorithm which basically create segments from indexes. Segments are basically sets of indexes, merged into one, based up on the merge factor taken from MergePolicy.  MergeScheduler does all this for us. Basically when addIndexes() method of IndwxWriter is called it merges segments into one.  Suppose we create index for 10 documents and we keep a merge factor of 3 than this is the way this algorithm will work

  • A stack will be created to store indexes
  • Each index will be stored one by one incrementally.
  • Once the total number of indexes becomes three, index 1, 2, 3 will be merged into one.
  • These steps are followed continuously.

The diagram(Taken from one of lectures from Doug Cutting ) below shows the algo.

incrementalIndexing

There are so many thing i have not mentioned in the post, but this information should be good enough for anyone to understand how Lucene indexing process works. I am planning to see how “scoring” and “search” works in Lucene in next post

Refs:

I might be wrong somewhere in this post please correct me if you find something wrong here.
Thank for reading !!!