Lucene Snippets: Index Stats

In Lucene 4.x there is an API to fetch index statistics for specific document’s fields.

The following examples shows how to create an index with some random documents and fetch some statistics for a field afterwards ..

Lucene Dependencies

Just one dependency needed here .. lucene-core. I’ve added the declarations needed for Maven and SBT here .. if you’re using Gradle or Buildr you should’t have a problem to create your build file either..

Maven

Simply add the following dependency and repository to your Maven-ized project’s pom.xml

<project>
    [..]
    <properties>
        <lucene.version>4.0-SNAPSHOT</lucene.version>
    </properties>
    <dependencies>
        <dependency>
            <artifactId>lucene-core</artifactId>
            <groupId>org.apache.lucene</groupId>
            <version>${lucene.version}</version>
        </dependency>
    </dependencies>
    <repositories>
        <repository>
            <id>lucene-repository</id>
            <name>Lucene Maven</name>
            <url>https://repository.apache.org/snapshots/</url>
            <snapshots>
                <enabled>true</enabled>
                <updatePolicy>always</updatePolicy>
            </snapshots>
        </repository>
    </repositories>
</project>

Simple Build Tool / SBT

To use Lucene here, simply add the following lines to your build.sbt

libraryDependencies += "org.apache.lucene" % "lucene-core" % "4.0.0-BETA"

resolvers += "apache-snapshots-repo" at "https://repository.apache.org/snapshots/"

Indexing and Analyzing

First we’re adding 2000 documents to our index .. each document has a field “id” and a field “title”.

Afterwards we’re using CollectionStatistics to fetch stats for a specific index field.

The CollectionStatistics API allows us to receive the following information:

docCount: returns the total number of documents that have at least one term for this field.
maxDoc: returns the total number of documents, regardless of whether they all contain values for this field.
sumDocFreq: returns the total number of postings for this field
sumTotalTermFreq: returns the total number of tokens for this field

package com.hascode.tutorial;

import java.io.File;
import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.search.CollectionStatistics;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.MMapDirectory;
import org.apache.lucene.util.Version;

public class IndexStatsExample {
 private static final String INDEX = "target/index_stats";

 public static void main(final String[] args) throws IOException {
  createRandomData();
  IndexReader reader = DirectoryReader.open(MMapDirectory.open(new File(
  INDEX)));
  IndexSearcher searcher = new IndexSearcher(reader);
  CollectionStatistics stats = searcher.collectionStatistics("title");
  System.out.println("Statistics for the field 'title':");
  System.out.println("Number of documents with a term for the field: "
  + stats.docCount());
  System.out.println("Total number of documents: " + stats.maxDoc());
  System.out.println("Total number of postings for the field: "
  + stats.sumDocFreq());
  System.out.println("Total number of tokens for the field: "
  + stats.sumTotalTermFreq());
 }

 public static void createRandomData() throws IOException {
  Directory dir = FSDirectory.open(new File(INDEX));
  Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
  IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40,
  analyzer);
  iwc.setOpenMode(OpenMode.CREATE);
  IndexWriter writer = new IndexWriter(dir, iwc);

  for (long i = 1; i < 2000; i++) {
    Document doc = new Document();
    doc.add(new LongField("id", i, Store.YES));
    doc.add(new StringField("title", "The big book of boredom - Part "
    + (int) i, Store.YES));
    writer.addDocument(doc);
  }
  Document doc = new Document();
  doc.add(new LongField("id", 9999l, Store.YES));
  writer.addDocument(doc);
  writer.close();
 }
}

Running the program should give you a similar output (I’ve used SBT here – alternatively use your IDE of choice or Maven mvn exec:java -Dexec.mainClass=com.hascode.tutorial.IndexStatsExample):

> run-main com.hascode.tutorial.IndexStatsExample
[info] Running com.hascode.tutorial.IndexStatsExample
Statistics for the field 'title':
Number of documents with a term for the field: 1999
Total number of documents: 2000
Total number of postings for the field: 1999
Total number of tokens for the field: -1
[success] Total time: 1 s, completed Sep 7, 2012 8:45:13 PM

Tutorial Sources

Please feel free to to view and download the complete sources from this tutorial from my GitHub repository – or if you’ve got Git installed just check it out with

git clone https://github.com/hascode/lucene-4-tutorial.git

Lucene Dependencies#

Maven#

Simple Build Tool / SBT#

Indexing and Analyzing#

Tutorial Sources#

Resources#

Lucene Dependencies

Maven

Simple Build Tool / SBT

Indexing and Analyzing

Tutorial Sources

Resources