Extending the Confluence Search Index

May 23rd, 2010 by

Developing plugins for the Confluence Wiki a developer sometimes needs to save additional metadata to a page object using Bandana or the ContentPropertyManager. Wouldn’t it be nice if this metadata was available in the built-in Lucene index?

That is were the Confluence Extractor Module comes into play..

 

Overview

An extractor allows the developer to add new fields to the lucene search index. Creating a new extractor is quite simple – just implement the interface com.atlassian.bonnie.search.Extractor or bucket.search.lucene.extractor.BaseAttachmentContentExtractor if you want to build a new file extractor.

There is a good documentation for both extractor types available at the Atlassian Wiki.

Example Application

The following demonstration plugin adds a new field “hascode” to the lucene search index, appends it to every existing page and fills the field with a string “XXX”. Maven archetypes are used for the tutorial – if you need some help on this topic – please take a look at this article.

  • Create a new Maven project using archetypes in your IDE or like by console (always use plugin type 2!):
    mvn archetype:generate -DarchetypeCatalog=http://svn.atlassian.com/svn/public/atlassian/maven-plugins/archetype-catalog
  • That’s my pom.xml
    <?xml version="1.0" encoding="UTF-8"?>
     
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    				 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    	<parent>
    		<groupId>com.atlassian.confluence.plugin.base</groupId>
    		<artifactId>confluence-plugin-base</artifactId>
    		<version>25</version>
    	</parent>
     
    	<modelVersion>4.0.0</modelVersion>
    	<groupId>com.hascode.confluence.plugin</groupId>
    	<artifactId>search-tutorial</artifactId>
    	<version>0.0.1-SNAPSHOT</version>
     
    	<name>hasCode.com - Confluence Search Tutorial</name>
    	<packaging>atlassian-plugin</packaging>
     
    	<properties>
    		<atlassian.plugin.key>com.hascode.confluence.plugin.search-tutorial</atlassian.plugin.key>
    		<!-- Confluence version -->
    		<atlassian.product.version>3.0</atlassian.product.version>
    		<!-- Confluence plugin functional test library version -->
    		<atlassian.product.test-lib.version>1.4.1</atlassian.product.test-lib.version>
    		<!-- Confluence data version -->
    		<atlassian.product.data.version>3.0</atlassian.product.data.version>
    	</properties>
     
    	<description>hasCode.com - Confluence Search Tutorial</description>
    	<url>https://www.hascode.com</url>
     
    </project>
  • Add the plugin descriptor for the extractor module to the atlassian-plugin.xml
    <atlassian-plugin key="${atlassian.plugin.key}" name="search-tutorial" pluginsVersion="2">
        <plugin-info>
            <description>hasCode.com - Confluence Search Tutorial</description>
            <version>${project.version}</version>
            <vendor name="hasCode.com" url="https://www.hascode.com"/>
        </plugin-info>
    	<extractor name="PageInformationExtractor" key="pageInformationExtractor" class="com.hascode.confluence.plugin.search_tutorial.extractor.PageExtractor" priority="1000">
    		<description>Adding some searchable fields for a page to the search index</description>
    	</extractor>
    </atlassian-plugin>
  • Creating a package com.hascode.confluence.plugin.search_tutorial.extractor and add the extractor class PageExtractor
    package com.hascode.confluence.plugin.search_tutorial.extractor;
     
    import org.apache.log4j.Logger;
    import org.apache.lucene.document.Document;
    import org.apache.lucene.document.Field;
     
    import com.atlassian.bonnie.Searchable;
    import com.atlassian.bonnie.search.Extractor;
    import com.atlassian.confluence.core.ContentEntityObject;
     
    public class PageExtractor implements Extractor {
    	private final Logger logger = Logger
    			.getLogger("com.hascode.confluence.plugin.search-tutorial");
    	private static final String FIELD_NAME = "hascode";
     
    	public void addFields(Document doc, StringBuffer searchableText,
    			Searchable searchable) {
    		if (searchable instanceof ContentEntityObject) {
    			String info = "XXX";
    			searchableText.append(info).append(" ");
    			doc.add(new Field(FIELD_NAME, info, Field.Store.YES,
    					Field.Index.TOKENIZED));
    		} else {
    			logger.debug("searchable is no content entity");
    		}
     
    	}
    }
  • Build and deploy the plugin
    mvn atlassian-pdk:install
  • Open Confluence in your browser, head over to  Confluence Admin > Content Indexing and rebuild the search index
  • There is a nice feature in Confluence that allows you to take a look at the indexed entries and analyze the available fields: http://localhost:8080/admin/indexbrowser.jsp
  • *Update:* The index browser has been removed in Confluence 3.2.1 due to security issues – if you want to take a deeper look at the lucene index you should use LUKE – the Lucene Index Toolbox – downloadable as a jar at the project’s homepage.
  • In the index browser/Luke you should be able to spot the new field “hascode” with the content “XXX” – thats how it looks like in my system

  • Now it is possible to search for the new field in the global confluence search by adding hascode:XXX as parameter
  • That’s what the search result looks like.

Final thoughts:

A new field was added to the document/the index but in addition the field’s value was also added to the contentBody. This means that a search for “XXX” would show similar results. One should consider which behaviour is desired.

The following example adds the field “hascode” only to pages with a title containing the word “test” – this is the modified PageExtractor

package com.hascode.confluence.plugin.search_tutorial.extractor;
 
import org.apache.log4j.Logger;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
 
import com.atlassian.bonnie.Searchable;
import com.atlassian.bonnie.search.Extractor;
import com.atlassian.confluence.core.ContentEntityObject;
 
public class PageExtractor implements Extractor {
	private final Logger logger = Logger
			.getLogger("com.hascode.confluence.plugin.search-tutorial");
	private static final String FIELD_NAME = "hascode";
 
	public void addFields(Document doc, StringBuffer searchableText,
			Searchable searchable) {
		if (searchable instanceof ContentEntityObject) {
			ContentEntityObject ceo = (ContentEntityObject) searchable;
			if (ceo.getTitle().matches("test")) {
				String info = "XXX";
				searchableText.append(info).append(" ");
				doc.add(new Field(FIELD_NAME, info, Field.Store.YES,
						Field.Index.TOKENIZED));
			}
		} else {
			logger.debug("searchable is no content entity");
		}
 
	}
}

I have created a page named “test” in the demonstration space – this page has the field “hascode” – take a look at the new search result:

Resources

Article Updates

  • 2015-03-03: Table of contents added.

    <?xml version=”1.0″ encoding=”UTF-8″?>

    <project xmlns=”http://maven.apache.org/POM/4.0.0″
    xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”
    xsi:schemaLocation=”http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd”>
    <parent>
    <groupId>com.atlassian.confluence.plugin.base</groupId>
    <artifactId>confluence-plugin-base</artifactId>
    <version>25</version>
    </parent>

    <modelVersion>4.0.0</modelVersion>
    <groupId>com.hascode.confluence.plugin</groupId>
    <artifactId>search-tutorial</artifactId>
    <version>0.0.1-SNAPSHOT</version>

    <name>hasCode.com – Confluence Search Tutorial</name>
    <packaging>atlassian-plugin</packaging>

    <properties>
    <atlassian.plugin.key>com.hascode.confluence.plugin.search-tutorial</atlassian.plugin.key>
    <!– Confluence version –>
    <atlassian.product.version>3.0</atlassian.product.version>
    <!– Confluence plugin functional test library version –>
    <atlassian.product.test-lib.version>1.4.1</atlassian.product.test-lib.version>
    <!– Confluence data version –>
    <atlassian.product.data.version>3.0</atlassian.product.data.version>
    </properties>

    <description>hasCode.com – Confluence Search Tutorial</description>
    <url>https://www.hascode.com</url>

    </project>

    Tags: , , , , , , , , , , , ,

    3 Responses to “Extending the Confluence Search Index”

    1. alejo Says:

      Hello,

      do you know how to extend search of confluence 3.4?

      What kind of plugin I have to do to extend the search? If a extractor, xwork….???

      I need intercept the search query to process and add extra funcionalities.

      Thanks for this tutorial and i wait for your response!

      Best regards!

    2. micha kops Says:

      The question is what you’re planning to do? If you want to save additional searchable fields in the search index for a page, attachment or any other content entity object then writing an extractor plugin would be the solution. if you want to manipulate the scoring mechanism in the confluence search then consider writing a lucene boosting strategy plugin. If you want to modify the search result view – you could write a renderer plugin to modify this – there is a really nice tutorial from Atlassian on this topic.

      Finally you always have the chance to dig a level deeper and to write a custom xworks action and use the confluence-search/lucene api directly to fit your needs or to write a servlet filter and intercept http requests to the search api and modify what needs to be modified … it all depends on what you want to implement there :)

    3. alejo Says:

      I’ve been able to create a plugin to achieve my goal, I think with a conveyor xwork plugin and using the search replace my default for my plugin.

      I now need is to insert my analyzer in Spanish.

      I wonder if you know it, is whether you can create a plugin to add a Spanish-language analyzer and from the analyzer to customize the search.

    Search
    Tags
    Categories