Solr Internals: Modifying Solr Documents Before Indexing

Solr is a great search platform built on top of Lucene which works pretty well out of the box but there are times when you would want to customize it to get something extra done. In this blog, we will be exploring one such advanced use case.

 

How to modify the way Solr indexes your documents.

An example where this is useful is if we want to add a new field based on the value of another field which is contained in the documents. How can we achieve this if we don’t have control over the document source?

Before we really go in details of how to do this in Solr. First, we will need to understand the concept of – update requests processors (UPRs).

Every update request received by Solr is run through a chain of plugins called URPs.

One can write these plugins to do any sort of pre-processing on the Solr docs. You can add new fields or even drop a document which you don’t like ?

In fact, a lot of features of solr are written this way as plugins so it is essential to understand how they work and how to configure them.

How do you implement UPRs

UPRs are created by implementing 2 abstract classes – the UpdateRequestProcessor and UpdateRequestProcessorFactory. The factory class is used to initiate instances of UPRs when a new request comes and the main business logic goes in the request processor. The factory can also take configuration parameters which can be used to modify the way processor will work.

Many such small request processors make a chain of UPRs. They are applied in the order they are present in the chain, when a new document is indexed.

A quick look at, solr config xml should reveal many such samples of update request processors like the one below  

    <updateRequestProcessorChain name=”dedupe”>

      <processor class=”solr.processor.SignatureUpdateProcessorFactory”>

 

        <bool name=”enabled”>true</bool>

        <str name=”signatureField”>id</str>

        <bool name=”overwriteDupes”>false</bool>

        <str name=”fields”>name,features,cat</str>

        <str name=”signatureClass”>solr.processor.Lookup3Signature</str>

 

      </processor>


      <processor class=”solr.LogUpdateProcessorFactory” />

      <processor class=”solr.RunUpdateProcessorFactory” />

 

    </updateRequestProcessorChain>

Shown above is an update processor chain with the name “dedupe” which is used to generate a signatureField based on certain field(s) specified, and it is used to de-duplicate documents and not index the same document multiple times.

In this example, the fields – name, features and cat are used to generate a field called id, which is used to identify duplicates and duplicates are not overwritten. The class which does this is solr.processor.Lookup3Signature.

The last 2 are part of the default processor chain, which performs an essential function and as such any custom chain usually contain these processors and shouldn’t be removed.

We can create such chains/individual processors in solrconfig.xml and specify which chain/processor to be used while indexing the document.

Let’s say our document has a field called “Category” and we expect a list of values for it, if the category value in an incoming document is something different, we want to change the field value to “Others” for those documents.

So our update request processor will look something like this-

public class CustomRequestProcessor extends UpdateRequestProcessor {

…..

@Override

public void processAdd(AddUpdateCommand cmd) throws IOException {


 
Log.info(“Processing the input Document in custom Request Processor”);

 SolrInputDocument doc = cmd.getSolrInputDocument();

 

 String category = (String) doc.getFieldValue(“Category”);

 //If category is not from a predefined list

     doc.setField(“Category”,”Others”);

 

 // pass it up the chain

 super.processAdd(cmd);

}

}

 

When this is completed, we just need to create a jar of the plugin classes and add a lib directive in solr config to inform solr where our plugins are present. They will be loaded when solr core (re)loads. We can now use the class in a processor definition like shown above.

There could be many other different use cases of URPs. Here is a list of currently available URP for some inspiration-https://lucene.apache.org/solr/guide/6_6/update-request-processors.html#UpdateRequestProcessors-UpdateRequestProcessorFactories


Also published on Medium.

Leave a Reply

Your email address will not be published.