Rebuilding a Solr Index – the Hard Way

Sometimes you have a need to reindex your data because of schema changes or Solr upgrades. Or just because you enjoy torturing your hardware (you bad person!).

If you have Solr newer than 3.6, you’re in luck and you can just use SolrEntityProcessor. Otherwise, you need to get your hands dirty and put something together to migrate your data, as we had to.

If all the data you need in the new index is stored in the old one, one way to migrate is to transform the Solr response from the old index into a format the UpdateHandler can understand. Since both the input and output format are XML, XSLT is a suitable tool to do that. Other than that, we also need something to actually fetch the Solr response and POST the transformed XML to the destination index.

Our Solution

For this whole shebang to work, you need:

  • A UNIX-like OS (tested on Mac OS X and Linux)
  • Ruby (should work on both 1.8 and 1.9)
  • cUrl
  • Java
  • Saxon XSLT processor. You need Saxon-HE for Java

XSLT

Let’s take a look at the XSLT stylesheet:

The stylesheet should be saved as response-from-solr.xslt. The stylesheet above should be taken as an example. You probably need to adapt the stylesheet for you input and output schema and other transformation you need to do.

Driver script

A XSLT stylesheet alone doesn’t do very much, as it is just a declaration of desired transformations. We still need something to:

  • Split the work in batches
  • Fetch the batches from the source index
  • Process the response from the source index
  • Put the processed response into the new index

So this is the quick and dirty script we came up with:

If you want to use the script, please beware: Though it works for us, please be aware that we can not guarantee that it also works for you. Use it at your own risk.

You should take a look at all the variables above the # -- end config -- line and change it to your liking. Then you can run the script and wait until it finishes.

Conclusion

Fortunately, we had all the data we need stored on the old Solr instance. The performance in our case was great. We let it run overnight and it had finished moving all the documents (around 50 million pieces) the next morning.

As I mentioned above, in Solr versions above 3.6 you can use SolrEntityProcessor. SearchWorkings has a blog post on how to do that.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">