Sometimes you have a need to reindex your data because of schema changes or Solr upgrades. Or just because you enjoy torturing your hardware (you bad person!).
If you have Solr newer than 3.6, you’re in luck and you can just use SolrEntityProcessor. Otherwise, you need to get your hands dirty and put something together to migrate your data, as we had to.
If all the data you need in the new index is stored in the old one, one way to migrate is to transform the Solr response from the old index into a format the UpdateHandler can understand. Since both the input and output format are XML, XSLT is a suitable tool to do that. Other than that, we also need something to actually fetch the Solr response and POST the transformed XML to the destination index.
Our Solution
For this whole shebang to work, you need:
- A UNIX-like OS (tested on Mac OS X and Linux)
- Ruby (should work on both 1.8 and 1.9)
- cUrl
- Java
- Saxon XSLT processor. You need Saxon-HE for Java
XSLT
Let’s take a look at the XSLT stylesheet:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
<?xml version="1.0"?> <!-- This is an example, so you probably need to adapt it to your case --> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xsl xs"> <xsl:output method="xml" version="1.0" encoding="utf-8"/> <xsl:template match="/response"> <add> <!-- transform each result document --> <xsl:for-each select="result/doc"> <doc> <xsl:for-each select="child::*"> <!-- let the solr-field template transform a result field to a document field --> <xsl:call-template name="solr-field"/> </xsl:for-each> <!-- you can also add some fields with static values: --> <field name="source"> <xsl:text>bulk-import</xsl:text> </field> </doc> </xsl:for-each> </add> </xsl:template> <xsl:template name="solr-field"> <xsl:choose> <!-- when a field needs special transformation like picking apart a multivalued field, you can add xsl:when statements to handle it like this. In this example, 'content' is multivalued --> <xsl:when test="@name = 'content'"> <!-- Delegate transforming to 'handle-content' --> <xsl:call-template name="handle-content"/> </xsl:when> <!-- when you want to ignore a field, just use an empty xsl:when --> <xsl:when test="@name = 'ignorethis'"></xsl:when> <!-- by default, just use the supplied value --> <xsl:otherwise> <field name="{@name}"> <xsl:value-of select="."/> </field> </xsl:otherwise> </xsl:choose> </xsl:template> <xsl:template name="handle-content"> <field name="title"><xsl:value-of select="str[2]"/></field> <field name="text"><xsl:value-of select="str[3]"/></field> </xsl:template> </xsl:stylesheet> |
The stylesheet should be saved as response-from-solr.xslt. The stylesheet above should be taken as an example. You probably need to adapt the stylesheet for you input and output schema and other transformation you need to do.
Driver script
A XSLT stylesheet alone doesn’t do very much, as it is just a declaration of desired transformations. We still need something to:
- Split the work in batches
- Fetch the batches from the source index
- Process the response from the source index
- Put the processed response into the new index
So this is the quick and dirty script we came up with:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
#!/usr/bin/env ruby require 'cgi' # ATTENTION: this only works if you have an numeric, strictly monotonic increasing ID field # to create batches from. It should be possible to adapt this for other criteria # to create batches (like an addedDate) or to leave out the batching (if you have # a small index). # # USE IT AT YOUR OWN RISK # The smallest and largest ID you need to move start_id = 1 end_id = 100_000_000 # How big a batch should be. You might need to give Saxon more memory if you increase it id_batch_size = 50_000 # URLs to the source and destination indexes source_solr_url = "http://<source-host>:8080/solr/" destination_solr_url = "http://<destination-host>:8080/solr" # Path to required files saxon_path = "saxon9he.jar" response_xslt_file = "response-from-solr.xslt" # -- end config -- batch_end_id = end_id batch_start_id = end_id - id_batch_size + 1 begin batch_query = "id:[#{batch_start_id} TO #{batch_end_id}]" puts "moving #{batch_query}" batch_url = "#{source_solr_url}/select/?q=#{CGI.escape(batch_query)}&fl=*&rows=#{id_batch_size}" # Put the commands together: # Pull batch from source Solr cmd = "curl -s '#{batch_url}' | " # Pipe the response into Saxon cmd += "java -Xmx1024m -XX:MaxPermSize=256m -jar '#{saxon_path}' -s:- '#{response_xslt_file}' |" # Pipe Saxon's output into the destination Solr cmd += "curl -s --data-binary '@-' --header 'Content-Type: text/xml' '#{destination_solr_url}/update' > /dev/null" puts "executing #{cmd}" system cmd # Set ID range for the next batch batch_end_id -= id_batch_size batch_start_id -= id_batch_size end while( batch_end_id > start_id ) |
If you want to use the script, please beware: Though it works for us, please be aware that we can not guarantee that it also works for you. Use it at your own risk.
You should take a look at all the variables above the # -- end config -- line and change it to your liking. Then you can run the script and wait until it finishes.
Conclusion
Fortunately, we had all the data we need stored on the old Solr instance. The performance in our case was great. We let it run overnight and it had finished moving all the documents (around 50 million pieces) the next morning.
As I mentioned above, in Solr versions above 3.6 you can use SolrEntityProcessor. SearchWorkings has a blog post on how to do that.
