Solr DataImportHandler delta scheduler

Background

The DataImportHandler is a Solr contribution that provides a configuration driven way to import data from database or XML file into Solr. It takes care of

  • Full builds (aka full index)
  • Delta builds (incremental delta imports, which indexes newly added or modified documents as well as deleted documents)

Official Solr wiki document

How to enable scheduler job

The dataimport scheduler is NOT included in any released Solr version. This is a proposal with a very old issue in Jira. The feature may never become real, because all modern operating systems already have scheduling capability built in, and adding it to Solr would be reinventing a very old wheel. However, I still like adding this scheduler job into Solr server to ease the deployment & maintenance effort.

To make the scheduler job working. Here are the steps (tested working in Solr 3, 4, 5)

1. Configure solrconfig.xml with dataimport handler

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
    <str name="config">dih-config.xml</str>
</lst>
</requestHandler>

2. Create dih-config.xml and save it under “conf” directory

<dataConfig>
 <dataSource type="JdbcDataSource"
 driver="com.mysql.jdbc.Driver"
 url="jdbc:mysql://localhost/dbname"
 user="user-name"
 password="password"/>
 <document>
 <entity name="id"
 query="select id,name,desc from mytable">
 </entity>
 </document>
</dataConfig>

3. Add your JDBC driver jar

Drop it to <solr-home>/lib or web-app/WEB-INF/lib folder

4. Add scheduler jar

Download the jar from here

5. Add dataimport.properties file in folder solr.home/conf/ with mandatory params inside (see bellow for the example of dataimport.properties)


#################################################
 # delta dataimport scheduler properties #
 #################################################
 # to sync or not to sync
 # 1 - active; anything else - inactive
 syncEnabled=1

# which cores to schedule
 # in a multi-core environment you can decide which cores you want syncronized
 # leave empty or comment it out if using single-core deployment
 syncCores=company,job
 # solr server name or IP address
 # [defaults to localhost if empty]
 server=127.0.0.1

# solr server port
 # [defaults to 80 if empty]
 port=8983

# application name/context
 # [defaults to current ServletContextListener's context (app) name]
 webapp=solr

# URL params [mandatory]
 # delta import command remainder of URL
 params=/dataimport?command=delta-import&optimize=false&clean=false&commit=true

# schedule interval
 # number of minutes between two runs
 # [defaults to 30 if empty]
 interval=10</pre>
<pre>

6. Declare ApplicationListener in Solr’s web.xml

</pre>
<listener>
<listener-class>org.apache.solr.handler.dataimport.scheduler.ApplicationListener</listener-class>
</listener>
<pre>

7. My directory structure

Solr home directory

solr-dir

 

 

Solr web directory

solr-dir-web

 

8. Run the full import command

http://127.0.0.1:8983/solr/company/dataimport/select?command=full-import

9. The delta import will also run in every 10 minutes

Note: after the first job runs, it generates “dataimport.properties” under the core/conf directory

For example solr/company/conf/dataimport.properties, the content looks like


#Mon Nov 09 21:45:47 UTC 2015
company.last_index_time=2015-11-09 21\:45\:41
last_index_time=2015-11-09 21\:45\:41

The delta scheduler will also generate some log


HTTPPostScheduler [company] <delta> Process started at .............. 2015.11.09 22:19:14
HTTPPostScheduler [company] <delta> Request method POST
HTTPPostScheduler [company] <delta> Using port 8983
HTTPPostScheduler [company] <delta> Application name solr
HTTPPostScheduler [company] <delta> URL params /dataimport?command=delta-import&optimize=false&clean=false&commit=true
HTTPPostScheduler [company] <delta> Full URL http://127.0.0.1:8983/solr/company/dataimport?command=delta-import&optimize=false&clean=false&commit=true
HTTPPostScheduler [company] <delta> Succesfully connected to server 127.0.0.1

(Visited 302 times, 1 visits today)

Leave a Reply