Solr MongoDB data import handler support delta delete nested object

Overview

MongoDB’s full text search is pretty basic, much like what’s offered in relational databases, which might be sufficient for you. However, if you need more advanced search capability, Solr will come to your mind. However, there’s no official (supported/complete) solution to integrate MongoDB and Solr. Here I’m going to throw in some solutions

Options

Option 1. Use Monbo Connector

Mongo Connector is a generic connection system that you can use to integrate MongoDB with another system with simple CRUD operational semantics (i.e. insert, update, delete, and search operations.) On startup, Mongo Connector copies your documents from MongoDB to your target system (could be Solr, ElasticSearch, etc). Afterwards, it constantly performs updates on the target system to keep MongoDB and the target in sync.

For more details, refer to the blog and github project below

  • http://blog.mongodb.org/post/29127828146/introducing-mongo-connector
  • https://github.com/10gen-labs/mongo-connector

Option 2. Use DataImportHandler

The DataImportHandler is a Solr contrib that provides a configuration driven way to import the data stored in data source into Solr in both “full builds/full index” and using incremental delta imports (delta index). The main advantage of this method of data importing is no need for additional software development and the rapid integration of the data source. However, officially this Data Import Handler only can integrate with relational database like MySQL, Oracle, not NoSQL database like MongoDB.

Step-by-step instruction to integrate MongoDB with DataImportHandler

Step 1. Understand the DataImportHandler

https://wiki.apache.org/solr/DataImportHandler This wiki explains it pretty well when working with relational database.

Step 2. Make sure your MongoDB is working

Assume my Mongodb database name “posts”, the collection name “sellposts” db.sellposts.find()

/* 1 */
{
    "_id" : "2bd571b04f374d71929560d04b58ba51",
    "categoryPath" : "/SALE/Appliances",
    "title" : "string",
    "price" : {
        "value" : 123456789.88,
        "currency" : "CAD",
        "currencySymbol" : "$"
    }
}

/* 2 */
{
    "_id" : "5d55c86945004dd79a4333bf2bcc6d83",
    "categoryPath" : "/SALE/Appliances",
    "title" : "Whrilpool cabrio set",
    "price" : {
        "value" : 629.0,
        "currency" : "USD",
        "currencySymbol" : "$"
    }
}

Step 3. Declare Solr fields in schema.xml

<!-- Sample Solr schema.xml -->
  <fields>    
    <field name="postId" type="string" indexed="true" required="true" />
    <field name="categoryPath" type="string" indexed="true" stored="true"/>
    <field name="title" type="textnosynonym" indexed="true" stored="true" />
    <field name="price" type="double" indexed="true" stored="true" />
    <field name="price_currency" type="string" indexed="true" stored="true" />
    <field name="price_currencySymbol" type="string" indexed="false" stored="true" />
  </fields>

Step 4. Declare dih-config.xml in solrconfig.xml

<config>
    ...
    <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
      <lst name="defaults">
        <str name="config">dih-config.xml</str>
      </lst>
    </requestHandler>
    ...
  </config>

Step 5. Define the dih-config.xml under your Solr collection/conf folder (where schema.xml, solrconfig.xml is stored)

<?xml version="1.0" encoding="utf-8" ?>
<dataConfig>
    <!--clientUri=mongodb://[username:password@]host1[:port1][,host2[:port2],...[,hostN[:portN]]][/[database][?options]] -->
    <dataSource name="MongoDS" type="MongoDataSource" database="posts" clientUri="mongodb://localhost:27017"/>
    <document name="import">
        <!-- if query="" then it imports everything -->
        <entity processor="MongoEntityProcessor" collection="sellposts" datasource="MongoDS" transformer="MongoMapperTransformer" name="sellpost" query="{$or : [{'status' : 'AVAILABLE'},{'status' : 'SOLD'} ]}" deltaQuery="{$and : [ {$or : [{'status' : 'AVAILABLE'},{'status' : 'SOLD'} ]}, {'modifiedAt':{$gt:{$date:'${dih.last_index_time}'}} } ] }" deltaImportQuery="{'_id':'${dih.delta._id}'}" deletedPkQuery="{$and : [ {$or : [{'status' : 'DELETED'},{'status' : 'UNLISTED'} ]}, {'modifiedAt':{$gt:{$date:'${dih.last_index_time}'}} } ] }" >

            <!--  If mongoField name and the field declared in schema.xml are the same, then you don't need to declare below field mapping.
                  If not same than you have to refer the mongoField to field in schema.xml
                 ( Ex: mongoField="EmpNumber" to name="EmployeeNumber").
                 <field column="EmpNumber" name="EmployeeNumber" mongoField="EmpNumber"/>
                 -->
            <field column="_id" name="postId"/>
            <field column="price_value" name="price"/>
        </entity>
    </document>
</dataConfig>

Step 6. Run the full import

Assuming solr is running on port 8080 and mongodb are running on 27017, open the following link http://localhost:8983/solr/sellpost/dataimport?command=full-import This should trigger the full index to import the data from mongodb to solr.

Try the search query: http://localhost:8983/solr/sellpost/query?q=*

Step 7. Enable the delta import scheduler job

If you need auto scheduling job configured for the delta import job, you can find more details here For your convenience, I also include the source code

  • SolrDataImportProperties.java
  • ApplicationListener.java
  • HTTPPostScheduler.java
Step 7.1 compile the above source and put it in the classpath
Step 7.2 declear it in web.xml
<!-- web.xml -->
   <listener>
     <listener-class>org.apache.solr.handler.dataimport.scheduler.ApplicationListener</listener-class>
   </listener>
Step 7.3 define dataimport.properties

${solrHome}/solr/conf/dataimport.properties

#################################################
# delta dataimport scheduler properties         #
#################################################
#  to sync or not to sync
#  1 - active; anything else - inactive
syncEnabled=1

#  which cores to schedule
#  in a multi-core environment you can decide which cores you want syncronized
#  leave empty or comment it out if using single-core deployment
syncCores=sellpost

#  solr server name or IP address
#  [defaults to localhost if empty]
server=127.0.0.1

#  solr server port
#  [defaults to 80 if empty]
port=8983

#  application name/context
#  [defaults to current ServletContextListener's context (app) name]
webapp=solr

#  URL params [mandatory]
#  delta import command remainder of URL
params=/dataimport?command=delta-import&clean=false&commit=true

#  define how frequent the delta import should run
#  (number of minutes between two runs)
#  [defaults to 10 if empty]
interval=5

Restart the Solr and wait patiently, Solr should be able to import from your MongoDB incrementally now.

 

(Visited 288 times, 4 visits today)

Leave a Reply