Solr WordDelimiterFilterFactory

When implementing a search solution, you often enter a search that you know should match a particular document, but it does not appear in the search results, or even it does, it’s not on the top (most relevant match). Most of the time, this type of mismatch is caused by one of two factors:

  1. A mismatch between index-time document analysis and query-time analysis (while not often recommended, it is possible to analyze documents differently from queries).
  2. The Analyzer/Tokenizer/TokenFilters are modifying one or more terms differently than expected.

You can use Solr’s Analysis function located at http://localhost:8080/ solr/admin/analysis.jsp to investigate both of these issues. The Analysis page accepts snippets of text for both queries and documents, as well as the Field name that identifies how the text should be analyzed, and returns step-by-step results of the text being analysed/modified.

The following screen shows the analysis results of how the Document with title “2010 Audi A3 2.0T Premium Plus PZEV” is analyzed and the related query “2010 audi a3” is analyzed.

The schema.xml

<fieldtype name="text" class="solr.TextField" sortMissingLast="true" omitNorms="true">
 <analyzer type="index">
 <tokenizer class="solr.WhitespaceTokenizerFactory" />
 <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
 <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
 <filter class="solr.LowerCaseFilterFactory"/>
 <filter class="solr.SnowballPorterFilterFactory" language="English" />
 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
 <analyzer type="query">
 <tokenizer class="solr.WhitespaceTokenizerFactory" />
 <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
 <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
 <filter class="solr.LowerCaseFilterFactory" />
 <filter class="solr.LowerCaseFilterFactory"/>
 <filter class="solr.SnowballPorterFilterFactory" language="English" />
 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

<field name="title" type="text" multiValued="false">

Notice that “A3” becomes “a” and “3” after WordDelimiterFilterFactory, then “a” is removed by StopFilterFactory

This is not what we want, as we want search keyword “2010 audi a3” to match “2010 Audi A3 2.0T Premium Plus PZEV” with higher relevance score.

What we can do?

Change the parameter in solr.WordDelimiterFilterFactory so that it doesn’t split on letter-number transitions.

Here is what WordDelimiterFilterFactory  does according to solr wiki page:

Splits words into subwords and performs optional transformations on subword groups. By default, words are split into subwords with the following rules:

split on intra-word delimiters (all non alpha-numeric characters).

“Wi-Fi” -> “Wi”, “Fi”
split on case transitions (can be turned off – see splitOnCaseChange parameter)

“PowerShot” -> “Power”, “Shot”
split on letter-number transitions (can be turned off – see splitOnNumerics parameter)

“SD500” -> “SD”, “500”
leading and trailing intra-word delimiters on each subword are ignored

“//hello—there, ‘dude'” -> “hello”, “there”, “dude”
trailing “‘s” are removed for each subword (can be turned off – see stemEnglishPossessive parameter)

“O’Neil’s” -> “O”, “Neil”
Note: this step isn’t performed in a separate filter because of possible subword combinations.

For our case, we want “A3” to stay as “A3″, so let’s use

splitOnNumerics=”0”  so it won’t split on letter-number transitions

(Visited 17 times, 1 visits today)