Prefix Query vs Fuzzy Query

I started to use ElasticSearch in my project. Today there’s one requirement from our customer, that is keyword search by matching the beginning letters of words, for example,  if the document contains “kimchi” as username, when searching for “kim”, “kimchi” should also be returned.

So let’s see what ElasticSearch can offer

Prefix Query

Prefix Query: Matches documents that have fields containing terms with a specified prefix (not analyzed). The prefix query maps to Lucene PrefixQuery. For instance, search for ‘tex’ would match the word ‘text’.

Example in JSON:


{
"prefix" : { "keywords4search" : "tex" }
}

Example in Java:


QueryBuilders.prefixQuery("keywords4search", keywords);

Fuzzy Query

Fuzzy Query: Finds words that need at most a certain number of character modifications, known as ‘edits’, to match the query. For example, search for ‘fuuzzy’ would still match the word ‘fuzzy’ because only a single deletion of ‘u’ would match the two words. The fuzziness  argument specifies that the results match with a maximum edit distance of n. The default value 0.5 is not recommended, fuzziness should be kept as  number between 1 and 2, meaning a maximum of 2 edits between the query and a term in a document is allowed. Larger differences are too expensive to compute efficiently and are not processed by Lucene.

Example in JSON:

{
 "fuzzy" : {
     "user" : {
         "value" : "fuuzzy",
         "boost" : 1.0,
         "fuzziness" : 2,
         "prefix_length" : 1,
         "max_expansions": 100
      }
  }
}

Example in Java:


QueryBuilders.fuzzyQuery("keywords4search", "fuuzzy")

.boost(1.0).fuzziness(2).prefixLength(1).maxExpansions(100);

prefix_length: The number of initial characters which will not be “fuzzified”. This helps to reduce the number of terms which must be examined. Defaults to 0, which is not recommended. Generally speaking the longer the prefix_length, the better the search performance.

max_expansions: The maximum number of terms that the fuzzy query will expand to (in another word, match before halting the search). Defaults to 50. Cutting down the query terms can increase the performance but if it’s too small, the negative effect is some valid results may be excluded due to early termination of the query.

From the above comparison, the prefix query will suit our need with better performance.

(Visited 81 times, 1 visits today)