Enable UTF-8 encoding in Solr

In the recent Solr project, my client requested the solr to support Russian (both index and search). For example (find little kittens, the search query looks like)

http://127.0.0.1:8080/search/ru/select/?q=котята

Background

Solr can index any characters encoded in the UTF-8 charset. There are no known bugs with Solr’s character handling, but there have been some reported issues with the way different application servers (servlet container) (and different versions of the same application server) treat incoming and outgoing multibyte characters differently. In particular, people have reported better success with Tomcat than with Jetty.

The most important points are:

  • The document has to be indexed as UTF-8 encoded on the solr server.
  • The client needs to use URL encoding in UTF-8 when sending search request to solr server.
    Soq=котята   should be encoded as
    q=%D0%BA%D0%BE%D1%82%D1%8F%D1%82%D0%B0
  • The server needs to support UTF-8 query strings.  For example, when setting up Tomcat to run Solr, you should be aware that although Solr supports UTF-8 by default, Tomcat does not. You have to enable the character encoding by editing Tomcat’s conf/server.xml
<Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000"
redirectPort="8443" URIEncoding="UTF-8"/>

Be sure to remove useBodyEncodingForURI from that Connector. For more details, click Solr Wiki

However, even after I did the above changes, the response is still encoded in iso-8859-1 when wt=json, whereas wt=xml works perfectly (i.e. returns result in UTF-8).

So I did further configuration with solrconfig.xml


<queryResponseWriter name="xml" class="org.apache.solr.response.XMLResponseWriter"/>
<queryResponseWriter name="json" class="org.apache.solr.response.JSONResponseWriter" default="true"/>

The strange thing is the result is not predictable when testing with browser (IE, Firefox, Chrome), sometimes it returns UTF-8, sometimes in ISO-8859-1. I turned on Firebug to check response headers. Eventually I figured out that browser’s cache made the test result not consistent. This reminds me to use wget, everything works fine thereafter.


wget -Sd  http://127.0.0.1:8080/search/ru/select/?q=%D0%BA%D0%BE%D1%82%D1%8F%D1%82%D0%B0

(Visited 353 times, 4 visits today)