One of our site requires Japanese search capabilities which looked easy on paper, but after playing around for the better part of a day I realized it wasn’t. It doesn’t take much to get it going, but sifting through mailing lists is pretty much the only way to get the details.
If you do a search using Japanese characters you’ll notice that solr will find results out of the box. The problem is that it uses English methods of search. The main problems this creates is that is assumes spaces separate words, which is not true in Japanese. A “tokenizer” creates the words for lucene to use. So we need to use a tokenizer that will work with Japanese. The CJK Tokenizer does the trick.
SOLR will not handle multiple languages in a single field so we need to create a new field for each language. First though you need to create the type of field:
Now define the field:
I, for whatever reason, could not get this going without specifically identifying which field the search should use, so I added “qf=body_ja” to my query string. If you see other syntax for defining the field type don’t use it (at least with solr 1.3-1.4). It seems to break the words up correctly but you won’t be able to search.
After you adjust your front end to save into the new fields you should be off to the race.