Thinking Sphinx in Arabic/Unicode
While using Thinking Sphinx in one of my Rails projects, I needed to search Arabic content. Since Sphinx supports Unicode, I thought that would be easy. But it was not due to the lack of documentation of Unicode support through Thinking Sphinx. So here is what to do to support Arabic (Unicode) search.
After reading a little in Sphinx documentation, I knew that to support non-English languages I had to create a charset_table for Sphinx to use while indexing my data. After some research, I found a nice charset table for several languages. So, I went to the configuration file created by Thinking Sphinx (app/config/development.sphinx.conf) and added an English/Arabic charset_table. I stopped, reindexed and restarted searchd. Then, tried to search Arabic with no luck! I noticed that my new configuration, including charset_table, was gone! Why? Thinking Sphinx regenerates the configuration file before reindexing!
After a lot of research, I discovered that to add your custom configuration, you must create the file app/config/sphinx.yml which Thinking Sphinx will use to override its default configuration. Hey, why didn't any one tell me that?!
After 2 hours of YAML syntax errors, I did it. Here is my sphinx.yml:
development: &my_settings
enable_star: true
min_prefix_len: 0
min_infix_len: 1
min_word_len: 1
charset_table: "0..9, a..z, _, A..Z->a..z, U+621..U+63a, U+640..U+64a, U+66e..U+66f, U+671..U+6d3, U+6d5, U+6e5..U+6e6, U+6ee..U+6ef, U+6fa..U+6fc, U+6ff"
test:
<<: *my_settings
production:
<<: *my_settings
Other Settings
- min_word_len: 1
Setting the minimum indexed word length to 1 means index everything. - min_prefix_len: 0
Setting the minimum word prefix length to index to 0 disables prefix indexing. If set to a positive number, indexer would index all the possible keyword prefixes (ie. word beginnings) in addition to the keywords themselves. - min_infix_len: 1
Setting the minimum infix length to index to 1 asks the indexer to index all the possible keyword infixes (ie. substrings) in addition to the keywords themselves. This allows wildcard searching by 'start*', '*end', and '*middle*' wildcards. However, indexing infixes will make the index grow significantly (because of many more indexed keywords), and will degrade both indexing and searching times. Note that you can't enable both prefix and infex indexing at the same time; that's why I disabled prefix indexing. - enable_star: true
This enables "star-syntax", or wildcard syntax, when searching through indexes which were created with prefix or infix indexing enabled. It only affects searching; so it can be changed without reindexing by simply restarting searchd.
Now, stop, reindex and restart searchd:
rake thinking_sphinx:stop
rake thinking_sphinx:index
rake thinking_sphinx:start
Finally, for the wildcard search to work, your controller should look something like this:
class PostsController < BaseController
def search
@posts = Post.search "*#{params[:search_query]}*"
end
end
You should be enjoying Arabic search now.








Leave a Comment
If you want to post code, do this:
<pre><code class="ruby|javascript|css|html"> your code here </code></pre>