Prefix FullText Index Search with asterisc (*), e.g. *earc*

We recently had a requirement that required us to enable prefix search in sql search – thus we had to make it possible to write a sql statement like:

select * from table where contains(*, ‘*earch*’)

The only other real option to fullfill this requirement would be to do major refactoring which was definitely out of the question, so after agreeing on the impossibility of this, we began comming up with ideas – and actually ended out with this list:

* Code own IWordBreaker implementation (I’ll describe this more in depth below)
* Create a separate table with reversed data. E.g. ‘word’ would be indexes as ‘drow’
* Extract the data that we would need in a separate column and do a LIKE on this
* Investigate other search engines than MS Search and Indexing Services

So, after some discussion we actually made this work with the custom iWordBreaker implementation. The idea was that FullText Indexing utilizes MS Indexing Services, and Indexing services uses this component, IWordBreaker, to break up chunks of text into words. I.e. given the string “some word” the job of the word breaker would be to extract the two words ‘some’ and ‘word’.

So, since the iWordBreaker provides full control over excatly what will be indexed, what we did was to not only index ‘word’, but also the word forms ‘^ord’ and ‘^rd’. Now, MS Search also utilizes the iwordbreaker, so when the user writes a command like:

select * from table where contains(data, ‘*ord’)

then we would replace * with ^ before checking the index. Cool:-)

There where some prerequisites that made this work. First of all we did not have an enormous amount of data – max 5 mio. rows in a table where each row could be a couple of thousand characters long. Enough data to make LIKE operator too slow.

Also, another issue was that – at least for sql2000 which we had to support – iWordBreaker implementations are not isolated to SQLServer, but rather everything which is being indexed on the computer and uses indexing services will use a specific iwordbreaker, so we had to make sure nothing else was using our implementation since it would then be hard to see through the consequences.

Also, we didn’t manage to make this work with managed code, so we had to fall-back to c++.

But, after some effort it worked like a charm. Actually, earlier on we also had some troubles indexing special characters with the standard US IWordBreaker implementation. Our custom iWordBreaker also fixed this problem.

Finally, I should mention that I myself was not the genious that implemented this stuff, and I’m not allowed to make the code public – but now the idea is out there if anyone happens to meet the same issue.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s