Effective full-text search: go simple

May 31, 2020

Are you developing enterprise software? Then you probably faced with the problem caused by LIKE ‘%...%' . There are some options how you can overcome on this issue even without switching to another database or database architecture once you are matured relational database like Microsoft SQL Server.

The personal part also matters, some solutions are not code based, read about it at the end of the article.

Business problem

When you are building a real enterprise software you usually have some sort of data distribution across table - supposed that you are using a relational database system - and in all cases you have several hundred thousands or million of records on certain tables. What you also have is a large number of users who used to use Google Search

with its simple “one textbox” query interface and they are happy with that so they are asking you to allow them to use something similar.

“Like” is your friend

In social media for sure… In SQL, maybe… Like can only work effectively when you are able to deal with your customer/users on the cost of the

implementation/complexity of the environment/functionality. If you can’t do that like can quickly emerge as a performance bottleneck. Let’s discuss why and what agreements you need to make to be able to use the simplest approach from an implementation perspective.

Prerequisites for using “Like”

Basically nothing, which makes it super easy to use. You just need to make sure you store your texts in character typed column like char , nchar , varchar and nvarchar . You can even use it with “unlimited” length ( nvarchar(max) , varchar(max) ).

Potential issues

Designing supporting indexes for the like statement is limited, but as a developer you can have multiple approaches. All of these have only one aim: avoid full table scans on large tables, basically to limit the number of strings in which the like checks for presence of the search phrase.

Of course it also matters when you need to check for certain words in fields containing book length text, but we will skip “War and peace” issue for now.

Only ending wildcard allowed

This is an easy to implement solution, but the result will not be Google Search like as you only search for the start of the string. Example:

SELECT [Id], [Name] FROM main WHERE [Name] LIKE 'bla%'

| Id | Name |

-------------------

| 1 | Black bird |

| 5 | Blast |

| 9 | Blackberry |

But the below records will not be found using the same search pattern:

| Id | Name |

------------------

| 2 | emblaze | << Matching pattern would be '%bla%' | 6 | rambla | << Matching pattern would be '%bla' or '%bla%'

So why does the first example perform well while the second is problematic? If you create a non-clustered secondary index on the [name] (searched) column you can avoid table scans and the index can be used for matching. Check the execution plan below without the additional index.

Execution plan for LIKE query without supporting index — full table scan

Let’s create the index on the table and check how the execution plan change.

CREATE NONCLUSTERED INDEX [IDX_main_Name] ON [dbo].[main] (

[name] ASC

)

Execution plan for LIKE query with supporting index — Index Seek

In case of the second and third patterns, the index will be completely useless for finding the records due to the order of records within the index. This ultimately means that the first type of LIKE query will use Index Seek operation while the rests remain using Clustered Index Scan (full table scan).

Keep in mind: if you have a lot of data on your table and you fulfill the request of your customer by implementing the LIKE '%...%' solution for millions of records, and the solution can only be wrong/slow, it will fall back on you!

You are the one responsible for the solution and implementing something that should never work/scale by definition is a waste. As a software developer you need to make sure that at least you warn your customer/product owner.

In other words if you don’t have any other options/columns to filter your table in addition, just agree with your customer that only the ending wildcard will be

allowed. Of course if you have additional columns just read on the next chapter.

Everything is allowed with limitations

There is another option how you can use the LIKE statement effectively. It is also about reducing the number of records that the SQL Server will need to scan but you prefilter the table in advance. For instance if you have a category field on your table with near even distribution of records within categories or creation time of the records matter you can apply additional WHERE filters and at the same time you make sure that only a smaller number of records will be checked by the LIKE statement.

SELECT [id],[name]

FROM [a].[dbo].[main]

WHERE categoryId = 1 AND [name] LIKE '%bla%'

Of course if you have a user interface - other than SSMS - you can apply UI restrictions like only allow selection of a week for a time period or choose only one category at a time.

On my test table with only a few records it wasn’t enough just to create an index on the categoryId column to make SQL Server use that. Without index hints I had to add another trick to make it seems less resource intensive than using a clustered index scan. The trick was that when I created the index I added and included column as well ( name ). With this I was able to achieve that when the SQL Server uses this new index for filtering on categories, it will not need to go back to the table for checking the value of the name field because it is in the index already which is in the memory.

CREATE NONCLUSTERED INDEX [IDX_main_categoryId] ON [dbo].[main] (

[categoryId] ASC

) INCLUDE([name])

Using included columns in indexes can be a good technic in many cases, but you need to avoid feeding up your indexes as well because with that you will just increase the disk and memory usage without real benefits.

Overall in this case as you can see below the execution plan has changed:  Like query on prefiltered table

If you are familiar with relational databases a little bit you might have heard about index hints as well for forcing SQL Server to use specific indexes. Using index hints is not something that people used to propose there are several other options to encourage index usage, but in general it is better to rely on the query optimizer to choose the best one for the certain query.

There is one thing that you should never do in your code:

Do not embed index hints in your non-SQL (C#, Java, etc.) code if you can’t change that any time in minutes. These can cause serious performance issues in a production environment and can reduce the toolset of your DBA to solve the problem.

By the way, do not use index hints at all in general.

Personal aspects

Some of the developers feel forced to implement something and rather avoid the discussion with the customer/product owner which used to result in people pointing to each other at the end.

Keep in mind that in most of the cases the customers/product owners would like to have usable and not too expensive application so they are ready for making deals on cost/functionality/delivery time.

IT people used to work with computers and not with people most of the time. As a result large part of them are introverted and used to belong to the Blue personality type which means they are extremely good for instance in coding but less good in value selling and communication.

If you are a product owner: keep asking questions.

If you are a programmer: understand that the good result is everyone’s interest and do not try to give technical solutions for all the problems. Changing the specification, adjusting the user interface or just replacing some functionality to achieve similar result are options you should always consider.

Next chapter

The story is continued with the Effective full-text search: go advanced article which will show you how to use the real full-text index functionality of SQL Server.

István Kovács
Cloud Architect