Effective full-text search: go deep

July 14, 2020

Searching in documents is easy with SQL Server, but quite challenging on the infrastructure side. In this article I will show you how to index files, what practices people use to index large number of scanned documents and how to store files in your database.

First of all everything detailed in the Effective full-text search series Go advanced and Go relevant articles are applicable. If you would like to write full-text queries, please check those for the details.

Table structure

SQL Server is a relational database, so we need a table with some columns to store the files and some metadata to support querying and indexing.

As a usual result of “normalization”, documents with only the document related properties, like file name, size, etc. are put on a single table without mixing those with other non-document related ones.


CREATE TABLE [dbo].[Doc] (
[ID] uniqueidentifier ROWGUIDCOL NOT NULL ,
[Extension] [varchar] (10) NOT NULL ,
[Content] [varbinary] (max) NOT NULL ,
[FileSize] [int] NOT NULL ,
[FileName] [nvarchar] (500) NOT NULL ,
[ts] [timestamp] NOT NULL ) ON [PRIMARY]
GO ALTER TABLE [dbo].[Doc] WITH NOCHECK ADD
CONSTRAINT [PK_Doc] PRIMARY KEY CLUSTERED
(
[ID]
) ON [PRIMARY]
GO ALTER TABLE [dbo].[Doc] ADD
CONSTRAINT [DF_Doc_ID] DEFAULT (newid()) FOR [ID]
GO 
 

Let’s check the columns one-by-one:

ID: you need to have a unique id simply to be able to turn on full-text indexing. Typical usage scenario for files stored in database especially in web based systems is to download those. An ID with the type of UniqueIdentifier (GUID) alternatively to INT/BIGINT can be used in a URL without providing the option for users to download all the files from the database just by incrementing a number in a URL. Of course this is not real security, only makes things harder.

Using UniqueIdentifier as a primary key instead of a number type like INT/BIGINT is not recommended in general because it can result in a large number of page splits and also in index maintenance. Consider having a BIGINT as a primary key and a UniqueIdentifier

as a unique key on the table to fulfill both the technical and usage needs according to your own scenario.

Extension: this is the extension of the file you store like: ‘.pdf’, ‘.docx’, ‘.xlsx’. This field is used by the SQL Server to select the right indexer (iFilter) to get the text out of the file. The iFilters are not SQL Server components, those belong to the operating system.

Content: this should be a binary field for storing the content of the file, it is preferably a varbinary(max) column because image is deprecated.

Filesize: the first optional field what I always add to the table because of a very typical use case: you need to show the size of the file for the users to be able to decide whether they would like to download that or not. Calculating the size on

the fly is possible using the DATALENGTH function, but it can be very slow at query time in certain scenarios.

Filename: if you ever would like to save your files to a file system or send those in emails you will need a readable name for that and the original file name is a good candidate instead of some default names or GUIDs. This field is not unique, you need to be prepared for duplicates.

Timestamp (ts): this field is required to be able to use incremental indexing instead of change tracking. In certain scenarios it might be necessary.

Timestamp on SQL Server has nothing to do with time. This is just an increasing number in binary format which can be used for detecting modifications and for this reason for instance to synchronize data across systems. If you need real time, use datetimeof set.

Trick against lazy developers

We worked together multiple times with a real DBA, Viktor Suha who is also an author on Pluralsight and he mentioned a trick I share with you:

Create a non-persisted computed column using a very simple formula: 1/0.

What is the result of it? It makes it impossible to run a query like SELECT * FROM [Doc] , because 1/0 is “calculated” on the fly — it is unpersisted — and it will fail.

The reason why I am mentioning it — other than using SELECT * is a really bad practice — in this specific case it can result in a huge server load because the server will start to read ALL files from the disk stored on the table (depending on the WHERE clause) and this is something that you don’t want to do on a production environment unless you need to.

Make sure you only apply this trick on your test system to avoid unexpected issues and be prepared for some comments from the developers.

Bad practice? What?

If you run queries like SELECT * FROM ... your code will not be fool proof. In case of new columns added to the table you will get more data back what may not be required/processed by the caller so it is a waste. Just imagine adding large binary columns… Will that stop/slow down your app?

In case of removing columns from the table your query will not fail, just the application processing the result so if you don’t have wide test coverage you may not even be able to notice the issue before regression testing… or after going-live.

Configure full-text indexing

This time let’s use SQL Server Management Studio instead of scripting. Find your table, right click and…

A wizard will show up where the most important steps are to select your binary column and add a “type” which needs to be the file extension.

You need to enable change tracking if you are not working on edge case scenarios.

The rest of the steps are quite obvious, select a full-text catalog what you already have and so on.

Add some records

The easiest is to add some “text” type of records which can be indexed by default by SQL Server.

INSERT INTO [dbo].[Doc] ([Extension], [Content], [FileSize], [Filename]) VALUES ('.txt', CAST('This is the content of my kind short text file.' AS varbinary), 32, 'kind.txt')
INSERT INTO [dbo].[Doc] ([Extension], [Content], [FileSize], [Filename]) VALUES ('.txt', CAST('This is the content of my nice short text file.' AS varbinary), 32, 'nice.txt')
INSERT INTO [dbo].[Doc] ([Extension], [Content], [FileSize], [Filename]) VALUES ('.htm', CAST('<html><body><h1>Terence Hill</h1> </body></html>' AS varbinary), 32, 'terence.htm')

Probably you noticed that the file size is not calculated automatically in this example, but expects the business logic to provide that. Alternatively I encourage you to use a persisted computed column for this purpose instead.

Filesize can be calculated using the DATALENGTH function which return value is a BIGINT , which makes it feasible for file length calculation purpose, but only at INSERT/UPDATE time. Try to avoid using it at query time!

There are two other thing to notice:

Html and text files are handled by the full-text search engine by default, you don’t need to install additional iFilters to index those.

Content of the files are converted ( CAST ) to binary.

Test the solution

As shown in the previous articles of the series, you can use a simple query like to below:

SELECT [ID], [Extension], [FileSize]
,[FileName], [ts]
FROM [dbo].[Doc]
WHERE FREETEXT([Doc].[Content], 'Terence')

Indexing different file types

SQL Server relies on iFilters, just like the Windows Search or SharePoint. This means that if you install additional iFilters on the SQL Server computer, the server will be able to index those type of files as well. It sounds quite well, but… at this point you need to think carefully and test before you install anything.

Every iFilter is different in memory and CPU usage as a result if you add a “heavy” iFilter and you have millions of documents of that type, your full index update may increase to an unacceptable level or newly added documents may not appear in the result, just after a long time.

If you just add a new iFilter, but there are a lot of documents already on your table, you need to run a full index population.

Most typical iFilters missing from computers:

Microsoft Office

Microsoft publish this as an independent package. You just need to download and install to be able to index Office files.

PDF

There are free and commercial iFilters as well on the market from multiple vendors. The most well-known ones are Adobe, Foxit and PDFLib. Some of the free ones are limited to a single thread which slows down processing.

TIFF

Part of Windows, but a separate component which is not installed by default. Indexing TIFF files basically means running OCR (Optical Character Recognition) which is pretty much resource intensive.

Indexing scanned documents

If you have only a few documents, then you can even pick the TIFF iFilter. If you have thousands or millions of documents you need to think about the whole document processing workflow you have.

The scanners which are capable of scanning millions of documents in a short time frame are mostly well equipped with software capable of creating hidden text pdfs. These PDF files just store the scanned picture as the background of the PDF and writes the transparent text over the picture aligned to the text so people can even copy-paste that.

There are multiple benefits here:

Usage of the PDFs is easier, PDF readers are available, feature rich and well known by the users.

Text selection works so reusing part of the document is relatively easy outside of some funny result made by the OCR engine.

Hidden text PDFs contain the text so no need for further OCRing. This result is lower processing cost on the full-text indexing side.

This type of solutions from large vendors like Canon or Kofax used to cost a fortune so these are not for ad-hoc or one-time scanning a larger number of documents scenarios.

Availability

Before building an app on SQL Server’s file indexing capability you need to validate whether it will work for you or not. If you will only support on premise or IaaS environments, you can make it work from an infrastructural perspective, but if you plan to support Azure SQL today then this is almost a no-go for you.

Azure SQL doesn’t allow you to add any non-default filters including Office and PDF. Without those, only having filters for different text files — excluding csv — it can be hard to cover your use case.

What’s next

This article became long enough so a short one will come next about optimizing file storage in SQL Server.

István Kovács
Cloud Architect