Monday, April 16, 2012

When numbers are important to your organization

I was contacted recently by a financial organization which had issues with the sorting of their search results using FAST Search Server 2010 for SharePoint. The issue was that when including year numbers in the search query, they were not given any relevance while ranking the search results.

A typical query could be income statement 2010, but the top results returned documents for 2006.

Clearly the year number 2010 was being excluded from ranking the results, and terms excluded by rank indicate stop word threshold issues.

The total number of documents indexed wasn’t particularly large, around 1.1 million, but when these contain financial data, the total dictionary will contain a lot of numbers and chance are that they will reach the stop word threshold and be excluded when calculating rank. This is fine for words like “the” “a” and “for”, but year numbers are a bit different as they are commonly used by humans when categorizing.

Looking at the default rank profile, there are two properties concerning stop word thresholds, StopWordThreshold and PositionStopWordThreshold.
PS C:\FASTSearch\bin> Get-FASTSearchMetadataRankProfile

Name                              : default
isDefault                         : True
RankModelName                     : default
StopWordThreshold                 : 2000000
PositionStopWordThreshold         : 20000000
QualityWeight                     : 50
AuthorityWeight                   : 80
QueryAuthorityWeight              : 50
FreshnessWeight                   : 100
FreshnessResolution               : Day
FreshnessManagedPropertyReference : Write

The description of these as per TechNet.
StopWordThreshold

This integer parameter sets the stop word threshold of the rank profile.

A stop word is a search term that is so common in the result set that it is not counted as part of the relevancy calculation.

When a query term exceeds this threshold, FAST Search Server 2010 for SharePoint retries the query with a higher full text index importance level until it can find a level where the query term is not a stop word (see Set-FASTSearchMetadataFullTextIndexMapping for details about importance levels).

If this is not possible, the query term is not included in the result set's relevancy. A low StopWordThreshold value gives better search performance, but a lower result set relevancy (since there is a bigger chance that a query term does not influence which items are in the result set).

PositionStopWordThreshold

This integer parameter sets the position stop word threshold.

If a query term occurs more often than position-stop-word-threshold (independent of the number of items it occurs in), then proximity relevancy calculations are not done for that term.

If the query term count does not exceed the position stop word threshold, an extra rank score is added if query terms are positioned close to each other in the managed properties.

If you do not want to use proximity as part of the relevancy model, set this parameter to 0 to disable proximity calculation. This will decrease CPU use when searching.

The issue at hand was the PositionStopWordThreshold, as the year 2010 occurs more than 20 million times all together, not unlikely for 1.1 million financial documents.

The simple solution

Up the threshold count. In this case it was set to 100 million to be sure, 5 times the original value.

One have to be aware that upping the threshold will up the CPU usage when searching as more calculations have to be done, but we’ll dismiss this as a scaling issue for now :)

The not so simple solution

If you have the ability to do some code and create a custom core results web part you could do the following. Create a managed property called year, which you populate with a custom extensibility stage during indexing. When executing a search you will first sort on the year managed property if your query contains a year number, and secondly you will sort on the rank profile.

This way you can keep the stop word threshold as is.

Page 431-432 in “Working with Microsoft FAST Search Server 2010 for SharePoint” provides a sample for a custom core search result web part which does something similar.