A typical query could be income statement 2010, but the top results returned documents for 2006.
Clearly the year number 2010 was being excluded from ranking the results, and terms excluded by rank indicate stop word threshold issues.
The total number of documents indexed wasn’t particularly large, around 1.1 million, but when these contain financial data, the total dictionary will contain a lot of numbers and chance are that they will reach the stop word threshold and be excluded when calculating rank. This is fine for words like “the” “a” and “for”, but year numbers are a bit different as they are commonly used by humans when categorizing.
Looking at the default rank profile, there are two properties concerning stop word thresholds, StopWordThreshold and PositionStopWordThreshold.
PS C:\FASTSearch\bin> Get-FASTSearchMetadataRankProfile Name : default isDefault : True RankModelName : default StopWordThreshold : 2000000 PositionStopWordThreshold : 20000000 QualityWeight : 50 AuthorityWeight : 80 QueryAuthorityWeight : 50 FreshnessWeight : 100 FreshnessResolution : Day FreshnessManagedPropertyReference : Write
The description of these as per TechNet.
StopWordThreshold
This integer parameter sets the stop word threshold of the rank profile.
A stop word is a search term that is so common in the result set that it is not counted as part of the relevancy calculation.
When a query term exceeds this threshold, FAST Search Server 2010 for SharePoint retries the query with a higher full text index importance level until it can find a level where the query term is not a stop word (see Set-FASTSearchMetadataFullTextIndexMapping for details about importance levels).
If this is not possible, the query term is not included in the result set's relevancy. A low StopWordThreshold value gives better search performance, but a lower result set relevancy (since there is a bigger chance that a query term does not influence which items are in the result set).
PositionStopWordThresholdThe issue at hand was the PositionStopWordThreshold, as the year 2010 occurs more than 20 million times all together, not unlikely for 1.1 million financial documents.
This integer parameter sets the position stop word threshold.
If a query term occurs more often than position-stop-word-threshold (independent of the number of items it occurs in), then proximity relevancy calculations are not done for that term.
If the query term count does not exceed the position stop word threshold, an extra rank score is added if query terms are positioned close to each other in the managed properties.
If you do not want to use proximity as part of the relevancy model, set this parameter to 0 to disable proximity calculation. This will decrease CPU use when searching.
The simple solution
Up the threshold count. In this case it was set to 100 million to be sure, 5 times the original value.One have to be aware that upping the threshold will up the CPU usage when searching as more calculations have to be done, but we’ll dismiss this as a scaling issue for now :)
The not so simple solution
If you have the ability to do some code and create a custom core results web part you could do the following. Create a managed property called year, which you populate with a custom extensibility stage during indexing. When executing a search you will first sort on the year managed property if your query contains a year number, and secondly you will sort on the rank profile.
This way you can keep the stop word threshold as is.
Page 431-432 in “Working with Microsoft FAST Search Server 2010 for SharePoint” provides a sample for a custom core search result web part which does something similar.