Saturday, December 10, 2011

How to prevent an item from being indexed with FAST for SharePoint

Yes, it can be done!

[Update 2012-05-25: There is an even better hack to solve this issue as Kai wrote in a commen below this post. Check out Segei Sergeev answer in this TechNet thread.]

It’s Saturday, my kid has gone to sleep, and I finally have time to tell you guys the good news. Preventing an item from being index, or to paraphrase, to drop a document in the document processing pipeline is indeed possible!

You can already prevent items from being indexed by limiting SharePoint lists and libraries from being crawled with library settings and you can use crawl rules to exclude certain url patterns. But what I am talking about is preventing an item from being indexed based business rules in your organization and looking at the meta data of the item or inside the text of a file.

There are many scenarios for not wanting an item searchable. You might want to prevent indexing items in your organization which contains the super secret codename “wobba”, or items of a certain ContentType. When indexing file shares you don't have much meta data to go on at all for excluding items, so creating your own module with the proper rules might be the only way to go.

Up until this post, this was not easily doable with FAST for SharePoint. With built-in SharePoint Search it’s still impossible (unless you create a custom crawler).
There has been at least a couple of threads about this at the FS4SP TechNet forum
and we have long time concluded that this is not an easy task, and cannot be done in a supported manner. (Supported meaning not editing config files which the documentation on TechNet tells us not to touch.)

I had sort of thought about how to do this earlier, but I didn’t figure it out before reading Leonardo Souza’s blog post the other day about: How "Remove Duplicate Results" works in FAST Search for SharePoint.

Leo talks about a property called “documentsignaturecontribution” which can be used to append data to the document signature checksum in the FS4SP content processing pipeline. But in order to assign data to this property you have to create a managed property by that  exact name, and output your custom data to a crawled property of your choosing which is mapped to the managed property.

The reason why you have to work with a managed property is because the document signature stage appears after the stage which maps crawled properties to managed properties, and all stages below the mapper stage works on managed properties. This find by Leo is just so cool, and there is no documentation on this anywhere as far as I’ve seen.

So, over to our problem. Which stage runs just before the document signature stage and comes to our aid?

The “Offensive Content Filter” stage

This stage also has an additional attribute where you can assign data, named “ocfcontribution”. There’s only vague documentation on MSDN on how to assign data to this field, which refers to using the XMLMapper. Using the XMLMapper means indexing xml documents, and this is a bit limiting.

The thing about the offensive content filter is that it will prevent documents from being indexed if they contain a certain about of bad language. If you get embarrassed by such words, then skip reading :)

So now we have a stage which can drop items, the rest is to assign enough bad words to “ocfcontribution” to get above the threshold it triggers on.

First off enable the Offensive Content Filter by editing C:\FASTSearch\etc\config_data\DocumentProcessor\optionalprocessing.xml

Next create a managed property called “ocfcontribution” of type “Text”, and also a crawled property with this name. The guid for the property set is one I have chosen for a test group in my system. Replace it with to suit your own system.
$mp = New-FASTSearchMetadataManagedProperty -Name ocfcontribution -Type 1
$cp = New-FASTSearchMetadataCrawledProperty -Name ocfcontribution  -Propset fa585f53-2679-48d9-976d-9ce62e7e
19b7 -VariantType 31
New-FASTSearchMetadataCrawledPropertyMapping -ManagedProperty $mp -CrawledProperty $cp

In order to test this I have created an xml file named “drop.xml” which I placed in C:\FASTSearch\pipelinemodules with the following contents

<?xml version="1.0" encoding="utf-8"?>
<Document>
   <CrawledProperty propertySet="fa585f53-2679-48d9-976d-9ce62e7e19b7" propertyName="ocfcontribution" varType="31">fuck shit porn cunt cock dick</CrawledProperty>
</Document>

Next I added the following custom extensibility stage to C:\FASTSearch\etc\pipelineextensibility.xml

<Run command="copy C:\FASTSearch\pipelinemodules\drop.xml %(output)s">
    <Output>
        <CrawledProperty propertySet="fa585f53-2679-48d9-976d-9ce62e7e19b7" varType="31" propertyName="ocfcontribution"/>
    </Output>
</Run>

This stage will for each item assign the contents of drop.xml to “ocfcontribution”, effectively dropping all items, which is ok for test purposes. You would of course create a custom module instead which has your business rules for when an item should be dropped.

Next issue “psctrl reset” to reload your config files and use for example “docpush” to index an item, and it will not be indexed, as the output below shows.

PS C:\temp> docpush -c test sample.txt
[2011-12-10 20:31:02.677] ERROR      test Reported error with http://cohowinery.com/sample.txt: processing:OffensiveConte
ntFilter:ERROR: Processor error status: NotPassing
[2011-12-10 20:31:03.678] INFO       test All add operations completed

I hope someone will find this trick useful, and it seems you can use English words for to trigger the filter, no matter the language of your items.

PS! If you enable the Offensive Content Filter and have content with explicit language, you could risk some items not being indexed with this method.