Thursday, February 5, 2015

Entity extraction in SharePoint based on the path managed property..

……is not possible, so a work-around is needed in order to accomplish this.
As background information I’m indexing a file server with a structure like below:
Below each of these end points there can be any number of sub folders like \\server\share\HR\john\sickleaves\test.docx. The clue is that I want a refiner for HR/F/MKT which should read:
  • Human Resource
  • Finance
  • Marketing
By creating a CSV dictionary you can use the managed property word extraction option in SharePoint to look for a term and replace it with another. The documentation for how to create and upload a dictionary can be found at
Note that this feature is not available for SharePoint Online even though the check boxes are available to you.

So back to the original issue. If you check of to use a dictionary for the path managed property (or originalpath or sitepath) it just won’t work. Why, I don’t know, but there is a workaround.
You can create a new managed property and map the same crawled properties to it as you find on the path managed property. Below I have created one called NoRecall which has no specific features set as I only want to use it as an extraction point.

  • Basic 11 - b725f130-47ef-101a-a5f1-02608c9eebac
  • Basic 9 - 49691c90-7e17-101a-a91c-08002b2ecda9
  • Web 2 - 70eb7a10-55d9-11cf-b75b-00aa0051fe20
The guid’s listed are the property set id’s for the crawled properties. These are needed as there is multiple cp’s named in the Basic category under different guid’s. If you do the mapping using the UI, include all to ensure you get the right one.

After the cp/mp mapping, uploading of the CSV file, checking off that you want Word Extraction on the NoRecall managed property and a re-crawl, you will see the expanded values appearing in the managed property WordCustomRefiner1. You might want to add an alias to this property as well to more easily reference it in your search solution.

My CSV file for the case above looks like this:

Key,Display form
server\share\HR\,Human Resources

In order to not get false positives include as much of the path as possible.

To sum it up:
  • Create a new managed property (or use a reusable one)
  • Replicate the cp mappings of the path mp
  • Upload a dictionary to associate with the word extraction
  • Check the setting to use word extraction on the managed property
  • Re-crawl