Wednesday, December 22, 2010

How To: Debug and log FAST Search for SharePoint pipeline extensibility stages with Visual Studio

One of the most powerful features with FAST Search for SharePoint is the ability to do work on the indexed data before it’s made searchable. This can include extracting location names from the documents being indexed or enriching the data from external sources by adding financial data to a customers CRM record based on a lookup key. Only your imagination limits the possibilities.
As the extensibility demo code seems to be missing from MSDN I decided to create a stage which counts the number of words in the crawled document. There is a special crawled property set which contains a field named “body” which contains the extracted text of the crawled item, “data” which is the binary content of the source document in base64 encoding, and “url” which is the link used when displaying results. My stage will use the body field.

First I created a new property set for the crawled property I will emit from my program. I could have used one of the existing ones, but I find it easier to have my custom properties in a separate location. I name the property set “mAdcOW” and assign it an arbitrary guid. You can get a GUID in PowerShell with the following command:


[guid]::NewGuid()

The PowerShell command to create a new property set/category with my chosen guid looks like this:


New-FASTSearchMetadataCategory -Name "mAdcOW" -Propset FA585F53-2679-48d9-976D-9CE62E7E19B7

The guid is important as it is later used in the pipeline extensibility configuration. Default, the property set will add newly discovered properties as they are seen during the crawl. This saves us the work of manually creating the crawled properties we are going to be using.

For maintainability I create my own folder below the FASTSearch root for my module named C:\FASTSearch\pipelinemodules. Check the %FASTSEARCH% environmental variable for your actual FS4SP location.

Now over to the actual pipeline stage. In Visual Studio create a new “Console Application”. I give it the name “WordCount”.
newproject
In Program.cs I have the following code:
private static int Main(string[] args)
{
#if DEBUG
Thread.Sleep(1000 * 90);
#endif
try
{
Logger.WriteLogFile(args[0], "input");
WordCount wc = new WordCount();
wc.DoProcessing(args[0], args[1]);
Logger.WriteLogFile(args[1], "output");
}
catch (Exception e)
{
// This will end up in the crawl log, since exit code != 0
Console.WriteLine("Failed: " + e.Message + "/" + e.StackTrace);
return 1;
}
return 0;
}

Take notice of the #if DEBUG part. The pause is there in order to have time to attach the Visual Studio Debugger. I did try to use

System.Diagnostics.Debugger.Break()

but the context in which the pipeline stage is run under does not have access to invoke the debugger.

You might also note the Logger.WriteLog lines in the Main function. This is something I got from an MSDN blog entry, and which I modified a bit for restructuring the code. I also added a configuration key to turn logging on/off and a key for specifying the folder name of the log files. An important piece of information from the blog entry is that you only have write access to the C:\Users\username\AppData\LocalLow folder. Instead of hard coding the folder name, I added code which uses the Win32 API to get the correct folder name in case it resides on another drive or folder than “Users”.

DoProcessing takes two arguments, the input file to read, and the output file to write. These are passed in from the document processor pipeline, and is how custom stages work. They read in an xml file with the data to process, and write out a new one with the new/modified data.

The code which counts the words uses the XDocument class and linq to xml for reading and writing the input and output data. At the top you see a declaration for the guid I used for my property set, and a guid for the special crawled propery set with the body property. These are the same as in the pipelineextensibility.xml configuration file. In short we select what was specified in the configuration file.
internal class WordCount
{
// this propset contains url/body/data - http://msdn.microsoft.com/en-us/library/ff795815.aspx
private static readonly Guid CrawledCategoryFAST = new Guid("11280615-f653-448f-8ed8-2915008789f2");
private static readonly Guid CrawledCategorymAdcOW = new Guid("fa585f53-2679-48d9-976d-9ce62e7e19b7");
private static readonly Regex WordSplit = new Regex(@"\s+", RegexOptions.Compiled);

// Actual processing
public void DoProcessing(string inputFile, string outputFile)
{
XDocument inputDoc = XDocument.Load(inputFile);

// Fetch the content type property from the input item
var res = from cp in inputDoc.Descendants("CrawledProperty")
where new Guid(cp.Attribute("propertySet").Value).Equals(CrawledCategoryFAST) &&
cp.Attribute("propertyName").Value == "body" &&
cp.Attribute("varType").Value == "31"
select cp.Value;

// Count the number of words separated by white space
int wordCount = res.Sum(s => WordSplit.Split(s).Length);

// Create the output item
XElement outputElement = new XElement("Document");
if (res.Count() > 0 && res.First().Length > 0)
{
outputElement.Add(
new XElement("CrawledProperty",
new XAttribute("propertySet", CrawledCategorymAdcOW),
new XAttribute("propertyName", "wordcount"),
new XAttribute("varType", 20), wordCount) // 20 = integer
);
}
outputElement.Save(outputFile);
}
}



After compiling a debug build of the program I copy it over to the folder previously created, C:\FASTSearch\pipelinemodules.

Default an FS4SP installation has 4 document processors running.

nctrl status

Document Processor              procserver_1             11644  Running
Document Processor              procserver_2              8224  Running
Document Processor              procserver_3              5452  Running
Document Processor              procserver_4              5920  Running


This means it will process 4 items in parallel. In order to ease debugging we turn off all but one.

nctrl stop procserver_2 procserver_3 procserver_4

(Remember to start them once you are done testing if this is a shared or production environment. Replace “stop” with “start” in the above command.)

Next I modify C:\FASTSearch\etc\pipelineextensibility.xml and add my word count stage.
<PipelineExtensibility>
<Run command="C:\FASTSearch\pipelinemodules\WordCount.exe %(input)s %(output)s">
<Input>
<CrawledProperty propertySet="11280615-f653-448f-8ed8-2915008789f2" varType="31" propertyName="body"/>
<!-- Included for debugging/traceability purposes -->
<CrawledProperty propertySet="11280615-f653-448f-8ed8-2915008789f2" varType="31" propertyName="url"/>
</Input>
<Output>
<CrawledProperty propertySet="fa585f53-2679-48d9-976d-9ce62e7e19b7" varType="20" propertyName="wordcount"/>
</Output>
</Run>
</PipelineExtensibility>



After saving the file I reset the document processors in order to read the updated configuration.

psctrl reset

I have now deployed a new pipeline stage ready for testing. On the FAST Content SSA in SharePoint Administration I start a new full crawl for my test source.

Start Windows Task Manager, check “Show processes from all users”, and wait for an instance of the program to appear.

process

Switch back to Visual Studio and set a break point in the code below the sleep statement.

main-debug

Go to the “Debug” menu and choose “Attach to Process”

attach_menu

Locate the process and click “Attach”. You might have to check “Show processes from all users” her as well for it to be displayed.

attach_process

Once the sleep statement completes you should be able to step thru the code like you normally would in Visual Studio.

If logging is enabled in the configuration file you will see files appearing in the logging folder

logfiles

where the input files have the url and body fields going in, and the output the wordcount field going out, as specified in the configuration file.

My crawled property “wordcount” has also been added during the crawl.

image

I create a new managed property which can be used in the search result page, and map the crawled property to it. This can also be done in the Admin UI instead of with PowerShell.
$managedproperty = New-FASTSearchMetadataManagedProperty -Name wordcount -Type 2 -Description "Number of words"
$wordcount = Get-FASTSearchMetadataCrawledProperty -Name wordcount
New-FASTSearchMetadataCrawledPropertyMapping -ManagedProperty $managedproperty -CrawledProperty $wordcount

The operation shows up in Central Admin

image

and the result xml when executing a search now shows the newly added wordcount property. Remember to add the column to the “Fetched properties” list in the Search Core Result web part.

image

The Visual Studio project for the pipeline stage as well as the pipelineextensibility.xml can be downloaded from my SkyDrive.

37 comments:

  1. Hi Mikael, I have started working with Pipeline processing and stuck at very basic of FAST Search crawling.
    I uploaded a document and it's getting crawled by FAST. When I go to crawled properties, there are over 300 properties listed. How do I know, which properties are through that document?

    ReplyDelete
  2. Hi Jeff,
    the easiest might be to include the spy stage when indexing. This works best if you can control one document at a time. Read my post at http://techmikael.blogspot.com/2011/01/how-to-spy-raw-data-and-available.html about this.

    Another possibility it so include all crawled properties in the extensibility config. This way you can look at them in your own code. Check out http://gallery.technet.microsoft.com/scriptcenter/834cd7a8-4e87-4b5a-bef9-a519fd1712ba for how to do this.

    ReplyDelete
  3. Hi Mikael, I was finally able to extend pipeline and create logs using your code. Thanks a lot!. I didn't get a chance to look at these links yet, but they certainly look helpful.
    Great Work!

    ReplyDelete
  4. Hi Mikael,
    I have this problem:
    ValueError: 'varType': Only crawled properties of string types are supported, got '20'
    and i don't know, how i can solve this.
    Can you help me by this Problem ?
    Kind Regards
    Toni

    ReplyDelete
  5. tblog: At what stage did you get this error?

    ReplyDelete
    Replies
    1. Hi Mikael,

      i got the same Problem like "tblog" above.. it says :

      Unexpected exception occured during batch processing: ProcessorDeploymentException: For pipeline 'Office14 (webcluster)', creating processor CustomerExtensibility failed: ValueError: 'varType': Only crawled properties of string types are supported, got '64'

      can you give any solution for this ?

      Best regards
      Sebastian

      Delete
    2. Hi Sebastian,
      You get this error only when you add a custom stage to the config right? If so, you have to recheck your configuration and make sure the cp's you are working with actually exist.

      Delete
    3. Hi Mikael,

      thank you for the fast response. The error is only on this FAST Crawler ... on another crawler (same version) it works fine, only on this machine exists this problem. the CP with the varType "64" is the ows_Modified field and in the cp's this field exists

      Delete
    4. Hi,
      If it fails on only one of several machines in the same farm I would try to copy all config xml files from one of the other servers. I guess you have also tried to stop and start FAST again?

      Delete
    5. Hi Mikael,

      i've restarted the server and this not solved the problem. i can't copy the config xml files, beacuse the two machines are not on the same farm. Do you have any other solutions ?

      Delete
    6. Not directly related, but try the steps in this kb article: http://support.microsoft.com/kb/2468431

      Delete
  6. Hi Mikael!

    "Long time no see!" :)

    Thank you for this great blog post. I have successfully created my own pipeline extension following the steps in this post.

    -Erik

    ReplyDelete
  7. Hi Mikael,

    The FAST Search backend reported warnings when processing the item. ( Customer-supplied command failed: Process terminated abnormally: Unknown error (0x80131700) )

    Getting the above error in crawl logs after adding the extending the pipeline.

    Is anything missing either any default settings or permissions??

    ReplyDelete
  8. Seems to be something wrong with your program. Crashed on unexpected input perhaps. I would suggest trying to debug it like I mention in this post, to see what goes wrong.

    ReplyDelete
    Replies
    1. Hi Mikael,
      I can't start WordCount.exe on my test server,and there is no WordCount.exe process in windows task manager after i start full crawling.
      Could you give me some advices?

      Regards,
      Fiyoung.

      Delete
    2. Hi,
      WordCount is an .exe file from the Visual Studio project linked in this post. You have to compile it and put it into the folder you specify in the pipelineextensibility.xml config file.

      https://skydrive.live.com/redir.aspx?cid=9ecc38025e460fc4&resid=9ECC38025E460FC4!737&parid=9ECC38025E460FC4!730&authkey=!

      Delete
    3. Thanks for quick replying.

      Acctually,I did build the vs project and put it into the folder i specify in the pipelineextensibility.xml.

      Here is the thing,i have two server,one is sp server,another is fs4sp server.The vs is on sp.Is this possible that the vs on sp couldn't debeug the WordCount.exe?

      Another thing is,even i start crawling the Fast content ssa,on fs4sp also coundn't find WordCount.exe which i copy it from sp.

      I think if i can debugger the program first,it will be a big step to use pipelineextensibility.
      Hope for more advice.Thank you.

      Regards,
      Fiyoung.

      Delete
    4. Hi,
      If you want to debug a document processor component you it is best to have VS on the same machine, meaning having VS on the FS4SP server. Remote debugging might be possible, but more of a hassle to get working.

      If on the fs4sp server you create a folder c:\fastsearch\pipelinemodules, copy the .exe file to that folder, add your entry to pipelineextensibility.xml, run "psctrl reset" to reload the config file, then the module should run. If it fails you should see it in the crawlerlog on the Content SSA.

      You can also use "docpush.exe" to test indexing and any error in the pipeline will be written to the console.

      thanks,

      Delete
    5. Hi Mikael,
      Today I try these steps,but also failed,hope to get your more helps.

      1.I enable the Advanced Filter Pack
      2.I check the Fastsearch root folder permission,In Security Tab,it do contain "FASTSearchAdministrators",and it includes my service account.
      3. Also reinstall the Microsoft Filter pack both in sp and fs4sp.
      4.I use docpush ,and "doclog -a",it throws a error "processing:IFilterConverter:Error:Missing input attribute:"mime"",I have no idea how to fix this error.
      but i use ifilter2html,it generate a plain text html well.

      Is there any way to fix this error?
      Thanks and Regards,
      Fiyoung.


      Delete
    6. Hi,
      I would try to reinstall the ifilter pack (http://www.microsoft.com/en-us/download/details.aspx?id=17062). If this does not help it seems that a config file in FS4SP has been corrupted. Did you enable the advanced filter pack with the ps1 file or edit manually?

      Delete
    7. Hi Mikael,

      I used .\AdvancedFilterPack.ps1 -enable to enable the advanced filter pack by several times.
      Which config file in FS4sp maybe corrupted?How can I fix this config?I try to run the Configuration Wizard it says "Please uninstall post configuration to reconfigure the system".Where is the post configuration?Is this a config file you said?Where is the config file you said?

      Please help me about this,So thank you.

      Regards,
      Fiyoung.

      Delete
  9. This comment has been removed by the author.

    ReplyDelete
  10. This comment has been removed by the author.

    ReplyDelete
  11. Hi Mikael,

    When the xml is changed like
    Run command="apptest.exe %(input)s %(output)s"
    and the exe copy to FASTSearch\bin.
    The exe started finally.

    Here is another important issue,
    When i use docpush it says,
    PS D:\FASTSearch\bin> docpush -c fast2 d:\file\Samsung.doc
    [2013-03-05 17:27:55.043] ERROR fast2 Reported error with http://cohowinery
    .com/d:\file\Samsung.doc: processing:IFilterConverter:ERROR: Missing input attri
    bute: "mime"
    [2013-03-05 17:27:55.044] INFO fast2 All add operations completed

    Could you give me more advices about "Missing input attribute mime"?
    Thanks,
    Fiyoung.

    ReplyDelete
  12. Hi,

    After several test,i find the varType is not correct.
    It generate crawled property finlly successful.

    Thanks for your help,
    Regards,
    Fiyoung

    ReplyDelete
  13. Hi,

    I get the below warning during crawl The FAST Search backend reported warnings when processing the item. ( Customer-supplied command failed: ) NO error messages in log

    Regards,
    Arun

    ReplyDelete
  14. Hi,

    Am getting the following warning when i run a crawl seems like the pipeline exe is not getting triggered The Fas Search backend reported warnings while processing the item (Customer supplied command failed: ) no errors reported in event or sharepoint logs

    ReplyDelete
  15. Hi,
    I am facing the same issue what Fiyoung is facing.
    In my case i am getting following error

    "INFO Running customer-supplied command in a child process: apptest.exe %(input)s %(output)s
    WARNING Customer-supplied command failed: (warning code 0)
    "
    Even my exe has infinite loop but it does not appeared on task manager.

    i followed all the steps mentioned in the article but no luck.DocPush.exe also gives the same warning.

    Hope you will help me.

    ReplyDelete
    Replies
    1. Hi, have you checked the logs on the FAST side? If you run your exe manually with a handcrafted input xml, does that work?

      Delete
  16. Nice Post, How we can trigger content processing only for required types such as docx,xls, pdf etc. in 2013 content enrichment allows this(powershell binding script), how same thing can be performed here?

    ReplyDelete
    Replies
    1. Hi, that's one of the drawbacks with FS4SP, no triggering mechanisms, so everything get's sent through. Then you need to add a check at the start of your code if you want to process it or not.

      Delete
    2. what about file max size specification, can we control like in 2013 enrichment?
      Does the same code work for list items? if not any inputs?

      Delete
    3. Main reason is, i am planning to parse data for required keyword and update properties accordingly, if i can find in list items it would be great.

      Delete
    4. You need to use google :) See https://social.technet.microsoft.com/Forums/en-US/5e52652a-5c01-4526-9dd5-3f8a09384423/is-it-possible-to-index-a-1-gb-ms-office-document-in-fs4sp and https://msdn.microsoft.com/en-us/library/office/ff795815(v=office.14).aspx, or buy my book :D http://www.amazon.com/Working-Microsoft-Search-Server-SharePoint/dp/0735662223

      Delete
  17. Hi,can this post be used in search server express 2010?
    If I have same problem(change meta tag data type to "date")
    can I solve it on search server express 2010?

    ReplyDelete
    Replies
    1. Sorry no, FAST is a totally different engine.

      Delete