Thursday, June 21, 2012

Why more people should blog (or My bad day hunting down an error with FS4SP)

Today has been one of those days where you feel you haven’t done anything productive, thinking you have tried all tricks in the book to fix some weird error to no avail.

And just now the solution came to me after spending some time with my family and letting my brain work it’s magic without my focus. And I’m just about 100% sure it will work when I test it tomorrow morning.

The short version

Always use mklink /J when moving a folder from within your FAST installation to another volume (and blog about any error you encounter for others to read).

The long version

In October 2011 I was involved with creating the FS4sP architecture and topology for a customer. At the time we used mklink /d on the test server to move the log and data folders to separate volumes (mklink the only way to split data over several volumes on FS4SP in a supported manner). I was only involved in the startup of the project and then the hosting partner took over creating scripts and setting up the environments to be used later in the project.

Fast forward April 2012 and I’m being rehired to look over the solution which has been implemented by another company. The setup looks ok, but I spot the C:\FASTSearch\log folder being symlinked instead of linked with a junction point on the staging servers. I was able to spot this as between October and April “Working with FAST Search Server 2010 for SharePoint” was finished and I had learned that mklink /d and mklink /j have subtle differences, especially with FS4SP. I remember discussing this at length with Marcus Johansson and also getting in touch with FAST/MS support about the issue.

I sent an e-mail to the hosting company about my findings and was forwarded a copy of an e-mail discussion they had in late November 2011, where they discovered issues using mklink /d instead of /j. When using /d some services failed to start for no apparent reason:
  • indexer
  • nameservice
  • search-1
  • topfdispatch
Marcus had even been involved in the e-mail discussion, pointing out that /j is the switch to use, and Marcus knows his stuff :)

Fast forward again to late-June 2012 (now) and I’m back on the project for the third time. This time to implement some changes to the FS4SP solution. When I’m about to test my work I discover the SharePoint farm has SSA issues and crawling is not working in staging. I decide to reconfigure the SSA’s and start by recreating the self-signed certificate on the FS4SP admin server. And what do I see, the four services mentioned above will not start and report status as dead. Except, I don’t remember sending the e-mail back in April or the e-mail discussion I was forwarded from November. And the issue is still present. It should have been fixed in November, and it should have been fixed in April, but it hasn’t.

Today I have spent a full working day trying to figure out why on earth this is just not working. Google has not been of much help and neither has monitoring every nook and corner of FS4SP.  And out of the blue this evening my brain has an epiphany and remembered seeing an arrow icon on the log folder, cross-referencing this with my e-mail back in April, and the solution was sitting right in my own e-mail archive. Symlinks are evil and must die!

Back to why blogging is important

If someone at the hosting provider or even Marcus had blogged this back in November I would have found the solution on Google. If I had blogged this in April based on the forwarded e-mail discussion I would have found it on Google.

I very seldom search my e-mails for solutions as I try to either blog about them or write them in our internal issue tracker at Puzzlepart, because then I can use Google to find it later.

That’s why I’m blogging this, so that the next time I encounter this issue, I can read about it right away and not waste 8 hours.