Index cleanupSay you crawl a content source every hour, and for some reason the crawl you set up cannot access the source for a timespan of 4 hours. During this time period it will have tried and failed to crawl the source three times, and all your items are effectively dropped from the search index. How convenient.
The index cleanup stage occurs when you delete a content source or start address, or both, from a search service application. It can also occur when the content indexing connector cannot find a host supplying content: The indexing connector will look for the host during three consecutive crawls, but if the host is not found it will delete the content source and cause the index to enter the cleanup stage.
When dealing with network traffic this is a disaster waiting to happen for any search solution as routers go down or DNS’ stop responding for some reason.
The solution could be quite simple, and is something other enterprise search products have incorporated. If a crawl for some reason fails, and tries to delete more than X percent of your index, then cancel that operation and flag it for further investigation. We can only hope this will make it’s way into SharePoint and FAST for SharePoint as well.
When I tested this it seemed to only drop data after three consecutive failed full crawls, not incremental crawls. So it’s not as bad as it could be, but still.