To pick up data from <meta> tags when crawling web pages:
- Make sure the crawled markup has line breaks after tags
- Look in both Web and Document Parser crawled property categories for your crawled properties
- Register the file extension crawled as the right mime type
- Add the file extension crawled as the supported File Types
I’m in the process of crawling an external web site using SharePoint 2013 and in order to get structured data we are picking up data from the <meta> tags.
The rule in the SharePoint crawler is that a tag like <meta name=”color” content=”red” /> will create a crawled property named COLOR in the Web category of crawled properties, with the meta named turned into upper case. This was the case for my test the other day as well.
I had a pause on another project for a day and came back to finish my crawl work. Either I was working on the wrong assumption that it was working at first, because these past two days I have not been able to figure it out. But fresh mornings and paying my dues in the f-word bowl at the office (thanks Pam :), it appears the values are now in the Document Parser category of crawled properties instead.
The reason meta tags won’t show in the Web category is most likely due to the server I’m crawling uses .obt as the file extension. This extension is by default mapped to “Microsoft Office Binder”, and won’t be treated as html.
But hold on.. one more piece to the puzzle, if the markup in the crawled pages are missing line breaks between the tags like
<html><head><meta name=”color” content=”red” />..
<meta name=”color” content=”red” />..
then it seems the html will not be parsed correctly.
Also when crawling non-default file extensions make sure you add them to the File Types section on the SSA, and also add the correct file format handler using PowerShell
$ssa = Get-SPEnterpriseSearchServiceApplication
New-SPEnterpriseSearchFileFormat -SearchApplication $ssa customext "Web Page" text/html
where you replace customext with your extension. In my case I cannot do this as obt is already in use.