Friday, December 8, 2023

AI model bias and why responsible technology matters – exemplified by image generation

In this era of AI where ChatGPT with LLM’s has become the hottest topic in computer science since the Apple Macintosh and the IBM PC, I figured I’d do a small write-up on AI model bias and why paying attention to bias is important. This is especially true when it comes to enterprise scenarios where Microsoft is launching a long range of AI powered copilot experiences.

The author of this article works for Microsoft (December 2023) and is an internal champion for responsible ai and technology as well as internal champion for privacy and compliance.

At Microsoft we have a high bar for delivering responsible AI solutions, which means there is a lot of work put in place to ensure the output from AI systems follow Microsoft’s AI principles to be fair, inclusive, reliable and safe, privacy and security is accounted for, and the systems are accountable.

Any model, that be a large language model (LLM) or a model to generate image will inherently have bias built in due to the training data used. In smaller models you can manually verify the training data to counter some bias and balance the training set, but as models grow large this becomes inherently harder. I’m not saying there are not systems in place to counter training bias already, but to truly counter bias this has to be built into pre and post processing of input prompts and outputs from the models.

I will use image generation as an example where I will show the difference between using image creator in Microsoft Designer (https://designer.microsoft.com/) built on DALL·E 3 from OpenAI and Stable Diffusion XL, which is an open source model from Stability.AI (https://stability.ai/). The Microsoft solution has guardrails in place, where the open source solution do not – unless you add them yourself via prompting. Not saying neither of them are perfect as examples will show.

I want to call out that any bias shown is not statistically verified, and only based on generating a set of sample random images with the same prompt.

Example 1 - photo of correctional officer in a well lit hallway eating a donut

image

The above eight images are from DALL·E 3. They are all close-up photos showing a fit, light skinned male with dark hair.

image

In comparison the SDXL images have a wider focal point showing the full body. It’s a mix of male and female people, and also a mix of light and dark skinned people. I would argue the SDXL model is more accurate to what people look like in 2023, while the DALL·E 3 model output “perfect” looking people. If this is due to the images the models are trained on, or the prompt being augmenting to have “perfect” looking people I do not know.

The default color palette is also different where DALL·E 3 has more green and SDXL has more brownish colors.

If I add “overweight” to the DALL·E 3 prompt, the Responsible AI filter kicks in and blocks the generation. If I add “fat”, then it works.

image

With SDXL I can modify the prompt to closeup photo of a slim white male correctional officer in a well lit hallway eating a donut” to mimic what DALL·E 3 output by default – to counter the wide angle and real life looking people bias of the model.

image

Example 2 – woman

Let’s try a simple prompt with the subject “woman”. For SDXL I added negative prompting to avoid any nsfw images – which is blocked as part of DALL·E 3 RAI principles.

image

DALL·E 3 seems to pivot towards portrait photo’s when no extra contextual information is given, as that is likely the intent with a simple input subject. They are also all dark haired and seem to be young women.

image

In comparison SDXL gives a wide variety of image types, pivoting to more art-like images instead of photos.

Example 3 – painting of a beautiful norwegian fjord with vikings, with a boing 737 in the sky, in the style of munch’s scream

image

The DALL·E 3 painting nails the airplane and pretty much the painting style of Edward Munch.

image

The SDXL one is not bad either, but the Munch style is not as visible for this one sample. And the scale of the plane vs. the viking ship and buildings is way off.

Learnings

These simple examples shows that articulating your intent in prompting is crucial. Either the system has to add guardrails and contextual information to the prompt, or the person prompting has to be articulate on what they want returned and what they do not want returned. And you have to generate many images to find that ONE you really like.

For online services like Microsoft Designer going the safe route is the only approach as people using the service comes from a wide variety of backgrounds and age groups. Taking that extra measure to ensure everyone feels safe is important to trusting the service.

Open source solutions you can run on your own PC/phone/table can allow for less guardrails as the individual running it likely has more skill and is using the tool themselves. Maybe the analogy of hiring a carpenter as a service vs. hammering yourself can be used. You trust a hired professional to meet a certain bar, while you are responsible yourself on anything you do.

When it comes LLM’s we know they are largely based on English text today, and would favor input and output in this language. As they are built on public data, that will influence default writing style as well. fortunately ChatGPT and Microsoft Copilots put a lot of effort into the system prompts put around the user prompt, to counter any bias in the model. This is to ensure grounding in facts and avoid hallucinations. More on that for another post.

References

I used the service of https://designer.microsoft.com/image-creator to create the DALL·E 3 images, and I used the Draw Things app on a MacBook with an 8-bit quantized version of the default SDXL model. The Draw Things app also work on iOS devices.

Thursday, August 10, 2023

How to paginate large results sets for SharePoint items using the Microsoft Graph Search API

If you want to paginate over a large set of results for some reason using the Microsoft Graph Search API, you can employ the logic mentioned for the SharePoint API at https://learn.microsoft.com/en-us/sharepoint/dev/general-development/pagination-for-large-result-sets. Note that this option applies to OneDrive and SharePoint items, not necessarily other content sources available via the Graph Search API (not tested).

Use a basic JSON template like below for your search requests, or modify to add other parameters needed for your request.

{
"requests": [
{
"entityTypes": [
"driveItem"
],
"from": 0,
"size": 500,
"query": {
"queryString": "contoso indexdocid>**LASTID**"
},
"fields": [
"indexdocid"
],
"sortProperties": [
{
"name": "[DocId]",
"isDescending": "false"
}
]
}
]
}

Where **LASTID** is 0 on the initial request. Once you get results back, pick the value of indexdocid of the last result, and use that as **LASTID** on the next request. In the below screenshot you would use 2377359 for the seconds request. Continue this logic until you stop getting results, and you should have iterated all files (driveItems) containing the term contoso for the above sample.




Monday, July 3, 2023

MacBook Pro M1 with 4K monitors on a ThinkPad USB-C dock

I have a couple of 4K monitors at work via a ThinkPad dock. If I were running the native resolution of 3840x2160 pixels the fonts and icons were just too small for my aging eyes. The alternative native resolution was 1980x1080, and then things get too large.

The ideal for me is scaling to 2560x1440. Sure you can do this via the display system settings, but then everything is blurry. But there is a fix.

  1. Install DisplayLink Manager from https://www.synaptics.com/products/displaylink-graphics/downloads/macos 
  2. Check the experimental mode for 3008x and 2560x modes support
  3. Then pick scaled text in Display settings, getting you a nativ scaled 2560x1440 which is not blurry in HiDPI mode. See https://support.displaylink.com/knowledgebase/articles/1993915.

Tuesday, April 25, 2023

New useful managed properties to use in Microsoft Search

image

For those working with hub sites in SharePoint you have for a long time used the managed property DepartmentId, later accompanied by RelatedHubSites when hub site hierarchies was enabled.

Now the time has come to have these properties, and some more, added to the public documentation.

Take a peek at https://learn.microsoft.com/en-us/sharepoint/crawled-and-managed-properties-overview which covers these new properties available for online experiences.

The documentation UX is not ideal, so ensure you scroll the table of properties to the right to read the comments per property. Here’s a copy of the table for reference where I moved the comment for visibility.

Note the (*) highlighting that it’s not guaranteed that each item has a value in the property.

Property name

Type

Comment

Multi-valued

Queryable

Searchable

Retrievable

Refinable

Sortable

Mapped crawled properties

DepartmentId

Text

Site ID of the hub of the immediate hub. Applies to all items in the hub/associated sites.

No

Yes

No

Yes

Yes

No

ows_DepartmentId

RelatedHubSites

Text

Site IDs of associated hubs including hub hierarchies. Can be used instead of DepartmentId for most scenarios. Applies to all items in the hub/associated sites.

Yes

Yes

No

Yes

No

No

ows_RelatedHubSites

IsHubSite

Yes/No

Applies to the site result of a hub (contentclass=STS_Site)

No

Yes

No

Yes

No

No

ows_IsHubSite

ModifierAADIDs

Text

Semi-colon separated list of AADIDs for modifiers of a file or page ordered in date descending order. (*)

Yes

Yes

No

Yes

Yes

Yes

 

ModifierDates

Date and Time

Semi-colon separated list of modification dates for modifiers of a file or page ordered in date descending order. (*)

Yes

No

No

Yes

No

No

 

ModifierNames

Text

Semi-colon separated list of the names for modifiers of a file or page ordered in date descending order. (*)

Yes

Yes

No

Yes

No

No

 

ModifierUPNs

Text

Semi-colon separated list of UPNs for modifiers of a file or page ordered in date descending order. (*)

Yes

No

No

Yes

No

No

 

ChapterTitle

Text

Semi-colon separated list of auto-generated chapters on Teams meeting videos. (*)

Yes

Yes

Yes

Yes

No

No

ChapterTitle

ChapterOffset

Text

Semi-colon separated list of time codes matching the chapter titles for auto-generated chapters on Teams meeting videos. (*)

Yes

No

No

Yes

No

No

ChapterOffset

* Property is not guaranteed to contain data.

Retirement of Dynamic Ordering feature in classic search experiences

In MC44789 post from April 22nd, 2023, Microsoft announced the retirement of the dynamic ordering feature experience in classic search.

If you don’t know what the feature is, the below image highlights the feature seen in the query builder in classic search result sources, query rules and search web parts.

image

The above screenshot show a rule which if it matches the term xrank in the title results will be boosted to the top of the result list.

Wait what?? So I will no longer be able to boost items per my own logic? Sure you can, and this is called out in the MC post – “Functional parity may be achieved by adding XRANK clauses directly to the query template in the Query Builder dialog.”

Previously when testing the query from the test tab you could see the output of the final query. However this is no longer the case and I’ll teach you how to transition dynamic rules over to manual XRANK.

image

Today using the constant boost, or cb, parameter to rank is not the recommended approach. The reason is that the internal rank scale has changed over the years so the value 5,000 may or may not be required to move something to the top. The below example has a rank of –17921 so adding 5000 would not help.

image

The recommended parameter to use today is to use standard deviation boost with the stdb parameter.

See https://learn.microsoft.com/en-us/graph/search-concept-xrank or https://learn.microsoft.com/en-us/sharepoint/dev/general-development/keyword-query-language-kql-syntax-reference#dynamic-ranking-operator for all parameters.

Manually writing dynamic ordering rules as XRANK

A query temple to boost a result to the top can then look like:

{?{searchTerms} XRANK(stdb=100) Title:xrank}

Feel free to replace 100 with a smaller or larger number as needed.

If you want to boost items with title=foo pretty high, and less with title=bar you can use a nested XRANK statement, similar to what multiple dynamic ordering rules will accomplish.

{?({searchTerms} XRANK(stdb=5) Title:foo) XRANK(stdb=2) Title:bar}

If you want to demote results instead of promoting them, use a negative number.

If you go for decimals instead of integers, I recommend reading https://www.techmikael.com/2014/11/you-should-use-exponential-notation.html to ensure they always work.

Filter on managed properties in search with or without values

Back in 2014 I wrote the post How To: Search up items which don’t have a value set, which covers how to write keyword query syntax (KQL) filters to return or restrict items depending on if a specific managed property has a value or not. Recently Microsoft added support to more easily query if managed properties of type Text contain or does not contain a value.

Here’s a link to the updated documentation:

https://learn.microsoft.com/en-us/sharepoint/dev/general-development/keyword-query-language-kql-syntax-reference#filter-on-items-where-a-text-property-is-empty-or-contains-a-value

Note: The new supported syntax only works for Microsoft 365 / Online

  image

 

Items missing or having a text value

The syntax is as follows:

KQL Syntax Description
NOT <Property Name>:* Items where a property does not have a value
<Property Name>:* Items where a property does has a value

The documentation uses the following example to list SharePoint sites associated to a hub site.

(DepartmentId:* OR RelatedHubSites:*) AND contentclass:sts_site NOT IsHubSite:true

Deciphering the query:

KQL Description
(DepartmentId:* OR RelatedHubSites:*) return items which has a value in the original DepartmentId managed property or in the successor RelatedHubSites property
contentclass:sts_site return only site items
NOT IsHubSite:true exclude hub site results

Note that hubs connected to another hub will not be included in the above query. If you want those, then remove the NOT IsHubSite:true part and post-process the results as needed.

For completeness let’s cover how to accomplish the same for other types of managed properties.

Items missing or having a YesNo value

To find items missing a value in a date property the syntax shown in my 2014 seems to no longer work and should be replaced with the following where the date is some low non-used date.

KQL Description
NOT (RefinableYesNo00:true OR RefinableYesNo00:false) return items not having a value in a YesNo property
(RefinableYesNo00:true OR RefinableYesNo00:false) return items having a value in a YesNo property

 

Items missing or having a date value

To find items missing a value in a date property the syntax shown in my 2014 seems to no longer work and should be replaced with the following where the date is some low non-used date.

KQL Description
NOT RefinableDate00>1900-01-01 return items not having a value in a date property
RefinableDate00>1900-01-01 return items having a value in a date property

 

Items missing or having a number value

For number type managed properties it’s easier as you typically know the range of values.

KQL Description
NOT Size>=0 if the managed property only contain positive values, then this will return all items with no value set
NOT RefinableDecimal00>=0 NOT RefinableDecimal00<0 return items where the property RefinableDecimal00 has no value
Size>=0 return all items having a value which is greater than your smallest value

Friday, April 7, 2023

There are still new things to learn from the SharePoint Search API I won't share. I will NOT!

…I will, just a tad bit late :)

This was a tweet I made Friday October 21st once I understood the root cause of an API issue which has popped up in the last months. The issue affected PnP Modern Search web parts when query rules were enabled, and also the Search Query Tool with default settings.

I thought it was a weird API issue which had been introduced as part of an ongoing service upgrade for Microsoft Search lately, but turns out everything was working as expected – except the expected part hasn’t been expected for the past many years. And a big thank you to engineers at Microsoft helping me understand what the root cause is.

On the API side it manifests itself as follows with the following simple API REST query executed in the SharePoint Search Query Tool where you only get 2 main results where you would expect 10.

https://tenant.sharepoint.com/_api/search/query?querytext='mikael'&rowlimit=10

image

Where are the rest of my 10 results? Well, they happen to be located in the Secondary Results, a place I never looked.

image

I’ll explain the behavior, and it is actually correct (sort of), and I will explain why this happens now in 2022.

A trip down memory lane

When SharePoint 2013 was released Microsoft released the feature of query rules, which allowed to bring in result blocks into the search results as seen below.

The below screenshot triggers the rule “People Name in SharePoint” which bring in two result blocks.

image

And the query rule definition

image

The definition above says to always place people on top, and possible show documents authored by the person as a block within the results, or interleaved.

The thing is that the logic/setting to have it ranked has changed.

Todays logic – changed in October 2021
image

The old logic

image

The old logic introduced with SharePoint 2013 would start the block high up, and if results in the block was not clicked it would move down the page, and eventually off the page, which is what has happened for most customers over the years.

The new logic now introduced will ALWAYS interleave the block on the page 1 results, never moving it off the page.

So how does this affect the API?

By default an API query will invoke query rules unless explicitly turned off, e.g. the above query https://tenant.sharepoint.com/_api/search/query?querytext='mikael'&rowlimit=10

As Modern Search was introduced quite some time back this has greatly reducing the use of the classic search center. This means that people haven’t clicked results in result blocks from quite some time, no clicks are recorded and the block moved off the page – never appearing in API queries.

Now as the logic has changed, the blocks come back, which is not a bug, but maybe not expected.

Together with query rules there is another API setting available, one I have never thought about, but it’s been there all along. “Enable Interleaving” which by default is set to true documented at https://learn.microsoft.com/en-us/previous-versions/office/sharepoint-csom/jj262234(v=office.15).

Of course, if you don’t need query rules for your scenario you should always disable them on the API call. Problem solved!

Then again, using the PnP modern search web parts a common scenario is to use promoted results, and thus you need query rules enabled. Which leads to queries on people names triggering the original “People names rules” causing interleaving to happen and the results split into primary and secondary result tables in the response.

The solution then is to set EnableInterleaving=false.

Changing the query to https://contoso.sharepoint.com/_api/search/query?querytext='mikael'&enableinterleaving=false&rowlimit=10 ensure 10 results as expected in the primary result set.

image

I have released a fix to the Query Tool which by default will disable interleaving, or you can set it yourself.

https://github.com/pnp/PnP-Tools/releases 

And I have made the same fix to the PnP Modern Search Web Parts v4.8, and any interleaving should be done manually at the template level if strictly needed.

Query variable trick in Microsoft Search verticals (and classic)

Microsoft has been working on both classic and modern scenarios for Microsoft Search, and evaluating existing solutions to determine the best way to support query variables. This post is not exclusive to Microsoft Search, and the same technique can be used with any SharePoint classic search experience. The only difference is the type of query variables that are supported for each experience.

For supported query variables in Microsoft Search modern experiences see https://learn.microsoft.com/en-us/microsoftsearch/manage-verticals#profile-query-variables.

For supported query variables in classic search see https://www.techmikael.com/2014/05/s15e03-query-variables-constant-trouble.html.

Solution

image

The sample case solution provide an option to filter search results down a city. All items are tagged with a managed property City to allow for the filtering. On the SharePoint page of the solution the user can pick their own or a specific city. When picking their own, no query parameter is passed with the city. When picking a specific city the user is sent to a vertical in Microsoft Search passing the city value as a query string parameter:

_/layouts/15/search.aspx/verticalname?City=Helsinki

Which brings us to the query template to use:

{searchTerms} {?City:{Request.City} NOT UNIQUESTRING}{?City:{Profile.positions.detail.company.address.city}}

To see what properties you can use for a Profile query variable, view the output of https://graph.microsoft.com/beta/me/profile in e.g. Graph Explorer. For our test case the City location for a person is available via the query variable {Profile.positions.detail.company.address.city}.

What the above query template achieve is: If a query string parameter City is present, this will be used as part of the query. If not present, the City value of the logged in users profile will be used instead. The {?} notation means that if a query variable is missing, the part enclosed within the braces will be removed all together from the template on evaluation.

I’m using a trick with UNIQUESTRING (which could be any random unique string not present in the search index) to invalidate the last part of the query if we have a query string parameter in the URL. It adds invalid KQL sort of for a property which does not exist, and is thus ignored.

Let’s add some examples to illustrate the evaluation where the ignored part of the query is highlighted in yellow, and the inclusion part in green. The user’s profile value for city is Oslo.

Scenario 1 – Click on Oslo

  • ?City=Oslo
  • Users City=Oslo

Ending query: City:Oslo NOT UNIQUESTRINGCity:Oslo    

Scenario 2 – Click on Helsinki

  • ?City=Helsinki
  • Users City=Oslo

Ending query: City:Helsinki NOT UNIQUESTRINGCity:Oslo

Scenario 3 – Click on My City

  • ?City=<missing>
  • Users City=Oslo

Ending query: City:Oslo

Scenario 4 – Click on Helsinki (and missing a city in the profile)

  • ?City=Helsinki
  • Users City=<missing>

Ending query: City:Helsinki NOT UNIQUESTRING

Scenario 5 – Click on My City and missing a city in the profile

  • ?City=<missing>
  • Users City=<missing>

Ending query: <empty>

And that’s all there is to it. By matching values on a users profile with values on other data you can create quick navigation and filtering scenarios. By adding SPFx into the mix even more control and logic can be built around the search results pages and passing in values.

Tuesday, March 14, 2023

Retirement of custom default result sources in Microsoft Search for modern search experiences

In November 2021 I posted about bookmarks being the successor feature of promoted results for organizational scoped searches in Microsoft Search, which was the first step to modernize the Microsoft Search stack and remove dependencies on classic SharePoint search features and API’s.

The next step is now under way, as announced by MC526131 - Retirement of custom default result sources in Microsoft Search for modern search experiences.

00049

 

For most customers the change, which will start April 10th 2023, should have no impact. The way KQL rewrite works for a default result source was never intended for modern search experiences as any change was applied to all verticals showing SharePoint and OneDrive content. The ability for an admin to edit and add KQL per vertical in the modern experience is a better and more accurate feature – succeeding the result source feature which doesn’t really work well in modern search experiences in SharePoint. See https://learn.microsoft.com/en-us/microsoftsearch/manage-verticals#keyword-query-language-kql for more information on vertical management.

I want to be crystal clear that nothing happens to classic search experiences nor to experiences which are powered by the SharePoint Search API. Everything keeps on working as before – it’s just Microsoft Search in out-of-the-box modern experiences which retires reading the setup.

I also want to point out that this does not effect query rule triggered promoted results on SharePoint sites or SharePoint hub-sites, as the modern experience will show a promoted result for these scopes regardless of the result source they may have been targeted towards.

And as a last note, this only applies to environment where search vertical administration is rolled out.

That’s it and you can likely ignore this post as it should not affect you