Do you want SharePoint 2013 in-flight workflows to complete after a Full Farm Fail Over?

Posted by

Important Updated – 10/30/17

The approach I outlined below of sharing a workflow farm between two SharePoint Farms in a high availability scenario leveraging via SQL Always On Async replication works however is not a supported scenario by Microsoft.  

Intro

I’ve been spending a lot of time lately testing SharePoint + SQL Always On scenarios and found some really interesting and important things regarding SharePoint 2013 Workflows. Before diving into the issue and solution, let me quickly recap 2013 workflows and some basics around SQL Always On. We made a huge shift to how Workflows are executed in SharePoint 2013. We stand up what’s now called a workflow farm and SharePoint 2013 farms will create a registration (powershell cmdlet) with the Workflow Farm. The key benefit is when executing SharePoint 2013 workflows, the SharePoint Farm/WFE submits the workflow instance to the WF Farm for processing. The WF Farm does the heavy lifting in this case. SharePoint 2013 is still backwards compatible in that you can create 2010 style workflows which will be processed locally the same way they are in SharePoint 2010. With SQL Always On, it’s possible to replicate SharePoint databases to other SQL Nodes in the same Data Center. This is referred to as synchronous replication which gives you high availability for a single Farm. It’s also possible to replicate those same databases asynchronously to a third SQL Node. This third node resides in a separate Data Center and a separate farm connects to those databases which would be in a read/only state. This is mainly for Disaster Recovery so if Data Center 1 dies, you can perform a full farm fail over and be up and running against those databases in read/write. This is referred to as Active/Passive SharePoint 2013 Dual Farm model.

Results of Failover Testing and Workflows

I wanted to test and see what were the outcomes of several SharePoint features after simulating both high availability failover via synchronous replication (same farm) and DR failover via asynchronous replication (different farm).

For my test, I created a fairly simple 2013 list workflow using SharePoint Designer which pauses for 10 minutes and then resumes and updates the current items UpdatedByWF field in a list. It looks like:

clip_image002

Note: The workflow will automatically fire when creating an item in the list.

Performed the following test:

  1. Start the WF by creating a list item
  2. Fail Over the High Availability Group while the workflow is in progress
  3. Wait 10 minutes to see if workflow completes

Test Time

1. Synchronous Failover to SQL 2 (same farm)

Results – Success (Workflows Complete)

2. Asynchronous Failover to SQL 3 (different farm)

Results – Fail (Workflows stay in – progress)

Note: I tested the same type of workflow except a SharePoint 2010 workflow and it completes successfully in both scenarios!

Issue

This means that out of the box, SharePoint 2013 in-flight workflows will never complete after a full farm failover until they are failed back to the farm where the 2013 workflow initiated. In my opinion, this was an eye opener because several large enterprise environments rely heavily on workflows for critical business functions.

Cause

The issue is caused with the manner in which you add the registrations to the Workflow Farm. When you register more than one SharePoint farm to the same workflow farm, you must use unique scope names. If you don’t, you’ll see an error like the following:

clip_image004

So in this case, the workflow farm contains two scope registrations, one for farm A and one for Farm B. When I kick off a workflow, the workflow instance is written to the instance table in the WFInstanceManagement database. The workflow instance is stamped along with Farm A’s scope ID (where the workflow initiated). For Example:

clip_image006

clip_image008

After failover, Farm B won’t understand how to interact with this instance because while it knows which instance to query, it passes its scope ID which is different than the scope ID (of Farm A) which is associated with the running workflow instance. The only supported workaround for this configuration is to fail back over to Farm A and let the workflow complete.

Resolution

The resolution to this is to set both Farms to use the same scope name. Not that simple, keep reading.

Q&A 

Question: Wait, I thought this wasn’t possible?
Answer: Yes, that’s what I thought to initially but we have a force parameter which you use in Farm B to set the scope to name the same.

Question: Wouldn’t that overwrite existing settings in Farm A’s scope registration?
Answer: Yes, it does which requires further explanation before going through the steps to resolve the issue.

Question: What happens when I register a 2013 SharePoint Farm to a Workflow Farm?

Let’s assume I run the following for test purposes:

register-spworkflowservice –spsite http://intranet –scopename “contosoWF” –allowouthhttp

This will create two scopes in the WF Farm.

Parent Scope with Path /ContosoWF
Child Scope with Path /ContosoWF/default

The scopes contain security configuration which allows the SharePoint server to access and call into the WF farm via server to server authentication. The parent scope contains trusted issuer that has the STS Cert stamped from Farm A. To see this information on the WF farm run the following PowerShell:

$parentScope = get-wfscope –scopeuri http://wf:12291/ContosoWF
$parentScope.SecurityConfigurations.TrustedIssuer

The child scoped is stamped with the Realm of Farm A as well as the STS Config’s name identifier property.

$childScope = get-wfscope –scopeuri http://wf:12291/ContosoWF/default

Steps to Resolve Issue

First, I had a couple of moments where I was close to pulling out the sledge hammer on my monitor. Thanks a ton to members of WF PG that helped me out of a few ditches. The steps to work around the issue must be followed in this order:

  1. Both Active/Passive Farms must use the same STS Certificate
  2. Both Active/Passive Farms must use the same Realm
  3. Both Active/Passive Farms must use the same STS Config Name Identifier
  4. Both Farms must be registered using the same Workflow Scope Name

Step 1: Use the same STS Cert in both Farms

Follow the instructions here to replace the existing STS Cert with the same new one in both farms.

A couple of important notes before applying this step.

Note 1: Don’t try and reuse the existing STS Cert in Farm 1 over in Farm 2. It doesn’t work and you must generate a new certificate for STS and use that same certificate for both Farms.

Note 2: Running the following will likely generate an error and you can ignore the error: certutil -addstore -enterprise -f -v root $stsCertificate

Step 2: Set Farm B with Farm’s Authentication Realm

1. Farm A: run get-spauthenticationrealm and copy the output

clip_image010

2. Farm B: Set the Authentication Realm to Farm A’s via Set-SPAuthenticationRealm –realm <guid>

clip_image012

Step 3: Ensure both Farms use the same value for the following property: SPSecurityTokenServiceConfig.NameIdentifier

Important: In this case, you can simply copy the property value from Farm 1 and set Farm 2’s property with the copied value.

Farm 1: Launch PowerShell on any server and run the following:

$stc = Get-SPSecurityTokenServiceConfig
$stc.NameIdentifier

Copy the entire value. Note in my case it’s: 00000003-0000-0ff1-ce00-000000000000@fd0ed39d-de87-45d6-8c54-4ef4950ebbff

Farm 2: Launch PowerShell on any server and run the following:

$stc = Get-SPSecurityTokenServiceConfig
$stc.NameIdentifier

Note: The value should be different so in my case I’m going to set this property to the value above.

$stc.NameIdentifier = “00000003-0000-0ff1-ce00-000000000000@fd0ed39d-de87-45d6-8c54-4ef4950ebbff”
$stc.update()

Step 4: Register both Farms to the Workflow Farm

This is the most important part. In this case, we will register both farms using the same scope name. I’ll use testscope as my scope name. OOB, this command-let won’t work after running in Farm 1 so in Farm 2, we will run with the Force parameter.

  1. From any SharePoint Server in Farm A, run the following PowerShell:

register-spworkflowservice –spsite “https://intranet” –WorkflowHostUri https://workflowserver.contoso.com:12290 –scopename “testscope”

2.  From any SharePoint Server in Farm B, run the following PowerShell:

register-spworkflowservice –spsite “https://intranet” –WorkflowHostUri https://workflowserver.contoso.com:12290 –scopename “testscope” –force

That should be it! A couple of things to remember while testing out in-flight workflows.

  1. The user initiating the workflow must have an associated user profile in both farms
  2. App Management Service Application must be provisioned in both farms
  3. Do not tests by running workflows as the System Account. It will error.
  4. When updating DNS and initiating fail over, you must flush the DNS cache on the WF Server and SharePoint WFE’s

Thanks,

Russ Maxwell, MSFT