Logic Apps and the Service Bus Connector - The Case of the Incomplete Message!

The Symptom

The other day I tracked down an issue my team was seeing with an Azure App Service Logic App. The Logic App was being triggered by an Azure Service Bus Connector that was configured to work with a Service Bus Topic. With a message published to the Service Bus Topic, an instance of the Logic App would execute successfully — as expected on the trigger’s configured schedule. But then, unexpectedly, on the next scheduled time period, often a new instance of the Logic App would execute and reprocess the exact same message from the Service Bus Topic. It was as if the Service Bus Topic message was never successfully “completed” and thus removed from the Topic.

What You Need To Understand

In order to understand why the same Service Bus Topic message was being processed twice, this is what you need to know.

The Service Bus Connector operates in ReceiveMode.PeekLock, so it is not receiving and immediately deleting the message from the Topic in one atomic call like in ReceiveMode.ReceiveAndDelete. In PeekLock mode, two calls to the Service Bus must be made. The first call is to the Receive method, and the Service Bus puts a lock on the message for a specific lock duration. The default LockDuration for a Queue or Subscription is 1 minute, and currently that cannot be configured in the Service Bus Connector. Then a second call, to the Complete method, is required to mark that message as processed and delete it from the Service Bus.
After a call to a Connector’s trigger method returns a result indicating that data is available to be processed and that a Logic App instance should be created and executed, the Logic App trigger mechanism immediately calls the Connector’s trigger method again. This logically is being done in case there is even more data to be processed. This makes sense because if there is more data available to be processed, one typically would want that data being processed as soon as possible, and would not want to wait for the next scheduled trigger time period.
Connectors can have their own specified triggerState data passed from one call of their trigger method to subsequent calls of their trigger method.
The Service Bus Connector passes the message’s unique LockToken as triggerState from the first successful trigger invocation to its next trigger call that runs immediately after it.
When the Service Bus Connector trigger method is executed, one of the first things that it does is to determine if the triggerState data has been provided. If the triggerState data has been provided, the Service Bus Connector calls the Complete method of the message represented by that given triggerState/LockToken. Only after doing that will it attempt to receive the next message from the configured Topic or Queue.

The Cause

Now that we understand how the Complete method is called as part of the Service Bus Connector’s trigger mechanism, it is easy to understand the cause of the problem.

If the second call to the trigger fails (the call that would complete the message), for instance, because of a transient error on the gateway, then the Service Bus message will not be completed — until perhaps the next scheduled trigger.

However, after the default lock duration of 1 minute expires, the lock is lost and Service Bus makes the message available to be received again.

This is exactly what happened — the message was not completed and when the next scheduled trigger executes, the same message is processed again.

Proof — The Smoking Trigger

From the Trigger History of the Logic App, we can see the trigger that failed in the image below.

Logic App Trigger History with Failed Trigger

The call to this trigger returned a 502 “Bad Gateway” error.

1502 - Web server received an invalid response while acting as a gateway or proxy server.

The full JSON of the output message for the failed trigger is:

1{
2    "headers": {
3        "date": "Mon, 07 Sep 2015 23:28:32 GMT",
4        "server": "Microsoft-IIS/8.0"
5    },
6
7    "body": "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\r\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\r\n<head>\r\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-8859-1\"/>\r\n<title>502 - Web server received an invalid response while acting as a gateway or proxy server.</title>\r\n<style type=\"text/css\">\r\n<!--\r\nbody{margin:0;font-size:.7em;font-family:Verdana, Arial, Helvetica, sans-serif;background:#EEEEEE;}\r\nfieldset{padding:0 15px 10px 15px;} \r\nh1{font-size:2.4em;margin:0;color:#FFF;}\r\nh2{font-size:1.7em;margin:0;color:#CC0000;} \r\nh3{font-size:1.2em;margin:10px 0 0 0;color:#000000;} \r\n#header{width:96%;margin:0 0 0 0;padding:6px 2% 6px 2%;font-family:\"trebuchet MS\", Verdana, sans-serif;color:#FFF;\r\nbackground-color:#555555;}\r\n#content{margin:0 0 0 2%;position:relative;}\r\n.content-container{background:#FFF;width:96%;margin-top:8px;padding:10px;position:relative;}\r\n-->\r\n</style>\r\n</head>\r\n<body>\r\n<div id=\"header\"><h1>Server Error</h1></div>\r\n<div id=\"content\">\r\n <div class=\"content-container\"><fieldset>\r\n  <h2>502 - Web server received an invalid response while acting as a gateway or proxy server.</h2>\r\n  <h3>There is a problem with the page you are looking for, and it cannot be displayed. When the Web server (while acting as a gateway or proxy) contacted the upstream content server, it received an invalid response from the content server.</h3>\r\n </fieldset></div>\r\n</div>\r\n</body>\r\n</html>\r\n"
8}

The Remedy

This situation should not happen often, but unfortunately it was occurring multiple times per day, and only ever on the second call to the trigger method. The only real solution is to follow Microsoft’s advice and design your systems to be idempotent, because it can and occasionally will happen.

In addition to that, to reduce the frequency of the situation, hopefully Microsoft will release a change in the near future so that failed calls to the trigger method are automatically retried.

Microsoft Azure Microsoft Azure App Service Logic Apps API Apps Service Bus Service Bus Connector Messaging

Asynchronous Message Exchange Pattern Terminology

In software development, it can be quite difficult to give a good name to a component or coding artefact (such as a variable, method or class).

If you do not intentionally put effort into searching for a good name, then you are probably increasing technical debt and ongoing maintenance pain with terminology that causes confusion for other developers - or for yourself the following day when your memory has faded!

A good name for a coding artefact typically fulfils the following criteria:

is as simple as possible;
is descriptive and intention revealing - it conveys the purpose or [single] responsibility of the artefact;
clearly differentiates its purpose from other artefacts;
is intuitive for other team members to understand; and
uses the accepted, ubiquitous terminology of the domain.

The last thing anyone wants is to be confused by a name that is ambiguous and overlaps with or overloads the name of other artefacts.

Deficiencies in the expressiveness of the English language itself can be a common cause of the difficulties we have. For instance, the Greek language distinguishes between at least four different ways of describing the one English word for “love”. Take that you imprecise and insufficient English language!

And so now we come to the word “asynchronous” when describing service call and message exchange patterns. Within the world of software, and especially in the integration, messaging or service-oriented architecture space, the word “asynchronous” is too overloaded.

Below is a list of the different types or usages of the word “asynchronous” that I am commonly communicating about and distinguishing between:

A client can perform an asynchronous, non-blocking call to a service operation - e.g. from a background thread.
A service operation may be called synchronously from a client, but the operation may return a response early to the caller and continue operating asynchronously in the background. For instance, this is possible with an Enterprise Service Bus / BizTalk orchestration.
A service operation may be called synchronously from a client, but the client may perform the call using a synchronous-over-asynchronous messaging pattern. For instance, the client sends an asynchronous request message to a queue or service bus. The client then subscribes and waits for a correlated response message. To the end user, the entire interaction appears synchronous, none the wiser that asynchronous messages were used instead of typical web service calls.
There might be a requirement for a service operation to be invokable both synchronously via a web service endpoint and also asynchronously via a message placed on a queue or service bus.

The word “asynchronous” is insufficient to clearly describe and distinguish between each of the points above.

So, borrowing concepts from both BizTalk ports and Command and Query Separation (CQRS), I am currently satisfied with using the following terminology:

I only use the word “asynchronous” when a unit of work is performed in the background - where there is no caller waiting for a response message on the same channel it used to make the call in the first place. Please note that there is a difference between a response message and the ACK/NACK that is returned to the caller to acknowledge successful receipt (or not) of the message.
A “Publish Message Command” pattern - where a caller sends/publishes a message to a queue or message bus and there may be multiple subscribers listening for that message.
A “One-Way [Asynchronous] Command” pattern - where a caller sends a message to a queue or message bus, and a unit of work is performed in the background.
A “One-Way [Asynchronous] Command with an Asynchronous Response” pattern (also known as a “Synchronous-over-Asynchronous Command / Query”) - where a caller sends a message to a queue or message bus and also subscribes to a corresponding/correlated response message that it waits for. Meanwhile, a unit of work is performed in the background and that unit of work publishes the corresponding/correlated response message to be received by the waiting original caller.

The terminology “One-Way” explicitly refers to the endpoint receiving a request and never sending a response message back to the caller (once again, the ACK/NACK is not considered a response message).

“Two-Way” however explicitly refers to the service endpoint receiving a request message and the service returning a response message back to the caller.

If we were to extend a similar style of terminology into the world of synchronous patterns, we would have:

A “Two-Way Command” pattern for a synchronous service operation - where the caller waits for a response; and
A “Two-Way Query” pattern for a synchronous service operation that is typically read only with no side effects (except perhaps for writing security audits).

To make things more interesting, different technologies may enable or prevent certain patterns from being implemented. For instance, in .NET, a “One-Way [Synchronous] Command” pattern can be implemented in WCF. In this case, the .NET method signature would have a “void” return type, operate synchronously and then at the end of the operation would return the typical ACK/NACK. This communication pattern however is not applicable in the world of BizTalk.

This WCF version of a “One-Way [Synchronous] Command” pattern is the only reason why I included the text “[Asynchronous]” in the name of the One-Way asynchronous patterns above - in order to clearly differentiate between them.

With this terminology, I have found that I can more quickly and clearly discuss message exchange patterns with my team. I hope it does the same for you!

Naming Async Asynchronous Messaging Patterns