[#PETALSDISTRIB-146] Reliability aka delivery guarantees for consumer-provider communication - Petals Link JIRA

Details

Type: New Feature
Status: New
Priority: Major
Resolution: Unresolved
Affects Version/s: 4.3.0-beta-1
Fix Version/s: 5.4.0
Component/s: Components, Container
Security Level: Public

Description:
Hide
This question relates to many components of Petals, in particular the container and the CDK.

The main point is to be able to guarantee to a consumer that its message is handled as desired by the provider (and inversely for out messages and such).
The term reliability and persistence are used in more or less in the same way as in the JBI specification (annexe).

Many aspects exists:

When the container is not impacted by any problem, the JBI specification provides everything needed: if the message exchange pattern is followed (i.e. either DONE, FAULT or IN/OUT messages are exchanged as the pattern dictates) then both actors can assume everything is ok, and if an error is set it means something went wrong and the actors are notified with this information.

When the container has a problem (let's say crash), messages shouldn't be lost so that when it comes back, the execution can continue as desired.

With such guarantees, then the service consumers and providers can focus on their business and do not have to manage anything for handling such failures.
Of course in some corner case this may not be attainable, this has to be discussed here.

For each of these aspects, there is work to do:

We need to ascertain that for the whole path that an exchange take (i.e. from send to accept), errors are caught as needed and the exchange is correctly set in error and returned to the sender. This means that:

The CDK should send back an error if it drops messages (see ~~PETALSCDK-135~~ and ~~PETALSCDK-90~~).

We should verify everything in the container (issue to be created in case it is needed)

From the moment a send() call on the DeliveryChannel returns, the message should be considered as safely persisted until a call to accept() is done on the other side

We should take into account distributed scenario (i.e. persistence must happen both before the transport is involved on the sender side, and after on the receiver side). If possible in the local case, double persistence shouldn't happen.

Some choice should be made between these two options (or making that configurable... why not... I'm not even sure these questions are not already answered by the JBI specs):

the execution from "send()" to "the message is stored in the deliverychannel queue" is kind of atomic and only persisting in the DeliveryChannel queue (see PETALSESBCONT-339) is enough, if it doesn't happen the send method will throw an exception and the sender has to deal with it.

as soon as send is called, if something wrong happen before the message can be stored in the deliverychannel of the receiver (assuming it is persisted as in PETALSESBCONT-339), then when it's relevant, the message would have been persisted so that when the problem is solved it can be resent again. In that case send would return as soon as persistence has occurred. (not sure why this behaviour would be preferred... we should discuss that).

Some preliminary discussions on persistence in the delivery channel are in PETALSESBCONT-339 (which should maybe rejected and dispatched in multiple issues because it's a mess...)
Show
This question relates to many components of Petals, in particular the container and the CDK. The main point is to be able to guarantee to a consumer that its message is handled as desired by the provider (and inversely for out messages and such). The term reliability and persistence are used in more or less in the same way as in the JBI specification (annexe). Many aspects exists:

When the container is not impacted by any problem, the JBI specification provides everything needed: if the message exchange pattern is followed (i.e. either DONE, FAULT or IN/OUT messages are exchanged as the pattern dictates) then both actors can assume everything is ok, and if an error is set it means something went wrong and the actors are notified with this information.

When the container has a problem (let's say crash), messages shouldn't be lost so that when it comes back, the execution can continue as desired.

With such guarantees, then the service consumers and providers can focus on their business and do not have to manage anything for handling such failures. Of course in some corner case this may not be attainable, this has to be discussed here. For each of these aspects, there is work to do:

We need to ascertain that for the whole path that an exchange take (i.e. from send to accept), errors are caught as needed and the exchange is correctly set in error and returned to the sender. This means that:

The CDK should send back an error if it drops messages (see ~~PETALSCDK-135~~ and ~~PETALSCDK-90~~).

We should verify everything in the container (issue to be created in case it is needed)

From the moment a send() call on the DeliveryChannel returns, the message should be considered as safely persisted until a call to accept() is done on the other side

We should take into account distributed scenario (i.e. persistence must happen both before the transport is involved on the sender side, and after on the receiver side). If possible in the local case, double persistence shouldn't happen.

Some choice should be made between these two options (or making that configurable... why not... I'm not even sure these questions are not already answered by the JBI specs):

the execution from "send()" to "the message is stored in the deliverychannel queue" is kind of atomic and only persisting in the DeliveryChannel queue (see PETALSESBCONT-339) is enough, if it doesn't happen the send method will throw an exception and the sender has to deal with it.

as soon as send is called, if something wrong happen before the message can be stored in the deliverychannel of the receiver (assuming it is persisted as in PETALSESBCONT-339), then when it's relevant, the message would have been persisted so that when the problem is solved it can be resent again. In that case send would return as soon as persistence has occurred. (not sure why this behaviour would be preferred... we should discuss that).

Some preliminary discussions on persistence in the delivery channel are in PETALSESBCONT-339 (which should maybe rejected and dispatched in multiple issues because it's a mess...)

Environment:

-

Issue Links

Depends

This issue depends on:
~~PETALSCDK-90~~ No response to the service consumer if the message exchange is rejected by processors
PETALSESBCONT-339 Introduce a mechanism to handle a lot of messages in the DeliveryChannel
PETALSESBCONT-437 Reimplement the Remote TCP transporter with Apache Netty
PETALSCDK-175 Handle capacity saturation of exchange processor with various strategies
~~PETALSCDK-135~~ Do not drop messages when there is no thread for processors available
~~PETALSESBCONT-367~~ The NIO Transporter does not send back exception if an error happens while delivering the message on the receiving side
~~PETALSESBCONT-379~~ After/During component shutdown, reinject the exchanges left in the DeliveryChannel into the NMR
~~PETALSESBCONT-387~~ During/After SU shutdown (aka endpoint deactivation), reinject new exchanges left in the DeliveryChannel into the NMR
~~PETALSESBCONT-381~~ When receiving an answer to a timed out exchange, send an error back to the sender if possible

Activity

Descending order - Click to sort in ascending order

Hide

Permalink

Victor NOËL added a comment - Thu, 2 Jul 2015 - 18:02:32 +0200

I agree yes, that's what I meant by "When the container is not impacted by any problem, the JBI specification provides everything needed".

Nevertheless, currently, there is situations when messages are lost and that could be avoided (for example messages are dropped in the CDK, and if the container crashes, exchanges in the queue are not restorable easily).
That's what this issue is about: the question discussed here is at a level lower than the MEP and the MEP should relies on this.
Also, I guess you understood that this isssue is a tentative to clear the mess of the PETALSESBCONT-339 issue

I don't understand the question "but which difference between both ?".

As for the send versus sendSync, yes, that's a good question. When using the word "send", I was actually also considering sendSync in the discussion, but things are a bit different yes.

in case of crash, there is nothing to do there, the sender has been stopped, it's not as if it can receive the message when it is back online, but also he never considered that things were ok since he never received its response.
if something is happening without the sender being stopped, then it's the same as with the rest I think, no? An error is present in the returned exchange when it is applicable, or else, the sendSync will either block until the message is back or it will timeout.

Show

Victor NOËL added a comment - Thu, 2 Jul 2015 - 18:02:32 +0200 I agree yes, that's what I meant by "When the container is not impacted by any problem, the JBI specification provides everything needed". Nevertheless, currently, there is situations when messages are lost and that could be avoided (for example messages are dropped in the CDK, and if the container crashes, exchanges in the queue are not restorable easily). That's what this issue is about: the question discussed here is at a level lower than the MEP and the MEP should relies on this. Also, I guess you understood that this isssue is a tentative to clear the mess of the PETALSESBCONT-339 issue

I don't understand the question "but which difference between both ?". As for the send versus sendSync, yes, that's a good question. When using the word "send", I was actually also considering sendSync in the discussion, but things are a bit different yes.

in case of crash, there is nothing to do there, the sender has been stopped, it's not as if it can receive the message when it is back online, but also he never considered that things were ok since he never received its response.
if something is happening without the sender being stopped, then it's the same as with the rest I think, no? An error is present in the returned exchange when it is applicable, or else, the sendSync will either block until the message is back or it will timeout.

Hide

Permalink

Christophe DENEUX added a comment - Thu, 2 Jul 2015 - 17:44:25 +0200 - edited

In my mind, the delivery guarantee is linked to the MEP, and the delivery guarantee is important when the consumer does not expect a reply from the provider, except an ACK telling that it's message is taken into account:

for MEP InOut, or InOptionalOut, it has no sens because a reply is expected. If no reply occurs, it's a timeout. So the consumer can manage the situation,
for MEP InOnly, and RobustInOnly: the delivery guarantee seems to have sens, but which difference between both ?

In your 2nd point, you talk about "From the moment a send() call on the DeliveryChannel returns", but what about DeliveryChannel.sendSync(...) ?

Show

Christophe DENEUX added a comment - Thu, 2 Jul 2015 - 17:44:25 +0200 - edited In my mind, the delivery guarantee is linked to the MEP, and the delivery guarantee is important when the consumer does not expect a reply from the provider, except an ACK telling that it's message is taken into account:

for MEP InOut, or InOptionalOut, it has no sens because a reply is expected. If no reply occurs, it's a timeout. So the consumer can manage the situation,
for MEP InOnly, and RobustInOnly: the delivery guarantee seems to have sens, but which difference between both ?

In your 2nd point, you talk about "From the moment a send() call on the DeliveryChannel returns", but what about DeliveryChannel.sendSync(...) ?

Petals Distribution

Reliability aka delivery guarantees for consumer-provider communication

Details

Issue Links

Activity

People

Dates