Thursday, September 19, 2024

Yes, you can have exactly-once delivery

Introduction

This post is ostensibly about an obscure technical issue in distributed systems, but it's really about human communications, and how disagreements that on the surface appear to be about technical issues can sometimes turn out to actually be disagreements about the meanings of words.  I'm taking the time to write a fairly extensive post about this for two reasons.  First, I'm hoping it will be helpful to provide a case-study about what appears on the surface to be a technical dispute that turns out to be a mere quibble over terminology.  Learning to recognize such situations can help defuse tense situations.  And second, this might provide some cover for someone early in their career who comes to the same conclusion I did about a certain emperor having no clothes.

The dispute

I was reading Hacker News a few days ago and stumbled on a comment posted by /u/solatic in response to an article entitled "Falsehoods programmers believe about TCP".  I quote that comment here in its entirety:

Related: you can get at most once delivery or at least once delivery; you cannot get exactly once delivery. If I had a dollar for every junior who thought that a lack of exactly once delivery guarantees was a bug...

This struck me as odd because I could see a straightforward way to implement exactly-once delivery on top of at-least-once delivery.  But solatic seemed sufficiently confident that I thought I might be missing something, so I posted what I thought was a cautious response:

If you can get at-least-once delivery, why can you not build exactly-once on top of that?

I won't go into detail about what happened after that.  You can go read the resulting thread if you like, but the TL;DR is that the discussion spun wildly out of control, which ultimately motivated me to write this post.

Before I dive into the details, I want to give my credentials because I am about to challenge what is apparently some pretty deeply entrenched conventional wisdom.  Unfortunately, many of the people who hold these mistaken views seem to subscribe to arguments from authority.  I got a lot of admonitions that were tantamount to, "You obviously need some remedial training in basic computer science."  That's possible, but I want to provide some information about my background so you can adjust your Bayesian prior regarding that particular hypothesis.

I have a Ph.D. in Computer Science.  My specialty is (or at least was) AI and not distributed systems, but I nonetheless know my way around a network stack.  My first startup, about 30 years ago now, was an (unsuccessful) attempt to launch a new network architecture.  We built our own hardware.  I wrote the Linux device drivers for our custom NIC.  I currently work part-time as a consultant for a hardware design team that builds network routing chips.  I know how a modern network works literally from the transistors on up.  And I know about the two generals and the muddy children.

So with that, let's look at the claim that I found questionable: "You can get at most once delivery or at least once delivery; you cannot get exactly once delivery."

What does "delivery" mean here?  The entire kerfuffle turns on the ambiguity of the meaning of this word, which none of my interlocutors ever defined.  But whatever it means, my puzzlement stemmed from the fact that I could see a trivial way to build exactly-once delivery from at-least-once delivery, under any reasonable definition of "delivery".  Specifically, if you could get at-least-once delivery, then turning that into exactly-once delivery should be (it seemed to me) a simple matter of keeping track of all the delivered messages and removing duplicates.  Doesn't that seem plausible to you?  If so, congratulations, your intuitions are serving you well because that will in fact work.  It's not a particularly good solution from an engineering point of view, and it is absolutely not how you would do it in practice (at least not without a lot more refinement) but it does refute the claim that you "cannot get exactly-once delivery".  You can.

So why do so many people deny this obvious truth?  Well, that's a good question.  I'm not 100% sure because no one in that HN thread ever presented a cogent argument or even a reference.  But one common theme that ran through the responses was that I needed to brush up on the two-generals problem.  So let's do that.

The two-generals problem

The Two Generals problem is a classic problem in theoretical computer science first studied by Leslie Lamport in 1983.  It was the first time ever someone proved that a particular problem in computer science had no solution.  Some people claim that if you could get exactly-once delivery you could solve the two-generals problem, but this is false.  The two problems have nothing to do with each other, despite having some superficial resemblance.

The two-generals problem is this: there are two generals (imagine that) leading two armies against a common enemy.  They can defeat their enemy only if they successfully join forces and coordinate a simultaneous attack.  But they are geographically separated in an era before modern telecommunications.  The only way they can communicate is by sending messages via courier thorough enemy territory, where they risk being captured and never heard from again.  In modern terms, their communications channel is half-duplex and unreliable.

It turns out that it is impossible for the generals to coordinate their attack under these conditions.  In order to coordinate the attack, each general needs to know that the other general has received and agreed to the same attack plan that they intend to undertake.  The technical term is that they need to achieve common knowledge, which turns out to be impossible using an unreliable half-duplex communications channel.  The proof is easy to understand: if there were such a protocol, then that protocol must have a termination condition, some point at which consensus has been reached, after which no further messages need to be sent.  But the protocol must still work even if the last message in the protocol fails to arrive, and so sending the last message must not be necessary -- the protocol has to still work without it.  By induction, the protocol must work even if no messages are sent at all.  That is clearly impossible, and so our assumption that there exists such a protocol must be false, QED.

But achieving common knowledge has absolutely nothing to do with the question of whether exactly-once delivery is possible -- it is.  As I've already described above, the easiest way to do it is to keep track of all delivered messages and remove duplicates.  (Exercise: why does this strategy not succumb to the same proof of impossibility as the two-generals problem?)

Exactly-once delivery

Now, this is where things get tricky, because no one ever actually responded to this very straightforward suggestion.  So at this point I need to extrapolate the replies I did receive and guess what the response would be if I really put someone's feet to the fire.  My guess is that they would say that what I've provided is not exactly-once delivery but exactly once processing.  I've made the processing of incoming messages idempotent, so that it doesn't matter if I get duplicate delivery, but I have not actually eliminated duplicate delivery, or something like that.

It is possible to define "delivery" makes the claim that exactly-once delivery is impossible true.  The problem is that such a definition only serves to advance pedantry and ignores important subtleties in human communication.  Let's go back to the original claim: "You can get at most once delivery, or at least once delivery; you cannot get exactly once delivery."  There are some things implied here that are not explicitly mentioned.  Specifically, there is an implication that the two things that you (supposedly) *can* have -- at-most-once and at-least-once -- are actually things that one might choose under realistic circumstances.  Consider:

You can get at most once delivery, or at least once delivery, or you can hit yourself in the head with a hammer.

That statement is true -- hitting yourself in the head with a hammer is an option.  But I hope I don't have to persuade you that this is not very useful, and the only thing you're likely to accomplish by pointing it out is to annoy people.  When you provide a list of options, there is an implied social contract that the items on the list have at least some plausible utility.

In a similar vein, one way to get at-most-once delivery is simply to disconnect the communications channel.  That will absolutely guarantee that no message is ever delivered more than once, because every message will in fact be delivered exactly zero times.  Like hitting yourself in the head with a hammer, this is an option you can undertake, but it's not very useful, and so asking someone to actually spend brain cycles considering it can't be anything more than a deliberate attempt to waste their time or insult their intelligence.

Because of this, someone hearing you say "You can get at most once delivery, or at least once delivery; you cannot get exactly once delivery" is entitled to assume that disconnecting the network is not what you meant with regards to the first option.  That assumption is reinforced by the inclusion of at-least-once delivery as a viable option, because that is actually not possible unless the odds of a message being delivered are greater than zero.  Likewise, your interlocutor is entitled to assume that the reliability of the network is less than 100%, not because you explicitly said so, but because networks are unreliable in point of actual fact, and also because that is the only thing that makes this topic worth discussing at all, and the only thing that makes the impossibility of exactly-once delivery even remotely plausible.  In a 100% reliable network you can insure exactly-once delivery by (wait for it!) ... sending each message exactly once!

Likewise, "you cannot have exactly-once-delivery, but you can have exactly-once processing" implies that there is a meaningful distinction to be drawn between "delivery" and "processing".  There isn't.  Processing is ubiquitous in modern networks.  Every router does it, and obviously every end-point does it.  Any distinction between "delivery" and "processing" that would make exactly-once delivery impossible would necessarily beg the question by stipulating that discarding duplicate messages is "processing".  You can't make this distinction in any other way because discarding duplicates can happen anywhere in the abstraction stack, including the device driver, or even in the NIC hardware.  No one does it that way, but only because it's easier and more convenient to discard duplicates in the network stack or the application, not because it's impossible to do it anywhere else.

So yes, you can have exactly-once delivery.  That may not be what you want.  In fact, it is almost certainly not what you want for anything that is actually mission-critical, because a lot can go wrong downstream of message delivery.  But that is a different question.  If you want it, you can absolutely have it.

Just for the sake of completeness I should point out that removing duplicates at the receiver is a pretty extreme oversimplification of what you would do in practice to provide exactly-once delivery.  A complete solution would almost certainly be an example of Greenspun's Tenth Law applied to the TCP protocol rather than Common Lisp.

Final Thoughts

This post was intended to be about human communication more than distributed systems or network protocols, so I want to address one last question: How did this myth of the impossibility of exactly-once delivery get so firmly established?  This is a genuine mystery to me.  I can find no academic citation anywhere, only a few blog posts, none of which have any references.  So I can only speculate.  My guess is that it stems from the fact that distributed systems are monstrously complicated both in theory and in practice, and there are some things that you really cannot guarantee in the face of certain kinds of failures.  But that is true of non-distributed systems too.  Distributed systems just amplify the problems, they don't introduce them.  Even in a non-distributed system you always have the possibility that a cosmic ray will flip a bit somewhere.  Yes, you can have ECC, but you can also have two cosmic rays, or a manufacturing defect, or sabotage, or a bug.  In the real world you can never eliminate all risks.  The best you can hope for is to reduce them to the point where you like the odds.  That's usually good enough.

That is even the case for human endeavors like this blog post.  It's possible that I've overlooked something and that my argument is wrong.  But I've thought about it, and I've done some research, and I'm able to account for all the data, so I'm casting my lot and publishing this, fully aware of the possibility that someone might point something out in the comments that will make me slap myself on the forehead and go "duh".  But even then I will count that as a good outcome because I will learn something.

That's how the scientific method works.

10 comments:

  1. You overlooked that in general case you need infinite memory for dedupe.

    ReplyDelete
    Replies
    1. No, you don't. You only need that for a naive implementation, and even then you only need an unbounded memory, not an infinite one.

      Delete
    2. would you describe non-naive implementation?

      Delete
    3. Yes, but first I need to know why you don't accept TCP, which I already cited, as the answer.

      Delete
  2. Talking Past Each Other

    >What does "delivery" mean here? The entire kerfuffle turns on the ambiguity of the meaning of this word, which none of my interlocutors ever defined.

    Bognar defined it:
    "TCP is implemented in the kernel with a buffer, the kernel responds with ACKs before an application reads the data.
    There is no guarantee that the application reads from that buffer (e.g. the process could crash), so the client on the other end believes that the application has received the message even though it hasn't.

    The kernel is handling at-least-once delivery with the network boundary and turning it into at-most-once with the process boundary."

    Which you agree with, sort of:
    >OK, if that's how you're defining your terms, I agree that you cannot have exactly-once delivery.

    But then you tacked on this which makes no sense:

    >But that makes it a vacuous observation. You can always get at-most-once delivery by disconnecting the network entirely. That provides a 100% guarantee that no message will be delivered more than once (because no message will ever be delivered at all). But that doesn't seem very useful.

    ReplyDelete
    Replies
    1. > But then you tacked on this which makes no sense:

      The point was (supposed to be) that there are interpretations of words that make statements correct but vacuous. In the interests of facilitating communication one ought to assume that those definitions are not what was intended.

      Delete
  3. Take the Computer Out of It

    1. I send you a letter via US mail.

    2. The USPS carries the letter to you and a postman puts the letter in your mailbox.

    3. You take the letter out of the mailbox.

    4. You open the letter and it says "We have a meeting at 10:30 AM on November 7". You create an appointment in your calendar. Included is a postcard for you to return, acknowledging you received my letter.

    5. After a week, I don't get the postcard.

    6. I mail you a duplicate letter via US mail.

    7. The USPS carries the letter to you and a postman puts the letter in your mailbox.

    8. You take the letter out of the mailbox.

    9. You open the letter and it says "We have a meeting at 10:30 AM on November 7". Since you've already gotten this message before, you throw out the letter.

    End result:
    Two letter deliveries to you
    Only 1 calendar appointment booked.

    Now Put the Computer In

    1. My computer sends your computer a message

    2. The message travels over the internet and ends up in the buffer of your NIC of your computer. Your NIC sends an ACK to my computer.

    3. Your calendar application reads the buffer of the NIC and copies it into the application.

    4. The calendar application processes the message, which it interprets as a calendar invite for a meeting at 10:30 AM on November 7. and writes the meeting onto the calendar.

    5. My computer doesn't receive the ACK

    6. My computer re-sends a duplicate of the message

    7. The message travels over the internet and ends up in the buffer of your NIC of your computer. Your NIC sends an ACK to my computer.

    8. Your calendar application reads the buffer of the NIC and copies it into the application.

    9. The calendar application processes the message, which it interprets as a calendar invite for a meeting at 10:30 AM on November 7, and it either A) recognizes it as a duplicate and throws it out, or b) over-writes the meeting onto the calendar (idempotent message).

    End Result:
    Two messages transited the internet and were written into the buffer of your computer's NIC.
    Only 1 calendar appointment booked.

    What is Delivery?

    It's transport of a message over the network and being written into the buffer of your NIC. The NIC has no knowledge of the meaning of the message, it's just data in its buffer (the meaning of the data is, ahem, indeterminate).

    Note in the mail example, you got a letter delivered twice by the USPS. In the computer example, the NIC of your computer received two messages over the network into its buffer -- that's two deliveries.

    What is "Processing"?

    An unfortunate term, as "processing" is often used to mean "computation" -- but on a computer, everything is computation.
    A better term would be "interpreting" or "acting on" the message by an application. The data has semantic meaning to whatever is "processing" it.

    Deduplication could be done by the network driver (which understands the semantic meaning of message ids, or whatever is used for deduplication).

    The application could do deduplication in a similar way, or, the message data has semantic meaning in the context of the application, so the application could be using idempotent messages and respond to them appropriately.

    ReplyDelete
    Replies
    1. > Take the Computer Out of It

      That's a great idea. I actually did this in an earlier draft but I decided to remove it because it felt like I was belaboring the obvious, but apparently not. So take your scenario and add a secretary who opens my mail and discards duplicate letters. When the secretary puts the resulting de-duplicated mail on my desk I think it is reasonable to label that mail as having been delivered to me (and I think it's reasonable to call it that even if I do not act on it). If my secretary is reliable (which I think is also a reasonable assumption) then that constitutes something that IMHO is reasonable to call exactly-once delivery.

      Delete
  4. The Blog Posts Links Were Responsive

    The links to the blog posts were responsive to the topic.

    Blog post The impossibility of exactly-once delivery defines exactly once delivery:

    "Exactly-once delivery guarantee is the guarantee that a message can be delivered to a recipient once, and only once. While having a message be delivered only once by a recipient, is the norm, it is impossible to guarantee it."

    He goes onto to define the at-most-once and at-least-once delivery guarantees.

    He then gives the solution: Exactly-once processing

    "Exactly-once processing. Exactly-once processing is the guarantee that even though we may receive a message multiple times, in the end, we observe the effects of a single processing operation. This can be achieved in two ways:

    Deduplication: dropping messages if they are received more than once

    Idempotent processing: applying messages more than once has precisely the same effect as applying it exactly once"

    Which is the possible solutions of my calendar program above -- drop duplicates, or overwrite the calendar invite.

    Blog post You Cannot Have Exactly-Once Delivery makes similar points as the prior blog post, with a similar conclusion:

    "The way we achieve exactly-once delivery in practice is by faking it. Either the messages themselves should be idempotent, meaning they can be applied more than once without adverse effects, or we remove the need for idempotency through deduplication."

    Your Solution

    Your solution is an abstraction layer that does deduplication.

    This is, as the second blog post says, "faking it."

    You receive multiple duplicate messages, but you fake getting only 1 message by dropping the duplicates.

    Faking It is ... Fake

    What your interlocutors mean by "exactly-once delivery" is exactly one message transits over the network and into your NIC buffer.

    You can make it look logically to an application that only one message arrived via your deduplication scheme -- but that is faking it. You are not stopping duplicate messages arriving.

    To truly solve the exactly-once delivery problem, you would need to develop a system where one, and only one, message transits the network to your NICs buffer. If you could guarantee that, you wouldn't need deduplication or idempotent messages.

    ReplyDelete
    Replies
    1. > "Exactly-once delivery guarantee is the guarantee that a message can be delivered to a recipient once, and only once. While having a message be delivered only once by a recipient, is the norm, it is impossible to guarantee it."

      That's not a coherent definition. It's circular, and it begs the question by not defining "recipient".

      > Deduplication: dropping messages if they are received more than once

      Yes. *Received*. Not *delivered*. If, after de-duplication, you then *deliver* the de-duplicated messages, then you have exactly-once delivery (assuming you have at-least-once delivery into the de-duplicator).

      Delete