CRDTs

Aliases
  • CRDT
  • CRDTs
  • Conflict-Free Replicated Data Types
Image of Author
September 21, 2022 (last updated June 6, 2023)

CRDTs is an acronym for Conflict-Free Replicated Data Types. They are a way of structuring data such that you no longer require a centralized data store or centralized source of truth.

Types of CRDTs

Operation Based and State Based CRDTs

This is a confusing distinction. I think it can describe both message content when broadcasting messages between nodes, and how "state" is stored.

Regarding message broadcasting, you can either emit just the modified local state, or the entire, new, modified state. I.e., you can broadcast "I changed the author first name", or, "here is the entire new data blob" which contains the new author first name.

Regarding internal state management, you can think of operation-based CRDTs as analogous to event sourcing. It is an implementation detail, but presumably a "pure" operation-based CRDT will store "just" the events, and compute the state from that (which it may or may not store, internally). State-based CRDTs will store the state internally (and may or may not store message events depending on how broadcasting is implemented).

AWSet and ORSet CRDTs are the same

An ORSet, or "Observed-Remove Set", is the same as an AWSet, or "Add-Wins Set". It is a way of defining the tie-breaking semantics of conflicting, concurrent updates. If concurrent operations call for a remove and an add, the add wins.

I'm guessing that the way you get the "observed remove" phrasing to work is basically by thinking about which kinds of removes actually remove an item. A concurrent remove hasn't been "observed" yet. To be observed is to be historical, to have been propagated, to no longer be concurrent. That said, "add wins" is more intuitive, mainly because it is underspecified what is meant by "observed".

Okay I think I get the "observed remove" naming: You can only remove what's been added. And adding occurs always before removing, and so therefore you can guarantee you won't get a removal of an element you haven't seen added yet. The og node has to call add before remove. Until the add has been propogated to other nodes you can't remove it. This might mean there is no such thing as concurrent add and remove. Maybe if we are dealing with primitives like numbers this wouldn't apply though.

What is Strong Eventual Consistency?

From the wiki post:

Whereas eventual consistency is only a liveness guarantee (updates will be observed eventually), strong eventual consistency (SEC) adds the safety guarantee that any two nodes that have received the same (unordered) set of updates will be in the same state. If, furthermore, the system is monotonic, the application will never suffer rollbacks. A common approach to ensure SEC is conflict-free replicated data types.

This makes sense as a usual descriptor because it requires something like a commutative property for concurrent events.

From An optimized conflict-free replicated set:

Strong Eventual Consistency (SEC) requires a deterministic outcome for any pair of concurrent updates. Thus, different replicas can be updated in parallel, and concurrent updates are resolved locally, without requiring consensus. Some simple conditions (e.g., that concurrent updates commute with one another) are sufficient to ensure SEC. Data types that satisfy these conditions are called Conflict-Free Replicated Data Types (CRDTs). Replicas of a CRDT object can be updated without synchronization and are guaranteed to converge.

Strong eventual consistency seems like a "hard problem" because you need to design data types that resolve to the same outcome no matter the order in which concurrent events were received.

Weak eventual consistency is a guarantee of delivery "liveness", but seems to have no restrictions on resolutions. That is, nodes might be self-consistent but not consistent with each other

The answer is basically: no. But, you can find a variety of ways to say 'yes' if you really wanted to.

From a bird's eye view, CRDTs, blockchains, the git protocol, various data stores, and probably a million other tools and technologies, can all be brought up when using words like: distributed, decentralised, etc. But why? What are the shared abstractions they all have in common? I don't think I have a good answer, but, I think it is related to the shape of information within some of these systems. A shape that is conducive to distributed data. I'm talking about directed acyclic graphs.

What makes DAGs conducive to distributed data? It can represent the notion of concurrency easily, and without sacrificing otherwise normal temporal ordering. In a distributed system, there is a sense in which things can happen "at the same time", i.e., concurrently. For everything else, and to as maximum extent possible, you want a normal temporal ordering. This is pretty close to the notion of causal ordering that DAGs also easily represent. A clever DAG-based data model can represent distributed systems. Git can embed parent commits in children commits, and can represent branches and merges (water, like time, flows in only one direction). Blockchains can represent parent blocks within children blocks, and can allow or disallow branches. CRDTs can represent distributed data modifications as a DAG to aid in its consistency guaranteed internal algorithms.

When working with a distributed system, decentralised or otherwise, you deal with questions of concurrency. Many distributed algorithms will employ DAGs in their modelling. In conclusion, it is less that CRDTs and blockchain are related, and more that any distributed system is going to have DAG-like characteristics. I think it is well and good to see similarities between these technologies. It's just that the similarities speak to more fundamental structures that both of the higher order technologies themselves depend on.

Resources