What Is A Distributed System?

The definition for a distributed system changes depending on who you ask, but the general “consensus” (no pun intended) is that distributed systems contain two particular categories of components. The first category is referred to as “nodes.” Nodes are meant to represent separate machines or processes. This separation manifests in the real world as physical distance, such as in Bitcoin, or just a separation of components, similar to the CPU cores in your laptop. The second category is referred to as “message passing.” These are represented as arrows, or “edges” as we like to call them in the context of graph theory.

Regardless of their name, their purpose is to demonstrate that information can move between machines. It isn’t guaranteed that all machines are directly, or even indirectly, connected. However, that doesn’t stop them from working together.

Why Do We Need A Distributed System?

The biggest question on your mind, though, must be, “Why do distributed systems matter? What’s wrong with a single system?”. Well, the problem with machines in the real world is that they can be faulty, crash, lose memory, or even be corrupted. In this event, we can protect information and services by creating backups. In other words, putting many machines together to accomplish a goal together instead of just having a single machine. This brings us the main advantage of a more reliable system: even if one machine shuts down or misbehaves, it’s not the end of the world. We can create protocols withstanding failure, giving us the power to be safe in the event of a crash.

Properties of a Distributed System!

Let’s discuss the properties of a distributed system keeping in mind the two main two components to understand exactly what we’re working with. One of the most obvious components of a distributed system is concurrency. Components in this system process information concurrently, as opposed to a single process that operates one step at a time. Second, there is no global clock. While all nodes can attempt to stay synchronized, there is no single clock or computer to access for the current time. Each node is commendable for maintaining its own time. A fun fact, time, and its effects on the operation of a distributed system constantly comes up in heated debates across the academic world. Third, as mentioned before, there is the potential for failure of any individual component, whether this is a dropped message or a failed processor.

In these events, protocols protect the system against individual failure, ensuring that the job gets done no matter what. All these properties of distributed systems come straight from understanding the role of nodes and edges in this system. In the next section, we’re going to look at how we distinguish a working distributed system from a faulty one.

This graph shows a classic example problem from distributed systems. Consider an arrangement of an odd number of nodes as such. Imagine that each node randomly starts with one of two values: 0 or 1. This is known as a binary consensus problem, as there are only two possible inputs for each node and only two possible outputs for the entire network. The goal of this system is to, through messages and computation, return the majority value among all nodes. From looking at this graph, you can easily tell that the answer must be 0, given that there are five zeroes and only four ones.

What are we looking for?

Before we start considering the exact process by which this system can reach the answer, let’s first ask ourselves: what exactly are we looking for in this system? What properties do we want to uphold? What constitutes a correct answer? We’re about to answer those questions right now. Leslie Lamport, distributed systems researcher, and superhero designed a scheme by which we can formally prove the correctness of a distributed system. It’s a fundamentally simple scheme, and it’s a scheme you can apply to any kind of engineering. Lamport says that a system is correct if two things are true: that it doesn’t do bad things, and that it eventually does good things. The formal terms for these are safety and liveness respectively.

Safety and Liveness

Safety properties refer to things that will not happen, and liveness refers to things that eventually happen. To understand what this means, let’s use the previous example. Which types of results do we want, and which do we not? I would say it’s reasonably clear to see what we don’t want. If our goal is for the majority value to be returned, then we never want a program to return a value that is not the majority value. That’s a safety property.

A liveness property is that the majority value will eventually be returned. You’ll notice that the safety and liveness properties are difficult to fulfill simultaneously, as getting closer to one means getting further from the other. You’ll also notice why we can’t have just safety or just liveness. Take real life as an example: living your life as safely as possible would be staying inside all day and never talking to anyone. Nothing can go wrong, true, but … nothing will ever happen either. On the other hand, you could also be incredibly spontaneous to live the liveliest life imaginable.

However, without sufficient regulation of your actions, you’ll likely get into serious trouble. You need some of both, but not too much, to define a sense of what your program can and should do. We spent quite a bit of time understanding safety and liveness, but how do we apply those ideas of desirable and undesirable guidelines to a distributed system? How can we prove that our system achieves its goal?

The correctness of Distributed Systems

Well, for any distributed system to be correct, we know that it must come to some form of consensus on the correct answer. That’s the whole point of a distributed system: given some input, the nodes need to agree on the output. The names given to these consensus-achieving procedures are known as consensus algorithms. We can put consensus algorithms in the context of safety and liveness by establishing what must happen, and what cannot happen, for a system to come to a consensus.

Lucky for us, computer scientists much smarter than we have already done that, recognizing that the three requirements of any correct consensus algorithm are validity, agreement, and termination.

Validity means that any value decided upon by the network must be proposed by one of the processes. In other words, the consensus algorithm cannot arbitrarily agree on some results. It cannot be hardcoded to always return the value 0. Sure, all our nodes will agree that the value is 0 and agree, but is that truly valid? If every node started with 1, and we’re looking for the majority value, 0 is meaningless

Agreement means that all non-faulty processes must agree on the same value. It’s pointless to bother with the results of faulty processes since those can always be wrong, so what matters is the agreement of the remainder. Of the working nodes, we never want to end up in a situation where one says 0, and one says 1. If there are inconsistencies in our results, then which one is true? Which one is correct? For the result to be meaningful, all functioning processes must agree on it.

The last requirement is termination. Termination means that all non-faulty nodes will eventually decide on some value. In other words, the system must eventually return some value; it cannot forever hang in limbo. As you can tell, both validity and agreement are safety properties. They specify things that can never happen in a system correctly coming to a consensus. By having validity and agreement, we ensure that honest nodes never decide on trivial, random, or different values.

Termination, on the other hand, is a liveness property, as it specifies what must happen for the system to be considered correct. Without the guarantee of termination, we have no guarantee that consensus will ever be achieved.

Conclusion

By understanding these three essentials of a consensus algorithm, you now understand what is needed for any distributed system to come to a consensus. Also, identifying the goals of consensus algorithms is critical to understanding different approaches. Before we get into examples of consensus algorithms, however, we need to understand more of the caveats of distributed systems, and the tradeoffs that any system must make. I hope you enjoyed reading the article if so, do comment down below and also share it with your friends.

What Is a Distributed System? Features, Properties and Characteristics of Distributed Systems.

What Is A Distributed System?

Why Do We Need A Distributed System?

Properties of a Distributed System!

What are we looking for?

Safety and Liveness

The correctness of Distributed Systems

Conclusion

You May Also Like

No comments:

Recent Posts

Categories

Blog Archive

Popular Posts

Contact Form

Pages