What Is A Distributed System?
The definition
for a distributed system changes depending on who you ask, but the general
“consensus” (no pun intended) is that distributed systems contain two
particular categories of components. The first category is referred to as “nodes.”
Nodes are meant to represent separate machines or processes. This separation
manifests in the real world as physical distance, such as in Bitcoin, or just a
separation of components, similar to the CPU cores in your laptop. The second
category is referred to as “message passing.” These are represented as arrows,
or “edges” as we like to call them in the context of graph theory.
Regardless of
their name, their purpose is to demonstrate that information can move between
machines. It isn’t guaranteed that all machines are directly, or even
indirectly, connected. However, that doesn’t stop them from working together.
Why Do We Need A Distributed System?
The biggest
question on your mind, though, must be, “Why do distributed systems matter?
What’s wrong with a single system?”. Well, the problem with machines in the
real world is that they can be faulty, crash, lose memory, or even be
corrupted. In this event, we can protect information and services by creating
backups. In other words, putting many machines together to accomplish a goal
together instead of just having a single machine. This brings us the main
advantage of a more reliable system: even if one machine shuts down or
misbehaves, it’s not the end of the world. We can create protocols withstanding
failure, giving us the power to be safe in the event of a crash.
Properties of a Distributed System!
Let’s discuss
the properties of a distributed system keeping in mind the two main two
components to understand exactly what we’re working with. One of the most
obvious components of a distributed system is concurrency. Components in
this system process information concurrently, as opposed to a single process
that operates one step at a time. Second, there is no global clock.
While all nodes can attempt to stay synchronized, there is no single clock or
computer to access for the current time. Each node is commendable for
maintaining its own time. A fun fact, time, and its effects on the operation of
a distributed system constantly comes up in heated debates across the academic
world. Third, as mentioned before, there is the potential for failure of any
individual component, whether this is a dropped message or a failed processor.
In these events,
protocols protect the system against individual failure, ensuring that the job
gets done no matter what. All these properties of distributed systems come
straight from understanding the role of nodes and edges in this system. In the
next section, we’re going to look at how we distinguish a working distributed
system from a faulty one.
This graph shows
a classic example problem from distributed systems. Consider an arrangement of
an odd number of nodes as such. Imagine that each node randomly starts with one
of two values: 0 or 1. This is known as a binary consensus problem, as there
are only two possible inputs for each node and only two possible outputs for
the entire network. The goal of this system is to, through messages and
computation, return the majority value among all nodes. From looking at this
graph, you can easily tell that the answer must be 0, given that there are five
zeroes and only four ones.
What are we looking for?
Before we start
considering the exact process by which this system can reach the answer, let’s
first ask ourselves: what exactly are we looking for in this system? What
properties do we want to uphold? What constitutes a correct answer? We’re about
to answer those questions right now. Leslie Lamport, distributed systems
researcher, and superhero designed a scheme by which we can formally prove the
correctness of a distributed system. It’s a fundamentally simple scheme, and
it’s a scheme you can apply to any kind of engineering. Lamport says that a
system is correct if two things are true: that it doesn’t do bad things, and
that it eventually does good things. The formal terms for these are safety and
liveness respectively.
Safety and Liveness
Safety
properties refer to things that will not happen, and liveness refers to things
that eventually happen. To understand what this means, let’s use the previous
example. Which types of results do we want, and which do we not? I would say
it’s reasonably clear to see what we don’t want. If our goal is for the
majority value to be returned, then we never want a program to return a value
that is not the majority value. That’s a safety property.
A liveness
property is that the majority value will eventually be returned. You’ll notice
that the safety and liveness properties are difficult to fulfill
simultaneously, as getting closer to one means getting further from the other.
You’ll also notice why we can’t have just safety or just liveness. Take real
life as an example: living your life as safely as possible would be staying
inside all day and never talking to anyone. Nothing can go wrong, true, but …
nothing will ever happen either. On the other hand, you could also be
incredibly spontaneous to live the liveliest life imaginable.
However, without
sufficient regulation of your actions, you’ll likely get into serious trouble.
You need some of both, but not too much, to define a sense of what your program
can and should do. We spent quite a bit of time understanding safety and liveness,
but how do we apply those ideas of desirable and undesirable guidelines to a
distributed system? How can we prove that our system achieves its goal?
The correctness of Distributed Systems
Well, for any
distributed system to be correct, we know that it must come to some form of
consensus on the correct answer. That’s the whole point of a distributed
system: given some input, the nodes need to agree on the output. The names
given to these consensus-achieving procedures are known as consensus
algorithms. We can put consensus algorithms in the context of safety and
liveness by establishing what must happen, and what cannot happen, for a system
to come to a consensus.
Lucky for us,
computer scientists much smarter than we have already done that, recognizing
that the three requirements of any correct consensus algorithm are validity, agreement, and termination.
Validity means
that any value decided upon by the network must be proposed by one of the
processes. In other words, the consensus algorithm cannot arbitrarily agree on
some results. It cannot be hardcoded to always return the value 0. Sure, all
our nodes will agree that the value is 0 and agree, but is that truly valid? If
every node started with 1, and we’re looking for the majority value, 0 is
meaningless
Agreement means
that all non-faulty processes must agree on the same value. It’s pointless to
bother with the results of faulty processes since those can always be wrong, so
what matters is the agreement of the remainder. Of the working nodes, we never
want to end up in a situation where one says 0, and one says 1. If there are
inconsistencies in our results, then which one is true? Which one is correct?
For the result to be meaningful, all functioning processes must agree on it.
The last requirement is termination. Termination means that all non-faulty nodes will
eventually decide on some value. In other words, the system must eventually
return some value; it cannot forever hang in limbo. As you can tell, both
validity and agreement are safety properties. They specify things that can never
happen in a system correctly coming to a consensus. By having validity and
agreement, we ensure that honest nodes never decide on trivial, random, or
different values.
Termination, on the other hand, is a liveness property, as it specifies what must happen for
the system to be considered correct. Without the guarantee of termination, we
have no guarantee that consensus will ever be achieved.
Conclusion
By understanding these three essentials of a consensus algorithm, you now understand what is needed for any distributed system to come to a consensus. Also, identifying the goals of consensus algorithms is critical to understanding different approaches. Before we get into examples of consensus algorithms, however, we need to understand more of the caveats of distributed systems, and the tradeoffs that any system must make. I hope you enjoyed reading the article if so, do comment down below and also share it with your friends.
No comments:
Please let me know if you liked the post. Do share it with your friends