Distributed Systems 101

Pranav Kohli
3 min readAug 30, 2020

--

A layman’s guide to distributed systems

What is a distributed system??

A system of independent nodes(computers, servers) not necessarily at the same location talking to each other and getting work done in parallel.

Too complicated…Let’s simplify it with an example.

Today in covid times, everybody is working remotely. Your manager or tech lead assigns you the work via email, zoom etc and you do the work assigned at your home. Similarly other employees would also get their tasks from the manager and they would be working independently. At the end of the day, the employees update the manager with the work done. This setup is the perfect analogy for a distributed system.

Manager : Task Scheduler who distributes the work

Employees: workers where the individual task is done and updated to the scheduler

Email, Jira : Message protocols for interaction between scheduler/worker or worker/worker.

Distributed system

Any discussion on Distributed systems is kinda incomplete with a mention of CAP theorem.

CAP stands for Consistency, Availability and Partition tolerance. It wants system designers to make a choice between above three competing guarantees in final design. It’s said that achieving all 3 in a system is not possible, and you MUST choose at most two out of three guarantees in your system.

Sounds simple, but wait !! What do you mean by Consistency, Availability and Partition tolerance?? Too many fancy words for comfort, let’s take the above remote working example and see what these mean.

Manager: Rajesh

Senior Employee 1: Ramesh

Junior Employee 2: Suresh

Work 1: W1

Consistency : Rajesh assigns a task W1 to Ramesh. Now Ramesh syncs up with Suresh to explain the task and they both start working on it. After some time, the requirements change and the task becomes W1.1. Rajesh updates it via email to Ramesh. Now Ramesh goes on a chai sutta break without updating the work details to Suresh. Now if Rajesh comes and asks Suresh about the work, he would still be working on W1 instead of W1.1. This would amount to a lack of consistency and no more five star for Ramesh and Suresh.

Simply, a system is said to be consistent if all nodes see the same data at the same time.

If we perform a read operation on a consistent system, it should return the value of the most recent write operation. This means that the read should cause all nodes to return the same data, which is the value of the most recent write.

Availability: Now supposedly in the previous example instead of going on a break, Ramesh syncs up with Suresh in a zoom meeting room and updates him with the new work details W1.1.

Now at this point if the manager Rajesh wanted an update on the work, he wouldn’t be able to as both the employees would be busy. That is a lack of availability.

Availability in a distributed system ensures that the system remains operational 100% of the time. Every request gets a non-error response regardless of the individual state of a node.

This does not guarantee that the response contains the most recent write.

Partition Tolerance: Supposedly Ramesh and Suresh stopped communicating due to a misunderstanding(It’s a five star issue). Now the work given to them wouldn’t be completed.

This means there is no partition tolerance.

Partition Tolerance means that the cluster as a whole continues to function even if there is a “partition” (communications break) between two nodes ie.. both nodes are up, but can’t communicate.No set of failures less than total network failure is allowed to cause the system to respond incorrectly.

In today’s world we can achieve all three in a distributed system not fully though, partially via eventual consistency.

Hopefully this gives you some food for thought.

Cheerio!!

--

--

Pranav Kohli
Pranav Kohli

No responses yet