Introduction to the Consensus Problem

Suryateja Pericherla 03/10/2025 11:19am GMT+0530 Categories: Blockchain. No Comments

Consensus is a fundamental problem in distributed systems. Since the 1970s this problem has been researched in the context of distributed systems, but recently, with the advent of blockchain technology, a renewed interest has arisen in developing distributed consensus algorithms that are suitable for blockchain networks.

The distributed consensus problem has been studied extensively in distributed systems research since the late 1970s. Distributed systems are classified into two main categories, namely message passing and shared memory.

In the context of blockchain, we are concerned with the message passing type of distributed systems, where participants on the network communicate with each other via passing messages to each other.

Blockchain is a distributed system that relies upon a consensus mechanism, which ensures the safety and liveness of the blockchain network.

Distributed consensus is a highly researched problem and many fundamental ideas to describe and elaborate on the problem have been developed. One of them, and arguably the most famous, is the Byzantine generals problem.

In addition to the Byzantine generals problem, we will look at relevant and important impossibility results, which will help to build a general understanding of the consensus and relevant limitations.

An understanding of these concepts is vital to understand what problem exactly is being solved and how.

Contents

1 The Byzantine Generals Problem

The Byzantine Generals Problem

The problem of reaching agreement in the presence of faults or Byzantine consensus was first formulated by M. Pease, R. Shostak, and L. Lamport. In distributed systems, a common goal is to achieve consensus (agreement) among nodes on the network even in the presence of faults.

In order to explain the problem, Lamport came up with an allegorical representation of the problem and named it the Byzantine generals’ problem.

The Byzantine generals’ problem metaphorically depicts a situation where a Byzantine army, divided into different units, is spread around a city.

A general command each unit, and they can only communicate with each other using a messenger. To be successful, the generals must coordinate their plan and decide whether to attack or retreat.

The problem, however, is that any generals could potentially be disloyal and act maliciously to obstruct agreement upon a united plan.

The requirement now becomes that every honest general must somehow agree on the same decision even in the presence of treacherous generals.

In order to address this issue, honest (loyal) generals must reach a majority agreement on their plan. In the digital world, generals are represented by computers (nodes) and communication links are messengers carrying messages. Disloyal generals are faulty nodes.

Fault Tolerance

A fundamental requirement in a consensus mechanism is that it must be fault-tolerant. In other words, it must be able to tolerate a number of failures in a network and should continue to work even in the presence of faults.

This naturally means that there has to be some limit to the number of faults a network can handle, since no network can operate correctly if a large majority of its nodes are failing.

Types of Fault-Tolerant Consensus

Fault-tolerant algorithms can be divided into two types of fault-tolerance. The first is Crash fault-tolerance (CFT) and the other is Byzantine fault-tolerance (BFT).

CFT covers only crash faults or, in other words, benign faults. In contrast, BFT deals with the type of faults that are arbitrary and can even be malicious.

Replication is a standard approach to make a system fault-tolerant. Replication results in a synchronized copy of data across all nodes in a network.

This technique improves the fault tolerance and availability of the network. This means that even if some of the nodes become faulty, the overall system/network remains available due to the data being available on multiple nodes.

There are two main types of replication techniques:

Active replication, which is a type where each replica becomes a copy of the original state machine replica.
Passive replication, which is a type where there is only a single copy of the state machine in the system kept by the primary node, and the rest of the nodes/replicas only maintain the state.

State Machine Replication

State machine replication (SMR) is a de facto technique that is used to provide deterministic replication services in order to achieve fault tolerance in a distributed system.

State machine replication was first proposed by Lamport in 1978 in his paper. Later, in 1990, Schneider formalized the state machine replication approach and published the results.

At an abstract level, a state machine is a mathematical model that is used to describe a machine that can be in different states. It is important to understand that a state machine can only have one state at a time.

A state machine stores a state of the system and transitions it to the next state as a result of input received. As a result of state transition, an output is produced along with an updated state.

The fundamental idea behind SMR can be summarized as follows:

All servers always start with the same initial state.
All servers receive requests in a totally ordered fashion (sequenced as generated from clients).
All servers produce the same deterministic output for the same input.

State machine replication is implemented under a primary/backup paradigm, where a primary node is responsible for receiving and broadcasting client requests.

This broadcast mechanism is called total order broadcast or atomic broadcast, which ensures that backup or replica nodes receive and execute the same requests in the same sequence as the primary.

Consequently, this means that all replicas will eventually have the same state as the primary, thus resulting in achieving consensus. In other words, this means that total order broadcast and distributed consensus are equivalent problems; if you solve one, the other is solved too.

Now that we understand the basics of replication and fault tolerance, it is important to understand that fault tolerance works up to a certain threshold.

In some scenarios, it might be impossible to provide the required services due to a lack of resources in a system. In distributed computing, such impossible scenarios are researched and reported as impossibility results.

Impossibility results unfold deep aspects of distributed computing and enable us to understand why certain problems are difficult to solve and under what conditions a previously unsolved problem might be solved.

The requirement of minimum available resources is known as lower bound results. The problems that are not solvable under any conditions are known as unsolvability results.

For example, it has been proven that asynchronous deterministic consensus is impossible. This result is known as the FLP impossibility result.

FLP Impossibility

FLP (Fischer, Lynch, and Patterson) impossibility is a fundamental unsolvability result in distributed computing theory that states that in an asynchronous environment, the deterministic consensus is impossible, even if only one process is faulty.

To circumvent FLP impossibility, several techniques have been introduced in the literature. These techniques include:

Failure detectors, which can be seen as oracles associated with processors to detect failures.

Randomized algorithms have been introduced to provide a probabilistic termination guarantee. The core idea behind the randomized protocols is that the processors in such protocols can make a random choice of decision value if the processor does not receive the required quorum of trusted messages.

Synchrony assumptions, where additional synchrony and timing assumptions are made to ensure that the consensus algorithm terminates and makes progress.

Lower Bounds on the Number of Processors to Solve Consensus

There are proven results in distributed computing that state several lower bounds, for example, the minimum number of processors required for consensus or the minimum number of rounds required to achieve consensus.

The most common and fundamental of these results is the minimum number of processors required for consensus. These results are listed below:

In the case of CFT, at least 2F + 1 number of nodes is required to achieve consensus.

In the case of BFT, at least 3F + 1 number of nodes is required to achieve consensus.

F represents the number of failures.

Suryateja Pericherla

Suryateja Pericherla, at present is a Research Scholar (full-time Ph.D.) in the Dept. of Computer Science & Systems Engineering at Andhra University, Visakhapatnam. Previously worked as an Associate Professor in the Dept. of CSE at Vishnu Institute of Technology, India.

He has 11+ years of teaching experience and is an individual researcher whose research interests are Cloud Computing, Internet of Things, Computer Security, Network Security and Blockchain.

He is a member of professional societies like IEEE, ACM, CSI and ISCA. He published several research papers which are indexed by SCIE, WoS, Scopus, Springer and others.