The Ethereum ecosystem relies on its client diversity. The use of multiple client implementations reduces the risk of any single implementation becoming the vessel for network-wide catastrophic failures. Client diversity ensures that Ethereum can continue to function even if a client implementation is the target for attacks or experiences errors. Additionally, client diversity promotes healthy competition and innovation in the development of Ethereum clients.
Highly available Ethereum nodes using N-Version design
Like all software, Ethereum clients are vulnerable to faults in their underlying execution stack, and disruptions caused by unstable operating systems, networks, and hardware, can result in downtime for external Ethereum users such as exchanges and dApps.
In this post, we share the most relevant insights of our paper: "Highly Available Blockchain Nodes with N-Version Design". We focus on how to achieve high availability for blockchain nodes under unstable execution environments, by taking advantage of the existing client diversity and making N-Version design practical.
We also provide details on the architecture of the N-ETH, our N-version Ethereum client prototype, and the experiments conducted to test its effectiveness in achieving high availability under unstable execution. Furthermore, we present the most relevant results of these experiments, including comparisons with regular single-version Ethereum clients.
Accessing the Ethereum Blockchain
Access to the Ethereum blockchain is commonly achieved through the JSON-RPC interface of an Ethereum node. This interface may be set up locally by running a full node or using a light client. Otherwise, users rely on third-party service providers to so. The interface provides the means to query the blockchain for information such as account balances, transaction histories, and contract data.
However, regardless of the target for requests, these access points may become unavailable or degraded at any point, due to unexpected faults in their execution stack. As described by the community, Ethereum clients experience bugs which often cause crashes, state corruptions, etc.
Figure 1. An external application queries the blockchain through JSON-RPC requests.
Figure 1 illustrates the most simple setup for an external client to connect to the Ethereum blockchain. We classify the possible availability states it can perceive the client to be as:
- Available: The client is able to respond correctly to requests with up-to-date information.
- Degraded: The client is able to respond to requests, although the responses may be incomplete or outdated.
- Unavailable: The client is unable to respond to requests, or the requests time out.
One way of mitigating the risk of blockchain nodes' becoming unavailable is by using N-Version design. Similar to Avizienis’ vision, the N-Version design concept involves running multiple versions of a system simultaneously, with each version producing independent outputs. The outputs are then analyzed to detect errors or discrepancies, which can be corrected or reconciled to produce a correct output.
In the context of Ethereum, N-Version design is made possible by the existing client diversity. Our N-version Ethereum node prototype, N-ETH, uses Geth, Besu, Nethermind, and Erigon as internal sub-nodes. These sub-nodes are encapsulated behind a proxy that routes and compares their responses. Figure 2 illustrates an overview of N-ETH. In this case, an external application would send requests to the node through the interface exposed by the proxy component. This allows for the use of multiple internal nodes and allows N-ETH to achieve high availability, as well hiding failure scenarios from external users.
Figure 2. Overview of N-ETH: An N-version Ethereum client. The used versions were the latest available at the time of performing experiments (Oct 2022).
Simulating Unstable Execution Environments
To disrupt the availability of Ethereum clients and N-ETH, we simulate faulty execution environments by amplifying the error rate of system calls. System calls are an interface offered by the operating system to allocate and interact with a computer’s resources. These calls are frequently unsuccessful, but the calling processes expect this as normal behavior, and react accordingly.
To create a sensible fault model, we measured the spontaneous system call error rate of our four Ethereum client implementations. With this data, we assemble several fault injection strategies (FIS) with amplified system call failure rates. In total, we crafted 19 FISs with a gradual increase of aggressiveness. This is, FIS 1 is the least aggressive, and FIS 19 is the most aggressive.
This process also ensures that the produced FISs are comparable and realistic, i.e. include system-call errors known to occur spontaneously in at least one of the analyzed client implementations.
Establishing a Baseline
To test the effectiveness of the N-ETH in achieving high availability under unstable conditions, we first needed to establish a comparable baseline of regular single-version deployments.
We subject our four Ethereum client implementations to the previously computed FISs, along with a workload targeting the JSON-RPC interface. The external application applying the workload keeps a record of the requests, responses, and any corresponding error.
Results reveal a key pattern:
Ethereum client implementations behave differently, even under similarly faulty conditions.
Table 1 shows that under the same faulty conditions, the errors observed by an external application vary in type and frequency. These depend on the client implementation, some errors are observed in all of them, while others are client implementation-specific.
|Post: Client.Timeout while awaiting headers||4||820||357||864|
|connect: connection refused||170||409||70||623|
|dial tcp: connect: connection reset by peer||-||-||-||1|
|http: server closed idle connection||-||-||-||2|
|malformed HTTP response||-||9||-||-|
|read: connection reset by peer||854||598||6||153|
|read: Client.Timeout while awaiting headers||-||-||1||-|
|Client.Timeout while reading body||3||-||-||1|
|gzip: invalid checksum||-||865||-||-|
|invalid byte in chunk length||-||2||599||-|
|invalid character in response||-||293||867||-|
|unexpected EOF||4||2||3||3, 373|
|unexpected end of JSON input||-||1||-||-|
Table 1: Types and frequency of errors received by external application.
This observation is central in validating the usefulness of N-version design, as it shows that the existing Ethereum client implementations have non-overlapping error modes, and can “cover for each other’s errors” under similar faulty environments. Specifically there are 5 error types in total, which are triggered only in one client implementation.
Table 2 shows a comparison of the availability scores of Geth and Besu, aggregated by FIS. It provides further data showing the asymmetry of behavior under faulty execution.
|FIS [1 - 2]||1.000||0.000||0.000||0.999||0.000||0.000|
|FIS [4 - 8]||1.000||0.000||0.000||0.000||0.999||0.000|
Table 2: Comparison of Geth and Besu under system-call based fault injection strategies.
The most interesting finding is that although Geth has the highest average availability for all FISs, Besu is able to perform better under the most aggressive FISs. It follows that an N-version client can take advantage of this specific asymmetry and offer higher availability than any single client under unstable execution.
A full version of Table 2 containing the comparison of all 4 client implementations and 19 FISs is available in the paper.
High Availability Under Fault Injection
By executing N-ETH under the same FISs and workload as our baseline, we are able to measure any gains in availability.
|FIS [1 - 13]||1.000||0.000||0.000|
Table 3: N-ETH’s availability under system-call based fault injection strategies.
Table 3 shows that N-ETH was able to achieve higher availability and reliability in average than any of the single-version Ethereum clients. Furthermore, the results show a noticeable increase of availability for the most aggressive FISs and suppression of unavailability as compared to the best performing client. In the case of availability, the increase compared to the best Ethereum client is from 80.2% (Geth) to 94.4%. In the case of unavailability, the decrease compared to the best Ethereum client is from 1.3% (Besu) to 0.02%.
These results demonstrate the effectiveness of the N-Version design in achieving high availability for Ethereum nodes, and highlight the benefits of client diversity in the development of Ethereum clients.
In this post, we present our approach of leveraging existing Ethereum client diversity to provide enhanced availability for external applications. The key finding in our experiments is that Ethereum clients’ asymmetric behavior under similar instability is sufficient to make N-version design viable. We take advantage of this finding in N-ETH, which is able to outperform all 4 tested state-of-the-art single-version Ethereum client deployments in terms of availability.
If you feel interested in this work, the corresponding research paper is available online, as well as the code and experimental data of N-ETH. If you have any questions, feel free to contact us via email: javierro at kth.se.______________________________________________________________________
Disclaimer: The information contained on this web page is for education purposes only. Readers are suggested to conduct their own research, review, analyze and verify the content before relying on them.
To publish press releases, project updates and guest posts with us, please email at email@example.com.
Subscribe to EtherWorld YouTube channel for ELI5 content.
Support us at avarch.eth
You've something to share with the blockchain community, join us on Discord!