Ethereum, the second-largest cryptocurrency by market capitalization, has proven to be a remarkably resilient platform since its launch in 2015. Despite facing numerous challenges over the years, such as security breaches and network congestion, Ethereum has continued to thrive and evolve.
A day before one-month anniversary of Shapella Upgrade, an unidentified issue with the Ethereum Beacon Chain resulted in a 25-minute delay in transaction finality, starting approximately at 8:15 pm UTC. Despite the proposal of new blocks, an unknown issue prevented their finalization. In this blog post, we will follow the sequence of events and mitigations.
- The state of Non-Finality
- First Inactivity Leak
- Mitigation Plan
- Next Steps
- Why Client Diversity is Important?
The state of Non-Finality
Since The Merge, Ethereum is having PoS consensus. Here, non-finality refers to the fact that the current state of the chain is not yet considered permanent or "final". In other words, there is still a possibility that some transactions or blocks on the chain may be reorganized or discarded in the future.
On May 11th, despite proposed blocks, Ethereum observed non-finality. The blockchain recorded a significant drop in the number of attestations from Ethereum epochs 200,552 to 200,554. The root cause of the problem stemmed from an influx of attestations for previous epochs, received by consensus clients, particularly Prysm nodes. These attestations did not correspond with the latest checkpoint in the fork choice.
This situation is usually a result of peered nodes lacking access to all the blocks for the remainder of the epoch. Consequently, Prysm had to expend significant resources to replay the state, resulting in CPU spikes and out-of-memory (OOM) issues. This situation is often referred to as a "death spiral."
Consensus clients are responsible for processing transactions and maintaining the state of the blockchain. They have a key role in attesting and proposing blocks. However, during this incident, the strain on these consensus clients were considerable high. Attestation from nodes were struggling to synchronize or having an outdated view of the chain were broadcast across the network, increasing the pressure on consensus clients.
Nodes, especially those operating on weaker hardware, struggled to keep up with the chain, leading to the non-finalization of transactions. Consensus clients had to recreate the beacon state multiple times to validate these attestations, leading to high CPU usage and memory bloat. This issue was especially noticeable in Prysm, where a cache designed to handle this situation quickly filled up due to an increase in validator sizes and the volume of untimely attestations.
First Inactivity Leak
During this incident, the Ethereum mainnet experienced its first-ever inactivity leak. This is an emergency state that alters the rewards and penalties for validators. It is triggered when the beacon chain fails to finalize a checkpoint for more than four epochs.
The inactivity leak is a safeguard mechanism designed to reinstate finality if over a third of the validators become offline. It operates by progressively diminishing the stakes of inactive validators until the active validators hold two-thirds of the remaining stake. Once this threshold is met, checkpoints can be finalized once again.
Mitigation Plan
During the "death spiral," a subtle bug was discovered in Prysm nodes where they were not utilizing the correct state for shuffling computation. In response to this issue, Ethereum developers have proposed several solutions. These suggestions include optimizing the caching scheme and implementing heuristics to filter out untenable attestations.
Moreover, a corrective measure is under development to enhance Prysm's resilience to similar incidents in the future. This solution involves disregarding attestations that are known to target older epochs and have not served as a checkpoint in any chain known to their node.
Client Diversity in Ethereum
This incident highlights the importance of client diversity in a decentralized network like Ethereum. Had the bug observed in any other client, the loss of finality possibly could have been gone unnoticed. But, the dominance of Prysm like Geth (Execution client) could have made it even worse.
Client diversity contributes to the overall resilience of the network. It diminishes the risk of the entire network collapsing due to an issue with a single client and aids in preserving the decentralized nature of the network. Thus, it's an essential element of the network's long-term sustainability and security.
Next Steps
The developers are committed to rectifying the problem and preventing similar incidents from occurring in the future. They are currently drafting a document to detail the incident and the measures taken to resolve it.
We are announcing v4.0.3-hotfix. This release contains the optimization to prevent the beacon node from high resource usage during turbulent times. It is highly recommended to upgrade if your node is under heavy usage.
— Prysm Ethereum Client (@prylabs) May 13, 2023
See release notes for more info: https://t.co/SMb4hzL1cW
In addition, an update is anticipated to be released soon. This will include several improvements to address the issues that arose during the incident. These improvements include the previously mentioned optimizations to the caching scheme, the application of heuristics to filter out unfeasible attestations, and a fix to enhance the resilience of Prysm.
Although the incident caused a temporary disruption, it didn't cripple the Ethereum network. The developers are diligently working on solutions and optimizations to avoid similar problems in the future. The inactivity leak that transpired during this incident is a crucial part of the network's resilience mechanism, designed to assure long-term chain protection in the event of catastrophic circumstances.
Credits: @superphiz, @potuz1, @benjaminion_xyz, @terencechain
& @preston_vanloon
Related Articles
- EIP-4844 ready for Multi-Client Devnets
- Verkle Trees Research Progress
- Exploring EVMMAX Proposals & BLS12-381
- Why Ethereum Clients prefer SSZ over RLP?
- Partial SSZ Migration in the Cancun Upgrade
- Mega EOF Endgame Specification
- '0-Blob Txns' Omitted from EIP-4844 in Cancun Upgrade
- 4844 Specs includes KZG multi verify function
- Upcoming Changes to Ethereum Blockchain
- Transient Storage for Beginners: EIP-1153 Explained
- How Layer 3 in Future will look like?
- An Overview of Beacon Chain API
Related Videos
- The Future of Ethereum Goerli Testnet
- ETH Withdrawals: Everything You Need to Know
- Client Diversity
- Reth: Ethereum Execution Layer Client Written in Rust
- Sign-In with Ethereum: EIP-4361
Cover image: Original photo by Brian Suman modified by team EtherWorld.
______________________________________________________________________Disclaimer: The information contained on this web page is for education purposes only. Readers are suggested to conduct their own research, review, analyze and verify the content before relying on them.
To publish press releases, project updates and guest posts with us, please email at contact@etherworld.co.
Subscribe to EtherWorld YouTube channel for ELI5 content.
Support us at Gitcoin
You've something to share with the blockchain community, join us on Discord!