Retrospective on Non-Custodial Activation Issues

On October 25, 2022, at approximately 6:04 pm UTC, the Pocket blockchain experienced a chain halt at block 74622. This happened after a governance transaction was sent to activate the Non-Custodial feature at block height 74620 following the v0.9.1.1 network upgrade of Pocket Core.

Issue

Around 9:30pm UTC, the internal engineering team uncovered a non-deterministic issue with an attempt to reward an incorrect node output address that was rewarded after the non-custodial staking feature activation.Of the 1,000 validator set, a ⅔+ quorum did not agree whether the rewards should be sent to the non-custodial output address due to a discrepancy in the local validator cache between different nodes. This resulted in a split network where the number of votes for a proposal block did not surpass the Byzantine Fault Tolerance threshold, leading to a chain halt whereby liveness was lost, but safety was maintained.

Solution

After uncovering the non-deterministic issue, the PNI Protocol Engineering team quickly deployed a patch (v0.9.1.2) that bypassed the validator cache, hard-coded the Non-Custodial activation height in the current state to 74622, jumped nodes to a higher round and provided a snapshot for block height 74617 so the network is synchronized. To give the node runner network time to coordinate on upgrading to v0.9.1.2, the nodes were put to sleep until 4 PM UTC on October 26th, in order to temporarily suspend all execution and have validators restart participating in consensus at the same time.After the nodes were reactivated on October 26th, the PNI Protocol Engineering team began monitoring the network to see if we would be able to achieve consensus. After validators started syncing in the same round, consensus quickly climbed from 50 to 66.4%. Despite eventually reaching a 67.8% consensus threshold, there were no valid block proposals, with many validators signing “0” (i.e. invalid) on proposed blocks.

The Second Issue

To understand why we were seeing so many invalid block proposals despite having 67.8% voting power moving through rounds in sync, the PNI Protocol Engineering team began digging into validator consensus data, both with internal tooling and with the support of our validator operators, who graciously shared files and opened endpoints for us to see how quickly other validators might move through rounds and sync to balance out the nil votes.Through this analysis, we discovered that the root of the invalid block proposals was coming from a discrepancy between the data directories of different validators running different software versions resulting in different user-configurable cache-driven states between blocks 74620 and 74622.

The Second Solution

Once the corrupted data directories issue had been identified, the decision was made to again put the nodes to sleep until 4 PM UTC on October 27th. A simple release (v0.9.1.3) targeting validator nodes was deployed on October 26th, which would once again have validators jump directly to the right round for faster synchronization. Additionally it was requested that all validators apply the snapshot provided with v0.9.1.2 when upgrading to v0.9.1.3 to avoid similar data corruption issues.After the nodes were reactivated, the network attempted to pass to the next block (74622) but only had 66.4% of the voting power and participation required. Around 5 PM UTC on October 27th, the PNI Protocol Engineering team began coordinating with validators to identify and encourage those who hadn’t to apply the provided snapshot to their validator databases, which enabled the network to climb past block 74622 to 74623 at 5:50 pm UTC.

Redirecting Traffic

With consensus achieved and blocks once again being produced, the PNI Protocol Engineering team began monitoring the network, while the PNI Infrastructure Engineering team prepared the necessary PRs to reroute network traffic from backup infrastructure back to network servicers.Redirecting traffic back to servicers required a redeployment of 18 dispatcher nodes as seeds, which the team executed in batches until all dispatchers were deployed around 3 AM UTC. At the end of every batch the Infrastructure team confirmed that the dispatchers were 100% synced and healthy. After the final batch had been deployed, the PNI Backend Engineering team re-routed the network traffic to service nodes, officially concluding the chain halt and fully restoring decentralized service.

Retrospective

Current state vs. future state. It’s important to remember that V0 is an MVP that found product-market fit, created an industry that others are now following, and was built on Cosmos, which was designed 6 years ago (a long time in Crypto’s lifetime). V1 is meant to be more resilient to these types of issues.Chain halts and decentralization. Hacks are very common in the industry, but we don’t hear of chain halts as often, even though many blockchains are built on top of Cosmos and Tendermint BFT. The reason for this is because, other than BTC & ETH, many chains are substantially centralized, so the companies designing the protocol contain most of the validator power and can either do a re-genesis, or restart all the validators themselves. The fact that most of the time was spent on coordinating a restart rather than finding the root cause shows that Pocket is much more decentralized than most other chains out there.Importance of social coordination. In light of this decentralization of the node network, last week’s events required a high degree of coordination between PNI and the broader Pocket community. From coordinating ideal timing with node operators, to pushing for adoption of new releases with validators, even though it proved to be a challenging week we nevertheless saw the benefits of a strong, aligned community towards getting the network back on the right footing.Servicing of relays. It’s also important to note that even during the chain halt, relays continued to be serviced by the Altruist Network, meaning that the Portal RPC service remained online (although node runners did not receive rewards during this time). Events like this highlight the critical role of the Altruist Network as a fallback to keeping RPC service up and running.

Wrapping Up

Last week was a challenging week, as we worked through the chain halt that was triggered by the activation of non-custodial staking. After fixing two distinct issues (around a node output address and invalid block proposals), we were able to deploy a new release in collaboration with our node runner community and started to see the block height climb again on October 27th. From that point, we were able to redirect traffic from Altruists back to servicers, and fully restore decentralized service.As alluded to above, getting a decentralized network of nodes to recover from a chain halt involves painstaking community coordination and alignment, and we are full of gratitude for the way our node runners came together to get on the same page and move past these issues. Thanks to all for keeping up to date with our Discord communications, updating nodes as needed, and getting the chain back to new heights. We truly appreciate you!