Skip to content

Conversation

@ibalajiarun
Copy link
Contributor

@ibalajiarun ibalajiarun commented Dec 13, 2025

Description

Introduces a pipeline phase that communicates with secret_share_manager from #18223 to retrieve a shared secret key for the block and attempts to decrypts the transactions in the block. The pipeline stage returns early if the Option<SecretShareConfig> is None, which will control if the decryption feature is enabled or not.

The pipeline stage consolidates multiple secret share derivation steps into the same stage because it keeps the implementation simpler and cleaner. Perf improvements will be made in a future PR, but should be easily done using tokio::spawn_blocking without introducing additional stages.

@ibalajiarun ibalajiarun force-pushed the balaji/enc-pool-8 branch 2 times, most recently from e48b887 to e22d1fe Compare December 16, 2025 01:48
@ibalajiarun ibalajiarun changed the title [pipeline] Split prepare into prepare and sig verify [pipeline] introduce decryption pipeline stage Dec 16, 2025
@ibalajiarun ibalajiarun marked this pull request as ready for review December 16, 2025 01:57
to_truncate
} else {
encrypted_txns
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Encrypted transactions beyond 10 are silently dropped

The code truncates encrypted transactions to a maximum of 10, and any transactions beyond this limit are permanently lost. After partition separates encrypted from unencrypted transactions, truncation discards excess encrypted transactions. These discarded transactions never appear in decrypted_txns (due to truncation) nor in unencrypted_txns (because they were encrypted), resulting in silent data loss. The TODO comment suggests this is known but shipping code that drops transactions is a critical issue.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not yet "shipping" this code.


let (encrypted_txns, unencrypted_txns): (Vec<_>, Vec<_>) = input_txns
.into_iter()
.partition(|txn| txn.is_encrypted_txn());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Transaction ordering changed by partition and concat

The partition operation separates transactions into encrypted and unencrypted groups, then concat rejoins them with all decrypted transactions before unencrypted ones. This reorders transactions from their original sequence. For example, [unenc1, enc1, unenc2, enc2] becomes [dec1, dec2, unenc1, unenc2]. In blockchain systems, transaction ordering is critical for deterministic execution and consensus. This reordering would cause nodes to disagree on execution results.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation, this is expected reordering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are we prioritizing encrypted txns? Why not keep the original order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is to make sure encrypted txns are not cut due to block gas limit overflow. They will be revealed but not executed.

};
let Entry::Occupied(o) = entry else {
unreachable!("Entry must exist");
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Coordinator may panic on vacant HashMap entry

In make_coordinator, when processing rand_ready_block or secret_ready_block, the code calls entry(first_block_id).and_modify(...) which does nothing if the key doesn't exist, returning an Entry::Vacant. The subsequent check let Entry::Occupied(o) = entry else { unreachable!(...) } will panic if the entry is vacant. This could occur if a ready signal arrives for a block that was never registered or was already processed and removed.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is impossible to happen, because hashmap insert happens when sending blocks to the rand and secret managers.

max_txns_from_block_to_execute,
block_gas_limit,
));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Early return drops channel without sending signal

When secret_share_config is Some but there are no encrypted transactions, the function returns early without sending anything on derived_self_key_share_tx. This causes the receiver (derived_self_key_share_rx) to receive a channel error when the sender is dropped. The channel type is Sender<Option<SecretShare>>, suggesting None should be sent to indicate "no key share needed" rather than dropping the sender and causing an error in secret_sharing_derive_self_fut.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO

@ibalajiarun ibalajiarun force-pushed the balaji/enc-pool-7 branch 4 times, most recently from ca3df8b to 1f1d14f Compare December 17, 2025 04:56
Base automatically changed from balaji/enc-pool-7 to main December 17, 2025 05:51

let (encrypted_txns, unencrypted_txns): (Vec<_>, Vec<_>) = input_txns
.into_iter()
.partition(|txn| txn.is_encrypted_txn());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are we prioritizing encrypted txns? Why not keep the original order?

})
.collect();

let encryption_round = block.round();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one optimization to save the number of trusted setup: the encryption_round can be decoupled from block round, and increment only when a block contains encrypted txns. Then the trusted setup size is number of encrypted txn blocks instead of blocks. Not sure how important it is for now, but just a note for potential optimizations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, I will add this as TODO. We can revisit this when Rex is back.

)))
.unwrap();

// TODO(ibalajiarun): improve perf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to break these task in separate phases later? or how do you plan to improve perf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to split into phases here because we only implement slow path now. The digest and eval computation should be done in 3 hops before ordering finishes and decryption share exchange phase finishes. I think we may need to use separate thread pool to compute eval proofs. That's the perf part here mostly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense

.zip(txn_ciphertexts)
.map(|(mut txn, ciphertext)| {
let eval_proof = proofs.get(&ciphertext.id()).unwrap().into();
if let Ok(payload) = FPTX::decrypt_individual::<DecryptedPayload>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the performance compared with batch decryption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is essentially the same as batch decryption. I just moved the par_iter outside because it give me individual Results.

author: Author,
epoch: u64,
validator: Arc<ValidatorVerifier>,
// wconfig: WeightedConfig,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo here?


let maybe_decryption_key = secret_shared_key_rx.await.unwrap();
// TODO(ibalajiarun): account for the case where decryption key is not available
let decryption_key = maybe_decryption_key.expect("decryption key should be available");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing decryption key causes unhandled panic

When decryption is enabled and the secret share manager fails to produce a key, the code will panic. Line 101 calls .unwrap() on the channel receive which panics if the sender is dropped. Line 103 calls .expect() on Option<SecretSharedKey> which panics if the key is None. The TODO comment explicitly acknowledges this case isn't handled. If the secret sharing protocol fails for any reason, the node will crash rather than gracefully handling the failure.

Fix in Cursor Fix in Web

Copy link
Contributor

@JoshLind JoshLind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 😄 Unblocking.

let _ = ready_block_tx.send(ordered_blocks).await;
}
}
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coordinator task lacks shutdown mechanism causing panic

High Severity

The make_coordinator function spawns a tokio task with an infinite loop using select! on three channels, but provides no shutdown mechanism. When end_epoch is called, it sends stop signals to the rand_manager and secret_share_manager (closing their channels), but the coordinator has no reset channel. When all three channels close and return None, the select! macro has no matching branches and no else clause, causing the task to panic. The coordinator needs a reset/shutdown channel similar to the managers.

Additional Locations (1)

Fix in Cursor Fix in Web

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

✅ Forge suite framework_upgrade success on a09bb94430a970de7bc45fe0d29bd33fd2e5a7d4 ==> 9f2fab9b6eb4a748d49d6fdf271a9291b3f2de23

Compatibility test results for a09bb94430a970de7bc45fe0d29bd33fd2e5a7d4 ==> 9f2fab9b6eb4a748d49d6fdf271a9291b3f2de23 (PR)
Upgrade the nodes to version: 9f2fab9b6eb4a748d49d6fdf271a9291b3f2de23
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2304.75 txn/s, submitted: 2312.43 txn/s, failed submission: 7.68 txn/s, expired: 7.68 txn/s, latency: 1265.13 ms, (p50: 1200 ms, p70: 1400, p90: 1500 ms, p99: 1800 ms), latency samples: 210020
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2339.90 txn/s, submitted: 2346.45 txn/s, failed submission: 6.54 txn/s, expired: 6.54 txn/s, latency: 1275.73 ms, (p50: 1200 ms, p70: 1300, p90: 1800 ms, p99: 2600 ms), latency samples: 207462
5. check swarm health
Compatibility test for a09bb94430a970de7bc45fe0d29bd33fd2e5a7d4 ==> 9f2fab9b6eb4a748d49d6fdf271a9291b3f2de23 passed
Upgrade the remaining nodes to version: 9f2fab9b6eb4a748d49d6fdf271a9291b3f2de23
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 2409.77 txn/s, submitted: 2416.89 txn/s, failed submission: 7.12 txn/s, expired: 7.12 txn/s, latency: 1230.73 ms, (p50: 1200 ms, p70: 1400, p90: 1500 ms, p99: 1800 ms), latency samples: 216682
Test Ok

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

✅ Forge suite compat success on a09bb94430a970de7bc45fe0d29bd33fd2e5a7d4 ==> 9f2fab9b6eb4a748d49d6fdf271a9291b3f2de23

Compatibility test results for a09bb94430a970de7bc45fe0d29bd33fd2e5a7d4 ==> 9f2fab9b6eb4a748d49d6fdf271a9291b3f2de23 (PR)
1. Check liveness of validators at old version: a09bb94430a970de7bc45fe0d29bd33fd2e5a7d4
compatibility::simple-validator-upgrade::liveness-check : committed: 12588.94 txn/s, latency: 2765.54 ms, (p50: 3000 ms, p70: 3000, p90: 3300 ms, p99: 3700 ms), latency samples: 415020
2. Upgrading first Validator to new version: 9f2fab9b6eb4a748d49d6fdf271a9291b3f2de23
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 5926.94 txn/s, latency: 5724.01 ms, (p50: 6400 ms, p70: 6500, p90: 6600 ms, p99: 6700 ms), latency samples: 200300
3. Upgrading rest of first batch to new version: 9f2fab9b6eb4a748d49d6fdf271a9291b3f2de23
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 6012.47 txn/s, latency: 5656.66 ms, (p50: 6300 ms, p70: 6400, p90: 6500 ms, p99: 6600 ms), latency samples: 205420
4. upgrading second batch to new version: 9f2fab9b6eb4a748d49d6fdf271a9291b3f2de23
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 10075.73 txn/s, latency: 3246.59 ms, (p50: 3200 ms, p70: 3800, p90: 4400 ms, p99: 5100 ms), latency samples: 333160
5. check swarm health
Compatibility test for a09bb94430a970de7bc45fe0d29bd33fd2e5a7d4 ==> 9f2fab9b6eb4a748d49d6fdf271a9291b3f2de23 passed
Test Ok

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2026

✅ Forge suite realistic_env_max_load success on 9f2fab9b6eb4a748d49d6fdf271a9291b3f2de23

two traffics test: inner traffic : committed: 13621.77 txn/s, submitted: 13621.82 txn/s, expired: 0.05 txn/s, latency: 2767.81 ms, (p50: 2700 ms, p70: 2900, p90: 3000 ms, p99: 3600 ms), latency samples: 5074320
two traffics test : committed: 100.00 txn/s, latency: 704.66 ms, (p50: 700 ms, p70: 800, p90: 800 ms, p99: 1000 ms), latency samples: 1840
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 2.277, avg: 2.168", "ConsensusProposalToOrdered: max: 0.167, avg: 0.165", "ConsensusOrderedToCommit: max: 0.047, avg: 0.043", "ConsensusProposalToCommit: max: 0.213, avg: 0.208"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.63s no progress at version 5591269 (avg 0.07s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.49s no progress at version 5828936 (avg 0.37s) [limit 16].
Test Ok

@ibalajiarun ibalajiarun merged commit 92f0a31 into main Jan 8, 2026
77 of 82 checks passed
@ibalajiarun ibalajiarun deleted the balaji/enc-pool-8 branch January 8, 2026 00:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants