YARN-11924. Add zkManager.exists(path) check to ZKConfigurationStore:…#8222
YARN-11924. Add zkManager.exists(path) check to ZKConfigurationStore:…#8222ferdelyi wants to merge 1 commit intoapache:trunkfrom
Conversation
|
💔 -1 overall
This message was automatically generated. |
Hean-Chhinling
left a comment
There was a problem hiding this comment.
Big thanks to you, @ferdelyi for working on this.
This is a huge help to the intermittent RM start up failures.
The code overall is really good.
I just have some improvement ideas and some questions
.../apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ZKConfigurationStore.java
Outdated
Show resolved
Hide resolved
| // when the /confstore/CONF_STORE path does not exist, hence the | ||
| // getZkData method returns a null value causing the RM to fail. | ||
| // To prevent this, added a re-try mechanism before giving up. | ||
| int maxRetries = 6; |
There was a problem hiding this comment.
What do you think of making the Max retries configurable like the sleepBetweenRetries?
There was a problem hiding this comment.
I've discussed with @K0K0V0K, and he suggested that we are OK to only have the retry seconds configurable. But I am OK either way. It feels a bit overkill to me. We already have so many configs.
There was a problem hiding this comment.
I see, that is true. It is a little bit overkills to have more configurations. And maybe 5 retries is more than enough to try to get data from ZNode will another RM is formatted or the data has not fully written yet.
| protected byte[] getZkData(String path) throws Exception { | ||
| // Should a 'yarn resourcemanager -format-state-store' command is issued | ||
| // while one of the RM is in a starting state, there is a time period | ||
| // when the /confstore/CONF_STORE path does not exist, hence the |
There was a problem hiding this comment.
The issue also occur when the /confstore/CONF_STORE path exists but the data is not fully written this also cause another RM to fail without the root.default queue. Thus retry and make sure the data is not empty.
I think we could add the followings:
"when the /confstore/CONF_STORE path does exists or when the data is not fully written"
There was a problem hiding this comment.
When we format the confstore, because of YARN-11551, my current understanding is that we will end up with an empty /confstore, hence the /confstore/CONF_STORE does not exist. Can you please verify if my understanding is correct?
There was a problem hiding this comment.
I think there are different states of failures:
- When another RM format the conf-store and then Node is not exists, it will throw the following errors which you already addressed.
2026-01-23 00:10:45,594 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore: Failed to retrieve configuration from zookeeper store
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /confstore/CONF_STORE
- When another RM access the conf-store while another RM has not finished written data to the conf-store. It will throw the following error. The conf-store node exists but the data has fully written yet otherwise it will throw the above NoNode for CONF_STORE.
2026-01-26 06:08:40,268 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore: Exception while deserializing scheduler configuration from store
java.lang.NullPointerException
.../apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ZKConfigurationStore.java
Outdated
Show resolved
Hide resolved
...che/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/TestZKConfigurationStore.java
Outdated
Show resolved
Hide resolved
...che/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/TestZKConfigurationStore.java
Show resolved
Hide resolved
|
|
||
| /** Retry period for ZKConfigurationStore will create znodes. */ | ||
| @Private | ||
| @Unstable |
There was a problem hiding this comment.
I am just curious here.
What does the "unstable" annotation means here and why do we it?
There was a problem hiding this comment.
When I clicked through it, it led to an explanation, but it is a bit unclear to me if we are still using it. @K0K0V0K do you know more on this topic?
There was a problem hiding this comment.
Last time when I clicked it, it did not go through. Now it did.
Seems like these annotation is used to inform users how much they should trust on certain configuration for the new releases. And Unstable is used for private configuration e.g. for YARN RM only (from my understanding)
...che/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/TestZKConfigurationStore.java
Outdated
Show resolved
Hide resolved
|
💔 -1 overall
This message was automatically generated. |
…getZkData() and retry mechanism Should a "yarn resourcemanager -format-state-store" command be issued while one of the RM is starting and in the INIT state (because of YARN-11551), there is a time period when the /confstore/CONF_STORE path does not exist, hence the getZkData method returns a null value, causing the RM to fail. To prevent this, add a check and re-try mechanism before giving up.
|
💔 -1 overall
This message was automatically generated. |
…getZkData() and retry mechanism
Should a "yarn resourcemanager -format-state-store" command be issued while one of the RM is starting and in the INIT state (because of YARN-11551), there is a time period when the /confstore/CONF_STORE path does not exist, hence the getZkData method returns a null value, causing the RM to fail. To prevent this, add a check and re-try mechanism before giving up.
Description of PR
Rare race condition is addressed when "yarn resourcemanager -format-state-store" issued when an RM is in the INITING state (already initialized the confstore) right before reading it. This change avoids a null pointer exception.
How was this patch tested?
Manually with locks introduced in the RM at the confstore format step with sleep, so while one of the RM is formatting the statestore, the other RM will be at the getZkData method trying to read the confstore in the INIT state.
Also with added unit tests.
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?AI Tooling
If an AI tool was used:
where is the name of the AI tool used.
https://www.apache.org/legal/generative-tooling.html