-
Notifications
You must be signed in to change notification settings - Fork 31
start faq entry on cuda/native code error possible causes [skip ci] #992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Erik Ordentlich <[email protected]>
docs/site/FAQ.md
Outdated
| ### What are some possible causes of low-level CUDA and/or native code errors? | ||
|
|
||
| - NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors. | ||
| - NCCL communication library does not allow communication between processes on the same GPU. Check your Spark GPU configs to ensure 1 task per GPU during fit() calls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a check in the code here should be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion and can address in the future. Could be some perf penalty for null/nan checking we'd have to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Bobby's comment was about checking task per GPU conf?
This line is a little confusing to me because we recommend fractional tasks/GPU under performance docs under the assumption that stage-level scheduling will handle the adjustment. Is this line intended for cases where stage level scheduling isn't supported?
Nvm didn't see the rest of the line.
Signed-off-by: Erik Ordentlich <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Adds a new FAQ entry documenting common causes of low-level CUDA and native code errors to help users troubleshoot these issues.
- Added question "What are some possible causes of low-level CUDA and/or native code errors?"
- Lists three main causes: NaNs/nulls in input data, NCCL library limitations with multiple processes per GPU, and unknown bugs
- Includes reference to stage-level scheduling documentation
- Minor formatting issue: extra spaces after periods in bullet points
Confidence Score: 4/5
- Safe to merge with minor spacing issues
- Documentation-only change with technically accurate content verified against recent codebase changes. Three minor spacing inconsistencies detected (extra spaces after periods in bullet points) that should be fixed for consistency with standard markdown formatting.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| docs/site/FAQ.md | 4/5 | Added FAQ entry for CUDA/native code errors with three common causes: NaNs/nulls, NCCL multi-process limitations, and unknown bugs. Minor spacing issue found. |
Sequence Diagram
sequenceDiagram
participant User
participant Documentation
participant FAQ
User->>FAQ: Encounters CUDA/native code error
FAQ->>User: Check for NaNs/nulls in input data
FAQ->>User: Verify NCCL multi-process config
Note over FAQ,User: Stage-level scheduling can help<br/>but not supported everywhere
FAQ->>User: Check Spark GPU configs<br/>(1 task per GPU during fit)
FAQ->>User: If issue persists, file bug report
1 file reviewed, 3 comments
|
|
||
| ### What are some possible causes of low-level CUDA and/or native code errors? | ||
|
|
||
| - NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: Extra space before "These"
| - NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors. | |
| - NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors. |
| ### What are some possible causes of low-level CUDA and/or native code errors? | ||
|
|
||
| - NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors. | ||
| - NCCL communication library does not allow communication between processes on the same GPU. [Stage level scheduling](https://nvidia.github.io/spark-rapids-ml/performance.html#stage-level-scheduling) can avoid this but it is not supported in all cases. Check requirements and adjust your Spark GPU configs to ensure 1 task per GPU during fit() calls if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: Extra space before "[Stage level scheduling]"
| - NCCL communication library does not allow communication between processes on the same GPU. [Stage level scheduling](https://nvidia.github.io/spark-rapids-ml/performance.html#stage-level-scheduling) can avoid this but it is not supported in all cases. Check requirements and adjust your Spark GPU configs to ensure 1 task per GPU during fit() calls if needed. | |
| - NCCL communication library does not allow communication between processes on the same GPU. [Stage level scheduling](https://nvidia.github.io/spark-rapids-ml/performance.html#stage-level-scheduling) can avoid this but it is not supported in all cases. Check requirements and adjust your Spark GPU configs to ensure 1 task per GPU during fit() calls if needed. |
|
|
||
| - NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors. | ||
| - NCCL communication library does not allow communication between processes on the same GPU. [Stage level scheduling](https://nvidia.github.io/spark-rapids-ml/performance.html#stage-level-scheduling) can avoid this but it is not supported in all cases. Check requirements and adjust your Spark GPU configs to ensure 1 task per GPU during fit() calls if needed. | ||
| - Previously unknown bugs. Please file an issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: Extra space before "Please"
| - Previously unknown bugs. Please file an issue. | |
| - Previously unknown bugs. Please file an issue. |
wbo4958
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
No description provided.