Skip to content

Conversation

@eordentlich
Copy link
Collaborator

No description provided.

@eordentlich eordentlich changed the title start faq entry on cuda/native code error possible causes start faq entry on cuda/native code error possible causes [skip ci] Nov 1, 2025
docs/site/FAQ.md Outdated
### What are some possible causes of low-level CUDA and/or native code errors?

- NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors.
- NCCL communication library does not allow communication between processes on the same GPU. Check your Spark GPU configs to ensure 1 task per GPU during fit() calls.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a check in the code here should be better

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion and can address in the future. Could be some perf penalty for null/nan checking we'd have to test.

Copy link
Collaborator

@rishic3 rishic3 Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Bobby's comment was about checking task per GPU conf?

This line is a little confusing to me because we recommend fractional tasks/GPU under performance docs under the assumption that stage-level scheduling will handle the adjustment. Is this line intended for cases where stage level scheduling isn't supported?
Nvm didn't see the rest of the line.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Adds a new FAQ entry documenting common causes of low-level CUDA and native code errors to help users troubleshoot these issues.

  • Added question "What are some possible causes of low-level CUDA and/or native code errors?"
  • Lists three main causes: NaNs/nulls in input data, NCCL library limitations with multiple processes per GPU, and unknown bugs
  • Includes reference to stage-level scheduling documentation
  • Minor formatting issue: extra spaces after periods in bullet points

Confidence Score: 4/5

  • Safe to merge with minor spacing issues
  • Documentation-only change with technically accurate content verified against recent codebase changes. Three minor spacing inconsistencies detected (extra spaces after periods in bullet points) that should be fixed for consistency with standard markdown formatting.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
docs/site/FAQ.md 4/5 Added FAQ entry for CUDA/native code errors with three common causes: NaNs/nulls, NCCL multi-process limitations, and unknown bugs. Minor spacing issue found.

Sequence Diagram

sequenceDiagram
    participant User
    participant Documentation
    participant FAQ
    
    User->>FAQ: Encounters CUDA/native code error
    FAQ->>User: Check for NaNs/nulls in input data
    FAQ->>User: Verify NCCL multi-process config
    Note over FAQ,User: Stage-level scheduling can help<br/>but not supported everywhere
    FAQ->>User: Check Spark GPU configs<br/>(1 task per GPU during fit)
    FAQ->>User: If issue persists, file bug report
Loading

1 file reviewed, 3 comments

Edit Code Review Agent Settings | Greptile


### What are some possible causes of low-level CUDA and/or native code errors?

- NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Extra space before "These"

Suggested change
- NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors.
- NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors.

### What are some possible causes of low-level CUDA and/or native code errors?

- NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors.
- NCCL communication library does not allow communication between processes on the same GPU. [Stage level scheduling](https://nvidia.github.io/spark-rapids-ml/performance.html#stage-level-scheduling) can avoid this but it is not supported in all cases. Check requirements and adjust your Spark GPU configs to ensure 1 task per GPU during fit() calls if needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Extra space before "[Stage level scheduling]"

Suggested change
- NCCL communication library does not allow communication between processes on the same GPU. [Stage level scheduling](https://nvidia.github.io/spark-rapids-ml/performance.html#stage-level-scheduling) can avoid this but it is not supported in all cases. Check requirements and adjust your Spark GPU configs to ensure 1 task per GPU during fit() calls if needed.
- NCCL communication library does not allow communication between processes on the same GPU. [Stage level scheduling](https://nvidia.github.io/spark-rapids-ml/performance.html#stage-level-scheduling) can avoid this but it is not supported in all cases. Check requirements and adjust your Spark GPU configs to ensure 1 task per GPU during fit() calls if needed.


- NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors.
- NCCL communication library does not allow communication between processes on the same GPU. [Stage level scheduling](https://nvidia.github.io/spark-rapids-ml/performance.html#stage-level-scheduling) can avoid this but it is not supported in all cases. Check requirements and adjust your Spark GPU configs to ensure 1 task per GPU during fit() calls if needed.
- Previously unknown bugs. Please file an issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Extra space before "Please"

Suggested change
- Previously unknown bugs. Please file an issue.
- Previously unknown bugs. Please file an issue.

Copy link
Collaborator

@wbo4958 wbo4958 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@eordentlich eordentlich merged commit 4965ecf into NVIDIA:main Nov 5, 2025
3 checks passed
@eordentlich eordentlich deleted the eo_faq_update branch November 5, 2025 04:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants