start faq entry on cuda/native code error possible causes [skip ci] #992

eordentlich · 2025-11-01T20:38:16Z

No description provided.

Signed-off-by: Erik Ordentlich <[email protected]>

wbo4958 · 2025-11-04T00:31:56Z

docs/site/FAQ.md

+### What are some possible causes of low-level CUDA and/or native code errors?
+
+  - NaNs or nulls in the input data.   These are currently passed directly into the cuML layer and may trigger such errors.
+  - NCCL communication library does not allow communication between processes on the same GPU.  Check your Spark GPU configs to ensure 1 task per GPU during fit() calls.


Looks like a check in the code here should be better

Good suggestion and can address in the future. Could be some perf penalty for null/nan checking we'd have to test.

I think Bobby's comment was about checking task per GPU conf?

This line is a little confusing to me because we recommend fractional tasks/GPU under performance docs under the assumption that stage-level scheduling will handle the adjustment. Is this line intended for cases where stage level scheduling isn't supported?
Nvm didn't see the rest of the line.

Signed-off-by: Erik Ordentlich <[email protected]>

greptile-apps

Greptile Overview

Greptile Summary

Adds a new FAQ entry documenting common causes of low-level CUDA and native code errors to help users troubleshoot these issues.

Added question "What are some possible causes of low-level CUDA and/or native code errors?"
Lists three main causes: NaNs/nulls in input data, NCCL library limitations with multiple processes per GPU, and unknown bugs
Includes reference to stage-level scheduling documentation
Minor formatting issue: extra spaces after periods in bullet points

Confidence Score: 4/5

Safe to merge with minor spacing issues
Documentation-only change with technically accurate content verified against recent codebase changes. Three minor spacing inconsistencies detected (extra spaces after periods in bullet points) that should be fixed for consistency with standard markdown formatting.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
docs/site/FAQ.md	4/5	Added FAQ entry for CUDA/native code errors with three common causes: NaNs/nulls, NCCL multi-process limitations, and unknown bugs. Minor spacing issue found.

Sequence Diagram

sequenceDiagram
    participant User
    participant Documentation
    participant FAQ
    
    User->>FAQ: Encounters CUDA/native code error
    FAQ->>User: Check for NaNs/nulls in input data
    FAQ->>User: Verify NCCL multi-process config
    Note over FAQ,User: Stage-level scheduling can help<br/>but not supported everywhere
    FAQ->>User: Check Spark GPU configs<br/>(1 task per GPU during fit)
    FAQ->>User: If issue persists, file bug report

_{1 file reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-11-04T03:10:21Z

docs/site/FAQ.md

+
+### What are some possible causes of low-level CUDA and/or native code errors?
+
+  - NaNs or nulls in the input data.   These are currently passed directly into the cuML layer and may trigger such errors.


syntax: Extra space before "These"

Suggested change

- NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors.

- NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors.

greptile-apps · 2025-11-04T03:10:22Z

docs/site/FAQ.md

+### What are some possible causes of low-level CUDA and/or native code errors?
+
+  - NaNs or nulls in the input data.   These are currently passed directly into the cuML layer and may trigger such errors.
+  - NCCL communication library does not allow communication between processes on the same GPU.  [Stage level scheduling](https://nvidia.github.io/spark-rapids-ml/performance.html#stage-level-scheduling) can avoid this but it is not supported in all cases.  Check requirements and adjust your Spark GPU configs to ensure 1 task per GPU during fit() calls if needed.


syntax: Extra space before "[Stage level scheduling]"

Suggested change

- NCCL communication library does not allow communication between processes on the same GPU. [Stage level scheduling](https://nvidia.github.io/spark-rapids-ml/performance.html#stage-level-scheduling) can avoid this but it is not supported in all cases. Check requirements and adjust your Spark GPU configs to ensure 1 task per GPU during fit() calls if needed.

- NCCL communication library does not allow communication between processes on the same GPU. [Stage level scheduling](https://nvidia.github.io/spark-rapids-ml/performance.html#stage-level-scheduling) can avoid this but it is not supported in all cases. Check requirements and adjust your Spark GPU configs to ensure 1 task per GPU during fit() calls if needed.

greptile-apps · 2025-11-04T03:10:23Z

docs/site/FAQ.md

+
+  - NaNs or nulls in the input data.   These are currently passed directly into the cuML layer and may trigger such errors.
+  - NCCL communication library does not allow communication between processes on the same GPU.  [Stage level scheduling](https://nvidia.github.io/spark-rapids-ml/performance.html#stage-level-scheduling) can avoid this but it is not supported in all cases.  Check requirements and adjust your Spark GPU configs to ensure 1 task per GPU during fit() calls if needed.
+  - Previously unknown bugs.  Please file an issue.


syntax: Extra space before "Please"

Suggested change

- Previously unknown bugs. Please file an issue.

- Previously unknown bugs. Please file an issue.

wbo4958

LGTM.

start faq entry on cuda/native code error possible causes

6793d1e

Signed-off-by: Erik Ordentlich <[email protected]>

eordentlich changed the title ~~start faq entry on cuda/native code error possible causes~~ start faq entry on cuda/native code error possible causes [skip ci] Nov 1, 2025

wbo4958 reviewed Nov 4, 2025

View reviewed changes

add pointer to stage level scheduling

e2b45a8

Signed-off-by: Erik Ordentlich <[email protected]>

greptile-apps bot reviewed Nov 4, 2025

View reviewed changes

wbo4958 approved these changes Nov 5, 2025

View reviewed changes

eordentlich merged commit 4965ecf into NVIDIA:main Nov 5, 2025
3 checks passed

eordentlich deleted the eo_faq_update branch November 5, 2025 04:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

start faq entry on cuda/native code error possible causes [skip ci] #992

start faq entry on cuda/native code error possible causes [skip ci] #992

Uh oh!

eordentlich commented Nov 1, 2025

Uh oh!

wbo4958 Nov 4, 2025

Uh oh!

eordentlich Nov 4, 2025

Uh oh!

rishic3 Nov 4, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Nov 4, 2025

Uh oh!

greptile-apps bot Nov 4, 2025

Uh oh!

greptile-apps bot Nov 4, 2025

Uh oh!

wbo4958 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		### What are some possible causes of low-level CUDA and/or native code errors?

		- NaNs or nulls in the input data. These are currently passed directly into the cuML layer and may trigger such errors.

	- NCCL communication library does not allow communication between processes on the same GPU. [Stage level scheduling](https://nvidia.github.io/spark-rapids-ml/performance.html#stage-level-scheduling) can avoid this but it is not supported in all cases. Check requirements and adjust your Spark GPU configs to ensure 1 task per GPU during fit() calls if needed.
	- NCCL communication library does not allow communication between processes on the same GPU. [Stage level scheduling](https://nvidia.github.io/spark-rapids-ml/performance.html#stage-level-scheduling) can avoid this but it is not supported in all cases. Check requirements and adjust your Spark GPU configs to ensure 1 task per GPU during fit() calls if needed.

	- Previously unknown bugs. Please file an issue.
	- Previously unknown bugs. Please file an issue.

start faq entry on cuda/native code error possible causes [skip ci] #992

start faq entry on cuda/native code error possible causes [skip ci] #992

Uh oh!

Conversation

eordentlich commented Nov 1, 2025

Uh oh!

wbo4958 Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

eordentlich Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

rishic3 Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

wbo4958 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rishic3 Nov 4, 2025 •

edited

Loading