Skip to content

[ENG-38911] Test hudi rs with hudi functional tests#18455

Open
Davis-Zhang-Onehouse wants to merge 1 commit intoapache:masterfrom
Davis-Zhang-Onehouse:ENG-38911
Open

[ENG-38911] Test hudi rs with hudi functional tests#18455
Davis-Zhang-Onehouse wants to merge 1 commit intoapache:masterfrom
Davis-Zhang-Onehouse:ENG-38911

Conversation

@Davis-Zhang-Onehouse
Copy link
Copy Markdown
Contributor

@Davis-Zhang-Onehouse Davis-Zhang-Onehouse commented Apr 3, 2026

── Step 1: Build once (all dependent modules + test-compile for hudi-spark) ──

mvn test-compile -am -pl hudi-spark-datasource/hudi-spark
-Dgluten.bundle.jar=$BUNDLE_JAR
-Dcheckstyle.skip=true -Dscalastyle.skip=true
-Dspotless.check.skip=true -Denforcer.skip=true
-T 1C -B -ntp

── Step 2: Run individual tests — no recompilation ──

export RUST_LOG=hudi_core=debug

BUNDLE_JAR=/home/ubuntu/ws1/gluten-internal/package/target/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.5.1-SNAPSHOT.jar

Common args for surefire:test (no -am needed — deps already compiled)

SUREFIRE_COMMON=(-pl hudi-spark-datasource/hudi-spark
-Dgluten.bundle.jar=$BUNDLE_JAR
-Dsurefire-log4j.file=log4j2-surefire-quiet.properties
-DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
-DskipUTs=true -Dcheckstyle.skip=true -Dscalastyle.skip=true
-DredirectTestOutputToFile=false -Dspotless.check.skip=true)

mvn surefire:test "${SUREFIRE_COMMON[@]}" -Dtest=TestMORDataSource| grep "Using Gluten/Velox native"
mvn surefire:test "${SUREFIRE_COMMON[@]}" -Dtest=TestHoodieFileGroupReaderOnSpark| grep "Using Gluten/Velox native"
mvn surefire:test "${SUREFIRE_COMMON[@]}" -Dtest=TestHoodieSparkMergeOnReadTableRollback| grep "Using Gluten/Velox native"
mvn surefire:test "${SUREFIRE_COMMON[@]}" -Dtest=TestMergeHandle| grep "Using Gluten/Velox native"
mvn surefire:test "${SUREFIRE_COMMON[@]}" -Dtest=TestPositionBasedFileGroupRecordBuffer| grep "Using Gluten/Velox native"

mvn test-compile -am -pl hudi-spark-datasource/hudi-spark
-Dgluten.bundle.jar=$BUNDLE_JAR
-Dcheckstyle.skip=true -Dscalastyle.skip=true
-Dspotless.check.skip=true -Denforcer.skip=true
-T 1C -B -ntp

BUNDLE_JAR=/home/ubuntu/ws1/gluten-internal/package/target/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.5.1-SNAPSHOT.jar

ScalaTest — use scalatest:test (not surefire:test)

SUREFIRE_SCALA=(-pl hudi-spark-datasource/hudi-spark
-Dgluten.bundle.jar=$BUNDLE_JAR
-Dsurefire-log4j.file=log4j2-surefire-quiet.properties
-DfailIfNoTests=false
-Dcheckstyle.skip=true -Dscalastyle.skip=true
-DredirectTestOutputToFile=false -Dspotless.check.skip=true)
mvn scalatest:test "${SUREFIRE_SCALA[@]}" -DwildcardSuites=org.apache.spark.sql.hudi.dml.others.TestMergeModeCommitTimeOrdering

mvn scalatest:test "${SUREFIRE_SCALA[@]}" -DwildcardSuites=org.apache.spark.sql.hudi.dml.others.TestMergeModeEventTimeOrdering

mvn scalatest:test "${SUREFIRE_SCALA[@]}" -DwildcardSuites=org.apache.spark.sql.hudi.dml.others.TestPartialUpdateForMergeInto

@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Apr 3, 2026
@Davis-Zhang-Onehouse Davis-Zhang-Onehouse changed the title Eng 38911 [ENG-38911] Test hudi rs with hudi functional tests Apr 3, 2026
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — Several hardcoded filesystem paths and debug artifacts should be removed; test utilities exist for managing temp directories safely.


val BASE_OUTPUT_PATH = "/home/ubuntu/ws3/hudi-rs/tables/fdss"

private def cleanPath(path: String): Unit = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: replace hardcoded /home/ubuntu/... path with a test-scoped temp directory (e.g., spark.sessionState.conf.warehouseDir or JUnit's @TempDir). This path is specific to your machine and will fail on CI/other developers' machines.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

withRecordType()(withTempDir { tmp =>
withRecordType()({
val tableName = generateTableName
val currTime = System.currentTimeMillis()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: hardcoded /home/ubuntu/... path should be replaced with a test temp directory; this will fail on CI and other developers' machines.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

val tableName = generateTableName
val currTime = System.currentTimeMillis()
val tablePath = s"/home/ubuntu/ws3/hudi-rs/tables/table${TestMergeModeCommitTimeOrdering.cnt}_$currTime"
println(s"fdss: ${tablePath}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: remove the debug string "fdss: " from this println, or replace with a meaningful test identifier if logging is needed.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

@@ -28,13 +28,15 @@ import org.apache.hudi.common.testutils.HoodieTestUtils
import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: consider adding a comment to the var cnt: Int explaining its purpose (e.g., '// Counter to generate unique table paths for parallel test execution').

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the detailed RCA document — the Velox alignment analysis is thorough. However, this PR has several issues that make it unsuitable for merging into the Apache Hudi mainline as-is: it introduces internal/proprietary dependencies, leaves debug artifacts in production code, disables significant test coverage, and hardcodes local filesystem paths. These need to be addressed before this can land.

<spark33.version>3.3.4</spark33.version>
<spark34.version>3.4.3</spark34.version>
<spark35.version>3.5.5</spark35.version>
<spark35.version>3.5.2-INTERNAL-0.16</spark35.version>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This changes the Spark version from a public Apache release (3.5.5) to what appears to be a private internal build (3.5.2-INTERNAL-0.16). This would break builds for anyone without access to the internal repository and is not appropriate for the Apache Hudi mainline. Could you revert this and keep the Gluten integration working with a standard Spark version?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

</repository>
<repository>
<id>codeartifact</id>
<name>codeartifact</name>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Adding a private CodeArtifact repository (onehouse-194159489498.d.codeartifact.us-west-2.amazonaws.com) to the root POM would force all Hudi builders to have access to this internal repo. This should not be in the mainline POM — it belongs in a local settings.xml or a separate profile.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

// Additionally, we have to explicitly wrap around resulting [[RDD]] into the one
// injecting [[SQLConf]], which by default isn't propagated by Spark to the executor(s).
// [[SQLConf]] is required by [[AvroSerializer]]
logWarning(s"createRdd executedPlan:\n${df.queryExecution.executedPlan.treeString}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This logWarning dumps the full executed plan on every createRdd call in production code. This is a debug statement that should be removed before merging — it will produce excessive log output in production workloads and may leak schema details into logs.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.


private def cleanPath(path: String): Unit = {
val hadoopPath = new Path(path)
val fs = hadoopPath.getFileSystem(spark.sessionState.newHadoopConf())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This hardcodes an absolute path (/home/ubuntu/ws3/hudi-rs/tables/fdss) that only exists on a specific developer machine. Tests will fail or write to unexpected locations on any other system. Could you use withTempDir or getBasePath instead?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

withRecordType()({
val tableName = generateTableName
val currTime = System.currentTimeMillis()
val tablePath = s"/home/ubuntu/ws3/hudi-rs/tables/table${TestMergeModeCommitTimeOrdering.cnt}_$currTime"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Same issue — hardcoded path /home/ubuntu/ws3/hudi-rs/tables/table... that only works on a specific machine. The original code used withTempDir { tmp => ... } which was correct; this change also removes the temp directory cleanup. The mutable cnt counter in the companion object is also not thread-safe if tests run in parallel.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

@@ -583,6 +576,9 @@ class TestMORDataSource extends HoodieSparkClientTestBase with SparkDatasetMixin
// First Operation:
// Producing parquet files to three default partitions.
// SNAPSHOT view on MOR table with parquet files only.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 These commented-out lines (val BASE_OUTPUT_PATH, val tablePath) look like leftover debug code. Could you remove them?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.util.{ArrayData, MapData}
import org.apache.spark.sql.execution.datasources.SparkColumnarFileReader
import org.apache.spark.sql.hudi.MultipleColumnarFileFormatReader
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The @Disabled annotations on tests like testCustomDelete use reasons like "Custom delete payload not supported" — but these features ARE supported by Hudi. They're only unsupported by Gluten/Velox. If this PR is meant to run alongside normal Hudi CI, these disabled tests would silently skip important coverage. Consider using @EnabledIf with a condition that checks for Gluten presence.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

// "cow,current,false,false", "cow,current,false,true", "cow,current,true,false",
"mor,current,false,false", "mor,current,false,true", "mor,current,true,false").foreach { args =>
val argList = args.split(',')
val tableType = argList(0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 COW table type tests have been commented out (cow,current,...), which removes test coverage for Copy-on-Write tables with commit time ordering. This is a core Hudi feature. Was this intentional, or leftover from debugging?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

val mergeConfigClause = if (setRecordMergeConfigs) {
// with precombine field set, UPSERT operation is used automatically
if (tableVersion.toInt == 6) {
if (tableVersion.toInt == 6) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This looks like an accidental indentation change (the if was de-indented by one level). Might cause confusion on future diffs.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

@@ -0,0 +1,373 @@
# RCA: Gluten/Velox MOR Test Failures
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This is an excellent root-cause analysis document, but should it live in the repo root? It seems more suited for a wiki page, Jira ticket, or the PR description itself. If it does belong in the repo, a subdirectory like docs/ or dev/ might be more appropriate.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

CodeRabbit Walkthrough: This PR introduces Gluten/Velox native execution support to Hudi test infrastructure, adds comprehensive merge-on-read (MOR) test suites with complex data types, documents three failing MOR test cases, disables unsupported test scenarios (custom merge modes, custom payloads), and updates Maven configuration to pull specialized Spark dependencies.

Greptile Summary: This PR introduces Gluten/Velox native-execution support into Hudi's functional test suite (MOR read path), driven by an investigation into three failing sub-tests (testMORWithMap, testMORWithDecimal, testMORWithArray). It adds GlutenTestUtils for opt-in Gluten SparkConf injection, a gluten-velox Maven profile, new TestMORFileSliceLayouts and testReadMORTableWithAllDataTypes test coverage, and a number of @Disabled annotations to skip currently-unsupported test variants.

However, the PR contains several OSS-blocking artifacts from the investigation process:

  • Production debug logging: A logWarning that prints the full executed query plan at WARN level was left in HoodieSparkUtils.scala and will appear in every Hudi job's logs.
  • Private build infrastructure: pom.xml replaces the public Apache Spark 3.5.5 with 3.5.2-INTERNAL-0.16 (an Onehouse-internal build) and adds a private AWS CodeArtifact Maven repository — breaking the build for all external contributors and CI environments.
  • Hardcoded developer paths: TestMORFileSliceLayouts.scala and TestMergeModeCommitTimeOrdering.scala both write tables to hardcoded absolute paths (/home/ubuntu/ws3/hudi-rs/tables/...) instead of withTempDir, making these tests fail on any machine other than the original author's. TestMergeModeCommitTimeOrdering also introduces a global mutable cnt counter in a companion object to differentiate paths.
  • Internal investigation document: RCA_Gluten_Velox_MOR_Failures.md is committed to the repository root, an internal RCA artifact that does not belong in the OSS codebase.
  • Commented-out test parameters: TestMORDataSource.scala has large blocks of @CsvSource entries commented out (rather than cleanly removed), alongside a leftover debug path comment.
  • Significant test coverage reduction: Across multiple files, test parameters are narrowed to version 9 only and many CUSTOM/pre-v9 test cases are disabled without corresponding tracking tickets.

Greptile Confidence Score: 1/5
Not safe to merge — the PR contains private build infrastructure that breaks the OSS build, a production debug log statement, and hardcoded developer paths that will fail tests on any other machine.

Score reflects three blocking P1 issues: (1) the internal Onehouse Spark version and private CodeArtifact Maven repository in root pom.xml make the project unbuildable for all external contributors, (2) the logWarning in production HoodieSparkUtils.scala prints the full query plan at WARN level on every MOR read, and (3) two new test classes write to hardcoded absolute paths that only exist on one developer's machine.

pom.xml (private repo + internal Spark version), HoodieSparkUtils.scala (debug log), TestMORFileSliceLayouts.scala and TestMergeModeCommitTimeOrdering.scala (hardcoded paths), RCA_Gluten_Velox_MOR_Failures.md (internal document)

CodeRabbit: yihua#39 (review)
Greptile: yihua#39 (review)

// Additionally, we have to explicitly wrap around resulting [[RDD]] into the one
// injecting [[SQLConf]], which by default isn't propagated by Spark to the executor(s).
// [[SQLConf]] is required by [[AvroSerializer]]
logWarning(s"createRdd executedPlan:\n${df.queryExecution.executedPlan.treeString}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Debug warning log left in production code

This logWarning call prints the entire executed query plan (as a tree string) at WARN level every time createRdd is called. This is clearly a debug statement added during the Gluten/Velox investigation and must not reach production:

  • It will spam the logs for every MOR read in every Hudi job.
  • treeString is non-trivial to compute and allocates for every partition scan.
  • WARN level means it will appear in most production logging configurations.

This line should be removed entirely before merging.

Greptile (original) (source:comment#3082852557)

@@ -170,7 +170,7 @@
<rocksdbjni.version>7.5.3</rocksdbjni.version>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Internal Onehouse Spark build version in OSS project

The Spark 3.5 version has been changed from the public 3.5.5 to 3.5.2-INTERNAL-0.16, an Onehouse-internal build that is only resolvable via the private CodeArtifact repository added below (line ~1966). This will break the build for every developer and CI environment that does not have access to that repository.

This must be reverted to an official Apache Spark release before merging to the OSS project.

Greptile (original) (source:comment#3082852612)

</repository>
<repository>
<id>codeartifact</id>
<name>codeartifact</name>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Private internal Maven repository in OSS project

The repository at https://onehouse-194159489498.d.codeartifact.us-west-2.amazonaws.com/maven/onehouse-internal/ is a private AWS CodeArtifact instance owned by Onehouse. Adding it to the root pom.xml means every Maven build of the project will attempt to resolve artifacts from a private, unauthenticated endpoint — this will fail for all external contributors and CI systems outside of Onehouse infrastructure.

Both this repository entry and the internal Spark version (3.5.2-INTERNAL-0.16) that depends on it must be removed.

Greptile (original) (source:comment#3082852660)

val hadoopPath = new Path(path)
val fs = hadoopPath.getFileSystem(spark.sessionState.newHadoopConf())
if (fs.exists(hadoopPath)) {
fs.delete(hadoopPath, true)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Hardcoded developer-local path used as base output directory

val BASE_OUTPUT_PATH = "/home/ubuntu/ws3/hudi-rs/tables/fdss" is a hardcoded absolute path on a specific developer's machine. All four tests in this file write their tables to sub-directories of this path. These tests will fail (or worse, silently write data to an unexpected location) on any other machine.

This should be replaced with a temporary directory, consistent with how other test classes in the project handle isolation, e.g.:

Suggested change
fs.delete(hadoopPath, true)
val BASE_OUTPUT_PATH = createTempDir().getAbsolutePath

Or use the withTempDir pattern already used throughout HoodieSparkSqlTestBase.

Greptile (original) (source:comment#3082852745)

withRecordType()(withTempDir { tmp =>
withRecordType()({
val tableName = generateTableName
val currTime = System.currentTimeMillis()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Hardcoded developer-local path and global mutable state

Two related problems here:

  1. Hardcoded path: s"/home/ubuntu/ws3/hudi-rs/tables/table${TestMergeModeCommitTimeOrdering.cnt}_$currTime" is a hardcoded path on a specific developer's machine. Tests will fail on any other CI or developer environment.

  2. Global mutable cnt in companion object: The companion object TestMergeModeCommitTimeOrdering.cnt is a JVM-singleton var incremented across test invocations. This is a form of shared mutable state that makes test execution order-dependent and could cause path collisions or unexpected behavior if tests are run in parallel.

Both issues stem from the same root cause — the code was modified to write tables to a fixed developer path for manual inspection with hudi-rs. This logic should be reverted to use withTempDir { tmp => ... } as in the original code.

The same pattern appears in the second test block in this file (~line 255).

Greptile (original) (source:comment#3082852816)

val tableName = generateTableName
val currTime = System.currentTimeMillis()
val tablePath = s"/home/ubuntu/ws3/hudi-rs/tables/table${TestMergeModeCommitTimeOrdering.cnt}_$currTime"
println(s"fdss: ${tablePath}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Same hardcoded path and debug statement in second test.

This is a duplicate of the issue in the first test - hardcoded absolute path and debug println must be removed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMergeModeCommitTimeOrdering.scala`
around lines 260 - 262, Remove the hardcoded absolute path and debug println: in
the TestMergeModeCommitTimeOrdering test where tablePath is constructed using
TestMergeModeCommitTimeOrdering.cnt and System.currentTimeMillis(), replace the
absolute "/home/ubuntu/..." path with a temp or relative test directory (or use
the test framework's temporary folder utility) and delete the println(s"fdss:
${tablePath}") call; ensure tablePath is built from a configurable base (or
tempDir) so the test is not environment-specific.

CodeRabbit (original) (source:comment#3082895783)

val fs = hadoopPath.getFileSystem(spark.sessionState.newHadoopConf())
if (fs.exists(hadoopPath)) {
fs.delete(hadoopPath, true)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't hard-code a workstation path into an automated test suite.

These tests assume /home/ubuntu/... exists and cleanPath recursively deletes under it. That will fail on CI/non-Ubuntu hosts and is unsafe on developer machines. Use a harness-provided temp directory instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMORFileSliceLayouts.scala`
around lines 48 - 55, Replace the hard-coded BASE_OUTPUT_PATH and the unsafe
cleanPath usage with a test-scoped temporary directory: change BASE_OUTPUT_PATH
to be created at runtime (e.g., use java.nio.file.Files.createTempDirectory or
the test harness/tempDir utility) and ensure cleanPath operates on that temp dir
only (or remove cleanPath and rely on the temp directory lifecycle). Update
references to BASE_OUTPUT_PATH and the cleanPath helper so all tests use this
generated path, and ensure the temp directory is deleted/cleaned in an
afterEach/afterAll teardown to avoid leaving artifacts.

CodeRabbit (original) (source:comment#3082895785)

.filterCompletedInstants().lastInstant().get().requestedTime()
fsView.getLatestMergedFileSlicesBeforeOrOn("", queryInstant)
.iterator().asScala.toSeq
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Close the file-system view and log readers on every exit path.

getMetaClientAndFileSystemView creates a HoodieTableFileSystemView that is never closed, and each HoodieLogFileReader only closes on the happy path. A failing read/assertion here will leak handles across this suite.

Also applies to: 111-131

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMORFileSliceLayouts.scala`
around lines 62 - 68, getQueryFileSlices currently obtains resources from
getMetaClientAndFileSystemView (which returns a HoodieTableFileSystemView) and
iterates log readers but does not guarantee closure on exception paths; update
getQueryFileSlices to ensure the HoodieTableFileSystemView and any
HoodieLogFileReader instances are closed in all exit paths (use try/finally or a
loan/withResource pattern) so that fsView.close() is always called and each log
reader is closed even if an assertion or read fails; locate
getMetaClientAndFileSystemView and any code that constructs HoodieLogFileReader
(also in the 111-131 region) and wrap their usage in try/finally blocks or a
managed-resource helper to release resources reliably.

CodeRabbit (original) (source:comment#3082895786)

<spark33.version>3.3.4</spark33.version>
<spark34.version>3.4.3</spark34.version>
<spark35.version>3.5.5</spark35.version>
<spark35.version>3.5.2-INTERNAL-0.16</spark35.version>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let me examine the pom.xml file structure and the specific lines mentioned
wc -l pom.xml

Repository: yihua/hudi

Length of output: 66


🏁 Script executed:

# Check the area around line 173 for the spark35.version
sed -n '170,180p' pom.xml

Repository: yihua/hudi

Length of output: 631


🏁 Script executed:

# Check the area around lines 1966-1970 for the Onehouse CodeArtifact repository
sed -n '1960,1975p' pom.xml

Repository: yihua/hudi

Length of output: 479


🏁 Script executed:

# Search for profile definitions to verify which is the default profile
rg -n '<profile>' pom.xml | head -20

Repository: yihua/hudi

Length of output: 433


🏁 Script executed:

# Look for activeByDefault or default profile settings
rg -n -A 3 'activeByDefault\|<id>spark3' pom.xml

Repository: yihua/hudi

Length of output: 36


🏁 Script executed:

# Search for spark3.5 profile definition and its activation settings
rg -n -A 15 '<id>spark3\.5' pom.xml

Repository: yihua/hudi

Length of output: 1058


🏁 Script executed:

# Look for all profiles with their IDs and activation settings
rg -n -B 2 -A 8 '<profile>' pom.xml | grep -E '(profile>|<id>|activeByDefault|activation)'

Repository: yihua/hudi

Length of output: 2375


🏁 Script executed:

# Check line 1974 onwards to see the first profile definition after repositories
sed -n '1974,2050p' pom.xml

Repository: yihua/hudi

Length of output: 2346


🏁 Script executed:

# Get the complete spark3.5 profile definition to check for activeByDefault
sed -n '2656,2715p' pom.xml

Repository: yihua/hudi

Length of output: 3074


🏁 Script executed:

# Check if there are any activeByDefault declarations in profiles
rg -n 'activeByDefault' pom.xml

Repository: yihua/hudi

Length of output: 159


🏁 Script executed:

# Check properties section to understand default spark module
sed -n '140,200p' pom.xml

Repository: yihua/hudi

Length of output: 3710


This makes the default Spark 3.5 build depend on private infrastructure.

spark3.5 is the active-by-default profile, and the default properties use ${spark35.version} which now points to 3.5.2-INTERNAL-0.16. With the Onehouse CodeArtifact repository configured in the global repositories section, every default build will attempt to resolve this internal coordinate from private AWS infrastructure. Public CI and external contributors cannot access this repository without Onehouse credentials. Please keep the public Spark coordinate as default and move the internal version and repository behind an explicit opt-in profile.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pom.xml` at line 173, The pom currently sets the Maven property
spark35.version to the internal coordinate "3.5.2-INTERNAL-0.16" which makes the
active-by-default profile spark3.5 depend on private Onehouse CodeArtifact
repos; revert the default <spark35.version> to the public Spark artifact (e.g.,
a public 3.5.x coordinate) and move the internal "3.5.2-INTERNAL-0.16" value and
the Onehouse repository declaration into a new opt-in Maven profile (e.g.,
profile id "onehouse-internal" or similar); update references to spark35.version
to remain unchanged so external builds use the public coordinate unless the
opt-in profile is activated.

CodeRabbit (original) (source:comment#3082895791)

at org.apache.spark.sql.catalyst.expressions.codegen.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$...$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
at HoodieSparkUtils$.$anonfun$createRdd$2(HoodieSparkUtils.scala:107)
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add languages to the fenced code blocks.

markdownlint is already flagging these fences, so this doc will keep warning until the blocks are tagged (text, scala, java, cpp, etc.).

Also applies to: 32-36, 82-85, 184-203, 206-210, 245-251, 254-264, 320-327, 347-351

🧰 Tools
🪛 markdownlint-cli2 (0.22.0)

[warning] 14-14: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@RCA_Gluten_Velox_MOR_Failures.md` around lines 14 - 22, Several fenced code
blocks in the document are missing language specifiers; update each
triple-backtick fence to include an appropriate language token (e.g., ```java
for the Java stacktrace starting with java.lang.AssertionError and entries like
UnsafeRow.setTotalSize, ```scala for Scala snippets referencing
HoodieSparkUtils.createRdd, or ```text/```cpp as appropriate) so markdownlint
stops flagging them; apply this change to the reported ranges (including the
stacktrace block and the other blocks at 32-36, 82-85, 184-203, 206-210,
245-251, 254-264, 320-327, 347-351) ensuring each fence uses the correct
language tag that matches the contained code/sample.

CodeRabbit (original) (source:comment#3082895798)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants