[ENG-38911] Test hudi rs with hudi functional tests#18455
[ENG-38911] Test hudi rs with hudi functional tests#18455Davis-Zhang-Onehouse wants to merge 1 commit intoapache:masterfrom
Conversation
5f9b15d to
e48e75c
Compare
e48e75c to
3c0e970
Compare
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Style & Readability Review — Several hardcoded filesystem paths and debug artifacts should be removed; test utilities exist for managing temp directories safely.
|
|
||
| val BASE_OUTPUT_PATH = "/home/ubuntu/ws3/hudi-rs/tables/fdss" | ||
|
|
||
| private def cleanPath(path: String): Unit = { |
There was a problem hiding this comment.
🤖 nit: replace hardcoded /home/ubuntu/... path with a test-scoped temp directory (e.g., spark.sessionState.conf.warehouseDir or JUnit's @TempDir). This path is specific to your machine and will fail on CI/other developers' machines.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| withRecordType()(withTempDir { tmp => | ||
| withRecordType()({ | ||
| val tableName = generateTableName | ||
| val currTime = System.currentTimeMillis() |
There was a problem hiding this comment.
🤖 nit: hardcoded /home/ubuntu/... path should be replaced with a test temp directory; this will fail on CI and other developers' machines.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| val tableName = generateTableName | ||
| val currTime = System.currentTimeMillis() | ||
| val tablePath = s"/home/ubuntu/ws3/hudi-rs/tables/table${TestMergeModeCommitTimeOrdering.cnt}_$currTime" | ||
| println(s"fdss: ${tablePath}") |
There was a problem hiding this comment.
🤖 nit: remove the debug string "fdss: " from this println, or replace with a meaningful test identifier if logging is needed.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| @@ -28,13 +28,15 @@ import org.apache.hudi.common.testutils.HoodieTestUtils | |||
| import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase | |||
There was a problem hiding this comment.
🤖 nit: consider adding a comment to the var cnt: Int explaining its purpose (e.g., '// Counter to generate unique table paths for parallel test execution').
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the detailed RCA document — the Velox alignment analysis is thorough. However, this PR has several issues that make it unsuitable for merging into the Apache Hudi mainline as-is: it introduces internal/proprietary dependencies, leaves debug artifacts in production code, disables significant test coverage, and hardcodes local filesystem paths. These need to be addressed before this can land.
| <spark33.version>3.3.4</spark33.version> | ||
| <spark34.version>3.4.3</spark34.version> | ||
| <spark35.version>3.5.5</spark35.version> | ||
| <spark35.version>3.5.2-INTERNAL-0.16</spark35.version> |
There was a problem hiding this comment.
🤖 This changes the Spark version from a public Apache release (3.5.5) to what appears to be a private internal build (3.5.2-INTERNAL-0.16). This would break builds for anyone without access to the internal repository and is not appropriate for the Apache Hudi mainline. Could you revert this and keep the Gluten integration working with a standard Spark version?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| </repository> | ||
| <repository> | ||
| <id>codeartifact</id> | ||
| <name>codeartifact</name> |
There was a problem hiding this comment.
🤖 Adding a private CodeArtifact repository (onehouse-194159489498.d.codeartifact.us-west-2.amazonaws.com) to the root POM would force all Hudi builders to have access to this internal repo. This should not be in the mainline POM — it belongs in a local settings.xml or a separate profile.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| // Additionally, we have to explicitly wrap around resulting [[RDD]] into the one | ||
| // injecting [[SQLConf]], which by default isn't propagated by Spark to the executor(s). | ||
| // [[SQLConf]] is required by [[AvroSerializer]] | ||
| logWarning(s"createRdd executedPlan:\n${df.queryExecution.executedPlan.treeString}") |
There was a problem hiding this comment.
🤖 This logWarning dumps the full executed plan on every createRdd call in production code. This is a debug statement that should be removed before merging — it will produce excessive log output in production workloads and may leak schema details into logs.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
|
|
||
| private def cleanPath(path: String): Unit = { | ||
| val hadoopPath = new Path(path) | ||
| val fs = hadoopPath.getFileSystem(spark.sessionState.newHadoopConf()) |
There was a problem hiding this comment.
🤖 This hardcodes an absolute path (/home/ubuntu/ws3/hudi-rs/tables/fdss) that only exists on a specific developer machine. Tests will fail or write to unexpected locations on any other system. Could you use withTempDir or getBasePath instead?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| withRecordType()({ | ||
| val tableName = generateTableName | ||
| val currTime = System.currentTimeMillis() | ||
| val tablePath = s"/home/ubuntu/ws3/hudi-rs/tables/table${TestMergeModeCommitTimeOrdering.cnt}_$currTime" |
There was a problem hiding this comment.
🤖 Same issue — hardcoded path /home/ubuntu/ws3/hudi-rs/tables/table... that only works on a specific machine. The original code used withTempDir { tmp => ... } which was correct; this change also removes the temp directory cleanup. The mutable cnt counter in the companion object is also not thread-safe if tests run in parallel.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| @@ -583,6 +576,9 @@ class TestMORDataSource extends HoodieSparkClientTestBase with SparkDatasetMixin | |||
| // First Operation: | |||
| // Producing parquet files to three default partitions. | |||
| // SNAPSHOT view on MOR table with parquet files only. | |||
There was a problem hiding this comment.
🤖 These commented-out lines (val BASE_OUTPUT_PATH, val tablePath) look like leftover debug code. Could you remove them?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| import org.apache.spark.sql.catalyst.InternalRow | ||
| import org.apache.spark.sql.catalyst.util.{ArrayData, MapData} | ||
| import org.apache.spark.sql.execution.datasources.SparkColumnarFileReader | ||
| import org.apache.spark.sql.hudi.MultipleColumnarFileFormatReader |
There was a problem hiding this comment.
🤖 The @Disabled annotations on tests like testCustomDelete use reasons like "Custom delete payload not supported" — but these features ARE supported by Hudi. They're only unsupported by Gluten/Velox. If this PR is meant to run alongside normal Hudi CI, these disabled tests would silently skip important coverage. Consider using @EnabledIf with a condition that checks for Gluten presence.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| // "cow,current,false,false", "cow,current,false,true", "cow,current,true,false", | ||
| "mor,current,false,false", "mor,current,false,true", "mor,current,true,false").foreach { args => | ||
| val argList = args.split(',') | ||
| val tableType = argList(0) |
There was a problem hiding this comment.
🤖 COW table type tests have been commented out (cow,current,...), which removes test coverage for Copy-on-Write tables with commit time ordering. This is a core Hudi feature. Was this intentional, or leftover from debugging?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| val mergeConfigClause = if (setRecordMergeConfigs) { | ||
| // with precombine field set, UPSERT operation is used automatically | ||
| if (tableVersion.toInt == 6) { | ||
| if (tableVersion.toInt == 6) { |
There was a problem hiding this comment.
🤖 This looks like an accidental indentation change (the if was de-indented by one level). Might cause confusion on future diffs.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| @@ -0,0 +1,373 @@ | |||
| # RCA: Gluten/Velox MOR Test Failures | |||
There was a problem hiding this comment.
🤖 This is an excellent root-cause analysis document, but should it live in the repo root? It seems more suited for a wiki page, Jira ticket, or the PR description itself. If it does belong in the repo, a subdirectory like docs/ or dev/ might be more appropriate.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
CodeRabbit Walkthrough: This PR introduces Gluten/Velox native execution support to Hudi test infrastructure, adds comprehensive merge-on-read (MOR) test suites with complex data types, documents three failing MOR test cases, disables unsupported test scenarios (custom merge modes, custom payloads), and updates Maven configuration to pull specialized Spark dependencies.
Greptile Summary: This PR introduces Gluten/Velox native-execution support into Hudi's functional test suite (MOR read path), driven by an investigation into three failing sub-tests (testMORWithMap, testMORWithDecimal, testMORWithArray). It adds GlutenTestUtils for opt-in Gluten SparkConf injection, a gluten-velox Maven profile, new TestMORFileSliceLayouts and testReadMORTableWithAllDataTypes test coverage, and a number of @Disabled annotations to skip currently-unsupported test variants.
However, the PR contains several OSS-blocking artifacts from the investigation process:
- Production debug logging: A
logWarningthat prints the full executed query plan at WARN level was left inHoodieSparkUtils.scalaand will appear in every Hudi job's logs. - Private build infrastructure:
pom.xmlreplaces the public Apache Spark3.5.5with3.5.2-INTERNAL-0.16(an Onehouse-internal build) and adds a private AWS CodeArtifact Maven repository — breaking the build for all external contributors and CI environments. - Hardcoded developer paths:
TestMORFileSliceLayouts.scalaandTestMergeModeCommitTimeOrdering.scalaboth write tables to hardcoded absolute paths (/home/ubuntu/ws3/hudi-rs/tables/...) instead ofwithTempDir, making these tests fail on any machine other than the original author's.TestMergeModeCommitTimeOrderingalso introduces a global mutablecntcounter in a companion object to differentiate paths. - Internal investigation document:
RCA_Gluten_Velox_MOR_Failures.mdis committed to the repository root, an internal RCA artifact that does not belong in the OSS codebase. - Commented-out test parameters:
TestMORDataSource.scalahas large blocks of@CsvSourceentries commented out (rather than cleanly removed), alongside a leftover debug path comment. - Significant test coverage reduction: Across multiple files, test parameters are narrowed to version 9 only and many CUSTOM/pre-v9 test cases are disabled without corresponding tracking tickets.
Greptile Confidence Score: 1/5
Not safe to merge — the PR contains private build infrastructure that breaks the OSS build, a production debug log statement, and hardcoded developer paths that will fail tests on any other machine.
Score reflects three blocking P1 issues: (1) the internal Onehouse Spark version and private CodeArtifact Maven repository in root pom.xml make the project unbuildable for all external contributors, (2) the logWarning in production HoodieSparkUtils.scala prints the full query plan at WARN level on every MOR read, and (3) two new test classes write to hardcoded absolute paths that only exist on one developer's machine.
pom.xml (private repo + internal Spark version), HoodieSparkUtils.scala (debug log), TestMORFileSliceLayouts.scala and TestMergeModeCommitTimeOrdering.scala (hardcoded paths), RCA_Gluten_Velox_MOR_Failures.md (internal document)
CodeRabbit: yihua#39 (review)
Greptile: yihua#39 (review)
| // Additionally, we have to explicitly wrap around resulting [[RDD]] into the one | ||
| // injecting [[SQLConf]], which by default isn't propagated by Spark to the executor(s). | ||
| // [[SQLConf]] is required by [[AvroSerializer]] | ||
| logWarning(s"createRdd executedPlan:\n${df.queryExecution.executedPlan.treeString}") |
There was a problem hiding this comment.
Debug warning log left in production code
This logWarning call prints the entire executed query plan (as a tree string) at WARN level every time createRdd is called. This is clearly a debug statement added during the Gluten/Velox investigation and must not reach production:
- It will spam the logs for every MOR read in every Hudi job.
treeStringis non-trivial to compute and allocates for every partition scan.- WARN level means it will appear in most production logging configurations.
This line should be removed entirely before merging.
— Greptile (original) (source:comment#3082852557)
| @@ -170,7 +170,7 @@ | |||
| <rocksdbjni.version>7.5.3</rocksdbjni.version> | |||
There was a problem hiding this comment.
Internal Onehouse Spark build version in OSS project
The Spark 3.5 version has been changed from the public 3.5.5 to 3.5.2-INTERNAL-0.16, an Onehouse-internal build that is only resolvable via the private CodeArtifact repository added below (line ~1966). This will break the build for every developer and CI environment that does not have access to that repository.
This must be reverted to an official Apache Spark release before merging to the OSS project.
— Greptile (original) (source:comment#3082852612)
| </repository> | ||
| <repository> | ||
| <id>codeartifact</id> | ||
| <name>codeartifact</name> |
There was a problem hiding this comment.
Private internal Maven repository in OSS project
The repository at https://onehouse-194159489498.d.codeartifact.us-west-2.amazonaws.com/maven/onehouse-internal/ is a private AWS CodeArtifact instance owned by Onehouse. Adding it to the root pom.xml means every Maven build of the project will attempt to resolve artifacts from a private, unauthenticated endpoint — this will fail for all external contributors and CI systems outside of Onehouse infrastructure.
Both this repository entry and the internal Spark version (3.5.2-INTERNAL-0.16) that depends on it must be removed.
— Greptile (original) (source:comment#3082852660)
| val hadoopPath = new Path(path) | ||
| val fs = hadoopPath.getFileSystem(spark.sessionState.newHadoopConf()) | ||
| if (fs.exists(hadoopPath)) { | ||
| fs.delete(hadoopPath, true) |
There was a problem hiding this comment.
Hardcoded developer-local path used as base output directory
val BASE_OUTPUT_PATH = "/home/ubuntu/ws3/hudi-rs/tables/fdss" is a hardcoded absolute path on a specific developer's machine. All four tests in this file write their tables to sub-directories of this path. These tests will fail (or worse, silently write data to an unexpected location) on any other machine.
This should be replaced with a temporary directory, consistent with how other test classes in the project handle isolation, e.g.:
| fs.delete(hadoopPath, true) | |
| val BASE_OUTPUT_PATH = createTempDir().getAbsolutePath |
Or use the withTempDir pattern already used throughout HoodieSparkSqlTestBase.
— Greptile (original) (source:comment#3082852745)
| withRecordType()(withTempDir { tmp => | ||
| withRecordType()({ | ||
| val tableName = generateTableName | ||
| val currTime = System.currentTimeMillis() |
There was a problem hiding this comment.
Hardcoded developer-local path and global mutable state
Two related problems here:
-
Hardcoded path:
s"/home/ubuntu/ws3/hudi-rs/tables/table${TestMergeModeCommitTimeOrdering.cnt}_$currTime"is a hardcoded path on a specific developer's machine. Tests will fail on any other CI or developer environment. -
Global mutable
cntin companion object: The companion objectTestMergeModeCommitTimeOrdering.cntis a JVM-singletonvarincremented across test invocations. This is a form of shared mutable state that makes test execution order-dependent and could cause path collisions or unexpected behavior if tests are run in parallel.
Both issues stem from the same root cause — the code was modified to write tables to a fixed developer path for manual inspection with hudi-rs. This logic should be reverted to use withTempDir { tmp => ... } as in the original code.
The same pattern appears in the second test block in this file (~line 255).
— Greptile (original) (source:comment#3082852816)
| val tableName = generateTableName | ||
| val currTime = System.currentTimeMillis() | ||
| val tablePath = s"/home/ubuntu/ws3/hudi-rs/tables/table${TestMergeModeCommitTimeOrdering.cnt}_$currTime" | ||
| println(s"fdss: ${tablePath}") |
There was a problem hiding this comment.
Critical: Same hardcoded path and debug statement in second test.
This is a duplicate of the issue in the first test - hardcoded absolute path and debug println must be removed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMergeModeCommitTimeOrdering.scala`
around lines 260 - 262, Remove the hardcoded absolute path and debug println: in
the TestMergeModeCommitTimeOrdering test where tablePath is constructed using
TestMergeModeCommitTimeOrdering.cnt and System.currentTimeMillis(), replace the
absolute "/home/ubuntu/..." path with a temp or relative test directory (or use
the test framework's temporary folder utility) and delete the println(s"fdss:
${tablePath}") call; ensure tablePath is built from a configurable base (or
tempDir) so the test is not environment-specific.
— CodeRabbit (original) (source:comment#3082895783)
| val fs = hadoopPath.getFileSystem(spark.sessionState.newHadoopConf()) | ||
| if (fs.exists(hadoopPath)) { | ||
| fs.delete(hadoopPath, true) | ||
| } |
There was a problem hiding this comment.
Don't hard-code a workstation path into an automated test suite.
These tests assume /home/ubuntu/... exists and cleanPath recursively deletes under it. That will fail on CI/non-Ubuntu hosts and is unsafe on developer machines. Use a harness-provided temp directory instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMORFileSliceLayouts.scala`
around lines 48 - 55, Replace the hard-coded BASE_OUTPUT_PATH and the unsafe
cleanPath usage with a test-scoped temporary directory: change BASE_OUTPUT_PATH
to be created at runtime (e.g., use java.nio.file.Files.createTempDirectory or
the test harness/tempDir utility) and ensure cleanPath operates on that temp dir
only (or remove cleanPath and rely on the temp directory lifecycle). Update
references to BASE_OUTPUT_PATH and the cleanPath helper so all tests use this
generated path, and ensure the temp directory is deleted/cleaned in an
afterEach/afterAll teardown to avoid leaving artifacts.
— CodeRabbit (original) (source:comment#3082895785)
| .filterCompletedInstants().lastInstant().get().requestedTime() | ||
| fsView.getLatestMergedFileSlicesBeforeOrOn("", queryInstant) | ||
| .iterator().asScala.toSeq | ||
| } |
There was a problem hiding this comment.
Close the file-system view and log readers on every exit path.
getMetaClientAndFileSystemView creates a HoodieTableFileSystemView that is never closed, and each HoodieLogFileReader only closes on the happy path. A failing read/assertion here will leak handles across this suite.
Also applies to: 111-131
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMORFileSliceLayouts.scala`
around lines 62 - 68, getQueryFileSlices currently obtains resources from
getMetaClientAndFileSystemView (which returns a HoodieTableFileSystemView) and
iterates log readers but does not guarantee closure on exception paths; update
getQueryFileSlices to ensure the HoodieTableFileSystemView and any
HoodieLogFileReader instances are closed in all exit paths (use try/finally or a
loan/withResource pattern) so that fsView.close() is always called and each log
reader is closed even if an assertion or read fails; locate
getMetaClientAndFileSystemView and any code that constructs HoodieLogFileReader
(also in the 111-131 region) and wrap their usage in try/finally blocks or a
managed-resource helper to release resources reliably.
— CodeRabbit (original) (source:comment#3082895786)
| <spark33.version>3.3.4</spark33.version> | ||
| <spark34.version>3.4.3</spark34.version> | ||
| <spark35.version>3.5.5</spark35.version> | ||
| <spark35.version>3.5.2-INTERNAL-0.16</spark35.version> |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let me examine the pom.xml file structure and the specific lines mentioned
wc -l pom.xmlRepository: yihua/hudi
Length of output: 66
🏁 Script executed:
# Check the area around line 173 for the spark35.version
sed -n '170,180p' pom.xmlRepository: yihua/hudi
Length of output: 631
🏁 Script executed:
# Check the area around lines 1966-1970 for the Onehouse CodeArtifact repository
sed -n '1960,1975p' pom.xmlRepository: yihua/hudi
Length of output: 479
🏁 Script executed:
# Search for profile definitions to verify which is the default profile
rg -n '<profile>' pom.xml | head -20Repository: yihua/hudi
Length of output: 433
🏁 Script executed:
# Look for activeByDefault or default profile settings
rg -n -A 3 'activeByDefault\|<id>spark3' pom.xmlRepository: yihua/hudi
Length of output: 36
🏁 Script executed:
# Search for spark3.5 profile definition and its activation settings
rg -n -A 15 '<id>spark3\.5' pom.xmlRepository: yihua/hudi
Length of output: 1058
🏁 Script executed:
# Look for all profiles with their IDs and activation settings
rg -n -B 2 -A 8 '<profile>' pom.xml | grep -E '(profile>|<id>|activeByDefault|activation)'Repository: yihua/hudi
Length of output: 2375
🏁 Script executed:
# Check line 1974 onwards to see the first profile definition after repositories
sed -n '1974,2050p' pom.xmlRepository: yihua/hudi
Length of output: 2346
🏁 Script executed:
# Get the complete spark3.5 profile definition to check for activeByDefault
sed -n '2656,2715p' pom.xmlRepository: yihua/hudi
Length of output: 3074
🏁 Script executed:
# Check if there are any activeByDefault declarations in profiles
rg -n 'activeByDefault' pom.xmlRepository: yihua/hudi
Length of output: 159
🏁 Script executed:
# Check properties section to understand default spark module
sed -n '140,200p' pom.xmlRepository: yihua/hudi
Length of output: 3710
This makes the default Spark 3.5 build depend on private infrastructure.
spark3.5 is the active-by-default profile, and the default properties use ${spark35.version} which now points to 3.5.2-INTERNAL-0.16. With the Onehouse CodeArtifact repository configured in the global repositories section, every default build will attempt to resolve this internal coordinate from private AWS infrastructure. Public CI and external contributors cannot access this repository without Onehouse credentials. Please keep the public Spark coordinate as default and move the internal version and repository behind an explicit opt-in profile.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pom.xml` at line 173, The pom currently sets the Maven property
spark35.version to the internal coordinate "3.5.2-INTERNAL-0.16" which makes the
active-by-default profile spark3.5 depend on private Onehouse CodeArtifact
repos; revert the default <spark35.version> to the public Spark artifact (e.g.,
a public 3.5.x coordinate) and move the internal "3.5.2-INTERNAL-0.16" value and
the Onehouse repository declaration into a new opt-in Maven profile (e.g.,
profile id "onehouse-internal" or similar); update references to spark35.version
to remain unchanged so external builds use the public coordinate unless the
opt-in profile is activated.
— CodeRabbit (original) (source:comment#3082895791)
| at org.apache.spark.sql.catalyst.expressions.codegen.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) | ||
| at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$...$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) | ||
| at HoodieSparkUtils$.$anonfun$createRdd$2(HoodieSparkUtils.scala:107) | ||
| ``` |
There was a problem hiding this comment.
Add languages to the fenced code blocks.
markdownlint is already flagging these fences, so this doc will keep warning until the blocks are tagged (text, scala, java, cpp, etc.).
Also applies to: 32-36, 82-85, 184-203, 206-210, 245-251, 254-264, 320-327, 347-351
🧰 Tools
🪛 markdownlint-cli2 (0.22.0)
[warning] 14-14: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@RCA_Gluten_Velox_MOR_Failures.md` around lines 14 - 22, Several fenced code
blocks in the document are missing language specifiers; update each
triple-backtick fence to include an appropriate language token (e.g., ```java
for the Java stacktrace starting with java.lang.AssertionError and entries like
UnsafeRow.setTotalSize, ```scala for Scala snippets referencing
HoodieSparkUtils.createRdd, or ```text/```cpp as appropriate) so markdownlint
stops flagging them; apply this change to the reported ranges (including the
stacktrace block and the other blocks at 32-36, 82-85, 184-203, 206-210,
245-251, 254-264, 320-327, 347-351) ensuring each fence uses the correct
language tag that matches the contained code/sample.
— CodeRabbit (original) (source:comment#3082895798)
── Step 1: Build once (all dependent modules + test-compile for hudi-spark) ──
mvn test-compile -am -pl hudi-spark-datasource/hudi-spark
-Dgluten.bundle.jar=$BUNDLE_JAR
-Dcheckstyle.skip=true -Dscalastyle.skip=true
-Dspotless.check.skip=true -Denforcer.skip=true
-T 1C -B -ntp
── Step 2: Run individual tests — no recompilation ──
export RUST_LOG=hudi_core=debug
BUNDLE_JAR=/home/ubuntu/ws1/gluten-internal/package/target/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.5.1-SNAPSHOT.jar
Common args for surefire:test (no -am needed — deps already compiled)
SUREFIRE_COMMON=(-pl hudi-spark-datasource/hudi-spark
-Dgluten.bundle.jar=$BUNDLE_JAR
-Dsurefire-log4j.file=log4j2-surefire-quiet.properties
-DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false
-DskipUTs=true -Dcheckstyle.skip=true -Dscalastyle.skip=true
-DredirectTestOutputToFile=false -Dspotless.check.skip=true)
mvn surefire:test "${SUREFIRE_COMMON[@]}" -Dtest=TestMORDataSource| grep "Using Gluten/Velox native"
mvn surefire:test "${SUREFIRE_COMMON[@]}" -Dtest=TestHoodieFileGroupReaderOnSpark| grep "Using Gluten/Velox native"
mvn surefire:test "${SUREFIRE_COMMON[@]}" -Dtest=TestHoodieSparkMergeOnReadTableRollback| grep "Using Gluten/Velox native"
mvn surefire:test "${SUREFIRE_COMMON[@]}" -Dtest=TestMergeHandle| grep "Using Gluten/Velox native"
mvn surefire:test "${SUREFIRE_COMMON[@]}" -Dtest=TestPositionBasedFileGroupRecordBuffer| grep "Using Gluten/Velox native"
mvn test-compile -am -pl hudi-spark-datasource/hudi-spark
-Dgluten.bundle.jar=$BUNDLE_JAR
-Dcheckstyle.skip=true -Dscalastyle.skip=true
-Dspotless.check.skip=true -Denforcer.skip=true
-T 1C -B -ntp
BUNDLE_JAR=/home/ubuntu/ws1/gluten-internal/package/target/gluten-velox-bundle-spark3.5_2.12-linux_amd64-1.5.1-SNAPSHOT.jar
ScalaTest — use scalatest:test (not surefire:test)
SUREFIRE_SCALA=(-pl hudi-spark-datasource/hudi-spark
-Dgluten.bundle.jar=$BUNDLE_JAR
-Dsurefire-log4j.file=log4j2-surefire-quiet.properties
-DfailIfNoTests=false
-Dcheckstyle.skip=true -Dscalastyle.skip=true
-DredirectTestOutputToFile=false -Dspotless.check.skip=true)
mvn scalatest:test "${SUREFIRE_SCALA[@]}" -DwildcardSuites=org.apache.spark.sql.hudi.dml.others.TestMergeModeCommitTimeOrdering
mvn scalatest:test "${SUREFIRE_SCALA[@]}" -DwildcardSuites=org.apache.spark.sql.hudi.dml.others.TestMergeModeEventTimeOrdering
mvn scalatest:test "${SUREFIRE_SCALA[@]}" -DwildcardSuites=org.apache.spark.sql.hudi.dml.others.TestPartialUpdateForMergeInto