Skip to content

Conversation

@suxiaogang223
Copy link
Contributor

…revent OOM in file scan (apache#58759)

### What problem does this PR solve?

- Relate Pr: apache#58858

## Problem Summary

When querying external table catalog (Hive, Iceberg, Paimon, etc.),
Doris splits files into multiple splits for parallel processing. In some
cases, especially with numerous small files, this can generate an
excessive number of splits, potentially causing:

1. **Memory pressure**: Too many splits consume significant memory in FE
2. **OOM issues**: Excessive split generation can lead to
OutOfMemoryError
3. **Performance degradation**: Managing too many splits impacts query
planning overhead

Previously, there was no upper limit on the number of splits in
non-batch mode, which could lead to problems when querying tables with
many small files.

## Solution

This PR introduces a new session variable `max_file_split_num` to limit
the maximum number of splits allowed per table scan in non-batch mode.

### Changes

1. **New Session Variable**: `max_file_split_num`
   - Type: `int`
   - Default: `100000`
- Description: "在非 batch 模式下,每个 table scan 最大允许的 split 数量,防止产生过多 split
导致 OOM。"
   - Forward to BE: `true`

2. **Implementation in FileQueryScanNode**:
- Added method `applyMaxFileSplitNumLimit(long targetSplitSize, long
totalFileSize)`
- Dynamically calculates minimum split size to ensure split count
doesn't exceed the limit
- Formula: `minSplitSizeForMaxNum = (totalFileSize + maxFileSplitNum -
1) / maxFileSplitNum`
   - Returns: `Math.max(targetSplitSize, minSplitSizeForMaxNum)`

3. **Applied to multiple scan nodes**:
   - `HiveScanNode`
   - `IcebergScanNode`
   - `PaimonScanNode`
   - `TVFScanNode`

4. **Unit Tests**:
   - `FileQueryScanNodeTest`: Test base logic
   - `HiveScanNodeTest`: Test Hive-specific implementation
   - `IcebergScanNodeTest`: Test Iceberg-specific implementation
   - `PaimonScanNodeTest`: Test Paimon-specific implementation
   - `TVFScanNodeTest`: Test TVF-specific implementation

## Usage

Users can now control the maximum number of splits per table scan by
setting the session variable:

```sql
-- Set to 50000 splits maximum
SET max_file_split_num = 50000;

-- Disable the limit (set to 0 or negative)
SET max_file_split_num = 0;
```

(cherry picked from commit 3e5a70f)
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223
Copy link
Contributor Author

run buildall

@suxiaogang223
Copy link
Contributor Author

run buildall

@suxiaogang223 suxiaogang223 force-pushed the pick-58759-branch-4.0-v2 branch from 4f944e4 to 75641e0 Compare February 13, 2026 04:21
@suxiaogang223
Copy link
Contributor Author

run buildall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants