Skip to content

Comments

Adding Index Setting for Data compression#20355

Open
darjisagar7 wants to merge 4 commits intoopensearch-project:feature/datafusionfrom
darjisagar7:IndexSettingCompression
Open

Adding Index Setting for Data compression#20355
darjisagar7 wants to merge 4 commits intoopensearch-project:feature/datafusionfrom
darjisagar7:IndexSettingCompression

Conversation

@darjisagar7
Copy link

@darjisagar7 darjisagar7 commented Jan 2, 2026

Description

[Describe what this change achieves]

>....                                                                                                                                                                                                              
        "compression.enabled": "true",
        "optimized.enabled": "true",
        "index.parquet.page_row_limit": 50000,
        "index.parquet.compression_type": "ZSTD"
    },
    "mappings": {
        "properties": {
            "id": {
                "type": "keyword"
            },
            "name": {
                "type": "keyword"
            },
            "age": {
                "type": "integer"
            },
            "salary": {
                "type": "long"
            },
            "score": {
                "type": "double"
            },
            "active": {
                "type": "boolean"
            },
            "created_date": {
                "type": "date"
            }
        }
    }
}'

{"acknowledged":true,"shards_acknowledged":true,"index":"index-7"}%                                                                                                                                                                                                                                                                                         
parquet-tools inspect OpenSearch/build/testclusters/runTask-0/data/nodes/0/indices/jhYkYQ8PQNGg-UAsLQEq8g/0/parquet/_parquet_file_generation_16.parquet

############ file meta data ############
created_by: parquet-rs version 53.4.1
num_columns: 15
num_rows: 6
num_row_groups: 1
format_version: 1.0
serialized_size: 2793


############ Columns ############
_routing
_doc_count
active
salary
_ignored
_seq_no
score
name
_id
id
created_date
_version
age
___row_id
_primary_term

############ Column(_routing) ############
name: _routing
path: _routing
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: ZSTD (space_saved: -47%)

############ Column(_doc_count) ############
name: _doc_count
path: _doc_count
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: ZSTD (space_saved: -47%)

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 2, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 2, 2026

❌ Gradle check result for f60cfd8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?


// Enhanced native methods that handle validation and provide better error reporting
public static native void createWriter(String file, long schemaAddress) throws IOException;
public static native void createWriter(String file, long schemaAddress, boolean isCompressionEnabled) throws IOException;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes it very specific for compression. Can we instead pass a map here which can be read and have attribute names? Also, there are certain settings which we may want to apply for certain columns only (some may get handled via mappings as well). Can you look into if we can make the argument passed to Rust aligned to handle the index/field level settings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg configuration for ref

write.format.default parquet Default file format for the table; parquet, avro, or orc
write.delete.format.default data file format Default delete file format for the table; parquet, avro, or orc
write.parquet.row-group-size-bytes 134217728 (128 MB) Parquet row group size
write.parquet.page-size-bytes 1048576 (1 MB) Parquet page size
write.parquet.page-row-limit 20000 Parquet page row limit
write.parquet.dict-size-bytes 2097152 (2 MB) Parquet dictionary page size
write.parquet.compression-codec zstd Parquet compression codec: zstd, brotli, lz4, gzip, snappy, uncompressed
write.parquet.compression-level null Parquet compression level
write.parquet.bloom-filter-enabled.column.col1 (not set) Hint to parquet to write a bloom filter for the column: 'col1'
write.parquet.bloom-filter-max-bytes 1048576 (1 MB) The maximum number of bytes for a bloom filter bitset
write.parquet.bloom-filter-fpp.column.col1 0.01 The false positive probability for a bloom filter applied to 'col1' (must > 0.0 and < 1.0)

);

public static final Setting<Boolean> INDEX_COMPRESSION_ENABLED_SETTING = Setting.boolSetting(
"index.compression.enabled",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the existing codec setting here? We can validate what values to support in Parquet plugin itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, some settings may come from parquet directly. If there is any setting which is specific to a dataformat, we should have the setting declared in the plugin itself

    public List<Setting<?>> getSettings() {
        return Collections.emptyList();
    }

@darjisagar7 darjisagar7 force-pushed the IndexSettingCompression branch from f60cfd8 to 04f4f7d Compare January 20, 2026 06:17
@github-actions
Copy link
Contributor

❌ Gradle check result for 04f4f7d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Sagar Darji <darjisagar7@gmail.com>
Signed-off-by: Sagar Darji <darjisagar7@gmail.com>
@darjisagar7 darjisagar7 force-pushed the IndexSettingCompression branch from 04f4f7d to fe1b049 Compare February 19, 2026 09:42
@darjisagar7 darjisagar7 force-pushed the IndexSettingCompression branch from fe1b049 to c01b518 Compare February 19, 2026 10:05
@github-actions
Copy link
Contributor

❌ Gradle check result for c01b518: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

"index.parquet.row_group_size_bytes",
new ByteSizeValue(128, ByteSizeUnit.MB),
Setting.Property.IndexScope,
Setting.Property.Dynamic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, its ok to be not dynamic

…rence

Signed-off-by: Sagar Darji <darjisagar7@gmail.com>
@darjisagar7 darjisagar7 force-pushed the IndexSettingCompression branch from c01b518 to 969e025 Compare February 19, 2026 10:42
@github-actions
Copy link
Contributor

❌ Gradle check result for 969e025: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

.set_compression(Compression::ZSTD(ZstdLevel::try_new(3).unwrap()))
.build();

let props = WriterPropertiesBuilder::build(&config);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle for merges as well


// Push current settings to Rust store once on construction, then keep in sync on updates
pushSettingsToRust(indexSettings);
indexSettings.getScopedSettings().addSettingsUpdateConsumer(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since settings are not dynamic for now, we can remove these

}

// Read configuration from the global settings store
let config: NativeSettings = SETTINGS_STORE.lock().unwrap().clone();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this index level?

Signed-off-by: Sagar Darji <darjisagar7@gmail.com>
@github-actions
Copy link
Contributor

❌ Gradle check result for 178887e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants