HIVE-28615: OTEL: Expose FS counters for queries#5535
HIVE-28615: OTEL: Expose FS counters for queries#5535tanishq-chugh wants to merge 7 commits intoapache:masterfrom
Conversation
|
ayushtkn
left a comment
There was a problem hiding this comment.
I am having doubts around the Heap consumption due to this at HS2, we can't do it without a flag, else people with OTEL disabled will also get impacted.
Let me think about it & discuss with folks & circle back
| private Long endTime; | ||
| private String state; | ||
| private QueryDisplay queryDisplay; | ||
| private QueryPlan queryPlan; |
There was a problem hiding this comment.
I am having doubts around this, persisting the entire queryPlan for huge queries might lead to some heap challenges at HS2..
There was a problem hiding this comment.
agree, but only because of memory reasons
be aware that counters like this are only available after finishing the DAG (when it's aggregated in the AM), so including this in the scope of OTEL looks strange, as this metrics collection is supposed to be a live thing (while running the query)
I guess if we run this code on a cluster while running a query, it will return NPE somewhere, or non-sense counter values
I think for proper metrics users should refer to the query history when its implementation is done
There was a problem hiding this comment.
as discussed in the meantime: OTEL integration also applies finished queries, in which case we might want to include counters
2 notes:
-
I agree with not using QueryPlan: as part of the query history service I already implemented a RuntimeContext that's not committed yet but it could be reused here, would you mind me refactoring it to a separate change so you can use it here? in this case, both query history and this code path would use the same
-
if we're adding counters, are we about to consider adding more? let me attach a query summary here to let you see and decide what could be useful, primarily I'm thinking of:
DAGCounter
TaskCounter
LlapIOCounters
LlapWmCounters
INFO : org.apache.tez.common.counters.DAGCounter:
INFO : NUM_SUCCEEDED_TASKS: 23
INFO : TOTAL_LAUNCHED_TASKS: 23
INFO : DATA_LOCAL_TASKS: 18
INFO : AM_CPU_MILLISECONDS: 8920
INFO : AM_GC_TIME_MILLIS: 0
INFO : INITIAL_HELD_CONTAINERS: 0
INFO : TOTAL_CONTAINERS_USED: 23
INFO : TOTAL_CONTAINER_LAUNCH_COUNT: 23
INFO : TOTAL_CONTAINER_RELEASE_COUNT: 23
INFO : NODE_USED_COUNT: 2
INFO : NODE_TOTAL_COUNT: 2
INFO : File System Counters:
INFO : FILE_BYTES_READ: 3848955
INFO : FILE_BYTES_WRITTEN: 14729274
INFO : FILE_READ_OPS: 0
INFO : FILE_LARGE_READ_OPS: 0
INFO : FILE_WRITE_OPS: 0
INFO : S3A_BYTES_READ: 21267950
INFO : S3A_BYTES_WRITTEN: 0
INFO : S3A_READ_OPS: 108
INFO : S3A_LARGE_READ_OPS: 0
INFO : S3A_WRITE_OPS: 0
INFO : org.apache.tez.common.counters.TaskCounter:
INFO : REDUCE_INPUT_GROUPS: 1069247
INFO : REDUCE_INPUT_RECORDS: 1079849
INFO : COMBINE_INPUT_RECORDS: 0
INFO : SPILLED_RECORDS: 1623096
INFO : NUM_SHUFFLED_INPUTS: 60
INFO : NUM_SKIPPED_INPUTS: 2
INFO : NUM_FAILED_SHUFFLE_INPUTS: 0
INFO : MERGED_MAP_OUTPUTS: 44
INFO : TASK_DURATION_MILLIS: 122643
INFO : INPUT_RECORDS_PROCESSED: 1004974
INFO : INPUT_SPLIT_LENGTH_BYTES: 135671627
INFO : OUTPUT_RECORDS: 1580632
INFO : APPROXIMATE_INPUT_RECORDS: 1001564
INFO : OUTPUT_LARGE_RECORDS: 0
INFO : OUTPUT_BYTES: 39368971
INFO : OUTPUT_BYTES_WITH_OVERHEAD: 42530547
INFO : OUTPUT_BYTES_PHYSICAL: 14727862
INFO : ADDITIONAL_SPILLS_BYTES_WRITTEN: 0
INFO : ADDITIONAL_SPILLS_BYTES_READ: 5770624
INFO : ADDITIONAL_SPILL_COUNT: 0
INFO : SHUFFLE_CHUNK_COUNT: 24
INFO : SHUFFLE_BYTES: 17983447
INFO : SHUFFLE_BYTES_DECOMPRESSED: 57039563
INFO : SHUFFLE_BYTES_TO_MEM: 8957046
INFO : SHUFFLE_BYTES_TO_DISK: 0
INFO : SHUFFLE_BYTES_DISK_DIRECT: 9026401
INFO : NUM_MEM_TO_DISK_MERGES: 0
INFO : NUM_DISK_TO_DISK_MERGES: 0
INFO : SHUFFLE_PHASE_TIME: 8867
INFO : MERGE_PHASE_TIME: 5033
INFO : FIRST_EVENT_RECEIVED: 3608
INFO : LAST_EVENT_RECEIVED: 4347
INFO : DATA_BYTES_VIA_EVENT: 0
INFO : HIVE:
INFO : DESERIALIZE_ERRORS: 0
INFO : RECORDS_IN_Map_1: 102
INFO : RECORDS_IN_Map_2: 2875432
INFO : RECORDS_IN_Map_6: 10000
INFO : RECORDS_IN_Map_7: 500000
INFO : RECORDS_IN_Map_8: 10000
INFO : RECORDS_OUT_0: 0
INFO : RECORDS_OUT_INTERMEDIATE_Map_1: 0
INFO : RECORDS_OUT_INTERMEDIATE_Map_2: 1079849
INFO : RECORDS_OUT_INTERMEDIATE_Map_6: 366
INFO : RECORDS_OUT_INTERMEDIATE_Map_7: 500000
INFO : RECORDS_OUT_INTERMEDIATE_Map_8: 366
INFO : RECORDS_OUT_INTERMEDIATE_Reducer_3: 0
INFO : RECORDS_OUT_INTERMEDIATE_Reducer_4: 0
INFO : RECORDS_OUT_INTERMEDIATE_Reducer_5: 51
INFO : RECORDS_OUT_OPERATOR_FIL_143: 366
INFO : RECORDS_OUT_OPERATOR_FIL_146: 366
INFO : RECORDS_OUT_OPERATOR_FIL_149: 2699840
INFO : RECORDS_OUT_OPERATOR_FIL_150: 2724950
INFO : RECORDS_OUT_OPERATOR_FIL_159: 0
INFO : RECORDS_OUT_OPERATOR_FIL_165: 51
INFO : RECORDS_OUT_OPERATOR_FIL_168: 500000
INFO : RECORDS_OUT_OPERATOR_FIL_172: 2040
INFO : RECORDS_OUT_OPERATOR_FIL_177: 0
INFO : RECORDS_OUT_OPERATOR_FS_183: 0
INFO : RECORDS_OUT_OPERATOR_GBY_155: 539644
INFO : RECORDS_OUT_OPERATOR_GBY_156: 540205
INFO : RECORDS_OUT_OPERATOR_GBY_162: 534649
INFO : RECORDS_OUT_OPERATOR_GBY_164: 51
INFO : RECORDS_OUT_OPERATOR_GBY_171: 2048
INFO : RECORDS_OUT_OPERATOR_LIM_182: 0
INFO : RECORDS_OUT_OPERATOR_MAPJOIN_153: 542954
INFO : RECORDS_OUT_OPERATOR_MAPJOIN_154: 547920
INFO : RECORDS_OUT_OPERATOR_MAPJOIN_174: 0
INFO : RECORDS_OUT_OPERATOR_MAPJOIN_175: 0
INFO : RECORDS_OUT_OPERATOR_MAPJOIN_176: 0
INFO : RECORDS_OUT_OPERATOR_MAP_0: 0
INFO : RECORDS_OUT_OPERATOR_RS_145: 366
INFO : RECORDS_OUT_OPERATOR_RS_148: 366
INFO : RECORDS_OUT_OPERATOR_RS_157: 539644
INFO : RECORDS_OUT_OPERATOR_RS_158: 540205
INFO : RECORDS_OUT_OPERATOR_RS_161: 0
INFO : RECORDS_OUT_OPERATOR_RS_167: 51
INFO : RECORDS_OUT_OPERATOR_RS_170: 500000
INFO : RECORDS_OUT_OPERATOR_RS_180: 0
INFO : RECORDS_OUT_OPERATOR_SEL_144: 366
INFO : RECORDS_OUT_OPERATOR_SEL_147: 366
INFO : RECORDS_OUT_OPERATOR_SEL_151: 2699840
INFO : RECORDS_OUT_OPERATOR_SEL_152: 2724950
INFO : RECORDS_OUT_OPERATOR_SEL_160: 0
INFO : RECORDS_OUT_OPERATOR_SEL_163: 534649
INFO : RECORDS_OUT_OPERATOR_SEL_166: 51
INFO : RECORDS_OUT_OPERATOR_SEL_169: 500000
INFO : RECORDS_OUT_OPERATOR_SEL_173: 2040
INFO : RECORDS_OUT_OPERATOR_SEL_179: 0
INFO : RECORDS_OUT_OPERATOR_SEL_181: 0
INFO : RECORDS_OUT_OPERATOR_TNK_178: 0
INFO : RECORDS_OUT_OPERATOR_TS_0: 102
INFO : RECORDS_OUT_OPERATOR_TS_18: 500000
INFO : RECORDS_OUT_OPERATOR_TS_24: 10000
INFO : RECORDS_OUT_OPERATOR_TS_3: 2875432
INFO : RECORDS_OUT_OPERATOR_TS_6: 10000
INFO : Shuffle Errors:
INFO : BAD_ID: 0
INFO : CONNECTION: 0
INFO : IO_ERROR: 0
INFO : WRONG_LENGTH: 0
INFO : WRONG_MAP: 0
INFO : WRONG_REDUCE: 0
INFO : org.apache.hadoop.hive.llap.counters.LlapIOCounters:
INFO : ALLOCATED_BYTES: 38535168
INFO : ALLOCATED_USED_BYTES: 34036847
INFO : CACHE_HIT_BYTES: 0
INFO : CACHE_MISS_BYTES: 20493187
INFO : CONSUMER_TIME_NS: 3531916885
INFO : DECODE_TIME_NS: 1180278124
INFO : HDFS_TIME_NS: 21459595705
INFO : IO_CPU_NS: 3237492937
INFO : IO_USER_NS: 2940000000
INFO : METADATA_CACHE_MISS: 58
INFO : NUM_DECODED_BATCHES: 355
INFO : NUM_DECODED_ROWS: 3395534
INFO : NUM_INPUT_ROWS: 3395534
INFO : NUM_VECTOR_BATCHES: 3408
INFO : ROWS_EMITTED: 3395534
INFO : SELECTED_ROWGROUPS: 355
INFO : TOTAL_IO_TIME_NS: 13162230633
INFO : org.apache.hadoop.hive.llap.counters.LlapWmCounters:
INFO : GUARANTEED_QUEUED_NS: 0
INFO : GUARANTEED_RUNNING_NS: 0
INFO : SPECULATIVE_QUEUED_NS: 35824072
INFO : SPECULATIVE_RUNNING_NS: 118969998048
INFO : org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable$LlapExecutorCounters:
INFO : EXECUTOR_CPU_NS: 29116150979
INFO : EXECUTOR_USER_NS: 27090000000
INFO : org.apache.hadoop.hive.ql.exec.tez.HiveInputCounters:
INFO : GROUPED_INPUT_SPLITS_Map_1: 1
INFO : GROUPED_INPUT_SPLITS_Map_2: 11
INFO : GROUPED_INPUT_SPLITS_Map_6: 1
INFO : GROUPED_INPUT_SPLITS_Map_7: 4
INFO : GROUPED_INPUT_SPLITS_Map_8: 1
INFO : INPUT_DIRECTORIES_Map_1: 1
INFO : INPUT_DIRECTORIES_Map_2: 1
INFO : INPUT_DIRECTORIES_Map_6: 1
INFO : INPUT_DIRECTORIES_Map_7: 1
There was a problem hiding this comment.
I just created HIVE-28623, let me block you for 1-2 days with it
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
|
hm, this is closed as stale, is it really? can we do anything to move forward? |
ab23d7d to
8994104
Compare
ayushtkn
left a comment
There was a problem hiding this comment.
Thanx @tanishq-chugh , can you try to check how does the heap usage looks like storing TezRuntimeContext maybe for one query, we aren't storing the plans now, so I believe it should be constant now & independent of the query, can you double check once
| "Frequency at which the OTEL Metrics are refreshed, A value of 0 or less disable the feature"), | ||
|
|
||
| HIVE_OTEL_COLLECTOR_ENDPOINT("hive.otel.collector.endpoint", "", | ||
| HIVE_OTEL_COLLECTOR_ENDPOINT("hive.otel.collector.endpoint", "http://otelcollector:5650", |
| "Defines the rewrite policy, the valid values are those defined in RewritePolicy enum"), | ||
|
|
||
| HIVE_OTEL_METRICS_FREQUENCY_SECONDS("hive.otel.metrics.frequency.seconds", "0s", | ||
| HIVE_OTEL_METRICS_FREQUENCY_SECONDS("hive.otel.metrics.frequency.seconds", "2s", |
| public synchronized void setTezRuntimeContext(TezRuntimeContext trc) {this.tezRuntimeContext = trc;} | ||
|
|
||
| public synchronized TezRuntimeContext getTezRuntimeContext() {return this.tezRuntimeContext;} | ||
|
|
||
| public synchronized JSONObject getCounters() { | ||
| JSONObject countersJson = new JSONObject(); | ||
| if (this.tezRuntimeContext != null){ | ||
| TezCounters counters = this.tezRuntimeContext.getCounters(); | ||
| if (counters != null) { | ||
| for (CounterGroup group : counters) { | ||
| for (TezCounter counter : group) { | ||
| countersJson.put(group.getName() + " - " + counter.getDisplayName(), counter.getValue()); | ||
| } | ||
| } | ||
| } | ||
| } | ||
| return countersJson; | ||
| } | ||
|
|
There was a problem hiding this comment.
we should make it optional & only when the config is set to true we should persist
…to avoid storing TezRuntimeContext at multiple places redundantly
…ther minimize memory usage
…or api servlet method
|



What changes were proposed in this pull request?
Expose FS counters information via OTEL at query level
Why are the changes needed?
Better usability for OTEL
Does this PR introduce any user-facing change?
Yes, FS related information will be visible at OTEL recievers
Is the change a dependency upgrade?
No
How was this patch tested?
Manual testing