-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[Bug]: BigQueryIO DIRECT_READ breaks for empty table during Bounded-to-Unbounded conversion #38007
Description
What happened?
Summary
When using BigQueryIO.read().withMethod(Method.DIRECT_READ) to read from an empty BigQuery table or a table that returns zero rows after applying RowRestriction, the BigQuery Storage API server returns a ReadSession with zero streams. BigQueryStorageSourceBase.split() handles this by returning ImmutableList.of() (an empty list of sources).
While this works correctly for purely bounded reads, it breaks any pipeline that wraps this bounded source into an unbounded one (via UnboundedReadFromBoundedSource).
When split() returns an empty list, UnboundedReadFromBoundedSource falls back to wrapping the original unsplit source directly (ImmutableList.of(boundedSource)). However, BigQueryStorageSourceBase.createReader() is not implemented — it unconditionally throws UnsupportedOperationException("BigQuery storage source must be split before reading"), because it is designed to only be read through its per-stream sub-sources (BigQueryStorageStreamSource). This causes the pipeline to get stuck in a loop of exceptions.
Environment
- Apache Beam SDK version: 2.70.0 (reproduced; likely affects all versions with DIRECT_READ support)
- Runner: Dataflow Runner v1
Steps to Reproduce
- Create a Beam pipeline that uses
BigQueryIO.read().withMethod(Method.DIRECT_READ)to read from a BigQuery table. - Use this read in a streaming pipeline.
- Ensure the target BigQuery table is empty, or returns zero rows after
RowRestrictionis applied. - Run the pipeline.
Observed Behavior
-
The pipeline gets stuck and the following error appears repeatedly in logs:
Caused by: java.lang.UnsupportedOperationException: BigQuery storage source must be split before reading org.apache.beam.sdk.io.gcp.bigquery.BigQueryStorageSourceBase.createReader(BigQueryStorageSourceBase.java:200) org.apache.beam.sdk.io.gcp.bigquery.BigQueryStorageTableSource.createReader(BigQueryStorageTableSource.java:42) org.apache.beam.sdk.util.construction.UnboundedReadFromBoundedSource$BoundedToUnboundedSourceAdapter$ResidualSource.advance(UnboundedReadFromBoundedSource.java:497)
Expected Behavior
The pipeline does not stall — downstream operations proceed normally, simply processing zero elements from this branch.
Additional Context
Chain of events
BigQueryIO.read().withMethod(Method.DIRECT_READ)usesBigQueryStorageSourceBaseas the underlyingBoundedSource.split()builds aCreateReadSessionRequestand sends it to the server.- The server returns a
ReadSessionwithstreams_count = 0for an empty table. split()returnsImmutableList.of().UnboundedReadFromBoundedSource$BoundedToUnboundedSourceAdapterreceives an empty list of sub-sources and falls back to wrapping the original unsplit source directly (ImmutableList.of(boundedSource)). When it callscreateReader()on thisBigQueryStorageSourceBase, the method throwsUnsupportedOperationException("BigQuery storage source must be split before reading")becauseBigQueryStorageSourceBasedoes not implementcreateReader()— it is designed to only be read through its per-stream sub-sources.- The job gets stuck in a loop, repeatedly hitting this exception without making progress. The watermark never advances and the pipeline must be manually cancelled.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam YAML
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Infrastructure
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner