UPDATED PROPOSAL: Enforce character string data type for all `type_id` `output_type` properties #52

annakrystalli · 2023-04-26T14:26:40Z

annakrystalli
Apr 26, 2023
Maintainer

Background

The ability to have categorical variable values as entries to type_id or value in some output types introduces a risk for data type inconsistencies in both type_id or value columns in model-output data. This can affect ability for opening an arrow dataset successfully and consistently.

The table below lists the possibility of encountering character/numeric data types in type_id or value columns for each output type (x = possible, - = not possible).

	type_id		value
output_type	numeric	character	numeric	character
quantile	x	-	x	-
cdf	x	x	x	-
pmf	x	x	x	-
mode	NA	NA	x	x
sample	x	-	x	x
mean	NA	NA	x	-
median	NA	NA	x	-

The key objective is to ensure the type_id column data type is predictable. This is made difficult by mixing output_types whose type_id are different, especially character with numeric.
Another important objective is to ensure the value column always remains numeric. This is not possible for sample output types of categorical or epiweek variables.

Options

A few options to handled this were proposed and discussed.

Option A: enforce `type id` to be a character column and `value` to remain numeric.

PROs

It's the simplest approach

CONs

Does not address the issue of character values in value for sample output types of categorical variables.
Makes validation of config files more convoluted.
Properties in the schema files available for checking numeric type variables (like minimum and maximum) stops becoming available when encoding as character so cannot be handled through json validation, instead needing to be done in R.
Anytime you want to plot, score or use numeric filtering on model output data of numeric output type, you will need to cast type_id to numeric first.

Example:

output_type	type_id	value
'mean'	NA	67.00000000
'sample'	'1'	43.20000000
'sample'	'2'	77.70000000
'quantile'	'0.25'	61.00000000
'quantile'	'0.5'	67.00000000
'quantile'	'0.75'	72.00000000
'cdf'	'EW202301'	0.09157819
'cdf'	'EW202302'	0.23810331
'cdf'	'EW202303'	0.43347012
'cdf'	'EW202304'	0.62883694
'cdf'	'EW202305'	0.78513039
'cdf'	'EW202306'	0.88932602
'cdf'	'EW202307'	0.94886638
'cdf'	'EW202308'	0.97863657
'cdf'	'EW202309'	0.99186776
'cdf'	'EW202310'	0.99716023

Option B

Another option discussed was to introduce an additional type_id_label column for storing character labels for categorical variables, mapping them to integer indices in type_id column`.

PROs

Ensures data type consistency and predictability of type_id (numeric) and type_id_label (character) column.
Including type_id_label as a column ensures data files are self contained without needing to look up what values type_id indices map to.
Numeric type_ids not required to be converted to character to filter/analyse.
Can accommodate sample output types for categorical variables as type_id integers can be sampled and stored in the value column.

CONs

Increase in file size to accommodate additional column(s).
Extra checking to ensure type_id and type_id_labels are consistent across rounds.
To ensure data files are self contained without needing to look up what sampled indices in value column map to in a sample output type of a categorical variable, an additional column (e.g. value_labels) is required.

DECISION: OPTION A

It was decided to go with Option A as it is the simplest to implement at this stage. It means that sample output types for categorical variables are not currently supported but that was deemed acceptable for now due to the rarity of that situation.

However, some aspects of the approach when looking into implementation feel lacking.

The most jarring implementation feature would be the charges in the tasks.json config files:

"quantile":{
   "type":"object",
   "properties":{
      "type_id":{
         "type":"object",
         "properties":{
            "required":{
               "type":[
                  "array",
                  "null"
               ],
               "items":{
                  "type":"number",
                  "minimum":0,
                  "maximum":1
               }
            }
         }
      }
   }
}

By enforcing type_id to be a character in all cases, it makes the schema for type_id in some output types really clunky. For example, in quantile not only does the "type" of the required and optional arrays change to string but now it is not possible to enforce/document a minimum and maximum value through the schema, as the keywords have no meaning for string types.

While these checks can of course be carried out in R rather than automatically through validation against the schema, the lack of encoding/documentation of the criterion within the schema feels very jarring, unintuitive and inefficient. It will have to be documented somewhere else which feels clunky.

"quantile":{
   "type":"object",
   "properties":{
      "type_id":{
         "type":"object",
         "properties":{
            "required":{
               "type":[
                  "array",
                  "null"
               ],
               "items":{
                  "type":"string"
               }
            }
         }
      }
   }
}

Proposal:

While consistency and predictability are indeed important to interacting with model-output data,
it may not be the case that the best approach is to fix type_id column to character.

Instead, I propose that we use the group of output types specified in a hub's tasks.json across all rounds to decipher whether a hub should have a character type_id column or not.

The proposal follows the principle of type coercion in R:

typeof(c(1L))
#> [1] "integer"
typeof(c(1L, .5))
#> [1] "double"
typeof(c(1L, .5, "A"))
#> [1] "character"

Principles

type_id could be integer, double or character dependant on the group of output_types
Importantly the data type for the type_id a given hub should have can predictably be determined by examining the collection of output types defined in the tasks.json instead of hard coded through the schema.

For example:

a hub with output types:
- mean
- median
- quantile
  would have a double `type_id
A hub that includes cdf output type of peak week, with type_id specified as epiweeks (e.g. EW202301) would have a character type_id

The predictability means the tasks.json config file can be used to cast the type_id column to the appropriate data type when connecting to the hub consistently.

Applying such rules and making the determination of type_id data type predictable makes it easier to document. We would just document transparently the rules the software uses for determining the data type when connecting to the hub. I feel that's much easier to explain than having to explain why numeric output types must be defined as strings in tasks.json.

The fact that we use the collection of type_ids across all output_types means that individual output type type_id properties can be defined and checked individually according to the criteria type_ids for a given output type should adhere to.

Bonus benefit!

I've drawn up draft functionality in the form of function build_arrow_schema() in branch schema-from-config to create a schema for use when opening an arrow dataset connection.

The fact that the determination is made from the collection of output types across all rounds means we don't have to enforce a string type for all type_id properties across all output types. We can therefore leave the schema as is and continue to use it to accurately encode the expectations of data for given output types.

The functionality required to use the config file to determine type_id can also be used to determine the correct data type for all columns and generate an overarching schema. This in turn can be used to open datasets for multiple file formats and combine them into a single hub connection.

library(hubUtils)
origin_path <- system.file("testhubs/simple", package = "hubUtils")
config_tasks <- hubUtils:::read_config(origin_path, "tasks")


# Create csv schema and open connection
schema_csv <- build_arrow_schema(config_tasks, format = "csv")
schema_csv
#> Schema
#> origin_date: date32[day]
#> target: string
#> horizon: int32
#> location: string
#> age_group: string
#> type: string
#> type_id: double
#> value: int32
con_csv <- arrow::open_dataset(
    origin_path, format = "csv",
    partitioning = "team",
    col_types = schema_csv,
    factory_options = list(exclude_invalid_files = TRUE))
con_csv
#> FileSystemDataset with 4 csv files
#> origin_date: date32[day]
#> target: string
#> horizon: int32
#> location: string
#> type: string
#> type_id: double
#> value: int32
#> team: string

# Create parquet schema and open connection
schema_parquet <- build_arrow_schema(config_tasks, format = "parquet")
schema_parquet
#> Schema
#> origin_date: date32[day]
#> target: string
#> horizon: int32
#> location: string
#> age_group: string
#> type: string
#> type_id: double
#> value: int32
#> team: string
con_parquet <- arrow::open_dataset(
    origin_path, format = "parquet",
    partitioning = "team",
    schema = schema_parquet,
    factory_options = list(exclude_invalid_files = TRUE))
con_parquet
#> FileSystemDataset with 1 Parquet file
#> origin_date: date32[day]
#> target: string
#> horizon: int32
#> location: string
#> age_group: string
#> type: string
#> type_id: double
#> value: int32
#> team: string

# Combine both datasets
hub_con <- arrow::open_dataset(
    sources = list(
        con_csv,
        con_parquet)
)
hub_con
#> UnionDataset
#> origin_date: date32[day]
#> target: string
#> horizon: int32
#> location: string
#> type: string
#> type_id: double
#> value: int32
#> team: string
#> age_group: string

^{Created on 2023-04-26 with reprex v2.0.2}

Let me know what you think of this proposal and of course happy to answer questions if anything is unclear.

nickreich · 2023-04-26T16:31:03Z

nickreich
Apr 26, 2023
Maintainer

This basically seems ok to me. Still not fully clear on how this might impact downstream operations, like plotting. Say for example you have a hub that has some quantile forecasts and some categorical CDF forecasts, so the type_id column is treated as character. Then to plot those quantile forecasts we still would have to coerce the quantiles in type_id column to numeric, right? And that operation would be specific to the calculated "type_id" data type class for the hub overall?

0 replies

annakrystalli · 2023-04-27T10:22:41Z

annakrystalli
Apr 27, 2023
Maintainer Author

Say for example you have a hub that has some quantile forecasts and some categorical CDF forecasts, so the type_id column is treated as character. Then to plot those quantile forecasts we still would have to coerce the quantiles in type_id column to numeric, right? And that operation would be specific to the calculated "type_id" data type class for the hub overall?

Yes, if the type_id column is determined to be character across the whole hub, when accessing data for plotting, if they are supposed to be some other data type, they will need to be coerced before plotting. The good thing (I think) about the proposal is that both the type_id column type as well as the data type a given output type type_id should be (and therefore whether and what sort of coercion would be required) can both be predictably determined from the task.json config.

0 replies

nickreich · 2023-05-03T13:11:16Z

nickreich
May 3, 2023
Maintainer

To try to flesh out the example a bit more, I think it might help to think about the table that @annakrystalli included at the top as broken down by output_type (as defined in rounds > model_tasks > output_type) and target_type (as defined in rounds > model_tasks > target_type). Since, as the table in the top comment suggests, the type_id class for a hub cannot be entirely predicted by just the output_type.

The following table determines, given a target_type (rows) and output_type (columns), what class the type_id could be. In a given hub, if any of the included combinations of (target_type, output_type) require a character ("C") then the hub should have type_id of a character class. If all combinations are numeric ("N") or "NA" then type_id should be numeric. Note that type_id is always NA for mean and median output_types.

	quantile	cdf	pmf	mode	sample	mean	median
continuous	N	N	N	N	N	NA	NA
discrete	N	N	N	N	N	NA	NA
binary	NA	NA	N	N	N	NA	NA
date	N	C	C	C	C	NA	NA
ordinal	N	C	C	C	C	NA	NA
nominal[1]	NA	NA	C	C	C	NA	NA
compositional	NA	NA	C	C	C	NA	NA

[1] I note that in the current version of the documentation we have "categorical" and "ordinal" as target types but not "nominal" but I think this should be changed.

Some additional notes/explanations on the above table:

quantile output_type, when applicable, is always numeric.
cdf and quantile do not make sense as an output_type for nominal or compositional targets because these targets don't have a concept of ordering in the outcome.
for binary targets, quantile and cdf output_types are not valid since there is no non-trivial range (it's just 0/1)across which to compute these outputs.
for continuous and discrete targets, all type_id are numeric.

Noting in general that the above roughly aligns with the "Valid prediction types by target type" table for Zoltar as well.

0 replies

annakrystalli · 2023-05-08T09:35:22Z

annakrystalli
May 8, 2023
Maintainer Author

Thanks for the useful comment @nickreich! This is all great info for our docs.

A quick note that in the upcoming schema version (v1.0.0), possible target types have been amended to continuous, discrete, date, binary, nominal, ordinal, compositional

0 replies

annakrystalli · 2023-05-08T09:50:26Z

annakrystalli
May 8, 2023
Maintainer Author

Given we seem in agreement, I will go ahead and start implementing this in the package.

1 reply

nickreich May 8, 2023
Maintainer

Great! sounds like a plan to me.

annakrystalli · 2023-05-17T07:36:45Z

annakrystalli
May 17, 2023
Maintainer Author

I've had an idea of how to deal with the issues raised regarding the non stability of the type_id data type under this proposal and potential brittleness of downstream user developed code.

What if we introduced an argument type_id_col_type in the connect_hub() function which could take two values:

"auto": the default which results in the behaviour described in the proposal (i.e. dynamically determined from the schema)
"character": type_id column is cast as character.

This way, users that want to develop more dependable downstream code can do so by fixing the type_id column to character.

1 reply

nickreich May 17, 2023
Maintainer

This seems like a reasonable idea to me. No concerns.

UPDATED PROPOSAL: Enforce character string data type for all type_id output_type properties #52

Uh oh!

Uh oh!

annakrystalli Apr 26, 2023 Maintainer

Background

Options

Option A: enforce type id to be a character column and value to remain numeric.

PROs

CONs

Example:

Option B

PROs

CONs

DECISION: OPTION A

Proposal:

Principles

Bonus benefit!

Replies: 6 comments · 2 replies

Uh oh!

nickreich Apr 26, 2023 Maintainer

Uh oh!

annakrystalli Apr 27, 2023 Maintainer Author

Uh oh!

Uh oh!

nickreich May 3, 2023 Maintainer

Uh oh!

annakrystalli May 8, 2023 Maintainer Author

Uh oh!

annakrystalli May 8, 2023 Maintainer Author

Uh oh!

nickreich May 8, 2023 Maintainer

Uh oh!

annakrystalli May 17, 2023 Maintainer Author

Uh oh!

nickreich May 17, 2023 Maintainer

UPDATED PROPOSAL: Enforce character string data type for all `type_id` `output_type` properties #52

annakrystalli
Apr 26, 2023
Maintainer

Option A: enforce `type id` to be a character column and `value` to remain numeric.

Replies: 6 comments 2 replies

nickreich
Apr 26, 2023
Maintainer

annakrystalli
Apr 27, 2023
Maintainer Author

nickreich
May 3, 2023
Maintainer

annakrystalli
May 8, 2023
Maintainer Author

annakrystalli
May 8, 2023
Maintainer Author

nickreich May 8, 2023
Maintainer

annakrystalli
May 17, 2023
Maintainer Author

nickreich May 17, 2023
Maintainer