UPDATED PROPOSAL: Enforce character string data type for all type_id output_type properties
#52
Replies: 6 comments 2 replies
-
|
This basically seems ok to me. Still not fully clear on how this might impact downstream operations, like plotting. Say for example you have a hub that has some quantile forecasts and some categorical CDF forecasts, so the type_id column is treated as character. Then to plot those quantile forecasts we still would have to coerce the quantiles in type_id column to numeric, right? And that operation would be specific to the calculated "type_id" data type class for the hub overall? |
Beta Was this translation helpful? Give feedback.
-
Yes, if the |
Beta Was this translation helpful? Give feedback.
-
|
To try to flesh out the example a bit more, I think it might help to think about the table that @annakrystalli included at the top as broken down by The following table determines, given a
[1] I note that in the current version of the documentation we have "categorical" and "ordinal" as target types but not "nominal" but I think this should be changed. Some additional notes/explanations on the above table:
Noting in general that the above roughly aligns with the "Valid prediction types by target type" table for Zoltar as well. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the useful comment @nickreich! This is all great info for our docs. A quick note that in the upcoming schema version (v1.0.0), possible target types have been amended to |
Beta Was this translation helpful? Give feedback.
-
|
Given we seem in agreement, I will go ahead and start implementing this in the package. |
Beta Was this translation helpful? Give feedback.
-
|
I've had an idea of how to deal with the issues raised regarding the non stability of the What if we introduced an argument
This way, users that want to develop more dependable downstream code can do so by fixing the |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Background
The ability to have categorical variable values as entries to
type_idorvaluein some output types introduces a risk for data type inconsistencies in bothtype_idorvaluecolumns in model-output data. This can affect ability for opening an arrow dataset successfully and consistently.The table below lists the possibility of encountering character/numeric data types in
type_idorvaluecolumns for each output type (x= possible,-= not possible).type_idcolumn data type is predictable. This is made difficult by mixing output_types whosetype_idare different, especiallycharacterwithnumeric.valuecolumn always remains numeric. This is not possible forsampleoutput types of categorical or epiweek variables.Options
A few options to handled this were proposed and discussed.
Option A: enforce
type idto be a character column andvalueto remain numeric.PROs
It's the simplest approach
CONs
valueforsampleoutput types of categorical variables.type_idto numeric first.Example:
Option B
Another option discussed was to introduce an additional
type_id_labelcolumn for storing character labels for categorical variables, mapping them to integer indices intype_idcolumn`.PROs
type_id(numeric) andtype_id_label(character) column.type_id_labelas a column ensures data files are self contained without needing to look up what valuestype_idindices map to.type_ids not required to be converted to character to filter/analyse.sampleoutput types for categorical variables astype_idintegers can be sampled and stored in thevaluecolumn.CONs
type_idandtype_id_labelsare consistent across rounds.valuecolumn map to in a sample output type of a categorical variable, an additional column (e.g.value_labels) is required.DECISION: OPTION A
It was decided to go with Option A as it is the simplest to implement at this stage. It means that
sampleoutput types for categorical variables are not currently supported but that was deemed acceptable for now due to the rarity of that situation.However, some aspects of the approach when looking into implementation feel lacking.
The most jarring implementation feature would be the charges in the
tasks.jsonconfig files:By enforcing
type_idto be a character in all cases, it makes the schema fortype_idin some output types really clunky. For example, inquantilenot only does the"type"of therequiredandoptionalarrays change tostringbut now it is not possible to enforce/document a minimum and maximum value through the schema, as the keywords have no meaning for string types.While these checks can of course be carried out in R rather than automatically through validation against the schema, the lack of encoding/documentation of the criterion within the schema feels very jarring, unintuitive and inefficient. It will have to be documented somewhere else which feels clunky.
Proposal:
While consistency and predictability are indeed important to interacting with model-output data,
it may not be the case that the best approach is to fix
type_idcolumn tocharacter.Instead, I propose that we use the group of output types specified in a hub's
tasks.jsonacross all rounds to decipher whether a hub should have a charactertype_idcolumn or not.The proposal follows the principle of type coercion in R:
Principles
type_idcould beinteger,doubleorcharacterdependant on the group of output_typestype_ida given hub should have can predictably be determined by examining the collection of output types defined in thetasks.jsoninstead of hard coded through the schema.For example:
a hub with output types:
meanmedianquantilewould have a double `type_id
A hub that includes
cdfoutput type of peak week, withtype_idspecified as epiweeks (e.g.EW202301) would have a charactertype_idThe predictability means the
tasks.jsonconfig file can be used to cast thetype_idcolumn to the appropriate data type when connecting to the hub consistently.Applying such rules and making the determination of
type_iddata type predictable makes it easier to document. We would just document transparently the rules the software uses for determining the data type when connecting to the hub. I feel that's much easier to explain than having to explain why numeric output types must be defined as strings intasks.json.The fact that we use the collection of
type_ids across all output_types means that individual output typetype_idproperties can be defined and checked individually according to the criteria type_ids for a given output type should adhere to.Bonus benefit!
I've drawn up draft functionality in the form of function
build_arrow_schema()in branchschema-from-configto create a schema for use when opening an arrow dataset connection.The fact that the determination is made from the collection of output types across all rounds means we don't have to enforce a string type for all
type_idproperties across all output types. We can therefore leave the schema as is and continue to use it to accurately encode the expectations of data for given output types.The functionality required to use the config file to determine
type_idcan also be used to determine the correct data type for all columns and generate an overarching schema. This in turn can be used to open datasets for multiple file formats and combine them into a single hub connection.Created on 2023-04-26 with reprex v2.0.2
Let me know what you think of this proposal and of course happy to answer questions if anything is unclear.
Beta Was this translation helpful? Give feedback.
All reactions