-
Notifications
You must be signed in to change notification settings - Fork 14
Description
The problem
I think it is still not clear what exactly the CVs are, particularly where in the information chain they sit.
What I hope to get out of this
I would like to get clarity about how the CVs are meant to evolve and be used. Not knowing this has sent @durack1 and I round in circles on multiple occasions. I think it is also the key barrier to building better tooling around the CVs (it has caused issues for me building https://github.com/PCMDI/input4MIPs_CVs, I think it is also causing issues with the tooling efforts in https://github.com/WCRP-CMIP/WCRP-universe/).
The use modes that I think cause confusion
Use mode implied by the name: defining allowed values
The name 'controlled vocabularies' suggests that these are the allowed set of values. In order words, the CVs define values people can use, then it's up to users to combine them as they wish.
Use mode in practice: an information source
This is what I think actually happens in practice. For example, in #177, the request was to add all information to the CVs so they could be used as the source from which to create citation entries.
In this second use mode, The CVs are seen as a source of information. In other words, the CVs don't define allowed values, they define the values.
Why I think this ambiguity causes confusion
Put very simply, it isn't clear whether the CVs define the schema for our data, or whether they are the data. This is obviously a problem, you can use a schema to define data structures, allowing tools to build on the structure they provide etc. You can use the data as an information source. However, you can't mix and match them because it defeats the point (e.g. if you're constantly adding new fields to your schemas, then you're constantly breaking downstream use; if you can never add new information to your data, then the data source isn't very useful).
A way out of this
I think @wolfiex has basically already shown the way out of this with the direction that https://github.com/WCRP-CMIP/WCRP-universe is heading: use something like JSON-LD consistently throughout. That provides both the schema and the data, in a way that clearly separates the two.
As part of this effort, it would be great to clarify what the schema for source ID is. #177 has shown that the current understanding in this repo is not sufficient for creating citations (hence the consideration of #200). However, is the proposed structure something that is going to be rolled out across all CVs, is that structure already in place everywhere and we have just missed it in input4MIPs or are we building something custom right now without any clear understanding of how this will scale beyond input4MIPs (@jitendra-kumar maybe you already have an idea about this)?
An implication
One implication of having the CVs define both the schema and the values is that data providers have to register everything in advance. They can't just make data compliant with the schema and turn up, they have to also pre-register their metadata values. That's ok, but it is an extra step (that might take getting used to?), which makes validation tools really important so people can get clear, consistent feedback about what they need to fix and validate data themselves (rather than having to do heaps of iterations, which can be very slow and frustrating for everyone).
Conclusion
I think there's basically two key next steps that come from this:
- Is there already some agreed form for source ID, that we should just update to match here (which will then basically make DOE OSTI DOIs for input4MIPs #177 trivial)?
- Are there any documents that clarify how the CVs work/are built? Even as simple as the following, put somewhere clearly accessible would have helped me a lot, "The controlled vocabularies define the metadata used in CMIP. They are composed of the data schemas (e.g. which fields are allowed and what type is expected for each field) as well as the data itself (i.e. what data entries do we have and what do they mean)."
Interested in others' views, pinging @durack1 and @jitendra-kumar, also @taylor13 as this may be relevant for/solved by the CV TT. Please add/tag anyone else that might be interested