Clarifying the point of the CVs

Related to #177 and #110

## The problem

I think it is still not clear what exactly the CVs are, particularly where in the information chain they sit.

## What I hope to get out of this

I would like to get clarity about how the CVs are meant to evolve and be used. Not knowing this has sent @durack1 and I round in circles on multiple occasions. I think it is also the key barrier to building better tooling around the CVs (it has caused issues for me building https://github.com/PCMDI/input4MIPs_CVs, I think it is also causing issues with the tooling efforts in https://github.com/WCRP-CMIP/WCRP-universe/).

## The use modes that I think cause confusion

### Use mode implied by the name: defining allowed values

The name 'controlled vocabularies' suggests that these are the allowed set of values. In order words, the CVs define values people can use, then it's up to users to combine them as they wish.

### Use mode in practice: an information source

This is what I think actually happens in practice. For example, in #177, the request was to add all information to the CVs so they could be used as the source from which to create citation entries.

In this second use mode, The CVs are seen as a source of information. In other words, the CVs don't define allowed values, they define **the** values.

## Why I think this ambiguity causes confusion

Put very simply, it isn't clear whether the CVs define the schema for our data, or whether they are the data. This is obviously a problem, you can use a schema to define data structures, allowing tools to build on the structure they provide etc. You can use the data as an information source. However, you can't mix and match them because it defeats the point (e.g. if you're constantly adding new fields to your schemas, then you're constantly breaking downstream use; if you can never add new information to your data, then the data source isn't very useful).

## A way out of this

I think @wolfiex has basically already shown the way out of this with the direction that https://github.com/WCRP-CMIP/WCRP-universe is heading: use something like JSON-LD consistently throughout. That provides both the schema and the data, in a way that clearly separates the two.

As part of this effort, it would be great to clarify what the schema for source ID is. #177 has shown that the current understanding in this repo is not sufficient for creating citations (hence the consideration of #200). However, is the proposed structure something that is going to be rolled out across all CVs, is that structure already in place everywhere and we have just missed it in input4MIPs or are we building something custom right now without any clear understanding of how this will scale beyond input4MIPs (@jitendra-kumar maybe you already have an idea about this)?

### An implication

One implication of having the CVs define both the schema and the values is that data providers have to register everything in advance. They can't just make data compliant with the schema and turn up, they have to also pre-register their metadata values. That's ok, but it is an extra step (that might take getting used to?), which makes validation tools really important so people can get clear, consistent feedback about what they need to fix and validate data themselves (rather than having to do heaps of iterations, which can be very slow and frustrating for everyone).

## Conclusion

I think there's basically two key next steps that come from this:

1. Is there already some agreed form for source ID, that we should just update to match here (which will then basically make #177 trivial)?
2. Are there any documents that clarify how the CVs work/are built? Even as simple as the following, put somewhere clearly accessible would have helped me a lot, "The controlled vocabularies define the metadata used in CMIP. They are composed of the data schemas (e.g. which fields are allowed and what type is expected for each field) as well as the data itself (i.e. what data entries do we have and what do they mean)."

Interested in others' views, pinging @durack1 and @jitendra-kumar, also @taylor13 as this may be relevant for/solved by the CV TT. Please add/tag anyone else that might be interested

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarifying the point of the CVs #201

The problem

What I hope to get out of this

The use modes that I think cause confusion

Use mode implied by the name: defining allowed values

Use mode in practice: an information source

Why I think this ambiguity causes confusion

A way out of this

An implication

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clarifying the point of the CVs #201

Description

The problem

What I hope to get out of this

The use modes that I think cause confusion

Use mode implied by the name: defining allowed values

Use mode in practice: an information source

Why I think this ambiguity causes confusion

A way out of this

An implication

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions