Skip to content

Reconciled Wikidata IDs Difficult to Retrieve via SPARQL #419

@SCN-MNG

Description

@SCN-MNG

Currently, we are storing the reconciled Wikidata QID in two different ways:

  1. We store the QID with exact match (P2888) whenever the dataset already has a URI for the entity (e.g. https://www.diamm.ac.uk/people/1)
  2. We replace all instances of a particular string (e.g. "J. S. Bach") with a QID if the dataset does not have a URI for that particular entity.

Case 1: Dataset contains URI for entity — using exact match (P2888)

When a dataset already defines a URI (e.g., https://www.diamm.ac.uk/people/1), we use the Wikidata property exact match (P2888) to link that local URI to the corresponding Wikidata entity. This approach ensures the original URI from the dataset is preserved.

For example, if https://www.diamm.ac.uk/people/1 was reconciled to https://www.wikidata.org/entity/Q1339, there would be a triple in our graph stating:

<https://www.diamm.ac.uk/people/1> wdt:P2888 <https://www.wikidata.org/entity/Q1339> .

However, all other statement related to this entity would use only the original dataset URI, never the Wikidata URI. For example:

<https://www.diamm.ac.uk/people/1> wdt:P569 "01-01-1200"^^xsd:dateTime .
<https://www.diamm.ac.uk/people/1> wdt:1449 "Beltrandus de Francia" .
<https://www.diamm.ac.uk/sources/1> wdt:P50 <https://www.diamm.ac.uk/people/1> .

In this case, the SPARQL query must first retrieve the DIAMM ID, then retrieve the Wikidata QID from that:

Case 2: Dataset does not contain URI for entity — replacing string with QID

This case applies to most values in our datasets, since it is much more common to have strings rather than URIs.

For example, if "Anonymous" (only string, no URI) was reconciled to https://www.wikidata.org/entity/Q4233718, the Wikidata URI would be directly placed within the triple:

<https://www.diamm.ac.uk/sources/1> wdt:P50 <https://www.wikidata.org/entity/Q4233718> .
<https://www.diamm.ac.uk/compositions/1> wdt:P86 <https://www.wikidata.org/entity/Q4233718> .

In this, the SPARQL query must directly retrieve the Wikidata ID.

Problem with Having Two Different Schema

Storing QIDs in two different ways confuses the LLM, since they require two different SPARQL queries.

Another issue is that a SPARQL query can retrieve a mix of Wikidata URIs and local URIs, instead of retrieving the complete set of one or the other.

For example, this query retrieves a mix of Wikidata URI and of The Global Jukebox URI:

SELECT ?culture
WHERE {
  GRAPH gj: {        
    ?ensemble a gj:Ensemble ;
              wdt:P2596 ?culture .
  }
}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions