Provenance section

Hello,

While processing some HGV metadata EpiDocs from papyri.info, I noticed most records include a `<provenance>` section, such as this one:

[HGV 140713](https://github.com/papyri/idp.data/blob/master/HGV_meta_EpiDoc/HGV141/140713.xml)
```xml
               <history>
                  <origin>
                     <origPlace>Soknopaiu Nesos (Arsinoites)</origPlace>
                     <origDate when="0144">144</origDate>
                  </origin>
                  <provenance type="located">
                     <p xml:id="geoIF5AB8">
                        <placeName n="1"
                                   type="ancient"
                                   ref="https://www.trismegistos.org/place/2157 https://pleiades.stoa.org/places/737053">Soknopaiu Nesos</placeName>
                        <placeName n="2"
                                   type="ancient"
                                   subtype="nome"
                                   ref="https://www.trismegistos.org/place/332 https://pleiades.stoa.org/places/736893">Arsinoites</placeName>
                        <placeName type="ancient" subtype="region">Ägypten</placeName>
                     </p>
                  </provenance>
               </history>
```

For some use cases, it might be nice to parse the provenance to get more detailed location info and/or Trismegistos and Pleiades location IDs. Is this something that falls within the project's goals?

For reference, I get the location in [my project](https://github.com/willem640/papyri-infilling/blob/main/preprocessing/maat/02_add_HGV_metadata.py) with this function:

```python
def get_location(doc):
    # HGV has a provenance section with a trismegistos places reference, which is better than the string name provided by doc.orig_place
    # use xpath + regex to get it out, but fall back to pleiades or string name when trismegistos is not provided

    result = {}

    provenance_tm_xpath = '//ns:history/ns:provenance/ns:p/ns:placeName[@type="ancient" and @subtype="nome"]'
    provenance = doc.xpath(provenance_tm_xpath)
    if len(provenance) > 0:
        # EpiDoc has a section with a trismegistos and/or pleiades reference
        place_name = provenance[0]
        urls = place_name.get("ref")
        if urls is not None and len(urls) > 0:
            tm_place_id_regex = r"https:\/\/www\.trismegistos\.org\/place\/(\d+)"
            pleiades_id_regex = r"https:\/\/pleiades\.stoa\.org\/places\/(\d+)"
            if tm_id := re.match(tm_place_id_regex, urls):
                result['TM'] = tm_id.group(1)
            if pl_id := re.match(pleiades_id_regex, urls):
                result['PL'] = pl_id.group(1)
            if place_name.text != None:
                result['text'] = place_name.text
    
    # If none of these methods worked, try getting the "normal" placeName
    # EpiDoc.orig_place wants [@type="ancient"] set, but this is not the case in the HGV EpiDocs
    if 'text' not in result:
        orig_places = doc.get_desc('origPlace')
        if len(orig_places) > 0:
            result['text'] = orig_places[0].text # just take the first one if there are multiple
    return result
```
It takes the first (or only) "nome" section and gets the name and ID out. About 2/3 of HGV records will have this section.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Provenance section #42

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Provenance section #42

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions