Skip to content

Provenance section #42

@willem640

Description

@willem640

Hello,

While processing some HGV metadata EpiDocs from papyri.info, I noticed most records include a <provenance> section, such as this one:

HGV 140713

               <history>
                  <origin>
                     <origPlace>Soknopaiu Nesos (Arsinoites)</origPlace>
                     <origDate when="0144">144</origDate>
                  </origin>
                  <provenance type="located">
                     <p xml:id="geoIF5AB8">
                        <placeName n="1"
                                   type="ancient"
                                   ref="https://www.trismegistos.org/place/2157 https://pleiades.stoa.org/places/737053">Soknopaiu Nesos</placeName>
                        <placeName n="2"
                                   type="ancient"
                                   subtype="nome"
                                   ref="https://www.trismegistos.org/place/332 https://pleiades.stoa.org/places/736893">Arsinoites</placeName>
                        <placeName type="ancient" subtype="region">Ägypten</placeName>
                     </p>
                  </provenance>
               </history>

For some use cases, it might be nice to parse the provenance to get more detailed location info and/or Trismegistos and Pleiades location IDs. Is this something that falls within the project's goals?

For reference, I get the location in my project with this function:

def get_location(doc):
    # HGV has a provenance section with a trismegistos places reference, which is better than the string name provided by doc.orig_place
    # use xpath + regex to get it out, but fall back to pleiades or string name when trismegistos is not provided

    result = {}

    provenance_tm_xpath = '//ns:history/ns:provenance/ns:p/ns:placeName[@type="ancient" and @subtype="nome"]'
    provenance = doc.xpath(provenance_tm_xpath)
    if len(provenance) > 0:
        # EpiDoc has a section with a trismegistos and/or pleiades reference
        place_name = provenance[0]
        urls = place_name.get("ref")
        if urls is not None and len(urls) > 0:
            tm_place_id_regex = r"https:\/\/www\.trismegistos\.org\/place\/(\d+)"
            pleiades_id_regex = r"https:\/\/pleiades\.stoa\.org\/places\/(\d+)"
            if tm_id := re.match(tm_place_id_regex, urls):
                result['TM'] = tm_id.group(1)
            if pl_id := re.match(pleiades_id_regex, urls):
                result['PL'] = pl_id.group(1)
            if place_name.text != None:
                result['text'] = place_name.text
    
    # If none of these methods worked, try getting the "normal" placeName
    # EpiDoc.orig_place wants [@type="ancient"] set, but this is not the case in the HGV EpiDocs
    if 'text' not in result:
        orig_places = doc.get_desc('origPlace')
        if len(orig_places) > 0:
            result['text'] = orig_places[0].text # just take the first one if there are multiple
    return result

It takes the first (or only) "nome" section and gets the name and ID out. About 2/3 of HGV records will have this section.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions