-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Hello,
While processing some HGV metadata EpiDocs from papyri.info, I noticed most records include a <provenance> section, such as this one:
<history>
<origin>
<origPlace>Soknopaiu Nesos (Arsinoites)</origPlace>
<origDate when="0144">144</origDate>
</origin>
<provenance type="located">
<p xml:id="geoIF5AB8">
<placeName n="1"
type="ancient"
ref="https://www.trismegistos.org/place/2157 https://pleiades.stoa.org/places/737053">Soknopaiu Nesos</placeName>
<placeName n="2"
type="ancient"
subtype="nome"
ref="https://www.trismegistos.org/place/332 https://pleiades.stoa.org/places/736893">Arsinoites</placeName>
<placeName type="ancient" subtype="region">Ägypten</placeName>
</p>
</provenance>
</history>For some use cases, it might be nice to parse the provenance to get more detailed location info and/or Trismegistos and Pleiades location IDs. Is this something that falls within the project's goals?
For reference, I get the location in my project with this function:
def get_location(doc):
# HGV has a provenance section with a trismegistos places reference, which is better than the string name provided by doc.orig_place
# use xpath + regex to get it out, but fall back to pleiades or string name when trismegistos is not provided
result = {}
provenance_tm_xpath = '//ns:history/ns:provenance/ns:p/ns:placeName[@type="ancient" and @subtype="nome"]'
provenance = doc.xpath(provenance_tm_xpath)
if len(provenance) > 0:
# EpiDoc has a section with a trismegistos and/or pleiades reference
place_name = provenance[0]
urls = place_name.get("ref")
if urls is not None and len(urls) > 0:
tm_place_id_regex = r"https:\/\/www\.trismegistos\.org\/place\/(\d+)"
pleiades_id_regex = r"https:\/\/pleiades\.stoa\.org\/places\/(\d+)"
if tm_id := re.match(tm_place_id_regex, urls):
result['TM'] = tm_id.group(1)
if pl_id := re.match(pleiades_id_regex, urls):
result['PL'] = pl_id.group(1)
if place_name.text != None:
result['text'] = place_name.text
# If none of these methods worked, try getting the "normal" placeName
# EpiDoc.orig_place wants [@type="ancient"] set, but this is not the case in the HGV EpiDocs
if 'text' not in result:
orig_places = doc.get_desc('origPlace')
if len(orig_places) > 0:
result['text'] = orig_places[0].text # just take the first one if there are multiple
return resultIt takes the first (or only) "nome" section and gets the name and ID out. About 2/3 of HGV records will have this section.
Metadata
Metadata
Assignees
Labels
No labels