-
-
Notifications
You must be signed in to change notification settings - Fork 18
Description
In order to generate pages.jsonl, I need to identify which resources in the WARC are pages (and not, for example, images).
Browsertrix provides an additional 'WARC-Resource-Type' field. If a record with that field has a value of 'document', it's a page, and I should create an entry for it in pages.jsonl. The 'WARC-Resource-Type' field is under discussion for inclusion in the WARC standard. Browsertrix also provides pageinfo records for each crawl, each url in this record has a type. If the url is of type "document", then it's a page.
However, not all WARC files come from Browsertrix, and until 'WARC-Resource-Type' is standardised, I need another strategy for identifying which records refer to pages.
The HTTP Content-Type gives a good signal, if a record is text/html, it's probably a page. Is there a better way of doing this?