Question: how can pages can be identified from the WARC record?

In order to generate `pages.jsonl`, I need to identify which resources in the WARC are pages (and not, for example, images).

Browsertrix provides an additional 'WARC-Resource-Type' field. If a record with that field has a value of 'document', it's a page, and I should create an entry for it in `pages.jsonl`. The 'WARC-Resource-Type' field is [under discussion](https://github.com/iipc/warc-specifications/issues/96) for inclusion in the WARC standard. Browsertrix _also_ provides [`pageinfo` records](https://crawler.docs.browsertrix.com/user-guide/qa/#resources-and-page-info) for each crawl, each url in this record has a type. If the url is of type "document", then it's a page.

However, not all WARC files come from Browsertrix, and until 'WARC-Resource-Type' is standardised, I need another strategy for identifying which records refer to pages.

The HTTP Content-Type gives a good signal, if a record is `text/html`, it's _probably_ a page. Is there a better way of doing this?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Question: how can pages can be identified from the WARC record? #164

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Question: how can pages can be identified from the WARC record? #164

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions