Skip to content

Question: how can pages can be identified from the WARC record? #164

@extua

Description

@extua

In order to generate pages.jsonl, I need to identify which resources in the WARC are pages (and not, for example, images).

Browsertrix provides an additional 'WARC-Resource-Type' field. If a record with that field has a value of 'document', it's a page, and I should create an entry for it in pages.jsonl. The 'WARC-Resource-Type' field is under discussion for inclusion in the WARC standard. Browsertrix also provides pageinfo records for each crawl, each url in this record has a type. If the url is of type "document", then it's a page.

However, not all WARC files come from Browsertrix, and until 'WARC-Resource-Type' is standardised, I need another strategy for identifying which records refer to pages.

The HTTP Content-Type gives a good signal, if a record is text/html, it's probably a page. Is there a better way of doing this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions