Skip to content

Provide additional info about source of 5000 WET-files to make leaderboard comparison accessible to non-students. #2

@vskogstad

Description

@vskogstad

Hi,
First of all I really appreciate the course, and am very greatful that you have made it openly available!
For assignment 4 you use the together cluster for storing data for Standford-students, and provide non-cluster alternatives up until section 4. I would not expect you to host a 375 GB download, but I was wondering if you could specify the dump(CC-MAIN-2025-18?) and which segments of the dump you used to create those 5000 WET-files you are storing on the cluster. If those are listed in the assignment, non-students could recreate the dataset and compare their results directly to the leaderboard. The paloma validation data-set is available on huggingface, so I think thats the only piece that is missing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions