Skip to content

Conversation

@ParamThakkar123
Copy link
Collaborator

@TheCedarPrince

This PR Fixes #12

Added a _huggingface_dataset_register to register datasets from huggingface hub using DataDeps.jl and implemented a load function to load the dataset using HealthSampleData.jl module for the users

Copy link
Member

@TheCedarPrince TheCedarPrince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @ParamThakkar123 for your work! I left a bunch of comments based on our call; apologies for the delay in reviewing with being out sick! Very excited here and happy to review multiple times as needed. Just let me know via ping again!

I think we also will need to implement #14 first before continuing with this PR however.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ParamThakkar123 , I think that some of the dependencies in here got mixed up. Some of these should instead be moved to testing dependencies in the test folder. You can read about this exact workflow here: https://docs.julialang.org/en/v1/stdlib/Test/#Workflow-for-Testing-Packages

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ParamThakkar123 , this still does not appear to be addressed -- could you take a look again? Thanks!

@TheCedarPrince
Copy link
Member

Hey @ParamThakkar123 , I think you are unblocked on this PR now! Could you rebase and address the comments I left here? Thanks and keep up the good work! :D

@ParamThakkar123
Copy link
Collaborator Author

Hey @ParamThakkar123 , I think you are unblocked on this PR now! Could you rebase and address the comments I left here? Thanks and keep up the good work! :D

Yes @TheCedarPrince I will start working on this now!!

@codecov
Copy link

codecov bot commented Oct 30, 2025

Codecov Report

❌ Patch coverage is 31.57895% with 26 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@921e0ee). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/huggingface.jl 0.00% 21 Missing ⚠️
src/HuggingFaceDatasets/data.jl 0.00% 5 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main      #13   +/-   ##
=======================================
  Coverage        ?   29.54%           
=======================================
  Files           ?        5           
  Lines           ?       44           
  Branches        ?        0           
=======================================
  Hits            ?       13           
  Misses          ?       31           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@TheCedarPrince TheCedarPrince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Param, very very close here! Some final comments; thanks for the hard work!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ParamThakkar123 , this still does not appear to be addressed -- could you take a look again? Thanks!

using HealthSampleData

"""
register_huggingface_dataset(name::String, repo::String, filename::String)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there -- this really should be explicitly hardcoded like what we have for the Eunomia example. Here is what I am imagining the flow to be:

  1. download_hf_dataset(name = "Synthea")
  2. There should be a registered DataDep for Synthea in the data.jl file (like Eunomia)
  3. The download_hf_dataset should return that DataDep

You may want to create the Synthea DataDep as a constant or something that you can call from within data.jl.

using HealthSampleData
using Test

@testset "HealthSampleData.jl" begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests for downloading; you could upload a very small file to HF JuliaHealthDatasets and use that for testing.

Copy link
Member

@TheCedarPrince TheCedarPrince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Param, very very close here! Some final comments; thanks for the hard work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Download Data Sources from HuggingFace

2 participants