Skip to content

Conversation

@Sam-Osian
Copy link
Owner

@Sam-Osian Sam-Osian commented May 11, 2025

In this PR, I have created a csv containing all scraped PFD reports (~5600). As an interim, this is just with HTML + PDF scraping (no LLM just yet).

This csv is now accessible to the user through new loader.py module. They just have to run (see new notebook):

from pfd_toolkit import load_reports

reports = load_reports()
reports

Once a week, this csv is updated using PFDScraper's top_up() method. This is executed via a GitHub Workflow. In each iteration of the workflow, a summary is added detailing how many reports have been updated.

This workflow only operates in main.

**

@johnpytch I could really do with your expertise here to make sure I've designed this appropriately! I've noticed that this new workflow isn't appearing in the Actions tab, so I'm not sure if I'm missing a final piece of the puzzle.

Along with this running once a week, I want to be able to run the Workflow 'on demand'.

@johnpytch
Copy link
Collaborator

Well done, it works. I made some small tweaks...

  • Made sure that PFD_Toolkit could be installed properly by uv sync by modifying the pyproject.toml slightly (path was wrong, needed to add src/pdf_toolkit instead of just have pfd_toolkit, same for data/*.csv)
  • Updated script to use relative path (eg: ./data not data/) so that the data/*.csv reports file could be found by the runner.
  • Minor tweaks in the workflow file if anything, I think I may have even reverted them now.

One thing I'll say, you may find that this fails to commit in main because the branch is protected. There may be a way to add permissions to the github-actions bot to allow it to push to main - or update the commit command to force push. When working, developing or testing in such ways, always be careful with force push as these changes go straight into main. While they can be reverted, it's worth not having that headache.

If you're happy go ahead and figure out the branch protection problem I pointed out and merge her! Lemme know how you get on 😄

@Sam-Osian Sam-Osian merged commit c4f47dc into main May 22, 2025
1 check passed
@Sam-Osian Sam-Osian deleted the auto_update branch May 26, 2025 11:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants