Skip to content

Conversation

@LVerneyEC
Copy link
Contributor

@LVerneyEC LVerneyEC commented Jul 11, 2025

Hey,

This is a follow-up of #1144 as discussed there.

This PR makes the fullDomFetcher configurable through environment variables to support a wider range of setups:

  • Add support for proxy
  • Sandboxing can be disabled through environment variables (which is required by some Docker setups - e.g. AWS Fargate)
  • Disabling headless mode can be done through environment variables (which might be useful for some visual debugging of declarations or for additional stealth).

Thanks

Copy link
Contributor

@Ndpnt Ndpnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this.
However, look at my comments, but it seems to me that the core issue hasn't been fully addressed.
Could you update the PR to include this, or let me know if I’m misunderstanding something.

@LVerneyEC
Copy link
Contributor Author

Updated logic has been implemented in the latest commits and replicated into htmlonlyfetcher as well.

@Ndpnt
Copy link
Contributor

Ndpnt commented Sep 10, 2025

Hi @LVerneyEC,

I went ahead and refactored some parts of the code and moved the proxy credentials into the browser object instead of keeping them in the module. Since I’m a bit tied up at the moment, it was quicker to implement than to explain.
I haven’t had a chance to test it yet — could you run it on your end and confirm that everything still works?

@LVerneyEC
Copy link
Contributor Author

LVerneyEC commented Sep 22, 2025

I went ahead and refactored some parts of the code and moved the proxy credentials into the browser object instead of keeping them in the module. Since I’m a bit tied up at the moment, it was quicker to implement than to explain. I haven’t had a chance to test it yet — could you run it on your end and confirm that everything still works?

I can confirm it works as expected.

@Ndpnt
Copy link
Contributor

Ndpnt commented Sep 30, 2025

@LVerneyEC, just one more step before we can deliver your contribution. Could you lint the commit messages and add a changelog entry?

@LVerneyEC
Copy link
Contributor Author

Should be rebased and cleaned. I have no way to test the CHANGELOG locally and tried to adapt to the best I understood. Please feel free to manually edit the CHANGELOG if not matching your practices before merging, as I don't want to be delaying this PR too much and this is becoming a blocking issue on our side. Thanks!

Add support for proxy, disabling the sandboxing (which is required by
some Docker setups) and disabling headless mode in the fullDomFetcher.
@Ndpnt
Copy link
Contributor

Ndpnt commented Oct 6, 2025

As stated in the CONTRIBUTING guidelines, each changelog release must include sponsor information.

For example, when a release is developed with the support of the French Ministry for Foreign Affairs, we include:

Development of this release was supported by the French Ministry for Foreign Affairs through its ministerial State Startups incubator, under the aegis of the Ambassador for Digital Affairs.

You can find other examples in the CHANGELOG.md.

Could you provide the sponsor information for this release?

@MattiSG
Copy link
Member

MattiSG commented Oct 7, 2025

Could you provide the sponsor information for this release?

Based on #1135, the sponsor information should be:

> Development of this release was supported by the [European Union](https://commission.europa.eu/).

@LVerneyEC let us know if something different should be used.

@Ndpnt I think it's fair that the core team takes responsibility for ensuring the changelog quality and makes the changes directly. Contributors can focus on their code changes, the publishing responsibility stays with the gatekeepers 🙂

Copy link
Member

@MattiSG MattiSG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems stalled for too long. @Ndpnt let's have a quick call about how to unblock this.

@Ndpnt
Copy link
Contributor

Ndpnt commented Oct 8, 2025

@Ndpnt I think it's fair that the core team takes responsibility for ensuring the changelog quality and makes the changes directly. Contributors can focus on their code changes, the publishing responsibility stays with the gatekeepers 🙂

Totally agree. I just needed the sponsor info to update it myself.

Co-authored-by: Matti Schneider <[email protected]>
@LVerneyEC
Copy link
Contributor Author

@LVerneyEC let us know if something different should be used.

I'm not sure to fully grasp the notion of "sponsor". Also it is quite hard to understand from an external contributor point of view as I would be contributing a feature and not a release (and the sponsor is attached to the release). Also given that in most open-source projects the release cycle is out of hands of individual contributors and OTA is very specific in this regards with a very short release cycle often with a single contribution.

Also, the wording "sponsor" is quite ambiguous and can imply a notion of money or advertisement (e.g. think of sponsored chairs in academia, where one would just pay for branding) beyond the mere contribution on a position paid by a third party.

Anyways, this contribution was made for the need of the EC instance running at https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/ on work time at the EC. I would rather not attach any sponsor wording but I guess the proposed sponsor wording could work (provided you switch "Union" to "Commission") if you need one. I would rather go for a more neutral wording such as

Development of this feature was contributed by the European Commission for its VLOP/VLOSE instance.

Ndpnt added 5 commits October 8, 2025 17:50
* main:
  Release v9.1.1
  Improve changelog
  Add test when mime type contains charset
  Fix PDF extraction when MIME type contains charset
@LVerneyEC
Copy link
Contributor Author

LVerneyEC commented Oct 22, 2025

Hey!

Just checking if there are any news on this PR and whether I can make anything to help merge it ASAP? This is an important feature for us to increase the overall quality of the ToS scraping we run.

Thanks!

@Ndpnt
Copy link
Contributor

Ndpnt commented Oct 23, 2025

@MattiSG Just a heads-up, if I don’t hear back from you, I’ll go ahead and merge this PR next week and update the contributing guidelines as mentioned earlier.

@MattiSG
Copy link
Member

MattiSG commented Nov 3, 2025

I'm not sure to fully grasp the notion of "sponsor"

Thank you for the feedback. I tried to clarify in d4173e3. As your contribution is part of employment by an institution, we acknowledge the funding of the project by your employer, as a matter of transparency. This has been requested by several funders.

Copy link
Member

@MattiSG MattiSG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LVerneyEC LVerneyEC requested a review from Ndpnt November 5, 2025 13:15
Co-authored-by: Matti Schneider <[email protected]>
@Ndpnt Ndpnt merged commit 252386d into OpenTermsArchive:main Nov 5, 2025
37 of 38 checks passed
@Ndpnt
Copy link
Contributor

Ndpnt commented Nov 6, 2025

🎉 Finally merged!
This PR has been quite a journey, but the discussions have helped us clarify our contribution process and improve our documentation. Thanks to @LVerneyEC for sticking with it and @MattiSG for helping unblock things.

@LVerneyEC
Copy link
Contributor Author

Thanks and lot for the help on this one! This is IMO a really great feature to help in corporate environments, but also to make the data collection way more robust by pinpointing specific proxies/IPs/countries for data collection. I'll update our engine soon to make use of the latest version!

@MattiSG
Copy link
Member

MattiSG commented Nov 6, 2025

Thank you @Ndpnt for leading the integration and @LVerneyEC for your patience! It is a fine line to find between welcoming contributions and making sure documentation and maintenance are covered. I'm glad this feature lands!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants