-
Notifications
You must be signed in to change notification settings - Fork 37
Add support for extra options in the fullDomFetcher #1173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Ndpnt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work on this.
However, look at my comments, but it seems to me that the core issue hasn't been fully addressed.
Could you update the PR to include this, or let me know if I’m misunderstanding something.
|
Updated logic has been implemented in the latest commits and replicated into htmlonlyfetcher as well. |
|
Hi @LVerneyEC, I went ahead and refactored some parts of the code and moved the proxy credentials into the browser object instead of keeping them in the module. Since I’m a bit tied up at the moment, it was quicker to implement than to explain. |
I can confirm it works as expected. |
|
@LVerneyEC, just one more step before we can deliver your contribution. Could you lint the commit messages and add a changelog entry? |
|
Should be rebased and cleaned. I have no way to test the CHANGELOG locally and tried to adapt to the best I understood. Please feel free to manually edit the CHANGELOG if not matching your practices before merging, as I don't want to be delaying this PR too much and this is becoming a blocking issue on our side. Thanks! |
Add support for proxy, disabling the sandboxing (which is required by some Docker setups) and disabling headless mode in the fullDomFetcher.
|
As stated in the CONTRIBUTING guidelines, each changelog release must include sponsor information. For example, when a release is developed with the support of the French Ministry for Foreign Affairs, we include:
You can find other examples in the CHANGELOG.md. Could you provide the sponsor information for this release? |
Based on #1135, the sponsor information should be: @LVerneyEC let us know if something different should be used. @Ndpnt I think it's fair that the core team takes responsibility for ensuring the changelog quality and makes the changes directly. Contributors can focus on their code changes, the publishing responsibility stays with the gatekeepers 🙂 |
MattiSG
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems stalled for too long. @Ndpnt let's have a quick call about how to unblock this.
Totally agree. I just needed the sponsor info to update it myself. |
Co-authored-by: Matti Schneider <[email protected]>
I'm not sure to fully grasp the notion of "sponsor". Also it is quite hard to understand from an external contributor point of view as I would be contributing a feature and not a release (and the sponsor is attached to the release). Also given that in most open-source projects the release cycle is out of hands of individual contributors and OTA is very specific in this regards with a very short release cycle often with a single contribution. Also, the wording "sponsor" is quite ambiguous and can imply a notion of money or advertisement (e.g. think of sponsored chairs in academia, where one would just pay for branding) beyond the mere contribution on a position paid by a third party. Anyways, this contribution was made for the need of the EC instance running at https://code.europa.eu/dsa/terms-and-conditions-database/vlops-and-vloses/ on work time at the EC. I would rather not attach any sponsor wording but I guess the proposed sponsor wording could work (provided you switch "Union" to "Commission") if you need one. I would rather go for a more neutral wording such as
|
* main: Release v9.1.1 Improve changelog Add test when mime type contains charset Fix PDF extraction when MIME type contains charset
|
Hey! Just checking if there are any news on this PR and whether I can make anything to help merge it ASAP? This is an important feature for us to increase the overall quality of the ToS scraping we run. Thanks! |
|
@MattiSG Just a heads-up, if I don’t hear back from you, I’ll go ahead and merge this PR next week and update the contributing guidelines as mentioned earlier. |
Clarified sponsor information and contribution acknowledgment in the changelog format
Thank you for the feedback. I tried to clarify in d4173e3. As your contribution is part of employment by an institution, we acknowledge the funding of the project by your employer, as a matter of transparency. This has been requested by several funders. |
MattiSG
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Matti Schneider <[email protected]>
|
🎉 Finally merged! |
|
Thanks and lot for the help on this one! This is IMO a really great feature to help in corporate environments, but also to make the data collection way more robust by pinpointing specific proxies/IPs/countries for data collection. I'll update our engine soon to make use of the latest version! |
|
Thank you @Ndpnt for leading the integration and @LVerneyEC for your patience! It is a fine line to find between welcoming contributions and making sure documentation and maintenance are covered. I'm glad this feature lands! |
Hey,
This is a follow-up of #1144 as discussed there.
This PR makes the fullDomFetcher configurable through environment variables to support a wider range of setups:
Thanks