-
Notifications
You must be signed in to change notification settings - Fork 3
added exception handing & query parameter parsing for comics where fi… #111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…le is not in URL.path
|
Looks like this is failing on JackRabbit again; I swapped it to the line you had edited in the last PR. Testing locally I couldn't get it running still after you proposed the change. I opened a separate PR to with a change to jackrabbit as a hard comparison of @rel='Next' vs the contains function. Building with poetry and running the downloader with that seems to be running smoothly. |
| parts = urlparse(url) | ||
|
|
||
| try: | ||
| parsed_filepath = parts.path | ||
| file_extension = parsed_filepath[parsed_filepath.rindex(".") :] | ||
| except: | ||
| if "?" in url: | ||
| if "&" in parts.query: | ||
| parsed_queries = parts.query.split("&") | ||
| for current_query in parsed_queries: | ||
| try: | ||
| file_extension = current_query[current_query.rindex(".") :] | ||
| break | ||
| except: | ||
| # nothing, loop again | ||
| dummy_var = "" | ||
| else: | ||
| try: | ||
| file_extension = parts.query[parts.query.rindex(".") :] | ||
| except: | ||
| print( | ||
| "File extension unknown; setting as '.unknown' to preserve data" | ||
| ) | ||
| file_extension = ".unknown" | ||
| else: | ||
| # worst case we can't identify the extension, setting 'unknown' to allow saving file for evaluation | ||
| print("File extension unknown; setting as '.unknown' to preserve data") | ||
| file_extension = ".unknown" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We'd probably want to extract this into a different method (like
get_file_paththat returns both file path and extension) - We could also use the
parse_qs(...)method available in theurllib.parsemodule. Here's an example of what the results would look like:
>>> from urllib.parse import parse_qs
>>> parts = urlparse("https://quantumvibe.com/disppageV3?story=qv&file=/simages/qv/qv1-001.jpg")
>>> parse_qs(parts.query)
{'story': ['qv'], 'file': ['/simages/qv/qv1-001.jpg']}
- We should also add unit tests for the extracted method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yeah that parse_qs would probably be more efficient, I didn't catch that in urlparse documentation.
I'll break this out in the next couple days due to #job
|
Quick comment, I haven't forgotten about this life stuff just happened. I still plan on a refactor |
Fix PR for Issue #110
Whilst adding "Quantum Vibe" I found a bug in the URL parser implementation within comic.py
This will throw a ValueError exception when the image is identified within a query parameter rather than a URL path.
Example that throws exception from "Quantum Vibe"
https://quantumvibe.com/disppageV3?story=qv&file=/simages/qv/qv1-001.jpgTo remedy this, I implemented exception handling and query parsing to identify where the filename/extension lives within the full URL. If none can be identified, it will gracefully save the file with a
.unknownextension to at least preserve the data and allow later recovery through a rename operation when the file type is known. I could probably implement some MIME detection here in that scenario, but have not yet.Sample config for Quantum Vibe for testing:
Limitations with this draft PR:
something.php?comic=somecomic&page=pagexyz.jpg)