-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Wget segfaults on certain websites (e.g. lanekassen.no) when spanning hosts, e.g:
wget --config=wget_warc.conf --level=12 -H -Ptempstore -Dlanekassen.no --warc-file=test-prefix_wget-1-lanekassen_no lanekassen.no
Then atfer 15 minutes:
2025-10-29 16:56:53 URL:https://login.idporten.no/authorize/selector [9576] -> "/mnt/data/magbb/projects/git/maalfrid_toolkit/src/maalfrid_toolkit/warc/test-prefix_wget/tempstore/larested.lanekassen.no/robots.txt.tmp.html" [1]
Program received signal SIGSEGV, Segmentation fault.
Download failed: Invalid argument. Continuing without source file ./string/../sysdeps/x86_64/multiarch/strlen-evex-base.S.
__strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex-base.S:81
warning: 81 ../sysdeps/x86_64/multiarch/strlen-evex-base.S: No such file or directory
(gdb) bt
#0 __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex-base.S:81
#1 0x000055555559d623 in xstrdup (string=0x0) at ../lib/../../lib/xmalloc.c:339
#2 0x000055555558e660 in register_redirection (to=<optimized out>,
from=0x555555749760 "http://larested.lanekassen.no/robots.txt") at ../../src/convert.c:987
#3 retrieve_url (orig_parsed=orig_parsed@entry=0x5555557159c0,
origurl=origurl@entry=0x555555749760 "http://larested.lanekassen.no/robots.txt",
file=file@entry=0x7fffffffce40, newloc=newloc@entry=0x0, refurl=<optimized out>,
refurl@entry=0x0, dt=0x7fffffffcc8c, dt@entry=0x0, recursive=<optimized out>,
iri=0x555555792a70, register_status=false) at ../../src/retr.c:1136
#4 0x000055555558f43e in res_retrieve_file (iri=0x55555573e2b0, file=0x7fffffffce40,
url=0x55555575a3c0 "http://larested.lanekassen.no/") at ../../src/res.c:569
#5 download_child (upos=upos@entry=0x555555769460, parent=parent@entry=0x5555557c73f0,
depth=depth@entry=2, start_url_parsed=start_url_parsed@entry=0x5555555d8380,
blacklist=blacklist@entry=0x5555555d8500, iri=iri@entry=0x55555573e2b0)
at ../../src/recur.c:741
#6 0x000055555558fb4f in retrieve_tree (start_url_parsed=0x5555555d8380, pi=<optimized out>)
at ../../src/recur.c:468
#7 0x00005555555627e1 in main (argc=<optimized out>, argv=0x7fffffffd298)
at ../../src/main.c:2165
Seems to be related to a robots.txt that cannot be parsed.
Setting span_hosts=False or disabling robots, resolves the problem, but this isn't a good solution in the long run.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels