Skip to content

wget segfault #9

@magbb

Description

@magbb

Wget segfaults on certain websites (e.g. lanekassen.no) when spanning hosts, e.g:

wget --config=wget_warc.conf --level=12 -H -Ptempstore -Dlanekassen.no --warc-file=test-prefix_wget-1-lanekassen_no lanekassen.no

Then atfer 15 minutes:

2025-10-29 16:56:53 URL:https://login.idporten.no/authorize/selector [9576] -> "/mnt/data/magbb/projects/git/maalfrid_toolkit/src/maalfrid_toolkit/warc/test-prefix_wget/tempstore/larested.lanekassen.no/robots.txt.tmp.html" [1]

Program received signal SIGSEGV, Segmentation fault.
Download failed: Invalid argument.  Continuing without source file ./string/../sysdeps/x86_64/multiarch/strlen-evex-base.S.
__strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex-base.S:81
warning: 81     ../sysdeps/x86_64/multiarch/strlen-evex-base.S: No such file or directory
(gdb) bt
#0  __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex-base.S:81
#1  0x000055555559d623 in xstrdup (string=0x0) at ../lib/../../lib/xmalloc.c:339
#2  0x000055555558e660 in register_redirection (to=<optimized out>, 
    from=0x555555749760 "http://larested.lanekassen.no/robots.txt") at ../../src/convert.c:987
#3  retrieve_url (orig_parsed=orig_parsed@entry=0x5555557159c0, 
    origurl=origurl@entry=0x555555749760 "http://larested.lanekassen.no/robots.txt", 
    file=file@entry=0x7fffffffce40, newloc=newloc@entry=0x0, refurl=<optimized out>, 
    refurl@entry=0x0, dt=0x7fffffffcc8c, dt@entry=0x0, recursive=<optimized out>, 
    iri=0x555555792a70, register_status=false) at ../../src/retr.c:1136
#4  0x000055555558f43e in res_retrieve_file (iri=0x55555573e2b0, file=0x7fffffffce40, 
    url=0x55555575a3c0 "http://larested.lanekassen.no/") at ../../src/res.c:569
#5  download_child (upos=upos@entry=0x555555769460, parent=parent@entry=0x5555557c73f0, 
    depth=depth@entry=2, start_url_parsed=start_url_parsed@entry=0x5555555d8380, 
    blacklist=blacklist@entry=0x5555555d8500, iri=iri@entry=0x55555573e2b0)
    at ../../src/recur.c:741
#6  0x000055555558fb4f in retrieve_tree (start_url_parsed=0x5555555d8380, pi=<optimized out>)
    at ../../src/recur.c:468
#7  0x00005555555627e1 in main (argc=<optimized out>, argv=0x7fffffffd298)
    at ../../src/main.c:2165

Seems to be related to a robots.txt that cannot be parsed.

Setting span_hosts=False or disabling robots, resolves the problem, but this isn't a good solution in the long run.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions