Skip to content

Greek wiktionary uses '0', '00' and '000' as parameters#363

Merged
xxyzz merged 2 commits intomainfrom
zerozerozero
Feb 17, 2025
Merged

Greek wiktionary uses '0', '00' and '000' as parameters#363
xxyzz merged 2 commits intomainfrom
zerozerozero

Conversation

@kristian-clausal
Copy link
Collaborator

https://el.wiktionary.org/w/index.php?title=%CE%A0%CF%81%CF%8C%CF%84%CF%85%CF%80%CE%BF:%CE%B5%CF%84

-->|0={{{0|}}} <!-- no parenthesis 
-->|00={{{00|}}} {{#if:{{{nocat|}}}|{{{nocat|}}}}}<!--
-->|000={{{000|}}} {{#if:{{{nodisplay|}}}|{{{nodisplay|}}}}}<!--

Apparently parameter names under 1, like '0' or '00' aren't changed to integers... Or that Scribunto handles all parameter names as strings and just does something special with the indexing stuff.

Still need to figure out some things...

@kristian-clausal kristian-clausal force-pushed the zerozerozero branch 2 times, most recently from c1494ed to 803b416 Compare February 13, 2025 09:18
@kristian-clausal
Copy link
Collaborator Author

I am trying to fix a Lua error, and I think I've figured out what's wrong:

$ lang=el; datadir=/home/kristian/Data/${lang};wiktwords --edition "$lang" --db-path "${datadir}/${lang}-wikt.db" --all --human-readable --out temp --errors temp.errors --all-languages --page gtmp/thelukos2.txt --override el-override/
2025-02-13 11:31:59,481 INFO: Capturing words for all available languages
gl: "           " lang: "       nil
*32*    table: 0x341b73c0
θηλυκός: ERROR: LUA error in #invoke('labels', 'main', '', '', 'label=γραμμ \U0010207b \U0010207b \U0010207b ', 'όρος= \U0010207b ', 'εμφ= \U0010207b ', 'σελ= \U0010207b  ', '0= ', '00=- \U0010207b', '000= \U0010207b', 'γλ= \U0010207b \U0010207b', 'ascii= \U0010207b \U0010207b') parent ('Πρότυπο:ετ', {1: 'γραμμ', '00': '-'}) at ['θηλυκός', 'Template:ετ', '#invoke', '#invoke']
[string "Module:labels"]:71: attempt to index field '?' (a nil value)
gl: "           " lang: "       nil
*32*    table: 0x341b73c0
θηλυκός: ERROR: LUA error in #invoke('labels', 'main', '', '', 'label=γραμμ \U0010207b \U0010207b \U0010207b ', 'όρος= \U0010207b ', 'εμφ= \U0010207b ', 'σελ= \U0010207b  ', '0= ', '00=- \U0010207b', '000= \U0010207b', 'γλ= \U0010207b \U0010207b', 'ascii= \U0010207b \U0010207b') parent ('Πρότυπο:ετ', {1: 'γραμμ', '00': '-'}) at ['θηλυκός', 'Template:ετ', '#invoke', '#invoke']
[string "Module:labels"]:71: attempt to index field '?' (a nil value)

In Greek wiktionary Module:labels we have

-- about languages, language specifics CHECK [[takeout]] [[φάλαινα]]
	print('gl: "', args['γλ'], '" lang: "', args['lang']) 
   	local lang_iso = args['γλ'] or args['lang'] or '' -- or args[2] at [[Template|ετ]] 
		if lang_iso == '' or lang_iso == nil then 
			if label == 'αμερ' or label == 'αμερ γρ' or label == 'αμερ σημασία'
			or label == 'βρετ' or label == 'βρετ γρ' or label == 'βρετ σημασία'
			then lang_iso = 'en'
			else lang_iso = 'el'
			end
		end
	print('*' .. string.byte(lang_iso) .. '*', languages)
	local lang_name = languages[lang_iso].name or ''

and it turns out that args['γλ'] is one character of space, just ASCII 32. This doesn't seem to be related to the '0' stuff, though, just coincidental... No idea why letting spaces through doesn't happen more often, I'll try to investigate.

@kristian-clausal
Copy link
Collaborator Author

[string "Module:labels"]:95: attempt to index field '?' (a nil value)
(venv) kristian@wiktextract-dev:~/Alt/wiktextract (el):
$ lang=el; datadir=/home/kristian/Data/${lang};wiktwords --edition "$lang" --db-path "${datad
2025-02-13 12:30:22,201 INFO: Capturing words for all available languages
NOT REACHED?    "γραμμ   "
gl: "           " lang: "       nil
*101*   table: 0xb500270
Label:  γραμμ
data[label]:    nil
θηλυκός: ERROR: LUA error in #invoke('labels', 'main', '', '', 'label=γραμμ \U0010207b \U0010207b \U0010207b', 'όρος= \U0010207b', 'εμφ= \U0010207b', 'σελ= \U0010207b', '0=', '00=- \U0010207b', '000= \U0010207b', 'γλ= \U0010207b \U0010207b', 'ascii= \U0010207b \U0010207b') parent ('Πρότυπο:ετ', {1: 'γραμμ', '00': '-'}) at ['θηλυκός', 'Template:ετ', '#invoke', '#invoke']
[string "Module:labels"]:95: attempt to index field '?' (a nil value)
NOT REACHED?    "γραμμ   "
gl: "           " lang: "       nil
*101*   table: 0xb500270
Label:  γραμμ
data[label]:    nil
θηλυκός: ERROR: LUA error in #invoke('labels', 'main', '', '', 'label=γραμμ \U0010207b \U0010207b \U0010207b', 'όρος= \U0010207b', 'εμφ= \U0010207b', 'σελ= \U0010207b', '0=', '00=- \U0010207b', '000= \U0010207b', 'γλ= \U0010207b \U0010207b', 'ascii= \U0010207b \U0010207b') parent ('Πρότυπο:ετ', {1: 'γραμμ', '00': '-'}) at ['θηλυκός', 'Template:ετ', '#invoke', '#invoke']
[string "Module:labels"]:95: attempt to index field '?' (a nil value)

aaaaargh I don't get why these template arguments don't get stripped properly. This is probably unrelated to the 000 issue: γραμμ with three spaces (because there are three spaces between three {{{...|...}}} spans in |label={{{1|}}} {{#if:{{{label|}}}|{{{label|}}}}} {{#if:{{{topic|}}}|{{{topic|}}}}} {{#if:{{{ετικέτα|}}} in Επεξεργασία: Πρότυπο:ετ means the Lua code can't find the correct data under γραμμ in labels/data... There is a step somewhere that does .strip() somewhere. Because label= is used in Επεξεργασία: Πρότυπο:ετ it should strip the stuff, so maybe it's a timing issue...

@xxyzz
Copy link
Collaborator

xxyzz commented Feb 14, 2025

We should be able to pass this test, but current we get result of "nilnilnilnil":

    def test_el_zero_arg(self):
        self.wtp.start_page("Πρότυπο:ετ")
        self.wtp.add_page(
            "Module:test",
            828,
            """local export = {}
function export.test(frame)
  --print(mw.dumpObject(frame.args))
  return tostring(frame.args[0]) .. tostring(frame.args["0"]) .. tostring(frame.args[00]) .. tostring(frame.args["00"]) .. tostring(frame.args[42]) .. tostring(frame.args["42"]) .. tostring(frame.args[042]) .. tostring(frame.args["042"])
end
return export""",
        )
        self.assertEqual(self.wtp.expand(
            "{{#invoke:test|test|0=0|00=1|42=2|042=3}}"), "00012223"
        )

@xxyzz
Copy link
Collaborator

xxyzz commented Feb 14, 2025

I'd like to suggest these changes:

at _sandbox_phase2.lua line 13, try string key first then try number key:

    local v = new_args._orig[key]
    if v == nil then
        local i = tonumber(key)
        if i ~= nil then
            key = i
        else
            return nil
        end
        v = new_args._orig[key]
        if v == nil then
            return nil
        end
    end

at luaexec.py line 443, only convert "0" and numbers that don't start with "0":

if k.isdigit() and (not k.startswith("0") or k == "0"):
    k = int(k)
    if k < 0 or k > 1000:

@xxyzz
Copy link
Collaborator

xxyzz commented Feb 14, 2025

For removing white spaces in expanded template arg:

    def test_el_strip_arg(self):
        self.wtp.start_page("θηλυκός")
        self.wtp.add_page(
            "Module:test",
            828,
            """local export = {}
function export.test(frame)
  return tostring(frame.args[0])
end
return export""",
        )
        self.assertEqual(self.wtp.expand(
            "{{#invoke:test|test|0={{{1|}}} {{{2|}}}}}"), ""
        )

at the end of preprocess() in luaexec.py:

ret = expand_all_templates(v)
if ret.strip() == "":
    ret = ""

maybe better check empty string at here:

@kristian-clausal
Copy link
Collaborator Author

I think that only positive, non-zero integers are allowed in Wikitext parameters, so using int(k) > 0 should be correct.

The current problem is, I think, that when we do stuff in wikitextprocessor/core expand_recurse:

                    if kind == "T":
                        # Template transclusion or parser function call.
                        # Expand its arguments.
                        print(f"{kind=}, {args=}, {argmap=}")
                        new_args = tuple(
                            expand_args(x, argmap).removesuffix("\n")
                            for x in args
                        )
                        print(f"{new_args=}")
                        parts.append(self._save_value(kind, new_args, nowiki))
                        continue

That seems to always leave a magic character for T expansions (including #if), and usually that magic character is the same value as all the other magic characters left because we keep on .pop()ing stuff from the magic character cookies stack and then putting something on top again and again. But there's not check for whether the template needs to have a cookie here.

new_args=('#invoke:labels', 'main', '', '', 'label=γραμμ \U0010207b \U0010207b \U0010207b ', 'όρος= \U0010207b ', 'εμφ= \U0010207b ', 'σελ= \U0010207b  ', '0= ', '00=- \U0010207b', '000= \U0010207b', 'γλ= \U0010207b \U0010207b', 'ascii= \U0010207b \U0010207b')
fn_name='#if', args=('', ''), ret=''
fn_name='#if', args=('', ''), ret=''
...
*101*   table: 0x1fa28c70
Label:  γραμμ
data[label]:    nil
θηλυκός: ERROR: LUA error in #invoke('labels', 'main', '', '', 'label=γραμμ \U0010207b \U0010207b \U0010207b ', 'όρος= \U0010207b ', 'εμφ= \U0010207b ', 'σελ= \U0010207b  ', '0= ', '00=- \U0010207b', '000= \U0010207b', 'γλ= \U0010207b \U0010207b', 'ascii= \U0010207b \U0010207b') parent ('Πρότυπο:ετ', {1: 'γραμμ', '00': '-'}) at ['θηλυκός', 'Template:ετ', '#invoke', '#invoke']
[string "Module:labels"]:95: attempt to index field '?' (a nil value)

What used to be normal magic characters with different values have been replaced by \U0010207b.

Why hasn't this come up before? I think it's because Lua simply discards the magic character because it's out of Unicode's normal range of characters. But in this case, all the #if calls in the template have spaces inbetween them and we're left with whitespace that isn't .strip()ped away as normal. All of this goes straight to the Lua interpreter, with the magic characters in place.

@xxyzz
Copy link
Collaborator

xxyzz commented Feb 14, 2025

I think the first "0" arg problem is not fixed, because frame.args["0"] is nil in this branch.

Could you add a simplified test for the second magic characters problem? Update: This is the third problem, the second is white space in template arguments not striped(maybe they are the same problem or related?)...

Update: I think I kind of understand the magic characters problem in "cookies", this does look strange, but I'm not familiar with the _save_value() function and I'm currently reading the code to try to understand it...

Update: I search the code and the Wtp.cookies list seems never poped? It only get appended in Wtp._save_value() and cleared in Wtp.start_page().

@kristian-clausal
Copy link
Collaborator Author

Aaah, there's a check to see if the cookie data already exists in _save_value using rev_ht. Because the #if templates are all reduced to the same data ("#if", "", "") or something like that, basically an empty if, _save_value returns the same. In https://el.wiktionary.org/w/index.php?title=%CE%A0%CF%81%CF%8C%CF%84%CF%85%CF%80%CE%BF:%CE%B5%CF%84 we have

-->|label={{{1|}}} {{#if:{{{label|}}}|{{{label|}}}}} {{#if:{{{topic|}}}|{{{topic|}}}}} {{#if:{{{ετικέτα|}}}|{{{ετικέτα|}}}}} <!-- forms Categories and/or link 
-->|όρος={{{όρος|}}} {{#if:{{{term|}}}|{{{term|}}}}} <!-- overrides the chosen text 
-->|εμφ={{{3|}}} {{#if:{{{show|{{{εμφ|}}}}}}|{{{show|{{{εμφ|}}}}}}}} <!-- shows chosen text at link εμφ π.χ. Μεγάλη Άρκτος
-->|σελ={{{σελ|}}} {{#if:{{{page|}}}|{{{page|}}}}}  <!-- σελ=1 συνδέει με το λήμμα, αντί την Κατηγορία - Links lemma, not default Category 
-->|0={{{0|}}} <!-- no parenthesis 
-->|00={{{00|}}} {{#if:{{{nocat|}}}|{{{nocat|}}}}}<!--
-->|000={{{000|}}} {{#if:{{{nodisplay|}}}|{{{nodisplay|}}}}}<!--
-->|γλ={{{2|}}} {{#if:{{{γλ|}}}|{{{γλ|}}}}} {{#if:{{{lang|}}}|{{{lang|}}}}}<!--
-->|ascii={{{ascii|}}} {{#if:{{{ascii|}}}|{{{ascii|}}}}} {{#if:{{{sort|}}}|{{{sort|}}}}}<!--

and it's exactly those #ifs that seem to remain as cookies.

@xxyzz
Copy link
Collaborator

xxyzz commented Feb 14, 2025

So... I guess it's normal? They still expand correctly?

@kristian-clausal
Copy link
Collaborator Author

The magic characters end up getting into Lua. They're not expanded. Lua just seems to discard the weird Unicode characters. At least it seems that way...

@xxyzz
Copy link
Collaborator

xxyzz commented Feb 14, 2025

But aren't arguments got expanded at here

before passing args to Lua module function?

You mean the "preprocess()" can't expand these encoded #if magic numbers?

@kristian-clausal
Copy link
Collaborator Author

You are absolutely correct, how did I miss that... We might have to strip() the string at that point, somehow. Lua doesn't seem to have a native implementation of just .strip()...

@kristian-clausal
Copy link
Collaborator Author

sandbox_phase2.lua starting line 21

    if not new_args._preprocessed[key] then
        local frame = new_args._frame
        v = frame:preprocess(v)
        if key ~= i then
            v = v:match "^%s*(.-)%s*$"
        end
        -- Cache preprocessed value so we only preprocess each argument once
        new_args._preprocessed[key] = true
        new_args._orig[key] = v

This seems to do it. The check for key ~= i is because we want to know if this was an argument that used aa=bb, because those don't have their spaces trimmed (according to documentation); if it's just | bb | then that should be trimmed. I'm not sure if 1= bb would be trimmed...

@xxyzz
Copy link
Collaborator

xxyzz commented Feb 14, 2025

"1= bb " do get striped... and don't forget frame.args["0"] shouldn't be nil.

kristian-clausal added a commit that referenced this pull request Feb 14, 2025
Check out wikitextprocessor PR #363

Issue was that in some Greek templates, we had a ton of
`{{#if:...}}` templates as template arguments, and they
were written with spaces in between: `| {{#if:...}} {{#if:...}} |`.

The ifs left magic characters which prevented stripping
before they were expanded into empty strings, so the
trimming has to be done later.
@kristian-clausal
Copy link
Collaborator Author

@xxyzz I had to make changes to one of your tests (Russian, remove \n at the end of unnamed arguments) and I would like you to check I didn't do anything dumb again.

@kristian-clausal kristian-clausal force-pushed the zerozerozero branch 3 times, most recently from a6f8c28 to f8b774e Compare February 14, 2025 10:17
Trim whitespace around frame args after expanding them

Check out wikitextprocessor PR #363

Issue was that in some Greek templates, we had a ton of
`{{#if:...}}` templates as template arguments, and they
were written with spaces in between: `| {{#if:...}} {{#if:...}} |`.

The ifs left magic characters which prevented stripping
before they were expanded into empty strings, so the
trimming has to be done later.

Co-authored-by: xxyzz <gitpull@protonmail.com>
very helpfully for debugging Lua modules
@xxyzz xxyzz merged commit 3c3f0a7 into main Feb 17, 2025
10 checks passed
@xxyzz xxyzz deleted the zerozerozero branch February 17, 2025 02:54
@kristian-clausal
Copy link
Collaborator Author

Thanks for taking a look and completing the stuff I missed! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants