Skip to content

Post Translation: Add translation of posts based on GlotPress#90

Closed
dd32 wants to merge 18 commits intoWordPress:trunkfrom
dd32:feature/post-translation
Closed

Post Translation: Add translation of posts based on GlotPress#90
dd32 wants to merge 18 commits intoWordPress:trunkfrom
dd32:feature/post-translation

Conversation

@dd32
Copy link
Copy Markdown
Member

@dd32 dd32 commented Jul 28, 2022

This PR is a combination of the work done to translate Patterns, WordPress.org/gutenberg/, and now https://github.com/WordPress/wporg-main-2022

The block parsing classes in this PR differ from that which is used for Patterns (See https://github.com/WordPress/pattern-directory/tree/trunk/public_html/wp-content/plugins/pattern-translations/includes) as in testing I found issue with it and started writing my own parsers that seemed to work more reliably for the purpose of post translation.

The strings for the wporg-main-2022 theme are currently in this project:
https://translate.wordpress.org/projects/disabled/posttranslation/wordpress-org-main-test
(The cron job is not run upon post update, it must be run manually through CLI)

See WordPress/wporg-main-2022#15

A list of TODOs from code and review:

  • Create GlotPress projects within a parent-project scope for translations
  • Re-evaluate how it selects projects for where to pull a string from, currently this is done through post_meta and filtered, but this should probably be defined per-site rather than per-post or be 100% automatically selected.
  • Re-evaluate how the GlotPress interaction in MakePot happens. Other projects on WordPress.org install that as a helper-plugin to translate.w.org and call a WP-CLI method instead.
  • Some Block Templates are known to not be caught by the the_content filters.
  • Some strings from post content include &nbsp; and <br> tags, it might be better to standardise some of these prior to inserting into GlotPress, for example, replacing <br> with a literal \n, although that will make retrieving them harder.
  • Importing these strings directly into GlotPress results in a lot of HTML and class/href attributes. It might be nice if we could replace <a href="https://...../"> and <span class="has-many-attributes"> with <a> and <span> respectfully in translations, filling them back in on retrieval. This is likely a v2/v3 feature though.. but ties in to the above reference to <br> in the content.

@akirk
Copy link
Copy Markdown
Member

akirk commented Jul 28, 2022

Really nice, this is quite something to pull off. I've gone through the code and noted how it works along the way (maybe useful for others) and added some remarks:

Further Remarks

The display locale must be set through code right now, potentially with a locale filter that looks at the URL?

Out of ignorance how block themes work exactly but will the_content be called multiple times for different segments (i.e. templates) of the page? So for example the header could use a different GlotPress project than the main page and thus be shared between all of the pages?

Using projects for each page in GlotPress has pros and cons:

  • Strings shared between pages (like header texts) would likely exist in multiple projects, this could lead to inconsistencies in translations and make it hard to find the correct place where to change an incorrect translation.
  • Updating a project with partial PO files is a pain because it either wipes out all other strings or you need to manually backfill the existing strings in the project so that they are not removed.
  • It's great to have all strings for a page in a single project but it is important that the sequence of originals matches the text on the page. We're not splitting longer texts into sentences which
    • makes out of sequence originals less of a problem, but
    • if something small in a longer text is changed, the whole translation will be either removed or fuzzied and because of how GlotPress handles these will be hard to determine what was exactly changed in the English text.

As already mentioned in WordPress/wporg-main-2022#15 (comment), if (RTL) languages need changes to the HTML, it might be necessary to do those transforms on the PHP level individually.

I see invalidation of translated pages upon translation save as quite necessary to validate good translations. It can be really hard to be sure that a new translation works if you cannot see it in context. It'd be unfeasible having to wait for the cache to expire.

Overall, the approach seems technically feasible and already exposes some of the difficulties we'll face in Gutenberg Phase 4 but also shows potential solutions for those. It will be exciting to see how this holds up in production, it looks quite ready for first experiments.

@dd32
Copy link
Copy Markdown
Member Author

dd32 commented Jul 28, 2022

Thanks @akirk!

Remark: By default, all go into a single project

The code that throws things in a single-project is a hold-over from early prototypes (This has actually been built 3-4 times now), I would anticipate that it would be a project-per-site at a minimum.

which I see as flawed/dangerous because an import to a project in GlotPress will remove all strings not in the imported PO file.

This is actually handled by the MakePot import class, the same code is in use for translations on /patterns/ and also within Automattic I believe.

I would love to see this scenario handled natively by GlotPress rather than code like this, I'm not sure how though.. I suspect it would end up very similar to this. It might be that storing it as project-per-page and a translation UI that just pulled strings from "current project, and all sub projects" would work here though.

Remark: We store references to the URL but we need to think about strings that might appear in multiple places but need to be translated differently

Handling context of strings that need to be translated differently on different pages (or different blocks on the same page!) is a shortcoming of this, and I suspect the best way to handle this will be to add a "context flag" to the Block in the editor to then make the strings within that Block (or Block Group) have that as the originals context. I haven't fleshed that idea out yet.

Eg, in psuedo-block, these would be two strings with one having header as the context, and the other download (showing both block-level, and parent-block-group-level context)

<header>
   <button context="header">Get WordPress Now!</button>
</header>
<group context="download">
   <block>...</block>
   <button>Get WordPress Now!</button>
   <block>...</block>
</group>

The URL reference here is mostly to provide translators the ability to click through to the pages where it's used, not to provide string context (at least not in gettext context terms).

Remark: I'm not sure why the last changed time of all posts is part of the cache key.

This is an overly cautious approach, I admit. I chose to clear all translations cache upon any page alteration as the cache here is only being used for speed - to avoid the block parsing needing to be run on every pageload. Clearing the cache everytime a page edit is made is not too bad. The 6hr cache timeframe could also be lowered to 1hr for the same reason - it's only to avoid it running every pageload.

[... cache key ...] Upon a saved translation in GlotPress (using a gp_translation_saved filter) the cache could be invalidated, too.

That would be beneficial for the GlotPress_Translate_Bridge plugins caches, which is used by Themes and possibly Plugins descriptions, and now here.

Remark: I see using a per-language custom post type as a feasible place to store the translation as well.

That was an option I considered, and if it was the source-of-truth for the translations, it would make perfect sense. In this case however, we're pulling from the source-of-truth which is GlotPress and on-the-fly converting the page, just with caching.

On the Pattern Directory we do use similar code, but instead of on-the-fly-and-cache we created DB posts, as that was the quickest way to implement Searching of the content. That may be needed for Translated Pages in the future for sure, but for now, we can skip the DB step IMHO.

The display locale must be set through code right now, potentially with a locale filter that looks at the URL?

Correct, currently we run a single theme wporg-main on all Rosetta sites (wordpress.org + de.wordpress.org for example) with the latter having get_locale() being de_DE. It's not anticipated that one would view https://wordpress.org/ in German right now, except through https://de.wordpress.org/ (although, this could change later, if we were to consolidate rosetta sites into just a translated wordpress.org experience)

On other sites, such as https://learn.wordpress.org/ we have a Language switcher, which also filters get_locale() to point to the locale being displayed, rather than having subdomains as WordPress.org uses.

Out of ignorance how block themes work exactly but will the_content be called multiple times for different segments

Good question! I'm not actually sure of the answer.. In practice, I don't understand how Block themes work under the hood either. the_content may not be the appropriate filter here, but it works (for both this, and other potential uses of the plugin).
I suspect there's some block-template-specific filters that will also need to be filtered on, as with a previous iteration of this I do know that it failed to translate some content that was not defined in the main Front Page page (See the 'Some Block Templates are known to not be caught by the the_content filters.' todo filter above, which doesn't explain what it is)

It's great to have all strings for a page in a single project but it is important that the sequence of originals matches the text on the page.

At this point, that's an issue that already exists AFAIK, this isn't going to make that any better or worse.
The point I think you're trying to make here, is that GlotPress's UI is not always the best translation interface for a document, I suspect further work to implement Community-Translator in the future may help here, as translating would be done in-page with far more context provided. Right now though, the focus is just getting content translatable while also removing the need to hard-code strings in PHP files :)

if something small in a longer text is changed, the whole translation will be either removed or fuzzied and because of how GlotPress handles these will be hard to determine what was exactly changed in the English text.

This feels like two different issues.

  1. This is a case where the String Importer we're using might be able to help by storing a BlockID of sorts (although that doesn't exist right now) - Instead of forcing GlotPress to determine "This string looks to be the previous version of this string.." we might be able to enhance the fuzzy code to understand that "This old original was originally in this block" even if the strings are 95% different.
  2. If GlotPress doesn't already, when it's detected a fuzzy, it should be displaying a diff of the old-original and new-original in the UI, to show the difference in the strings and why it was marked fuzzy. This is out-of-scope of this issue though :)

I see invalidation of translated pages upon translation save as quite necessary to validate good translations. It can be really hard to be sure that a new translation works if you cannot see it in context. It'd be unfeasible having to wait for the cache to expire.

I would love to improve the loop-time for translations to clear the cache for sure!
Existing translations of these pages are often on a 24hr+ timeframe from GlotPress change to visible (Once daily cron export from GlotPress, and then the next WordPress.org deploy). Once this is working acceptably in production though, we can start to clear caches proactively from the GlotPress side when a translation is changed, which should be able to reduce this loop-time down much more.

@amieiro
Copy link
Copy Markdown
Contributor

amieiro commented Jul 28, 2022

In a recent Polyglots meeting, @naokomc talked about how the Japanese team is translating the handbooks and how they have them updated:

  • They transform the WordPress content to Markdown using the wp-handbook-converter npm package.
  • They translate the Markdown strings.
  • I don't know how they convert from the translated Markdown to HTML inside the WordPress pages.

I think it could be interesting to use the translation process presented in this PR into the handbooks as beta testers, because translators have this need. What do you think?

@tellyworth
Copy link
Copy Markdown
Contributor

Is there an angle to explore around the way Gutenberg saves block markup?

Generate special classes or IDs on translateable elements, for example. Wrap translateable text in a special tag. Store data in json blobs. Or even use formats as a mechanism - require content authors to explicitly identify text to be translated.

@dd32
Copy link
Copy Markdown
Member Author

dd32 commented Jul 29, 2022

tellyworth: Is there an angle to explore around the way Gutenberg saves block markup?

100% there is - however this is something that's probably going to be best considered as part of Gutenberg Phase 4, purely because those who have their minds deep within the Gutenberg internals and have been thinking about it in the back of their minds forever, are sure to have a better idea of how that could be done.

tellyworth: Generate special classes or IDs on translateable elements, for example. Wrap translateable text in a special tag. Store data in json blobs.

The main problem I can see there, is that translations can be within HTML tags, or within attributes, or possibly even within the block meta-data in a custom field that's later inserted into / used by a dynamic block.

I could see using an ASCII control character (there's a literal Start/End of text pair - STX/ETX) as a viable flag to use to wrap textual pieces though.

amieiro: I think it could be interesting to use the translation process presented in this PR into the handbooks as beta testers, because translators have this need. What do you think?

Handbooks were the original target for the code I started working on for this, and then Developer Documentation (ie. the Code Reference), then w.org/gutenberg/ and now this.. I've started and stopped writing this code so many times it's not funny :D

Realistically though, Handbooks probably require a little bit more effort than what I'm currently targeting (Homepage/Downloads) as a Document needs more context into the surrounding text for good translations, where as with this it's mostly short strings that are separate from one another.

@akirk
Copy link
Copy Markdown
Member

akirk commented Jul 29, 2022

The main problem I can see there, is that translations can be within HTML tags, or within attributes, or possibly even within the block meta-data in a custom field that's later inserted into / used by a dynamic block.

This! I could see us ending up with some sort of XPath definition file per Gutenberg post that acts like an array of C pointers to the translatable strings, although the text inside meta-data (very valid and real point, @dd32) might need some extra work.

The tricky part is not only that the text can be inside HTML tags but that potentially the translatable strings then could also contain HTML, so it's not like you could say, let's just translate all DOM textnodes.

@dd32 dd32 self-assigned this Aug 3, 2022
@dd32 dd32 added the [Type] Enhancement New feature or request label Aug 3, 2022
ryelle added a commit to WordPress/wporg-mu-plugins that referenced this pull request Apr 3, 2023
ryelle added a commit to WordPress/wporg-main-2022 that referenced this pull request Apr 3, 2023
This includes the updates from WordPress/wordpress.org#90, a new AttributeParser for attribute-only blocks, and a fix for the ListItem block to allow child lists.
Fixes #211
ryelle added a commit to WordPress/wporg-main-2022 that referenced this pull request Apr 3, 2023
This includes the updates from WordPress/wordpress.org#90, a new AttributeParser for attribute-only blocks, and a fix for the ListItem block to allow child lists.
Fixes #211
ryelle added a commit to WordPress/wporg-main-2022 that referenced this pull request Apr 4, 2023
…nifest (#247)

* Update parsers

This includes the updates from WordPress/wordpress.org#90, a new AttributeParser for attribute-only blocks, and a fix for the ListItem block to allow child lists.
Fixes #211

* Add phpunit test infrastructure and tests

Pulled from #181

* Remove swag page from manifest

* Update content with new parser

* Add a phpunit workflow
BreadGuy007 added a commit to BreadGuy007/WP_org that referenced this pull request Apr 15, 2024
…nifest (#247)

* Update parsers

This includes the updates from WordPress/wordpress.org#90, a new AttributeParser for attribute-only blocks, and a fix for the ListItem block to allow child lists.
Fixes #211

* Add phpunit test infrastructure and tests

Pulled from #181

* Remove swag page from manifest

* Update content with new parser

* Add a phpunit workflow
@dd32
Copy link
Copy Markdown
Member Author

dd32 commented Apr 14, 2026

Closing in favour of #602

@dd32 dd32 closed this Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

[Type] Enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants