Post Translation: Add translation of posts based on GlotPress by dd32 · Pull Request #90 · WordPress/wordpress.org

dd32 · 2022-07-28T05:41:53Z

This PR is a combination of the work done to translate Patterns, WordPress.org/gutenberg/, and now https://github.com/WordPress/wporg-main-2022

The block parsing classes in this PR differ from that which is used for Patterns (See https://github.com/WordPress/pattern-directory/tree/trunk/public_html/wp-content/plugins/pattern-translations/includes) as in testing I found issue with it and started writing my own parsers that seemed to work more reliably for the purpose of post translation.

The strings for the wporg-main-2022 theme are currently in this project:
https://translate.wordpress.org/projects/disabled/posttranslation/wordpress-org-main-test
(The cron job is not run upon post update, it must be run manually through CLI)

See WordPress/wporg-main-2022#15

A list of TODOs from code and review:

Create GlotPress projects within a parent-project scope for translations
Re-evaluate how it selects projects for where to pull a string from, currently this is done through post_meta and filtered, but this should probably be defined per-site rather than per-post or be 100% automatically selected.
Re-evaluate how the GlotPress interaction in MakePot happens. Other projects on WordPress.org install that as a helper-plugin to translate.w.org and call a WP-CLI method instead.
Some Block Templates are known to not be caught by the the_content filters.
Some strings from post content include   and   tags, it might be better to standardise some of these prior to inserting into GlotPress, for example, replacing   with a literal \n, although that will make retrieving them harder.
Importing these strings directly into GlotPress results in a lot of HTML and class/href attributes. It might be nice if we could replace <a href="https://...../"> and  with <a> and  respectfully in translations, filling them back in on retrieval. This is likely a v2/v3 feature though.. but ties in to the above reference to   in the content.

…pecify the string was actually translated (even if it remained the same).

…the work for /gutenberg/, /patterns/ pattern translation, etc.

akirk · 2022-07-28T08:01:06Z

Really nice, this is quite something to pull off. I've gone through the code and noted how it works along the way (maybe useful for others) and added some remarks:

Posts are processed upon saving as follows:
- A GlotPress project is associated with each post.
 - Remark: By default, all go into a single project which I see as flawed/dangerous because an import to a project in GlotPress will remove all strings not in the imported PO file. So if you save post A and then post B the strings from post A will be thrown out unless it's in a dedicated project. There is a todo that sounds like projects in GlotPress could be created automatically for this purpose.
- The strings are imported via a new MakePot class that analyzes the posts on Gutenberg Block level.
 - Remark: We store references to the URL but we need to think about strings that might appear in multiple places but need to be translated differently. We need to think about possibilities how a context could be added selectively. I'm thinking this could be an array of problematic strings and they could get the block id as context.
- The parsing of blocks happens with custom parsers defined by the block type and they can individually extract strings as follows:
 - A HTML parser can extract strings from HTML tags and attributes. For example, for an image block it will make the contents of a <figcaption> tag translatable as well as the title and alt attribute of any tag. This uses regex to parse the HTML.
 - The BasicText parser extracts all text nodes individually.
- The import happens async using wp_cron.
[Here translation can happen by the community on the projects in translate.wordpress.org]
Translated posts are output using the_title, get_the_excerpt and the_content filters:
- Translations are fetched directly from GlotPress.
- The translations are inserted into the HTML using the custom block parsers from above which have replace_strings() functions: they re-extract the strings and then use the provided translations to preg_replace() them. For example in the HTMLParser.
- wp_cache is used to cache the translated HTML for 6hrs.
  - Remark: I'm not sure why the last changed time of all posts is part of the cache key. Upon a saved translation in GlotPress (using a gp_translation_saved filter) the cache could be invalidated, too.
  - Remark: I see using a per-language custom post type as a feasible place to store the translation as well.

Further Remarks

The display locale must be set through code right now, potentially with a locale filter that looks at the URL?

Out of ignorance how block themes work exactly but will the_content be called multiple times for different segments (i.e. templates) of the page? So for example the header could use a different GlotPress project than the main page and thus be shared between all of the pages?

Using projects for each page in GlotPress has pros and cons:

Strings shared between pages (like header texts) would likely exist in multiple projects, this could lead to inconsistencies in translations and make it hard to find the correct place where to change an incorrect translation.
Updating a project with partial PO files is a pain because it either wipes out all other strings or you need to manually backfill the existing strings in the project so that they are not removed.
It's great to have all strings for a page in a single project but it is important that the sequence of originals matches the text on the page. We're not splitting longer texts into sentences which
- makes out of sequence originals less of a problem, but
- if something small in a longer text is changed, the whole translation will be either removed or fuzzied and because of how GlotPress handles these will be hard to determine what was exactly changed in the English text.

As already mentioned in WordPress/wporg-main-2022#15 (comment), if (RTL) languages need changes to the HTML, it might be necessary to do those transforms on the PHP level individually.

I see invalidation of translated pages upon translation save as quite necessary to validate good translations. It can be really hard to be sure that a new translation works if you cannot see it in context. It'd be unfeasible having to wait for the cache to expire.

Overall, the approach seems technically feasible and already exposes some of the difficulties we'll face in Gutenberg Phase 4 but also shows potential solutions for those. It will be exciting to see how this holds up in production, it looks quite ready for first experiments.

dd32 · 2022-07-28T08:36:33Z

Thanks @akirk!

Remark: By default, all go into a single project

The code that throws things in a single-project is a hold-over from early prototypes (This has actually been built 3-4 times now), I would anticipate that it would be a project-per-site at a minimum.

which I see as flawed/dangerous because an import to a project in GlotPress will remove all strings not in the imported PO file.

This is actually handled by the MakePot import class, the same code is in use for translations on /patterns/ and also within Automattic I believe.

I would love to see this scenario handled natively by GlotPress rather than code like this, I'm not sure how though.. I suspect it would end up very similar to this. It might be that storing it as project-per-page and a translation UI that just pulled strings from "current project, and all sub projects" would work here though.

Remark: We store references to the URL but we need to think about strings that might appear in multiple places but need to be translated differently

Handling context of strings that need to be translated differently on different pages (or different blocks on the same page!) is a shortcoming of this, and I suspect the best way to handle this will be to add a "context flag" to the Block in the editor to then make the strings within that Block (or Block Group) have that as the originals context. I haven't fleshed that idea out yet.

Eg, in psuedo-block, these would be two strings with one having header as the context, and the other download (showing both block-level, and parent-block-group-level context)

<header>
   <button context="header">Get WordPress Now!</button>
</header>
<group context="download">
   <block>...</block>
   <button>Get WordPress Now!</button>
   <block>...</block>
</group>

The URL reference here is mostly to provide translators the ability to click through to the pages where it's used, not to provide string context (at least not in gettext context terms).

Remark: I'm not sure why the last changed time of all posts is part of the cache key.

This is an overly cautious approach, I admit. I chose to clear all translations cache upon any page alteration as the cache here is only being used for speed - to avoid the block parsing needing to be run on every pageload. Clearing the cache everytime a page edit is made is not too bad. The 6hr cache timeframe could also be lowered to 1hr for the same reason - it's only to avoid it running every pageload.

[... cache key ...] Upon a saved translation in GlotPress (using a gp_translation_saved filter) the cache could be invalidated, too.

That would be beneficial for the GlotPress_Translate_Bridge plugins caches, which is used by Themes and possibly Plugins descriptions, and now here.

Remark: I see using a per-language custom post type as a feasible place to store the translation as well.

That was an option I considered, and if it was the source-of-truth for the translations, it would make perfect sense. In this case however, we're pulling from the source-of-truth which is GlotPress and on-the-fly converting the page, just with caching.

On the Pattern Directory we do use similar code, but instead of on-the-fly-and-cache we created DB posts, as that was the quickest way to implement Searching of the content. That may be needed for Translated Pages in the future for sure, but for now, we can skip the DB step IMHO.

The display locale must be set through code right now, potentially with a locale filter that looks at the URL?

Correct, currently we run a single theme wporg-main on all Rosetta sites (wordpress.org + de.wordpress.org for example) with the latter having get_locale() being de_DE. It's not anticipated that one would view https://wordpress.org/ in German right now, except through https://de.wordpress.org/ (although, this could change later, if we were to consolidate rosetta sites into just a translated wordpress.org experience)

On other sites, such as https://learn.wordpress.org/ we have a Language switcher, which also filters get_locale() to point to the locale being displayed, rather than having subdomains as WordPress.org uses.

Out of ignorance how block themes work exactly but will the_content be called multiple times for different segments

Good question! I'm not actually sure of the answer.. In practice, I don't understand how Block themes work under the hood either. the_content may not be the appropriate filter here, but it works (for both this, and other potential uses of the plugin).
I suspect there's some block-template-specific filters that will also need to be filtered on, as with a previous iteration of this I do know that it failed to translate some content that was not defined in the main Front Page page (See the 'Some Block Templates are known to not be caught by the the_content filters.' todo filter above, which doesn't explain what it is)

It's great to have all strings for a page in a single project but it is important that the sequence of originals matches the text on the page.

At this point, that's an issue that already exists AFAIK, this isn't going to make that any better or worse.
The point I think you're trying to make here, is that GlotPress's UI is not always the best translation interface for a document, I suspect further work to implement Community-Translator in the future may help here, as translating would be done in-page with far more context provided. Right now though, the focus is just getting content translatable while also removing the need to hard-code strings in PHP files :)

if something small in a longer text is changed, the whole translation will be either removed or fuzzied and because of how GlotPress handles these will be hard to determine what was exactly changed in the English text.

This feels like two different issues.

This is a case where the String Importer we're using might be able to help by storing a BlockID of sorts (although that doesn't exist right now) - Instead of forcing GlotPress to determine "This string looks to be the previous version of this string.." we might be able to enhance the fuzzy code to understand that "This old original was originally in this block" even if the strings are 95% different.
If GlotPress doesn't already, when it's detected a fuzzy, it should be displaying a diff of the old-original and new-original in the UI, to show the difference in the strings and why it was marked fuzzy. This is out-of-scope of this issue though :)

I see invalidation of translated pages upon translation save as quite necessary to validate good translations. It can be really hard to be sure that a new translation works if you cannot see it in context. It'd be unfeasible having to wait for the cache to expire.

I would love to improve the loop-time for translations to clear the cache for sure!
Existing translations of these pages are often on a 24hr+ timeframe from GlotPress change to visible (Once daily cron export from GlotPress, and then the next WordPress.org deploy). Once this is working acceptably in production though, we can start to clear caches proactively from the GlotPress side when a translation is changed, which should be able to reduce this loop-time down much more.

amieiro · 2022-07-28T10:52:42Z

In a recent Polyglots meeting, @naokomc talked about how the Japanese team is translating the handbooks and how they have them updated:

They transform the WordPress content to Markdown using the wp-handbook-converter npm package.
They translate the Markdown strings.
I don't know how they convert from the translated Markdown to HTML inside the WordPress pages.

I think it could be interesting to use the translation process presented in this PR into the handbooks as beta testers, because translators have this need. What do you think?

tellyworth · 2022-07-29T00:36:40Z

Is there an angle to explore around the way Gutenberg saves block markup?

Generate special classes or IDs on translateable elements, for example. Wrap translateable text in a special tag. Store data in json blobs. Or even use formats as a mechanism - require content authors to explicitly identify text to be translated.

dd32 · 2022-07-29T07:16:55Z

tellyworth: Is there an angle to explore around the way Gutenberg saves block markup?

100% there is - however this is something that's probably going to be best considered as part of Gutenberg Phase 4, purely because those who have their minds deep within the Gutenberg internals and have been thinking about it in the back of their minds forever, are sure to have a better idea of how that could be done.

tellyworth: Generate special classes or IDs on translateable elements, for example. Wrap translateable text in a special tag. Store data in json blobs.

The main problem I can see there, is that translations can be within HTML tags, or within attributes, or possibly even within the block meta-data in a custom field that's later inserted into / used by a dynamic block.

I could see using an ASCII control character (there's a literal Start/End of text pair - STX/ETX) as a viable flag to use to wrap textual pieces though.

amieiro: I think it could be interesting to use the translation process presented in this PR into the handbooks as beta testers, because translators have this need. What do you think?

Handbooks were the original target for the code I started working on for this, and then Developer Documentation (ie. the Code Reference), then w.org/gutenberg/ and now this.. I've started and stopped writing this code so many times it's not funny :D

Realistically though, Handbooks probably require a little bit more effort than what I'm currently targeting (Homepage/Downloads) as a Document needs more context into the surrounding text for good translations, where as with this it's mostly short strings that are separate from one another.

akirk · 2022-07-29T07:24:34Z

The main problem I can see there, is that translations can be within HTML tags, or within attributes, or possibly even within the block meta-data in a custom field that's later inserted into / used by a dynamic block.

This! I could see us ending up with some sort of XPath definition file per Gutenberg post that acts like an array of C pointers to the translatable strings, although the text inside meta-data (very valid and real point, @dd32) might need some extra work.

The tricky part is not only that the text can be inside HTML tags but that potentially the translatable strings then could also contain HTML, so it's not like you could say, let's just translate all DOM textnodes.

…rtcode as a string.

…es of the button

…her than running it within a cron task that will never fire on a sandbox

…block templates).

…innerblocks with the strings in a bit

This includes the updates from WordPress/wordpress.org#90, a new AttributeParser for attribute-only blocks, and a fix for the ListItem block to allow child lists. Fixes #211

…nifest (#247) * Update parsers This includes the updates from WordPress/wordpress.org#90, a new AttributeParser for attribute-only blocks, and a fix for the ListItem block to allow child lists. Fixes #211 * Add phpunit test infrastructure and tests Pulled from #181 * Remove swag page from manifest * Update content with new parser * Add a phpunit workflow

dd32 · 2026-04-14T02:31:16Z

Closing in favour of #602

dd32 added 5 commits July 28, 2022 05:30

Add a parameter to ::translate()

5a6ff58

GlotPress Translate Bridge: Add support for the found parameter, to s…

143314a

…pecify the string was actually translated (even if it remained the same).

GlotPress Translate Bridge: Add some hacks and debug/todos

072c49e

Post Translation: Import the initial Post Translation work, based on …

8ca2589

…the work for /gutenberg/, /patterns/ pattern translation, etc.

TODO: Translations with and  

0c42f7e

dd32 mentioned this pull request Jul 28, 2022

How do we handle translation? WordPress/wporg-main-2022#15

Closed

dd32 added 6 commits August 3, 2022 06:07

Enable the Shortcode parser, which just imports a shortcode block sho…

074db66

…rtcode as a string.

Specify a translation project per site, with optional per-post project.

ab2b95b

Create the Translation project and translation sets on-the-fly

4edc911

Variable typo..

446934c

Allow for some post-meta to be translated, if specified via a filter.

6eb5aa1

Bump version for artificial reasons

bb993ad

dd32 self-assigned this Aug 3, 2022

dd32 added the [Type] Enhancement New feature or request label Aug 3, 2022

dd32 added 7 commits August 3, 2022 07:52

Remove one hack, and add another.

a4517d4

Load the 'admin' code in the rest api too.

2cdab25

Use the HTMLParser for Buttons rather than XPath, to get the attribut…

9709d7a

…es of the button

More temporary hackery, to run the GlotPress import at post save, rat…

adc2960

…her than running it within a cron task that will never fire on a sandbox

Only monitor for changes to published PAGES.

dfa3bd6

Try to catch blocks being rendered outside of the_content (such as …

ab912e6

…block templates).

Don't parse strings out of column/group blocks, as we'll process the …

43a9328

…innerblocks with the strings in a bit

ryelle mentioned this pull request Sep 13, 2022

Provide parser code for other projects to use WordPress/wporg-mu-plugins#269

Open

ryelle added a commit to WordPress/wporg-mu-plugins that referenced this pull request Jan 26, 2023

Port over changes from WordPress/wordpress.org#90

4f303bf

adamwoodnz mentioned this pull request Jan 26, 2023

Fix export of content with list items WordPress/wporg-main-2022#178

Merged

ryelle added a commit to WordPress/wporg-mu-plugins that referenced this pull request Apr 3, 2023

Port over changes from WordPress/wordpress.org#90

e5b334f

ryelle added a commit to WordPress/wporg-main-2022 that referenced this pull request Apr 3, 2023

Update parsers

2b9149d

This includes the updates from WordPress/wordpress.org#90, a new AttributeParser for attribute-only blocks, and a fix for the ListItem block to allow child lists. Fixes #211

ryelle added a commit to WordPress/wporg-main-2022 that referenced this pull request Apr 3, 2023

Update parsers

12fb450

This includes the updates from WordPress/wordpress.org#90, a new AttributeParser for attribute-only blocks, and a fix for the ListItem block to allow child lists. Fixes #211

ryelle mentioned this pull request Apr 3, 2023

Content Parsers: Update parsers, fix list-item parsing, fix broken manifest WordPress/wporg-main-2022#247

Merged

ryelle mentioned this pull request May 15, 2023

Update rosetta homepages to new homepage version WordPress/wporg-main-2022#266

Closed

dd32 closed this Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Post Translation: Add translation of posts based on GlotPress#90

Post Translation: Add translation of posts based on GlotPress#90
dd32 wants to merge 18 commits intoWordPress:trunkfrom
dd32:feature/post-translation

dd32 commented Jul 28, 2022 •

edited

Loading

Uh oh!

akirk commented Jul 28, 2022 •

edited

Loading

Uh oh!

dd32 commented Jul 28, 2022 •

edited

Loading

Uh oh!

amieiro commented Jul 28, 2022

Uh oh!

tellyworth commented Jul 29, 2022

Uh oh!

dd32 commented Jul 29, 2022

Uh oh!

akirk commented Jul 29, 2022 •

edited

Loading

Uh oh!

dd32 commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dd32 commented Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akirk commented Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Further Remarks

Uh oh!

dd32 commented Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amieiro commented Jul 28, 2022

Uh oh!

tellyworth commented Jul 29, 2022

Uh oh!

dd32 commented Jul 29, 2022

Uh oh!

akirk commented Jul 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dd32 commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dd32 commented Jul 28, 2022 •

edited

Loading

akirk commented Jul 28, 2022 •

edited

Loading

dd32 commented Jul 28, 2022 •

edited

Loading

akirk commented Jul 29, 2022 •

edited

Loading