Directory

⚓ T317274 Use free software implementation for Phonos on Wikimedia sites
Page MenuHomePhabricator

Use free software implementation for Phonos on Wikimedia sites
Open, Needs TriagePublic

Description

The current plan for Phonos is to use a proprietary Google API when being deployed to Wikimedia sites. There are some free software options for the same functionality, but are supposedly not up to par. This task outlines and tracks those deficiencies, so we can file upstream bugs and switch when a free option can replace Google.

An analysis was done in T307624: [16 hours] Investigate: Options of TTS engines with some notes on https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2022/Reading/IPA_audio_renderer/TTS_investigation

Event Timeline

[offtopic] Is there a separate task we can track the reliance on a proprietary service instead of using a free software solution?

T307624: [16 hours] Investigate: Options of TTS engines. Some of us (myself included) were passionate about pushing for FOSS but from the options we found, Google performed extraordinarily better and supports many more languages. The Language team who will eventually takeover ownership of Phonos also seemed more keen to use Google than having to maintain/update third-party code. At any rate, Phonos was designed in such a way that we can add more engines if we find one better than Larynx and eSpeak, which we already support. For now, I believe the decision has been made that we're going with Google.

To be clear, I'm not asking that the decision be reconsidered now, rather I would like actionable feedback that we can file upstream and figure out what resources can be used to help build an acceptable free software replacement.

I reviewed that task and Meta-Wiki page (linked in main task description) but didn't feel it had enough detail that I could point an upstream at it and ask what it would take to get those issues fixed.

On the issue of language support, for ContentTranslation AIUI we use the free apertium implementation where it's high quality, and only use proprietary services when it's not usable. Depending on what wikis this is initially targeting, maybe that's an option?

Remaining [offtopic] here, but I'd quite like to see if we can maintain https://larynx-tts.wmcloud.org/openapi — perhaps I'll request a WMCS project for it and chuck a load of compute power at a VM and see how the maximum settings sound (as the generation time at that level tends to be fairly long on even moderately spec'd VMs)

Hello there, thanks for writing this and for asking for more details! Agree with you that our preferred route would have definitely been open source. We definitely tried to make it the case, as @MusikAnimal and @TheresNoTime noted above!

I reviewed that task and Meta-Wiki page (linked in main task description) but didn't feel it had enough detail that I could point an upstream at it and ask what it would take to get those issues fixed.

Mind sharing more details about what kind of details would be most helpful? Here is what I see as the most imperative par functionality from an open source engine if we were to go the larynx/other open source route:

  • Supports as many language options
  • Had an accessible learning curve/ barrier to contributing it for open source devs
  • Was as reliable as the option we selected
  • Sounded as close to human as as the option we selected

Depending on what wikis this is initially targeting, maybe that's an option?

The scope of the project is to benefit all wikis that use IPA and are public

Happy to answer more! Will move this to Tracking instead of Backlog since I know CommTech engineers + myself are still passionate about considering this route

For the people developing this at Language-Team... please do not let it be just about IPA.

For languages like Georgian, Hausa, Polish, and Spanish (and even Mandarin, which is written with thousands of characters but each one has a consistent pronunciation), the pronunciation is completely predictable from orthography, so speakers don't need IPA to tell how a word is pronounced—they can just look at the spelling and know. (Or in cases like Russian and Slovene, they have an auxiliary system to fill in the details about the pronunciation the orthography does not normally cover.)

Forcing users to input IPA just to get text spoken when computers are just as capable of deriving the same audio from the orthographic form would privilege those who can afford to learn it and would be highly inequitable. Please avoid it at all costs. Orthography-to-audio has more demand anyway.