Git Product home page Git Product logo

Comments (10)

ZhymabekRoman avatar ZhymabekRoman commented on July 3, 2024 1

@reddere, Use GoogleTranslateV2 and specify all your "static" links/hashtags into specific span tag:

<span class="notranslate">TAGS OR LINKS THERE</span>

For more information visit: https://cloud.google.com/translate/troubleshooting

In [5]: from translatepy.translators.google import GoogleTranslateV2

In [6]: dl = GoogleTranslateV2()

In [9]: dl.translate('Kado Thorne es un Vampiro y viajó en el tiempo desde el año 2020 cuando se presentó a la skin Oro.\n\n<span class="notranslate">#Fortnite</span> <span class="notran
   ...: slate">#FortniteLastResort</span> <span class="notranslate">https://t.co/m1cE9sSrNb</span>', 'it')
Out[9]: TranslationResult(service=Translator(Google), source='Kado Thorne es un Vampiro y viajó en el tiempo desde el año 2020 cuando se presentó a la skin Oro.\n\n<span class="notranslate">#Fortnite</span> <span class="notranslate">#FortniteLastResort</span> <span class="notranslate">https://t.co/m1cE9sSrNb</span>', source_lang=Language(Spanish), dest_lang=Language(Italian), translation='Kado Thorne è un vampiro e ha viaggiato indietro nel tempo a partire dall\'anno 2020 quando gli è stata presentata la skin Oro.\n\n<span class="notranslate">#Fortnite</span> <span class="notranslate">#FortniteLastResort</span> <span class="notranslate">https://t.co/m1cE9sSrNb</span>')

from translate.

Animenosekai avatar Animenosekai commented on July 3, 2024

Do you have an example to reproduce ?

from translate.

reddere avatar reddere commented on July 3, 2024

Do you have an example to reproduce ?

Absolutely @Animenosekai ! Here is a text I got from a tweet. Notice how both hashtags and the tweet link letters are alterated. In the second hashtag, a letter even gets added out of nowhere.

from translatepy.translators.google import GoogleTranslate 

text = 'Kado Thorne es un Vampiro y viajó en el tiempo desde el año 2020 cuando se presentó a la skin Oro.\n\n#Fortnite #FortniteLastResort https://t.co/m1cE9sSrNb'

translator = GoogleTranslate()

italian_text = translator.translate(text, 'Italian')

print(italian_text)

Result:
Kado Thorne è un vampiro e ha viaggiato nel tempo dal 2020 quando apparve nell'oro della pelle.\n\n#FORTNITE #FORTNITLelasTResort https://t.co/M1ce9SSRNB

Even if the normal text got translated fine, hashtags and link got alterated:

  • Hashtag n.1 went from #Fortnite to #FORTNITE (letters alteration)
  • Hashtag n.2 went from #FortniteLastResort to #FORTNITLelasTResort (letters alteration + missing letter E + somehow "Last" got totally distorted and "Lelas", which doesnt mean anything in Italian)
  • Link went from https://t.co/m1cE9sSrNb to https://t.co/M1ce9SSRNB. This alteration breaks entirely the link.

Any ideas on how to fix this?

from translate.

Animenosekai avatar Animenosekai commented on July 3, 2024

Parsing with a Regex maybe ?

from translate.

reddere avatar reddere commented on July 3, 2024

what do you mean? theres params I can pass to the GoogleTranslate() instance that allow me to hide parts of the passed text using regex?

from translate.

Animenosekai avatar Animenosekai commented on July 3, 2024

what do you mean? theres params I can pass to the GoogleTranslate() instance that allow me to hide parts of the passed text using regex?

Nope not for now but should I ?

Here is the major problem coming with this and HTML translation though :

#71 (comment)

TLDR: Might work for Latin based languages, but different languages have different structures and the order of words might need to change from one language to another. (this is also one of the reasons why when we translate stuff we don't translate each word individually and put back the pieces)

from translate.

reddere avatar reddere commented on July 3, 2024

Yeah I mean implement what I said would actually make it way better. The issue you mentioned kinda relates to the topic, and yeah thats easily fixable by just add a space in the final result after the dots or commas, if missing, but yeah implementing regex or any other way to hide certain parts of text would be awesome as it's frequent to alterate them

from translate.

Animenosekai avatar Animenosekai commented on July 3, 2024

Yes, this issue might be easier to handle than normal translations, as links don't exactly mean anything and don't need to be translated.

But, here is the problem :

First, it is not possible to separately translate things because it might not result in the best translation (because words have different meanings as a whole rather than individually). Also, as said before, there is no telling the position of the link should change, thus we can't just pin the position of the link and replace it after the translation:

(French) Je voudrais changer le lien https://google.com parce qu'il me semble y avoir trouvé une erreur
(Japanese) https://google.comのリンクに問題があると思うから変えたいです

Notice the change of position of the link

Now, if we let the translator translate everything and it ends up having issues with the links, we might want to find the link in the translated text and replace it with the previous one.

Something like this would be imaginable:

def link_correction(translated_text: str, links: list[str]) -> str:
    """A simple link correction function to keep the same links as before translation"""
    processing_text = translated_text.lower()
    for link in links:
        index = processing_text.find(link.lower()) # try to find the link in the translated text
        translated_text = translated_text[:index] + link + translated_text[len(link) + 1:] # just replace the link with the one before translation
    return translated_text

Note
This is an oversimplification of what could be done

Now, as you mentioned previously:

Link went from https://t.co/m1cE9sSrNb to https://t.co/M1ce9SSRNB. This alteration breaks entirely the link.

So if we have two links similar lower cased, they might be both replaced by the same link.


Now what should I do ?

  • Should I implement something which takes a Regex expression and tries to split the original text, then translates each parts individually and puts the pieces back together at the end, successfully leaving the Regex'ed parts untouched, but which comes with the first issue mentioned ?
  • Should I implement the oversimplified algorithm written herebefore ?
  • Also, should I implement the thing to add back spaces after dots, but this would work on languages using spaces after dots only (Latin-based for example) and might break the other ones ?
  • Also, what if for some reason, the user wants to translate the links ?

Note
Even if I'm only talking about links here the same thing applies to the hashtags, with the exception that hashtags are even harder to correct after the translation as they might carry some meaning and might need to be translated

from translate.

reddere avatar reddere commented on July 3, 2024

Thank you so much @ZhymabekRoman @Animenosekai . Haven't tested the workaround yet, but I kept my old GoogleTranslator until just 2 days ago when I tried the ReversoTranslator, which to me, seems to work even better than GoogleTranslator. Both on a lexical and choice of word level, in Italian seems to work decently.

Somehow though, I did find an issue for that one as well, as it throws error when word like única are in the source text, but I find better to open a separate issue for that one: #96

from translate.

Animenosekai avatar Animenosekai commented on July 3, 2024

Was talking with Venom on Discord about possible workarounds and support for notranslate or other HTML parsing ways of not translating certain parts of a given input. Might consider this soon.

from translate.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.