Request #42359

From:
Account Type:
Premium Paid Account
Dreamwidth:
Account Name: [personal profile] juan_gandhi
Style: (S2) core: public, layout: public, theme: public, user: custom
Email confirmed? Yes
cluster: 10
data version: 10
scheme: lynx
Media storage used: 2901.353 MB (96.7%)
Support category:
Time posted:
Thu, 20 Aug 2020 21:44:27 GMT (246 weeks ago)
Status:
answered (still needs help)
Summary:
[Cyrillic] encoding failure in comments in one of my posts
Original Request:
Hi,

Please take a look here: https://juan-gandhi.dreamwidth.org/3660899.html?nc=140#comments

It's supposed to be Cyrillic. As I understand, the texts are, so to say, double-utf8-encoded. That is, we take utf8 and think it's something else, and encode again.

How do I know? 6 years working with localization tools (and developing them) at Borland. Happened regularly. Especially if you use Perl for handling strings. Do you?

Anyway, hope, you'll find a solution.

Best regards, and thanks a lot for what you do.

-Vlad Patryshev
Diagnostics: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36
highlander_ii: Josh Duhamel sitting in a wooden box ([Josh D] inna box, _support) [personal profile] highlander_ii - Highlander II
Answer (#123237)
Posted: Fri, 04 Nov 2022 08:26:36 GMT (131 weeks ago)
Hi there -

Apologies for the long delay in getting you a response, we weren't quite sure how to tackle this issue. However, we've made some recent changes that may have corrected this.

Grazie,
H2
juan_gandhi: (VP) [personal profile] juan_gandhi - Juan-Carlos Gandhi
Comment (#123241)
Posted: Fri, 04 Nov 2022 10:31:36 GMT (131 weeks ago)
Hi,

So far the problem was not corrected.
I probably could help you with this, after 6 years of handling localization at Borland, and a couple of years doing similar things at Google.

The text I see on the page, e.g. <a href="https://juan-gandhi.dreamwidth.org/3660899.html?nc=140#comments">here</a>, e.g. О чём они договорились, когда начинали проект?
...

is UTF-8 encoding of Cyrillic characters. You can try to use <a href="https://cafewebmaster.com/online_tools/utf8_decode">online utf decoder</a> and see pretty good Cyrillics, like "О чём они договорились, когда начинали проект?"

I, hopefully, assume that you use UTF-8 inside your system. Since on the page I aslo see raw UTF-8, it means that the Cyrillic text was, during importing from livejournal, converted to UTF-8 twice. I had a lot of that experience while at Borland, and my scripts were just checking this before storing texts into the translation database.

I believe you may have tons of such problematic encodings in the db, not just in my small examples.

My solution was: a) have a script that detects double-utf-encoding, b) decode it once, leaving a good utf-8 representation of Cyrillics (and maybe other scripts as weil).

Feel free to ping me in case you need any cooperation. I love Dreamwidth, have been using it for many years, and at times I even manage to pull people back from their Facebook accounts to dw.

Best regards,
-Vlad Patryshev
You must log in to answer Support requests.
Go to: previous open request, next open request
Return to the list of open requests.
Back to the Support Area.