Request #42359
From:
Account Type:
Premium Paid Account
Dreamwidth:
Account Name:
juan_gandhi
![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Email confirmed?
Yes
data version: 10
scheme:
lynx
Support category:
Time posted:
Thu, 20 Aug 2020 21:44:27 GMT (246 weeks ago)
Status:
answered (still needs help)
Summary:
[Cyrillic] encoding failure in comments in one of my posts
Original Request:
Hi,
Please take a look here: https://juan-gandhi.dreamwidth.org/3660899.html?nc=140#comments
It's supposed to be Cyrillic. As I understand, the texts are, so to say, double-utf8-encoded. That is, we take utf8 and think it's something else, and encode again.
How do I know? 6 years working with localization tools (and developing them) at Borland. Happened regularly. Especially if you use Perl for handling strings. Do you?
Anyway, hope, you'll find a solution.
Best regards, and thanks a lot for what you do.
-Vlad Patryshev
Diagnostics: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36
Please take a look here: https://juan-gandhi.dreamwidth.org/3660899.html?nc=140#comments
It's supposed to be Cyrillic. As I understand, the texts are, so to say, double-utf8-encoded. That is, we take utf8 and think it's something else, and encode again.
How do I know? 6 years working with localization tools (and developing them) at Borland. Happened regularly. Especially if you use Perl for handling strings. Do you?
Anyway, hope, you'll find a solution.
Best regards, and thanks a lot for what you do.
-Vlad Patryshev
Diagnostics: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36
![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Answer (#123237)
Posted: Fri, 04 Nov 2022 08:26:36 GMT (131 weeks ago)
Hi there -
Apologies for the long delay in getting you a response, we weren't quite sure how to tackle this issue. However, we've made some recent changes that may have corrected this.
Grazie,
H2
Apologies for the long delay in getting you a response, we weren't quite sure how to tackle this issue. However, we've made some recent changes that may have corrected this.
Grazie,
H2
![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Comment (#123241)
Posted: Fri, 04 Nov 2022 10:31:36 GMT (131 weeks ago)
Hi,
So far the problem was not corrected.
I probably could help you with this, after 6 years of handling localization at Borland, and a couple of years doing similar things at Google.
The text I see on the page, e.g. <a href="https://juan-gandhi.dreamwidth.org/3660899.html?nc=140#comments">here</a>, e.g. Ð ÑÑм они договоÑилиÑÑ, когда наÑинали пÑоекÑ?
...
is UTF-8 encoding of Cyrillic characters. You can try to use <a href="https://cafewebmaster.com/online_tools/utf8_decode">online utf decoder</a> and see pretty good Cyrillics, like "О чём они договорились, когда начинали проект?"
I, hopefully, assume that you use UTF-8 inside your system. Since on the page I aslo see raw UTF-8, it means that the Cyrillic text was, during importing from livejournal, converted to UTF-8 twice. I had a lot of that experience while at Borland, and my scripts were just checking this before storing texts into the translation database.
I believe you may have tons of such problematic encodings in the db, not just in my small examples.
My solution was: a) have a script that detects double-utf-encoding, b) decode it once, leaving a good utf-8 representation of Cyrillics (and maybe other scripts as weil).
Feel free to ping me in case you need any cooperation. I love Dreamwidth, have been using it for many years, and at times I even manage to pull people back from their Facebook accounts to dw.
Best regards,
-Vlad Patryshev
You must log in to answer Support requests.
So far the problem was not corrected.
I probably could help you with this, after 6 years of handling localization at Borland, and a couple of years doing similar things at Google.
The text I see on the page, e.g. <a href="https://juan-gandhi.dreamwidth.org/3660899.html?nc=140#comments">here</a>, e.g. Ð ÑÑм они договоÑилиÑÑ, когда наÑинали пÑоекÑ?
...
is UTF-8 encoding of Cyrillic characters. You can try to use <a href="https://cafewebmaster.com/online_tools/utf8_decode">online utf decoder</a> and see pretty good Cyrillics, like "О чём они договорились, когда начинали проект?"
I, hopefully, assume that you use UTF-8 inside your system. Since on the page I aslo see raw UTF-8, it means that the Cyrillic text was, during importing from livejournal, converted to UTF-8 twice. I had a lot of that experience while at Borland, and my scripts were just checking this before storing texts into the translation database.
I believe you may have tons of such problematic encodings in the db, not just in my small examples.
My solution was: a) have a script that detects double-utf-encoding, b) decode it once, leaving a good utf-8 representation of Cyrillics (and maybe other scripts as weil).
Feel free to ping me in case you need any cooperation. I love Dreamwidth, have been using it for many years, and at times I even manage to pull people back from their Facebook accounts to dw.
Best regards,
-Vlad Patryshev
Go to: previous open request, next open request
Return to the list of open requests.
Back to the Support Area.