This site is now 100% read-only, and retired.

XML logo

Character encoding
Posted by Utumno on Mon 10 Nov 2008 at 10:12
Tags: none.

(locale: us_US.UTF-8; system is fully UTF-8)

I am a Pole living in Taiwan, so frequently I need to view Chinese characters or data with Polish diacritic marks ( I dont need to input them, just view ). This mostly works in graphical environment but fails miserably in the console. Also, I cannot seem to be able to correctly serve files encoded with ISO-8859-2 with Apache ( Polish-specific letters are garbled ). More specifically, when I set charset in Apache to ISO-8859-2, then file contents are displayed correctly, but file NAMES are not ( take a look for yourself: www.koltunski.pl/test ) . I suspect this and garbled Polish data when viewed from the console is really one and the same problem.

[RANT]
Shouldn't all this just work transparently?? Isn't fully-UTF-8 system all about being able to view whatever you want, whenever you want?? Why is it so complicated?? We have to have kernel support for various codepages, we probably have to mount filesystems with correct 'charset' and 'codepage' options, we have to set all those "LANGs" "LC_ALLs" and whatnot, we have to install appropriate fonts and God knows what else...
[/RANT]

 

Comments on this Entry

Re: Character encoding
Posted by Anonymous (213.227.xx.xx) on Mon 10 Nov 2008 at 13:08
Whoa! One problem at a time!

> This mostly works in graphical environment but fails miserably in the console.

When you say console, do you actually mean on the console sitting at the computer, or are you using something to access the console.
Maybe you mean Putty, in which case set your character set in Putty to utf8.

> I cannot seem to be able to correctly serve files encoded with ISO-8859-2 with Apache ( Polish-specific letters are garbled ). More specifically, when I set charset in Apache to ISO-8859-2, then file contents are displayed correctly, but file NAMES are not ( take a look for yourself: www.koltunski.pl/test )

You can force a character set to a certain type for a certain file extension, but remember that your filesystem stores filenames as utf-8.

Why not convert the files to utf-8 with iconv? It's 2008.

[ Parent ]

Re: Character encoding
Posted by Utumno (118.160.xx.xx) on Mon 10 Nov 2008 at 17:26
[ View Weblogs ]
Of course I tried to convert the filenames:

leszek@utumno:~/encoding-tests$ ls
BytyÃ& Atilde;ƒâ€ â€& acirc;„¢ÃƒÆ’ââ ;‚¬Å¡Ãƒâ€š&Atild e;‚±
leszek@utumno:~/encoding-tests$ echo `ls` | iconv -f ISO-8859-2 -t UTF-8
BytyÃ& Atilde;ƒÂ¢Ã¢â€šÂ ;¬Ã…¾Ãƒ&At ilde;‚¢Ã¢â €šÂ¬à ;ƒâ€¦Ã‚¡Ã& #131;ƒÆ’à ¢â‚¬à ;…¾ÃƒÂ&A circ;¢ÃƒÂ¢Ã¢â‚ ¬Å¡Ã‚¬Ãâ ;€šÃ‚¦


( should be 'BytyÃââ‚&n ot;žÃ‚ ¹ÃƒÂÂ&mac r;¿& Atilde;ƒâ€šÃ‚½' ; as you can see in http://www.koltunski.pl/test/ISO-8859-2 )

Edit: sweet! The Polish characters do not work here, either! ( the first and second occurances should be garbled, as it is in my console (although it is garbled in a different way here than it is in my console!), but the third is a correct UTF-8 string copied and pasted and if all this were so easy, would get displayed here correctly!!

[ Parent ]

Re: Character encoding
Posted by Utumno (118.160.xx.xx) on Mon 10 Nov 2008 at 17:30
[ View Weblogs ]

Heh, let's try to copy/paste some chinese characters:

野è&iu ml;¿½â€°Ã¨Å½ “å­¸à ;©ï¿½â€¹Ã¦& Acirc;­Â£Ã¥Â¼ï&iques t;½Ã¥Â®Å¡Ã¨&Ac irc;ªÂ¿Ã¯Â¼ÅR 17;主å&Acir c;¼ÂµÃ¤Â¿Â&re g;æâ€Â¹

Polish characters:

oplatajÄ… siÄ™

[ Parent ]

Re: Character encoding
Posted by Utumno (118.160.xx.xx) on Mon 10 Nov 2008 at 17:40
[ View Weblogs ]

Lol

And I am not just talking about Debian or Linux, - they are broken every freaking where, including Windows, Mac, the Internet, ISO standards, everywhere...

And it's been broken ever since some moron decided to use a 7bit charset back in the (?) 60s. Ever since then one could only keep building an increasingly complicated mess on top of it.

[ Parent ]

Re: Character encoding
Posted by Utumno (118.160.xx.xx) on Mon 10 Nov 2008 at 17:44
[ View Weblogs ]

Wait, there's more. Go to Firefox -> Menu -> View -> Character Encoding and set your encoding to something more funky, like the Chinese Big5. Then watch the hilarious mess this site has turned into.

That's just pathetic. I feel like grabbing the 'inventor' of ASCII and slapping some sense into him.

[ Parent ]

Re: Character encoding
Posted by Anonymous (213.227.xx.xx) on Tue 11 Nov 2008 at 07:58
Calm down.

You're doing something wrong.

1. Your filenames on disk are stored in utf-8.
2. Your files are either stored in Polish or utf-8.

To view the contents of a file properly, your editor and console/terminal emulator must both be using the correct character set. If one is using utf-8 and the other Polish (as it seems in your case) then you're doing something wrong.

Can you:

1. Tell us which editor you are using
2. Tell us which console/terminal emulator you are using.

Try first with vim, and gnome-terminal/konsole/Putty.
Make sure you set the character set to utf-8.

Can you see the utf-8 files correctly? If not, how do you know they are utf-8.

[ Parent ]

Re: Character encoding
Posted by Utumno (60.248.xx.xx) on Tue 11 Nov 2008 at 10:36
[ View Weblogs ]

I know how to view the files, and I could convert the Chinese characters to UTF-8 first before pasting them here which, I really do hope , would have worked.

That however is not the point and my entries here are not intended to be questions; they are intended to be a rant against the current state of i18n in IT world.

I could delve into technical details why it is HORRIBLY broken everywhere including Windows, Macs or the Internet - but instead I am simply going to say this -

I18N in the IT world sucks. BIGTIME.

It is a hint for you not to try to convince me that it doesn't. There is no law of nature that prevents copying Russian characters from wherever you happen to have them and pasting them to a Chinese forum online from simply working.

[ Parent ]

Re: Character encoding
Posted by Anonymous (213.227.xx.xx) on Tue 11 Nov 2008 at 10:55
You came here with a problem asking for help. I tried to help you. You respond telling me you don't want any help and start ranting.

[ Parent ]

Re: Character encoding
Posted by Utumno (220.133.xx.xx) on Tue 11 Nov 2008 at 14:36
[ View Weblogs ]

Well, I came here to rant :)

[ Parent ]

Re: Character encoding
Posted by rjc (85.12.xx.xx) on Tue 11 Nov 2008 at 12:45
The problem is that you're using ISO-8859-2 on an UTF-8 system but ISO-8859-2 is NOT an UTF-8 subset.
Convert all iso2 to utf8 and use only the latter. That's the way I did it - I'm using en_GB.UTF-8 as the system locale and read/write both Polish and Hebrew in UTF-8.
My advice: forget ISO in a multi-language environment; UTF-8 is not ideal but for some languages is good enough.

Regards,
rjc

[ Parent ]

Re: Character encoding
Posted by Utumno (220.133.xx.xx) on Tue 11 Nov 2008 at 14:44
[ View Weblogs ]

Forget ISO? Heh. When I go to a Russian site, copy some Russian characters from it, then go to a Chinese forum and paste them there, I expect them to be shown correctly, but they come out garbled. Whose fault is this? Did I just fail to 'forget ISO' ?

No, it's fault of whatever moron assigned 1 byte for the char ( was it Ritche or or his buddy Kernighan? ) and another one who decided to use only 7 bits of it for the charset.

7 bits ought to be enough for everybody, no?

[ Parent ]

Re: Character encoding
Posted by rjc (85.12.xx.xx) on Tue 11 Nov 2008 at 15:54
Were the Russian and Chinese sites both UTF-8? I guess not.

[ Parent ]

Re: Character encoding
Posted by simonw (84.45.xx.xx) on Tue 11 Nov 2008 at 21:10
[ View Weblogs ]
Whilst I understand the need to rant, having been fighting similar issues in Perl recently, that was automatically encoding binary data for us (sigh), the primary problem these days are die hards who still use ISO character encodings.

K&R's famous contributions came a LONG time after ASCII appeared (about 1963).

7 bit ASCII characters won't be causing you any issues at all if you use UTF8 everywhere.

If you don't like 7 bit ASCII you should go back in time and try the alternatives (EBCDIC is no doubt still fun for some folks).

[ Parent ]

Re: Character encoding
Posted by Utumno (60.248.xx.xx) on Wed 12 Nov 2008 at 06:54
[ View Weblogs ]

I use ISO-8859-2 in the small forum I host.

Reason? Users cling to their IE5s and IE6s like their very life depended in it. Now, I don't know - probably it's possible to view & post in a UTF-8 using IE5, but users report they have 'problems' and I dont have one around to test. I Probably should do it, though.

[ Parent ]

Re: Character encoding
Posted by Anonymous (202.134.xx.xx) on Thu 27 Nov 2008 at 20:04
People generally miss the fact that the application reading the data needs to know what encoding it's in to correctly understand the text. Doing an iconv -f x utf-8 isn't going to display properly if the console isn't displaying utf-8. Run "locale charmap" and see what it's using, then "locale" and see what the LC_CTYPE is. If you use UTF-8 everywhere you can basically just cut and paste and it'll work. I can past from basically any site into my forums and it will generally just work.

I did have the same problem with filenames when I first moved to UTF-8, I had a filename stored with extended characters in iso-8859-1. Just had to rename some files to their correct names. Now I have filenames in cyrillic, hirigana, hangul, and latin characters.

[ Parent ]