subscribe via RSS
I spent a fair part of my past life not understanding character encodings in its entirety. While this was totally unimportant in past times, when e.g. a dataformat or file was written or created by the same program reading it, most likely not crossing country or language borders, nowadays it is. Very. So what is this all about?
Let‘s have a look at how characters can be encoded. The most simple approach is to simply put up a table with characters and assign each character a number between 0 and 255. By doing this, you get a character pool with 255 elements and can address each element with 1 byte. This approach is widely still in use, the most popular implementation is by far the ASCII format. These 255 characters did a great job for a long time. By dividing it into a lower and upper part system manufacturers were able to provide e.g. Germany with their beloved special characters, but were as well able to provide language-specific special characters to other nationalities, like French or Spain. By doing this, you gain the ability to use one byte to encode the most important subset of characters, but lose portability. A file created on a german-speaking workstation wouldn‘t be displayed correctly somewhere outside this language zone. That‘s where the concept of codemaps has it‘s origin. Each system was equipped with a codemap, which was nothing more but a table mapping byte-values to chars.
So there was a need growing, in times of internet and global networking, to encode characters so that they were viewable everywhere, in a consistent manner. Many approaches exist, but the one that is most widely used today and the de-facto standard for encodings is called UTF-8. Fundamentally, it is a multi-byte encoding, that is, each character can be encoded using up to 4 bytes. The exact byte count is determined by the highest order bits, if it‘s 0, only 1 byte is present, if it‘s 1, two bytes and so on. This is a clever approach, since the lower table of ASCII characters is properly UTF-8 encoded, just because the implementation is cool. The lower case table, for your interest, includes all basic numbers, whitespaces, and the alphabet in lower- and uppercase.
So nowadays, one should definitely use UTF-8. Why? The best thing would be to have everything in UTF-8, e.g. XML-Documents, Text Documents. Most moden languages handle Strings as being UTF-8 by default, only changing the external presentation on request. UTF-8 simplifies development of easily localizable application tremendously, while being simple to manage on the other hand. As exaggerated as it may sound: UTF-8 is indeed the answer to most encoding problems.
If you want to use it in Java, you are already equipped with what you need. If you want to use C to build a UTF-8-capable application, use the iconv-library, it is open-source and included on every distribution, including Cygwin. Iconv is able to encode almost any format that is there into any other format, and it‘s able to read UTF-8, or work with it to extend the C Standard Library ( which is at the date of this writing not able to do so .
Hopefully I could give you a brief introduction to what encodings are, and not are, and why UTF-8 should be used. Remind yourself!
A fellow of mine just started his blog, http://www.codebu.de, dedicated to all Rails concerning and Ruby stuff, be sure to check it out.
So this is the last post from my room in Finland, tomorrow I'm heading home. I'm very grateful for the people I've met, the things I've learned and the endless nights spent on the road. Thank you all!
A short note on a previous post where I talked about my project of a simple File-Sharing applications that basically works using drag'n'drop and some zeroconf to find other peers. I wasn't exactly surprised to find something that matches that description pretty well, but here it is, it's called giver and should run on any platform that has support for some kind of .net/mono framework. I haven't tried it yet, but I'll sure give it a shot and tell you about it.
I'm already thinking about dropping the Java-project and instead do a client for the giver-protocol in Cocoa. I would be excited to have someone reporting about the actual use of giver! Moritz.
Running Mac OS X, i forgot to mention that in the headline. Why? Apple is most likely busy updating it's own stuff, promoting it's platform, and since the iPhone came up, many people started learning Cocoa, so the developer base there has grown, too. Additionally, the general user base is also growing, making it a more interesting target for Software Companies. And there are not many basic rules, valid everywhere, but one is for sure that Java Desktop applications just don't integrate well. There is maybe an exception for programs using SWT, but still, the native look and feel is something different.
So the question I'm asking is whether it's that bad that there is no Java support? Yes! Absolutely! At least if you either use programs or build programs depending on it. Of course, most programs just work fine with Java 5, but there are some that just don't. And so one of the main reasons for Java is obsolete: write once, run everywhere. And while Sun, the company behind Java, is providing Runtime Environments for Windows and Linux, it's not for Mac. So it's not Apples fault alone. But careless of who's fault it is, it just sucks, clearly spoken.