I spent a fair part of my past life not understanding character encodings in its entirety. While this was totally unimportant in past times, when e.g. a dataformat or file was written or created by the same program reading it, most likely not crossing country or language borders, nowadays it is. Very. So what is this all about?
Let‘s have a look at how characters can be encoded. The most simple approach is to simply put up a table with characters and assign each character a number between 0 and 255. By doing this, you get a character pool with 255 elements and can address each element with 1 byte. This approach is widely still in use, the most popular implementation is by far the ASCII format. These 255 characters did a great job for a long time. By dividing it into a lower and upper part system manufacturers were able to provide e.g. Germany with their beloved special characters, but were as well able to provide language-specific special characters to other nationalities, like French or Spain. By doing this, you gain the ability to use one byte to encode the most important subset of characters, but lose portability. A file created on a german-speaking workstation wouldn‘t be displayed correctly somewhere outside this language zone. That‘s where the concept of codemaps has it‘s origin. Each system was equipped with a codemap, which was nothing more but a table mapping byte-values to chars.
So there was a need growing, in times of internet and global networking, to encode characters so that they were viewable everywhere, in a consistent manner. Many approaches exist, but the one that is most widely used today and the de-facto standard for encodings is called UTF-8. Fundamentally, it is a multi-byte encoding, that is, each character can be encoded using up to 4 bytes. The exact byte count is determined by the highest order bits, if it‘s 0, only 1 byte is present, if it‘s 1, two bytes and so on. This is a clever approach, since the lower table of ASCII characters is properly UTF-8 encoded, just because the implementation is cool. The lower case table, for your interest, includes all basic numbers, whitespaces, and the alphabet in lower- and uppercase.
So nowadays, one should definitely use UTF-8. Why? The best thing would be to have everything in UTF-8, e.g. XML-Documents, Text Documents. Most moden languages handle Strings as being UTF-8 by default, only changing the external presentation on request. UTF-8 simplifies development of easily localizable application tremendously, while being simple to manage on the other hand. As exaggerated as it may sound: UTF-8 is indeed the answer to most encoding problems.
If you want to use it in Java, you are already equipped with what you need. If you want to use C to build a UTF-8-capable application, use the iconv-library, it is open-source and included on every distribution, including Cygwin. Iconv is able to encode almost any format that is there into any other format, and it‘s able to read UTF-8, or work with it to extend the C Standard Library ( which is at the date of this writing not able to do so .
Hopefully I could give you a brief introduction to what encodings are, and not are, and why UTF-8 should be used. Remind yourself!