Have you ever tried to write Java code with non-ASCII characters? Like having French method names?
The other day I stumbled upon Java classes written in French. Class names like “OpÃ©ration”, methods names like “getRÃ©alisateur” and embedded log messages and comments all the same.
At first you say “not common but cool” (and you start thinking about writing code in Chinese because your boss always wondered how we could forbid clients from decompiling our classes without using an obfuscator).
But cool it is not!
Why? Because of encoding!
Here is a quiz, what is the encoding those Java files were saved in?
- UTF-8 (after all this is how strings are encoded in the JVM)
- ASCII (come-on, everybody is writing code in English)
- MacRoman (why not?)
Just wonder for a while.
Answer is #3 because the Java IDE (Eclipse in this case) is by default using the platform encoding to save files. And those classes have been created on a Mac.
I actually had no problem reading and compiling them because I also use Eclipse on a Mac and because the Java compiler is also assuming the source files are in the platform encoding.
So what, nothing wrong then? Yeah, except the integration server is running on Ubuntu and sometimes I work on Windows as well. And on those platforms the default encoding is not MacRoman…
Something interesting is that it is always like that! I mean, even when you code in plain English there are chances that your IDE is going to write the files in the platform encoding. But nobody notices because as long as you only use characters in the ASCII-7 range, then they will be encoded the same in almost all encodings.
So what is the solution? Well it depends if you really want to code in French (or in Chinese). My advice anyway is “don’t do that” and externalize localized strings. However, if you really insist you have two solutions:
- Make the whole production chain encoding-explicit: Configure your IDE to use UTF-8 and specify in your build that the Java compiler is going to deal with UTF-8 encoded files (UTF-8 is better in most cases).
- Make sure you only use ASCII-7 characters in your files and replace all non-ASCII-7 characters with their \uXXXX equivalent (even in comments).
However, be aware that #1 is not always possible, you might be using processing tools that do not offer you the option to use something else than the platform encoding.
Have fun with encoding :)
Image Credits: Arite