Tag Archives: encoding

And you think 160 is not enough?

SMS are not 160 characters long, they are 140 bytes long! This is what I discovered today after my SO complained that her mobile operator was charging her for SMS she never sent…

And when you know how computers are working, it totally makes sense!

“So what?” are you going to ask? So, this is again a nice example of character encodings drive you crazy. According to wikipedia there are 3 encodings used in text messages which respectively use 7bits, 8bits and 16bits to encode a single character.

Depending on the characters you used in your message your phone is going to decide what encoding to use, thus reducing the maximum number of characters to, respectively, 160, 140 and 70 (and even less, see later). Any extra character will lead to the splitting of your message into multiple SMS and, obviously, a raise in your bill.

By default the 7bit encoding used is GSM 03.38, which has the following 128 characters alphabet: @, £, $, Â¥, è, é, ù, ì, ò, Ç, LF, Ø, ø, CR, Ã…, Ã¥, Δ, _, Φ, Γ, Λ, Ω, Π, Ψ, Σ, Θ, Ξ, ESC, Æ, æ, ß, É, SP, !, “, #, ¤, {5f676304cfd4ae2259631a2f5a3ea815e87ae216a7b910a3d060a7b08502a4b2}, &, ‘, (, ), *, +, ,, -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, <, =, >, ?, ¡, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, Ä, Ö, Ñ, Ãœ, §, ¿, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, ä, ö, ñ, ü, à

If you use only those characters, then you text messages can have 160 characters, however, any character outside of this alphabet will mean the use of a different encoding. And if you are using exotic scripts, your messages will be encoded in UTF-16 and in this encoding a Chinese character, for example, will take up to 4 bytes, reducing the maximum length of you Chinese message to 35 characters max.

I guess that now that smart phones are supporting international scripts and transparently breaking up text messages, a lot of people get trapped. The only recommendation I can think of is to enable your phone to display the character count when you type text messages, I noticed that my iPhone is changing the maximum number of characters according to the encoding it’s going to use to send my message.

If you want to know more about character encodings I absolutely recommend the following article by Joel Spolsky: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

UPDATE: At Stephanie’s request here is how to activate message count on your iPhone (at least on my 3GS with iOS 4.1).

Go to your iPhone settings, scroll down to “Messages” then toggle “Character Count” on. When you write a text message the count will show up only if you have at least two lines of text :)

Fun with Java files encoding

Have you ever tried to write Java code with non-ASCII characters? Like having French method names?

The other day I stumbled upon Java classes written in French. Class names like “Opération”, methods names like “getRéalisateur” and embedded log messages and comments all the same.

At first you say “not common but cool” (and you start thinking about writing code in Chinese because your boss always wondered how we could forbid clients from decompiling our classes without using an obfuscator).

But cool it is not!

Why? Because of encoding!

Here is a quiz, what is the encoding those Java files were saved in?

  1. UTF-8 (after all this is how strings are encoded in the JVM)
  2. ASCII (come-on, everybody is writing code in English)
  3. MacRoman (why not?)

Just wonder for a while.

Answer is #3 because the Java IDE (Eclipse in this case) is by default using the platform encoding to save files. And those classes have been created on a Mac.

I actually had no problem reading and compiling them because I also use Eclipse on a Mac and because the Java compiler is also assuming the source files are in the platform encoding.

So what, nothing wrong then? Yeah, except the integration server is running on Ubuntu and sometimes I work on Windows as well. And on those platforms the default encoding is not MacRoman…

Something interesting is that it is always like that! I mean, even when you code in plain English there are chances that your IDE is going to write the files in the platform encoding. But nobody notices because as long as you only use characters in the ASCII-7 range, then they will be encoded the same in almost all encodings.

So what is the solution? Well it depends if you really want to code in French (or in Chinese). My advice anyway is “don’t do that” and externalize localized strings. However, if you really insist you have two solutions:

  1. Make the whole production chain encoding-explicit: Configure your IDE to use UTF-8 and specify in your build that the Java compiler is going to deal with UTF-8 encoded files (UTF-8 is better in most cases).
  2. Make sure you only use ASCII-7 characters in your files and replace all non-ASCII-7 characters with their \uXXXX equivalent (even in comments).

However, be aware that #1 is not always possible, you might be using processing tools that do not offer you the option to use something else than the platform encoding.

Have fun with encoding :)