Monday, May 28, 2012

Character Encoding Troubles

In the past, I encountered an issue with a legacy application that migrated from WebLogic Server running on a old Windows server to a newer Tomcat application server running on Red Hat Linux environment. This application has been running fine without issues for weeks until the developer suddenly started to complain that this application is not saving the registered ® and copyright © trademark symbols to the database. The developer also said that when he runs the application from his Windows laptop, he's able to save these two symbols into the database.

Earlier in my career, I was developer for a software company that built multilingual software specializing in Asian languages, I recognize this as a character encoding issue. So I asked the developer to send me the source code related to this issue and here's the relevant part of the code:

    // Sending updates to the database
    update.setValue("data",encodeString(text));
    // Inserting the data to the database
    insert.setValue("data",encodeString(text));

That seemed odd, what does the method encodeString do? Here's the implementation:

public String encodeString(String value) throws java.io.UnsupportedEncodingException
{
  log("encodeString", "Begin");
  if (value == null)
  {
    log("encodeString", "value == null");
    return value;
  }
  
  byte [] btValue = value.getBytes();
  String encodedValue = new String(btValue, _ISO88591);
  
  /*
  Charset utf8charset = Charset.forName("UTF-8");
  Charset iso88591charset = Charset.forName("ISO-8859-1");

  ByteBuffer inputBuffer = ByteBuffer.wrap(btValue);

  // decode UTF-8
  CharBuffer data = utf8charset.decode(inputBuffer);

  // encode ISO-8559-1
  ByteBuffer outputBuffer = iso88591charset.encode(data);
  byte[] outputData = outputBuffer.array();
  byte[] inputData = inputBuffer.array();
 
  log("ISO-8859-1: ", new String(outputData));
  log("UTF-8: ", new String(inputData));
  
  //String encodedValue = new String(btValue, _ISO88591);
  String encodedValue = new String(inputData);
  String encodedValue_ISO88591 = new String(inputData, _ISO88591);
  //encodedValue.getBytes("UTF-8");
  log("Encoded UTF: ", encodedValue);
  log("Encoded ISO88591: ", encodedValue_ISO88591);
  */
  
  return encodedValue;
}

Wow! I can see that the developer is trying to get a handle on this encoding business and hence all the commented out R&D code, but clearly this developer is not familiar with character set encoding issues.

The main problem is with the following line of code:

  byte [] btValue = value.getBytes();

From Java API manual, the getBytes() method encodes the string into a sequence of bytes using the platform's default charset. On the old legacy Windows Server that this application was originally running on, it was probably using Windows code page 1252, which is basically ISO-8859-1 and hence the register and copyright symbols were correctly encoded. However, on the Red Hat Linux operating system, the default encoding was ascii and therefore the register/copyright symbol got converted into question marks.

Java strings are internally unicode (UTF-16) and typically the JDBC drivers will provide the appropriate conversions to and from the database. Therefore, fix to this application is simple, all one have to do is merely change the following two lines of code and get rid of the entire encodeString() method:

    // Changed from : update.setValue("data",encodeString(text));
    update.setValue("data",text);
    // Changed from : insert.setValue("data",encodeString(text));
    insert.setValue("data",text);