View comments | RSS feed

Processing foreign-language form submissions

Your website can likely be accessed from any browser in the world. When the client submits a form, it encodes the submission with the encoding type set in the browser. As a result, you cannot rely on the fact that all form submissions will be made in the same encoding type. In Internet Explorer, you can see the encoding type by selecting View > Encoding.

This section describes the following methods of decoding request data:

Getting request encoding type

When a browser uses a character set that is not ISO-8859-1, it is supposed to send the encoding character set in the Content-Type header of the request. You use the request object's getCharacterEncoding method to get the character set from the Content-Type header. You can use that value to decode the form data and work with the response using the correct character set.

For example, if the client submits a form using EUC-JP and sets the request's Content-Type header to Shift-JIS, the processing servlet can determine how the request was encoded and properly decode it.

Using encoding-aware String constructors

Your server decodes the form data using the default character set ISO-8859-1 (also called Latin-1), without regard for the encoding type used by the client's browser. If the form was submitted using the EUC-JP character set, the request data becomes corrupted because JRun decodes the request data using the default decoding character which is different from the encoding character set.

Java provides the following two String constructors that allow you to set the encoding type, so that the decoding is performed correctly:

The following example gets the character encoding type and the raw byte array of the original form input and then constructs a String with the correct encoding type:

...
String encoding = request.getCharacterEncoding();
String httpDefaultEncoding = "ISO-8859-1";
String corruptData = request.getParameter("name"); // the data is decoded using ISO-8859-1 by default.
String correctData = new String(corruptData.getBytes(httpDefaultEncoding), encoding);
...

Unfortunately, most browsers do not currently set the HTTP Content-Type request header, regardless of the encoding type used in the request. As a result, calls to the getCharacterEncoding method usually return null values.

This type of String constructor throws an UnsupportedEncodingException if the encoding type is not supported.

Using setCharacterEncoding

While there is no declarative solution to determining the client's character encoding, the servlet API includes the following convenience method that sets the request object's encoding so that the remaining request data can be processed correctly:

request.setCharacterEncoding

This method lets you assign an encoding type to the request, so that all future calls to the request object decode the request's data correctly. Using setCharacterEncoding lets you avoid converting the request data from the default encoding to another encoding.

You must set the request's encoding before any calls to getParameter or getReader.

The following example sets the encoding type servlet so that Japanese parameters from a Shift_JIS-encoded form can be read with standard getParameter methods:

request.setCharacterEncoding("Shift_JIS");
String username = request.getParameter("username");

Comments


No screen name said on Jul 12, 2005 at 9:31 PM :
The following code in the last section
request.setCharacterEncoding("Shift_JIS");
String username = request.getParameter("username");
can't give unicode encoded String username.

Because response.getParameter("name") always converts http header stream to unicode using iso-8859-1 but not request.getCharacterEncoding(). -- with my testing
No screen name said on Jul 13, 2005 at 4:45 AM :
By definition, request.setCharacterEncoding() set the character encoding used in the body of this request, so it helps properly decode POST data in request body.

If want to decode GET parameter by settting request.setCharacterEncoding(), need to configure container to allow servlet use character encoding of body to decode headers. For example, In tomcat 5.0, set useBodyEncodingForURI=true or URIEncoding="encoding you like" will solve the problem.

 

RSS feed | Send me an e-mail when comments are added to this page | Comment Report

Current page: http://livedocs.adobe.com/jrun/4/Programmers_Guide/i10n6.htm