神刀安全网

Code Page 437 Refuses to Die

console: 1. to alleviate or lessen the grief, sorrow, or disappointment of; give solace or comfort. — That’s what I need more of after trying to demystify the behavior of System.out in the Windows console. Read on if you want to be consoled and enlightened.

Code Page 437 Refuses to Die

I am updating volume 2 of Core Java and got stuck on the section on text files in the internationalization chapter. How hard can this be in 2016? Surely everyone uses UTF-8 these days. Or do they? Make a file Test.java

public class Test {    public static void main(String[] args) {       System.out.println(java.nio.charset.Charset.defaultCharset());    } }

Run

java Test

in Windows 10. I have a standard US English version, and I get

windows-1252

Your mileage may differ, of course, depending on the local version of Windows that you have.

Windows-1252 is a superset of the 8-bit ISO 8859-1 encoding, with most of the non-printing characters in the range 0x80-0x9F replaced by goodies such as curly quotes and the Euro character € (U+20AC).

One of the characters that is not encoded by Windows-1252 is the Greek letter uppercase sigma ∑ (U+03A3). So, what do you think will happen when you add this line?

System.out.println("/u20AC/u03A3");

Have a guess:

  1. This line will print /u20AC/u03A3
  2. This line will print €∑
  3. This line will print €?
  4. This line will print ?∑

Of course, the first answer is wrong. /u20AC and /u03A3 are Unicode escapes, representing € and ∑ in the UTF-16 encoding that Java uses in String objects.

The second answer would be right if the default charset was UTF-8. But it can’t be since the ∑ characters isn’t in Windows-1252. So, the third choice must be the answer.

Actually, it’s the fourth.

The Windows console uses a different character set, the truly archaic IBM437 or “code page 437” from the original 1982 IBM Personal Computer . Interestingly, Java knows about that tidbit (see below).

Now try

java Test > out type out

What do you think it happens now?

  1. The output contains €∑
  2. The output contains €?
  3. The output contains ?∑
  4. The output contains Ç?

If you picked the last choice, pat yourself on the back and do something better with your time than reading this blog.

For the rest of us, where does Ç? come from???

To understand that, remember that System.out is an instance of java.io.PrintStream . That actually makes no sense since you send characters and strings, not bytes, to System.out . But the Writer interface was added only in Java 1.1, and of course by then it was far too late to change System.out to a PrintWriter since it might have broken some of the dozens of Java programs that were out in the field already.

When you look at the source code for PrintStream , you’ll find a field

private OutputStreamWriter charOut;

That’s the writer to which println sends its output. It’s easy enough to get it through reflection:

Field f = PrintStream.class.getDeclaredField("charOut"); f.setAccessible(true); OutputStreamWriter charOut = (OutputStreamWriter) f.get(System.out);

Now we can ask it for its encoding:

System.out.println(charOut.getEncoding());

When you run java Test without redirection, this line prints

Cp437

With redirection ( java Test > out ), you get

Cp1252

It is interesting that the encoding for System.out changes when you redirect the output. But that still doesn’t explain the Ç character. Actually, out contains two bytes: 0x80, the Windows-1252 endoding of €, and 0x3F, the encoding of ?. The encoder for Windows-1252 produced a ? when it couldn’t encode the ∑.

When you type that file on the Windows console, which uses code page 437, then the 0x80 shows up as Ç , the character with code page 437 encoding 0x80. And the 0x3F shows up as ? since the ASCII characters have the same encdoding in both code pages.

That’s pretty crazy. You can run

chcp 1252

so that the console and Java writers have the same encoding. Then you get

windows-1252 €? Cp1252

Or you can switch the Windows console to Unicode:

chcp 65001

Then you get

windows-1252 �? Cp1252

In other words, the Java program knows that the console is no longer using code page 437, but it doesn’t want to believe its good fortune that it’s actually using UTF-8, so it falls back to Windows-1252, emitting € as 0x80 and ? as 0x3F (for the ∑ that Windows-1252 can’t encode). The Windows console can’t make sense of 0x80 which should never be the first byte of an UTF-8 coding sequence , so it displays a replacement character � (U+FFFD).

That is utter madness. To really get it to work, do this:

chcp 65001 java -Dfile.encoding=UTF-8 Test

Then you can finally see

UTF-8 €∑ UTF8

in the console.

Disclaimer: The file.encoding property is undocumented and not officially supported, and it has been reported to act inconsistently across Java versions and platforms. This simple use for changing the character encoding for System.out seems to work. But don’t use it as a mechanism for setting the Charset for arbitrary Writer instances.

PS. Here is the complete program for you to copy/paste and experiment.

import java.io.*; import java.lang.reflect.*;  public class Test {    public static void main(String[] args) throws ReflectiveOperationException {       System.out.println(java.nio.charset.Charset.defaultCharset());       System.out.println("/u20AC/u03A3");       Field f = PrintStream.class.getDeclaredField("charOut");       f.setAccessible(true);       OutputStreamWriter charOut = (OutputStreamWriter) f.get(System.out);       System.out.println(charOut.getEncoding());    }    }

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Code Page 437 Refuses to Die

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址