Java Cookbook
Detect the Charset of a URL
Problem:
You want to know what character set a web page is encoded in.
Solution:
The Java programmer can use icu4j's CharsetDetector class to create a best guess for the charset of a specified input stream. Unfortunately this is very much a guess, and should not be overly relied upon.
To get the Charset of web page:
try{
//The url to check
URL url = new URL("http://ru.yahoo.com");
//get an inputstream from the url
InputStream is = url.openStream();
//get a byte array from the input stream and close the inputstream
byte[] bytes = new byte[2000];
is.read(bytes);
is.close();
//get a CharacterDetector
CharsetDetector cd = new CharsetDetector();
//set the text to detect
cd.setText(bytes);
//detect
CharsetMatch match = cd.detect();
//Get the name of the most likely match
System.out.println(match.getName());
}catch (MalformedURLException e){
e.printStackTrace();
}catch (IOException e){
e.printStackTrace();
}
The output:
UTF-8
If you are testing any of these recipes in Eclipse and the characters are not displaying correctly in your console visit http://i18ncookbook.com/eclipse_settings.
This site is ad supported. I hope you find something among our sponsors worth clicking. ;)
i18n search