Detect the Charset of a URL

 Problem:

You want to know what character set a web page is encoded in.

Solution:

The Java programmer can use icu4j's CharsetDetector class to create a best guess for the charset of a specified input stream.  Unfortunately this is very much a guess, and should not be overly relied upon.

 

To get the Charset of web page:

try{
    //The url to check
    URL url = new URL("http://ru.yahoo.com");
    //get an inputstream from the url
    InputStream is = url.openStream();
    //get a byte array from the input stream and close the inputstream
    byte[] bytes = new byte[2000];
    is.read(bytes);
    is.close();
    //get a CharacterDetector
    CharsetDetector cd = new CharsetDetector();
    //set the text to detect
    cd.setText(bytes);
    //detect
    CharsetMatch match = cd.detect();
    //Get the name of the most likely match
    System.out.println(match.getName());
}catch (MalformedURLException e){
    e.printStackTrace();
}catch (IOException e){
    e.printStackTrace();
}

 


The output:

UTF-8