The i18n Cookbook - recipies for a global society

  • java cookbook
  • about the author
Home › Java Internationalization Cookbook › Unicode, Transliteration, and Charactersets

Java Cookbook

  • Java Internationalization Cookbook
    • Locales
    • Dates and Times
    • Numerical Systems
    • Misc
    • Resource Bundles
    • Unicode, Transliteration, and Charactersets
      • Convert text from one script to another
      • Detect the Charset of a URL
      • Get Transliterators available source ids
      • Get all available transliterator ids
      • Get available target ids for a Transliterator source id
      • Read a Unicode file
      • Write a Shift_JIS Japanese file

Detect the Charset of a URL

 Problem:

You want to know what character set a web page is encoded in.

Solution:

The Java programmer can use icu4j's CharsetDetector class to create a best guess for the charset of a specified input stream.  Unfortunately this is very much a guess, and should not be overly relied upon.

 

To get the Charset of web page:

try{
    //The url to check
    URL url = new URL("http://ru.yahoo.com");
    //get an inputstream from the url
    InputStream is = url.openStream();
    //get a byte array from the input stream and close the inputstream
    byte[] bytes = new byte[2000];
    is.read(bytes);
    is.close();
    //get a CharacterDetector
    CharsetDetector cd = new CharsetDetector();
    //set the text to detect
    cd.setText(bytes);
    //detect
    CharsetMatch match = cd.detect();
    //Get the name of the most likely match
    System.out.println(match.getName());
}catch (MalformedURLException e){
    e.printStackTrace();
}catch (IOException e){
    e.printStackTrace();
}

 


The output:

UTF-8

 

‹ Convert text from one script to another up Get Transliterators available source ids ›
  • Charset
  • icu4j
  • Printer-friendly version
  • Add new comment

If you are testing any of these recipes in Eclipse and the characters are not displaying correctly in your console visit http://i18ncookbook.com/eclipse_settings.

This site is ad supported.  I hope you find something among our sponsors worth clicking. ;)

i18n search

Google
Custom Search

Search

Tags in Tags

calendar date icu4j Java Locale number format numberformat parse spellout timezone transliteration transliterator
more tags

User login

  • Create new account
  • Request new password
  • java cookbook
  • about the author