One of the most important tasks a Java internationalization programmer has is the handling of different writing systems. Characterset conversion used to be a huge task for all developers working with multiple locales. While the wide-spread acceptance of Unicode has made that job much easier than before, there are still many situations where characterset conversion becomes important.
Transliteration is another area that the localization programmer will find useful. Transliteration should not be confused with translation. Translation is the conversion of text from one language to another. Transliteration is the conversion of text from one writing system (Alphabet) to another. This is useful for handling names and other non-translatable text.
You want to convert the script text is written in.
The Transliterator from icu4j class allows for easy conversion of text from one writing system to another. To transliterate text you simply get an instance of the Transliterator class using the id of the writing system you are converting from, and the id of the system you want to convert into, and then call the transliterate method on the object.
It is important to note that not all writing systems contain all sounds, nor do they handle sounds in the same way. This means that round trip transliteration is often faulty.
To transliterate text from English to Hangul:
//Get a Transliterator instance for converting Latin script to Hangul (Korean) script
Transliterator trans = Transliterator.getInstance("Latin-Hangul");
//The text to transliterate
String txt = "Transliteration is very cool.";
//Output the example text
System.out.println(txt);
//Transliterate the text
String korean = trans.transliterate(txt);
//Output the Hangul text
System.out.println("To Hangul: " + korean);
//Get an instance of a transliterator going in the reverse.
//This is the same as calling Transliterator.getInstance("Hangul-Latin");
trans = Transliterator.getInstance("Latin-Hangul",Transliterator.REVERSE);
//Output the transliterated English value
//Note the English doesn't match the original value exactly.
//This is due to the different sounds available in each script
System.out.println("Back to English: " + trans.transliterate(korean));
The output:
Transliteration is very cool.
To Hangul: 트란스리테라티온 잇 베류 초올.
Back to English: teulanseulitelation is belyu chool.
You want to know what character set a web page is encoded in.
The Java programmer can use icu4j's CharsetDetector class to create a best guess for the charset of a specified input stream. Unfortunately this is very much a guess, and should not be overly relied upon.
To get the Charset of web page:
try{
//The url to check
URL url = new URL("http://ru.yahoo.com");
//get an inputstream from the url
InputStream is = url.openStream();
//get a byte array from the input stream and close the inputstream
byte[] bytes = new byte[2000];
is.read(bytes);
is.close();
//get a CharacterDetector
CharsetDetector cd = new CharsetDetector();
//set the text to detect
cd.setText(bytes);
//detect
CharsetMatch match = cd.detect();
//Get the name of the most likely match
System.out.println(match.getName());
}catch (MalformedURLException e){
e.printStackTrace();
}catch (IOException e){
e.printStackTrace();
}
The output:
UTF-8
You want to retireve all available source ids for a Transliterator.
A Transliterator is created using a combination of a source and a target id joined by a hyphen.
To get a list of all sources:
//Get an Enumeration of available source ids
Enumeration<String> ids = Transliterator.getAvailableSources();
//Loop through available ids and output them to the console
while(ids.hasMoreElements()){
String id = ids.nextElement();
System.out.println(id);
}
The output:
Arabic
Hangul
Tamil
Thaana
Gujarati
Simplified
Han
Telugu
Syriac
Devanagari
Name
Publishing
Digit
Latin
Kannada
NumericPinyin
Jamo
Any
Fullwidth
Cyrillic
Armenian
Georgian
Katakana
Hex
Malayalam
Oriya
Pinyin
Tone
Thai
Greek
Hiragana
Halfwidth
Hebrew
Accents
Traditional
Bengali
Gurmukhi
You want to get all available transliterator ids.
You obtain a Transliterator by specifying an id. The id is a String that contains two script names separated by a hyphen, for example: "Latin-Hangul."
To obtain a list of all ids:
//Get an Enumeration of available ids
Enumeration<String> ids = Transliterator.getAvailableIDs();
//Loop through available ids and output them to the console
while(ids.hasMoreElements()){
String id = ids.nextElement();
System.out.println(id);
}
This outputs:
Accents-Any
Any-Accents
Any-Publishing
Arabic-Latin
Armenian-Latin
Bengali-Devanagari
Bengali-Gujarati
Bengali-Gurmukhi
Bengali-Kannada
Bengali-Latin
Bengali-Malayalam
Bengali-Oriya
Bengali-Tamil
Bengali-Telugu
Cyrillic-Latin
Devanagari-Bengali
Devanagari-Gujarati
Devanagari-Gurmukhi
Devanagari-Kannada
Devanagari-Latin
Devanagari-Malayalam
Devanagari-Oriya
Devanagari-Tamil
Devanagari-Telugu
Digit-Tone
Fullwidth-Halfwidth
Georgian-Latin
Greek-Latin
Greek-Latin/UNGEGN
Gujarati-Bengali
Gujarati-Devanagari
Gujarati-Gurmukhi
Gujarati-Kannada
Gujarati-Latin
Gujarati-Malayalam
Gujarati-Oriya
Gujarati-Tamil
Gujarati-Telugu
Gurmukhi-Bengali
Gurmukhi-Devanagari
Gurmukhi-Gujarati
Gurmukhi-Kannada
Gurmukhi-Latin
Gurmukhi-Malayalam
Gurmukhi-Oriya
Gurmukhi-Tamil
Gurmukhi-Telugu
Halfwidth-Fullwidth
Han-Latin
Hangul-Latin
Hebrew-Latin
Hiragana-Katakana
Hiragana-Latin
Jamo-Latin
Kannada-Bengali
Kannada-Devanagari
Kannada-Gujarati
Kannada-Gurmukhi
Kannada-Latin
Kannada-Malayalam
Kannada-Oriya
Kannada-Tamil
Kannada-Telugu
Katakana-Hiragana
Katakana-Latin
Latin-Arabic
Latin-Armenian
Latin-Bengali
Latin-Cyrillic
Latin-Devanagari
Latin-Georgian
Latin-Greek
Latin-Greek/UNGEGN
Latin-Gujarati
Latin-Gurmukhi
Latin-Han
Latin-Hangul
Latin-Hebrew
Latin-Hiragana
Latin-Jamo
Latin-Kannada
Latin-Katakana
Latin-Malayalam
Latin-NumericPinyin
Latin-Oriya
Latin-Syriac
Latin-Tamil
Latin-Telugu
Latin-Thaana
Latin-Thai
Malayalam-Bengali
Malayalam-Devanagari
Malayalam-Gujarati
Malayalam-Gurmukhi
Malayalam-Kannada
Malayalam-Latin
Malayalam-Oriya
Malayalam-Tamil
Malayalam-Telugu
NumericPinyin-Latin
NumericPinyin-Pinyin
Oriya-Bengali
Oriya-Devanagari
Oriya-Gujarati
Oriya-Gurmukhi
Oriya-Kannada
Oriya-Latin
Oriya-Malayalam
Oriya-Tamil
Oriya-Telugu
Pinyin-NumericPinyin
Publishing-Any
Simplified-Traditional
Syriac-Latin
Tamil-Bengali
Tamil-Devanagari
Tamil-Gujarati
Tamil-Gurmukhi
Tamil-Kannada
Tamil-Latin
Tamil-Malayalam
Tamil-Oriya
Tamil-Telugu
Telugu-Bengali
Telugu-Devanagari
Telugu-Gujarati
Telugu-Gurmukhi
Telugu-Kannada
Telugu-Latin
Telugu-Malayalam
Telugu-Oriya
Telugu-Tamil
Thaana-Latin
Thai-Latin
Tone-Digit
Traditional-Simplified
Any-Null
Any-Remove
Any-Hex/Unicode
Any-Hex/Java
Any-Hex/C
Any-Hex/XML
Any-Hex/XML10
Any-Hex/Perl
Any-Hex
Hex-Any/Unicode
Hex-Any/Java
Hex-Any/C
Hex-Any/XML
Hex-Any/XML10
Hex-Any/Perl
Hex-Any
Any-Lower
Any-Upper
Any-Title
Any-Name
Name-Any
Any-NFC
Any-NFD
Any-NFKC
Any-NFKD
Any-Latin
Any-Telugu
Any-Malayalam
Any-Oriya
Any-Gurmukhi
Any-Gujarati
Any-Bengali
Any-Devanagari
Any-Kannada
Any-Tamil
Any-Han
Any-Katakana
Any-Hiragana
Any-Armenian
Any-Cyrillic
Any-Hangul
Any-Arabic
Any-Greek
Any-Greek/UNGEGN
Any-Hebrew
Any-Thai
Any-Syriac
Any-Thaana
Any-Georgian
You have a source id but you want to retrieve all available target ids for the source id.
A Transliterator is retrieved using an id that is a combination of a source and target id.
To retrieve all possible target IDs for the source "Latin":
//Get an Enumeration of available target ids for the source "Latin"
Enumeration<String> ids = Transliterator.getAvailableTargets("Latin");
//Loop through available ids and output them to the console
while(ids.hasMoreElements()){
String id = ids.nextElement();
System.out.println(id);
}
The output:
Gujarati
Jamo
Han
Katakana
Hiragana
Armenian
Cyrillic
NumericPinyin
Gurmukhi
Bengali
Hangul
Arabic
Greek
Devanagari
Hebrew
Thai
Oriya
Tamil
Syriac
Malayalam
Kannada
Thaana
Telugu
Georgian
You want to read a Unicode encoded file into memory.
To read a file with a charset other than the system default you should specify the charset in the reader. You can specify either a Charset or the String id of the Charset.
To read a Unicode file containing Japanese characters:
//Surround with try catch to handle potential exception
try{
//Get an InputStream
FileInputStream fis = new FileInputStream("C:\\files\\test.txt");
//Get a reader specifying the charset
InputStreamReader isr = new InputStreamReader(fis,"UTF-8");
//wrap with a buffered reader for performance
BufferedReader br = new BufferedReader(isr);
//Read it into a variable an output
String txt;
while((txt = br.readLine()) != null){
System.out.println(txt);
}
}catch (FileNotFoundException e){
e.printStackTrace();
}catch (IOException e){
e.printStackTrace();
}
The output:
これはテストです。
高松
日本
米国
英国
世界
You want to write a file to disk in an encoding other than the default.
The convenience FileWriter class writes files in the default character encoding of the JVM. If you want to specify an encoding you should create a FileOutputStream and pass it to an OutputStreamWriter. The OutputStreamWriter class allows you to specify an encoding Charset.
To write a file to disk as Shift_JIS:
//Handle potential exceptions
try{
//Our text to write out to the file. In this case garbage Japanese
String example = "これはテストです。高松日本米国英国世界";
//Create an output stream
FileOutputStream fos = new FileOutputStream("C:\\files \\testOut.html");
//Create a writer specifying our output stream and character set.
OutputStreamWriter osw = new OutputStreamWriter(fos,"Shift_JIS");
//Let's buffer it for performance
BufferedWriter bw = new BufferedWriter(osw);
//write the file bw.write(example);
//close the writer bw.close();
} catch (FileNotFoundException e){
e.printStackTrace();
} catch (IOException e){
e.printStackTrace();
}
to test the output open the file in your browser and change the encoding to Shift_JIS.