Lexical Tools

UTF-8

What is UTF-8:

The term "UTF-8" stands for "Unicode Transformation Format, 8-bit form." UTF-8 encodes each Unicode character as a sequence whose length is a precise multiple of eight bits, and totals between one and four bytes.

Is my file in UTF-8 format?

The most common mistakes for Unicode users are they do not save the file in Unicode (UTF-8) encoding format. Even foreign characters (> 128) with diacritics and ligatures appear on the text editor, that doesn't mean the file is in the Unicode format. It could be in MS word format or other encoding. For example, Chinese characters could be saved in Chinese Traditional (Big5) encoding and appear correctly in the text editor. Make sure to save the file as Unicode (UTF-8) if you are using Unicode IO.

How to read and write data in UTF-8 in Java:

Java supports UTF-8 the same way it does any other encoding. Use "UTF-8" as the encoding name.

  • Read In:

    From a given file:
    BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(fName), "UTF-8");

    From stand in:
    BufferedReader in = new BufferedReader(new InputStreamReader(System.in));

  • Write out:

    To a given file:
    BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(fName), "UTF-8"));

    To stand out:
    BufferedWriter out = new BufferedWriter(new OutputStreamWriter(System.out));