Java正确判别出文件的字符集(尤其是带BOM和不带BOM的UTF-8字符)
前几天在项目中需要读取用户上传过来的txt文件,但不确定txt文件的字符集
UTF-16、UTF-8(带BOM)、Unicode可以根据前三个字节区别
- public String getTxtEncode(FileInputStream in) throws IOException{
- byte[] head = new byte[3];
- in.read(head);
- String code = "GBK";
- if (head[0] == -1 && head[1] == -2 )
- code = "UTF-16";
- if (head[0] == -2 && head[1] == -1 )
- code = "Unicode";
- //带BOM
- if(head[0]==-17 && head[1]==-69 && head[2] ==-65)
- code = "UTF-8";
- if("Unicode".equals(code)){
- code = "UTF-16";
- }
- return code;
- }
但不带BOM的UTF-8和GBK前三个字节不确定,用以上方法无法区别
通过在google上搜索发现不带BOM的识别是Java遗留的一个bug,呵呵,终于找到根源了,Java提供了此bug的解决方案
- package com.justsy.sts.utf8;
- import java.io.*;
- /**
- * This inputstream will recognize unicode BOM marks and will skip bytes if
- * getEncoding() method is called before any of the read(...) methods.
- *
- * Usage pattern: String enc = "ISO-8859-1"; // or NULL to use systemdefault
- * FileInputStream fis = new FileInputStream(file); UnicodeInputStream uin = new
- * UnicodeInputStream(fis, enc); enc = uin.getEncoding(); // check and skip
- * possible BOM bytes InputStreamReader in; if (enc == null) in = new
- * InputStreamReader(uin); else in = new InputStreamReader(uin, enc);
- */
- public class UnicodeInputStream extends InputStream {
- PushbackInputStream internalIn;
- boolean isInited = false;
- String defaultEnc;
- String encoding;
- private static final int BOM_SIZE = 4;
- public UnicodeInputStream(InputStream in, String defaultEnc) {
- internalIn = new PushbackInputStream(in, BOM_SIZE);
- this.defaultEnc = defaultEnc;
- }
- public String getDefaultEncoding() {
- return defaultEnc;
- }
- public String getEncoding() {
- if (!isInited) {
- try {
- init();
- } catch (IOException ex) {
- IllegalStateException ise = new IllegalStateException(
- "Init method failed.");
- ise.initCause(ise);
- throw ise;
- }
- }
- return encoding;
- }
- /**
- * Read-ahead four bytes and check for BOM marks. Extra bytes are unread
- * back to the stream, only BOM bytes are skipped.
- */
- protected void init() throws IOException {
- if (isInited)
- return;
- byte bom[] = new byte[BOM_SIZE];
- int n, unread;
- n = internalIn.read(bom, 0, bom.length);
- if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00)
- && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
- encoding = "UTF-32BE";
- unread = n - 4;
- } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)
- && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
- encoding = "UTF-32LE";
- unread = n - 4;
- } else if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB)
- && (bom[2] == (byte) 0xBF)) {
- encoding = "UTF-8";
- unread = n - 3;
- } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
- encoding = "UTF-16BE";
- unread = n - 2;
- } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
- encoding = "UTF-16LE";
- unread = n - 2;
- } else {
- // Unicode BOM mark not found, unread all bytes
- encoding = defaultEnc;
- unread = n;
- }
- // System.out.println("read=" + n + ", unread=" + unread);
- if (unread > 0)
- internalIn.unread(bom, (n - unread), unread);
- isInited = true;
- }
- public void close() throws IOException {
- // init();
- isInited = true;
- internalIn.close();
- }
- public int read() throws IOException {
- // init();
- isInited = true;
- return internalIn.read();
- }
- }
通过使用上述InputStream类的实现可以正确的读取出不带BOM和带BOM的字符集
- package com.justsy.sts.utf8;
- import java.io.BufferedReader;
- import java.io.File;
- import java.io.FileInputStream;
- import java.io.IOException;
- import java.io.InputStreamReader;
- import java.nio.charset.Charset;
- public class UTF8Test {
- public static void main(String[] args) throws IOException {
- File f = new File("D:"+File.separator+"Order.txt");
- FileInputStream in = new FileInputStream(f);
- String dc = Charset.defaultCharset().name();
- UnicodeInputStream uin = new UnicodeInputStream(in,dc);
- BufferedReader br = new BufferedReader(new InputStreamReader(uin));
- String line = br.readLine();
- while(line != null)
- {
- System.out.println(line);
- line = br.readLine();
- }
- }
- }
结合Java提供的方案,我们就可以比较完整的判别出各种字符集了
- public String getTxtEncode(FileInputStream in) throws IOException{
- String dc = Charset.defaultCharset().name();
- UnicodeInputStream uin = new UnicodeInputStream(in,dc);
- if("UTF-8".equals(uin.getEncoding())){
- uin.close();
- return "UTF-8";
- }
- uin.close();
- byte[] head = new byte[3];
- in.read(head);
- String code = "GBK";
- if (head[0] == -1 && head[1] == -2 )
- code = "UTF-16";
- if (head[0] == -2 && head[1] == -1 )
- code = "Unicode";
- //带BOM
- if(head[0]==-17 && head[1]==-69 && head[2] ==-65)
- code = "UTF-8";
- if("Unicode".equals(code)){
- code = "UTF-16";
- }
- return code;
- }
相关推荐
利用chardet,cpdetector包获取文件格式,并判断文件类型是否带BOM
正确使用各种Java运算符,如一元运算符,算术运算符,关系运算符,逻辑运算符,条件运算符和赋值运算符等。 辨别使用if,if…else,switch选择结构执行不同的动作。 辨别使用while,for,do…...
matlab-判别分析-源代码
模式识别原理---判别函数的PPT讲义,超好
判别分析-距离判别.pptx
Fisher判别率,也就是数据复杂度衡量指标之F1。计算数据复杂度,数据输入和输出代码以附带。
数据结构实验-括号匹配的检验-链栈实现
N-MORB和E-MORB数据挖掘--玄武岩判别图及洋中脊源区地幔性质的讨论.pdf
判别分析-费歇判别法.ppt
汇编语言 输入字符的判别。从键盘输入字符,若是 0~9,则直接显示。若是 A~Z 或 a~z,则均显示‘*’。若是其它字符则不显示,继续等待新的字符输入。用回车键结束程序。
研究了Bayes判别与可拓判别两种判别法的优缺点,在Bayes判别规则中引入了关联度,建立了集Bayes判别与可拓判别两种判别方法优点的Bayes 可拓判别法,并证明了Bayes判别法与...
2020年C题--面向康复工程的脑电信号分析和判别模型.doc
数学建模-判别分析-4.zip
最好在myeclipse里面运行,可以直接在下面输入一段字符串,运行便可以看出结果。
自然语言处理——情绪判别相关文件
--|------baseface 图片、根据这些图片训练的128d向量,以及文件夹与人名的映射文件 ---------|------0 第一个人的图片tag=0 ---------|------1 第二个人的图片tag=1 ...... ---------|------n 第n个人的图片tag...
行业资料-建筑装置-带判别纸张类发行日期的厚度检测装置及纸币处理装置.zip
欢迎大家交流-判别分析-4.ppt
行业分类-设备装置-基于标注关系的手写汉字正确性判别方法.zip
判别分析-4.ppt