通过BOM探测文本文件编码类型

balaschen

浏览: 190192 次
性别:

最近访客更多访客>>

jauncehome

bbmmhjjwanm

zhangpeng1990

artan

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

综合

Tomcat

从tomcat源码抄来的改的 redface :

private Object[] getEncodingName(byte[] b4, int count) throws Exception {
	        if (count < 2) {
	            throw new Exception("unknow");
	        }

	        // UTF-16, with BOM
	        int b0 = b4[0] & 0xFF;
	        int b1 = b4[1] & 0xFF;
	        if (b0 == 0xFE && b1 == 0xFF) {
	            // UTF-16, big-endian
	            return new Object[]{"UTF-16BE",new Integer(2)};
	        }
	        if (b0 == 0xFF && b1 == 0xFE) {
	            // UTF-16, little-endian
	            return new Object[] {"UTF-16LE",new Integer(2)};
	        }

	        // default to UTF-8 if we don't have enough bytes to make a
	        // good determination of the encoding
	        if (count < 3) {
	        	throw new Exception("unknow");
	        }

	        // UTF-8 with a BOM
	        int b2 = b4[2] & 0xFF;
	        if (b0 == 0xEF && b1 == 0xBB && b2 == 0xBF) {
	            return new Object[]{"utf-8",new Integer(3)};
	        }

	        // default to UTF-8 if we don't have enough bytes to make a
	        // good determination of the encoding
	        if (count < 4) {
	        	throw new Exception("unknow");
	        }

	        // other encodings
	        int b3 = b4[3] & 0xFF;
	        if (b0 == 0x00 && b1 == 0x00 && b2 == 0x00 && b3 == 0x3C) {
	            // UCS-4, big endian (1234)
	            return new Object[]{"ISO-10646-UCS-4",new Integer(4)};
	        }
	        if (b0 == 0x3C && b1 == 0x00 && b2 == 0x00 && b3 == 0x00) {
	            // UCS-4, little endian (4321)
	            return new Object[]{"ISO-10646-UCS-4",new Integer(4)};
	        }
	        if (b0 == 0x00 && b1 == 0x00 && b2 == 0x3C && b3 == 0x00) {
	            // UCS-4, unusual octet order (2143)
	            // REVISIT: What should this be?
	            return new Object[]{"ISO-10646-UCS-4",new Integer(4)};
	        }
	        if (b0 == 0x00 && b1 == 0x3C && b2 == 0x00 && b3 == 0x00) {
	            // UCS-4, unusual octect order (3412)
	            // REVISIT: What should this be?
	            return new Object[]{"ISO-10646-UCS-4",new Integer(4)};
	        }
	        if (b0 == 0x00 && b1 == 0x3C && b2 == 0x00 && b3 == 0x3F) {
	            // UTF-16, big-endian, no BOM
	            // (or could turn out to be UCS-2...
	            // REVISIT: What should this be?
	            return new Object[]{"UTF-16BE",new Integer(4)};
	        }
	        if (b0 == 0x3C && b1 == 0x00 && b2 == 0x3F && b3 == 0x00) {
	            // UTF-16, little-endian, no BOM
	            // (or could turn out to be UCS-2...
	            return new Object[]{"UTF-16LE",new Integer(4)};
	        }
	        if (b0 == 0x4C && b1 == 0x6F && b2 == 0xA7 && b3 == 0x94) {
	            // EBCDIC
	            // a la xerces1, return CP037 instead of EBCDIC here
	            return new Object[]{"CP037",new Integer(4)};
	        }

	        throw new Exception("unknow");

	    }

分享到：

利用sax和xslt转换csv文件内容 | java mail 纯文本附件乱码的解决方案

2008-01-24 15:19
浏览 2386
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

通过BOM探测文本文件编码类型

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

通过BOM探测文本文件编码类型

评论

发表评论

相关推荐

利用sax和xslt转换csv文件内容

java mail 纯文本附件乱码的解决方案

ntdsutil设置AD查询返回最大条目

struts2-layout

Http基本明文验证

如何启用活动目录SSL连接

AD User重要属性

添加用户、修改ad密码

ldap 访问AD测试

JNDI 连接Windows Active Directory 教程(转）

正确认识memcached的缓存失效

webwork结合memcached实现sna架构

activemq实验

发现用Spring配置事务不爽的一个地方

Java Transaction Design Strategies读书笔记

有谁知道银行的跨行转帐是怎么保证交易的原子性和一致性？

InnerClass引用的外层local final变量，究竟具有什么语义

最近访客更多访客>>