- 浏览: 1777337 次
- 性别:
- 来自: 北京
文章分类
最新评论
-
奔跑的小牛:
例子都打不开
如何使用JVisualVM进行性能分析 -
蜗牛coder:
好东西[color=blue][/color]
Lucene学习:全文检索的基本原理 -
lovesunweina:
不在haoop中是在linux系统中,映射IP的时候,不能使用 ...
java.io.IOException: Incomplete HDFS URI, no host -
evening_xxxy:
挺好的, 谢谢分享
如何利用 JConsole观察分析Java程序的运行,进行排错调优 -
di1984HIT:
学习了~~~
ant使用ssh和linux交互 如:上传文件
HBase如何迁移数据?这里有个方案:http://blog.mozilla.com/data/2011/02/04/migrating-hbase-in-the-trenches/ ,我还未验证,因为我碰到了更加棘手的问题,我的两个集群在两个局域网,没法通信。(不过可以有一台机双网卡连接两个集群)。
先了解下 /app/cloud/hadoop/bin/hadoop distcp src desc
原文内容:
We recently had a situation where we needed to copy a lot of HBase data while migrating from our old datacenter to our new one. The old cluster was running Cloudera’s CDH2 with HBase 0.20.6 and the new one is running CDH3b3. Usually I would use Hadoop’s distcp utility for such a job. As it turned out we were unable to use distcp while HBase was still running on the source cluster. Part of the reason for this is that the HFTP will throw XML errors due to HBase modifying files (particularly the case if HBase removes a directory). And to transfer our entire dataset at the time was going to take well over a day. This presented a serious problem because we couldn’t accept that kind of downtime. We were also about 75% full in the source cluster so doing HBase export was out as well. Thus I created a utility called
Backup
. Backup is designed to essentially do the same work as distcp with a few differences. The first being that Backup would be designed move beyond failures. Since we’re still running HBase on the source cluster we can actually expect quite a few failures as a matter of fact. So inside Backup’s MapReduce job will by design catch generic exceptions. This is probably a bit over-zealous, but I really needed it not to fail no matter what. Especially after a few hours in. One of the other differences is that I designed Backup to always use relative paths. It does this by generating a common path between the source and destination via regular expression. Distcp on the other hand will do some really interesting things depending on what options you’ve enabled. If you use the
-f
flag for providing a file list, it will take all the files and write them directly to the target directory, rather than putting them in their respective sub-directories based on the source path. If you run with the
-update
flag it seems to put the source directory inside the destination rather than realizing that I want these two directories to look the same. The last major difference is that Backup is designed to run in update mode always. This was found because our network connection could only push about 200 MB/s between datacenters. We later found that a firewall was the bottleneck, but we didn’t want to drop our pants to the world either. Distcp would take hours just to stat and compare the files. For context we had something on the order of 300K-400K files we were looking to transfer. This is because distcp currently does this in a single-thread before it runs its MapReduce job. This actually makes sense when considering that distcp is only a single MapReduce job and it wants to distribute the copy evenly. Since we needed to minimize downtime, the first thing I did was distribute the file stat comparisons. In exchange we currently take a hit on not being able to evenly distribute the copy work. Backup uses a hack to attempt to get better distribution, but it’s nowhere near ideal. Currently it looks at the top-level directories just under the main source directory. It then splits that list of directories into mapred.map.tasks number of files. Since the data is small (remember this is paths and not the actual data) you’re pretty much guaranteed MapReduce will take your suggestion for once. This splits up the copy pretty well especially for the first run. On subsequent runs however you’ll get bottlenecked by a few nodes doing all the work. You can always up the mapred.map.tasks even higher, but really I need to split it out into two MapReduce jobs. I also added a
-f
flag so that we could specify file lists. I’ll explain later on why this was really useful for us. So back to our situation. I ran the first Backup job while HBase was running. This copied the bulk of our 28 TB dataset obviously with a bunch of a failures because HBase had deleted some directories. Now that we had most of the data we could do subsequent Backup’s within a smaller time window. We ingest about 300 GB/day so our skinny pipe between datacenters was able to make subsequent transfers in hours and not days. During scheduled downtime we would shutdown the source HBase. Then we copied the data to a secondary cluster in the new datacenter. As soon as the transfer was finished we would verify the source and destination matched. If so then we were all good to start up the source cluster again and resume normal production operation. Meanwhile we would copy the data from the secondary cluster to the new production cluster. The reason for doing this was because HBase 0.89+ would change the region directories, and we also needed to allow Socorro web developers to do their testing. So having the two separate clusters was a real blessing. It allowed us to keep a pristine backup at all times on secondary while testing against the new production cluster. So we did this a number of times the week before launch. Always trying to keep everything as up to date as we could before we threw the switch to cut over. It was during this last week I added the
-f
flag which allowed giving Backup a source file list. We would run “hadoop fs -lsr /hbase
” on both the source and the destination cluster. I wrote a simple python utility (lsr_diff
) to compare these two files and figure out what needed to be copied and what needed to be deleted. The files to copy could be given to the Backup job while the deletes could be handled with a short shell script (Backup doesn’t have delete functionality). The process looked something like this: The number of map tasks I refined over time, but I started the initial run with (# of hosts * # of map task slots). On subsequent runs I ended up doubling that number. After the backup job completed each time we would run “hadoop fs -lsr” and diff again to make sure that everything copied over. I saw a lot of times that wasn’t the case when the source was HFTP from one datacenter to another. However when copying files from an HDFS source within our new datacenter I never saw an issue with copying. Due to other issues (there always are right?) we had a pretty tight timeline and this system was pretty hacked together, but it worked for us. In the future I would love to see some modifications made to distcp. Here’s my wishlist based on our experiences: 1.) Distribute the file stat comparisons and then run a second MapReduce job to do the actual copying. To be honest though I found the existing distcp code a bit overly complex otherwise I might have made the modifications myself. Perhaps the best thing is that someone take a crack at a fresh rewrite of distcp altogether. I would love to hear people’s feedback. 声明:谁有高招麻烦告知在下,上面说的这个解决方案不适合我的情况。
RUN ON SOURCE CLUSTER:
hadoop fs -lsr /hbase > source_hbase.txt
RUN ON TARGET CLUSTER:
hadoop fs -lsr /hbase > target_hbase.txt
scp source_host:./source_hbase.txt .
python lsr_diff.py source_hbase.txt target_hbase.txt
sort copy-paths.txt -o copy-paths.sorted
sudo -u hdfs hadoop fs -put copy-paths.sorted copy-paths.sorted
nohup sudo -u hdfs hadoop jar akela-job.jar com.mozilla.hadoop.Backup -Dmapred.map.tasks=112 -f hdfs://target_host:8020/user/hdfs/copy-paths.sorted hftp://source_host:50070/hbase hdfs://target_host:8020/hbase
2.) Do proper relative path copies.
3.) Distribute deletes too.
发表评论
-
HBase配置LZO压缩
2011-07-10 22:40 6116系统: gentoo HDFS: hadoop:hado ... -
HBase RegionServer 退出 ( ZooKeeper session expired)
2011-04-23 08:32 9018RegionServer 由于 ZooKeeper sessi ... -
HBase迁移数据方案1(两个集群不能通信)
2011-03-30 18:23 3816前一篇文章里面介绍了 两个可以直接通信的集群之间很容易拷贝数据 ... -
HBase如何存取多个版本的值
2011-03-07 16:11 27177HBase如何存取多个版本 ... -
HBase简介(很好的梳理资料)
2011-01-30 10:18 130589一、 简介 history s ... -
Google_三大论文中文版(Bigtable、 GFS、 Google MapReduce)
2010-11-28 16:30 22147做个中文版下载源: http://dl.iteye.c ... -
hadoop主节点(NameNode)备份策略以及恢复方法
2010-11-11 19:35 27728一、dits和fsimage 首先要提到 ... -
HRegionServer: ZooKeeper session expired
2010-11-01 14:21 11409Hbase不稳定,分析日志 ... -
Bad connect ack with firstBadLink
2010-10-25 13:20 8283hbase报的错误,经过分析是Hadoop不能写入数据了。可恶 ... -
hbase0.20.支持多个主节点容灾切换功能(只激活当前某个节点,其他节点备份)
2010-09-09 14:53 2837http://wiki.apache.org/hadoop/H ... -
java.io.IOException: Incomplete HDFS URI, no host
2010-09-07 08:31 16126ERROR org.apache.hadoop.hdfs.se ... -
升级hadoop0.20.2到hadoop-0.21.0
2010-09-05 11:52 7719按照新的文档来 更新配置: http://hadoop.apa ... -
hadoop-hdfs启动又自动退出的问题
2010-05-20 10:45 6059hadoop-hdfs启动又自动退出的问题,折腾了我1天时间啊 ... -
在windows平台下Eclipse调试Hadoop/Nutch
2010-04-29 14:34 3263即让碰到这个问题说明 准备工作都做好了,软件包,环境什么的这里 ... -
Hadoop运行mapreduce实例时,抛出错误 All datanodes xxx.xxx.xxx.xxx:xxx are bad. Aborting…
2010-04-29 14:26 6366Hadoop运行mapreduce实例时,抛出错误 All d ... -
cygwin 添加用户
2010-04-13 17:48 7347http://hi.baidu.com/skychen1900 ... -
nutch总体输入输出流程图解析
2010-04-12 16:58 2426附件里面有word文档,请下 ... -
解析hadoop框架下的Map-Reduce job的输出格式的实现
2010-04-10 18:34 10093Hadoop 其实并非一个单纯用于存储的分布式文 ... -
nutch分布式搭建
2010-04-06 17:54 6789如何在eclipse中跑nutch :http://jiaj ... -
解析Nutch插件系统
2010-03-31 16:31 6489nutch系统架构的一个亮点就是插件,借鉴这个架构我们 ...
相关推荐
hbase基于快照的数据迁移,hbase提供的数据迁移方案。
nosql实验五-HBase数据迁移与数据备份&恢复
由于大数据里面涉及到非关系型数据库如hive、kudu、hbase等的数据迁移,目前涉及到的迁移工具都没有支持hive数据库的事务表的迁移,如果hive库里面存在大量的事务表的时候,目前的工具都是不支持的,例如华为的CDM,...
Hadoop数据迁移--从Hadoop向HBase
关系型数据库与HBASE间的数据迁移介绍.pptx
将数据从Hadoop中向HBase载入数据,该过程大致可以分为两步: 一、将Hadoop中普通文本格式的数据转化为可被HBase识别的HFile文件,HFile相当于Oracle中的DBF数据 文件。 二、将HFile载入到HBase中,该过程实际就是...
介绍如何hbase-0.94.1手动进行数据迁移
数据流信息从MySQL到HBase的迁移策略的研究.pdf
利用sqoop把mysql数据导入到hbase中,建立phoenix与hbase的映射,用phoenix jdbc操作hbase!达到sql操作nosql!
根据mysql中数据库配置表信息查询mysql中数据,将部分处理为json格式,上传到hbase中。
一、 HBase技术介绍 HBase简介 HBase – Hadoop Database,是一个高可靠性、高性能、面向列、可伸缩的分布式存储系统... Sqoop则为HBase提供了方便的RDBMS数据导入功能,使得传统数据库数据向HBase中迁移变的非常方便。
文章分析了现有迁移工具的利弊,基于HBase数据库提出了一种有效的数据迁移策略,并依据提出的策略实现了一种半自动化移工具。以美国城市和方言系统CityDetail数据库数据为例,阐述了该迁移工具的工作原理并对迁移后...
该文件为hbase hbck2 jar;...对于HBase跨集群HD集群迁移,当HDFS文件迁移后,使用HBCK客户端,完成HBase元数据修复。当前版本:hbase-hbck2-1.3.0.jarhbase hbck -j /opt/software/hbase-hbck2-1.3.0-SNAPSHOT.jar
该工具是HBase提供的一个将HDFS数据转化为HBase数据库数据的一个工具。 其实现过程是,先将HDFS文件转复制并转化为HFile文件,然后将HFile移动并加载成为HBase的数据文件。
一种MySQL到HBase的迁移策略的研究与实现.pdf
该工具是HBase提供的一个将HDFS数据转化为HBase数据库数据的一个工具。 其实现过程是,先将HDFS文件转复制并转化为HFile文件,然后将HFile移动并加载成为HBase的数据文件。
Hbase从入门到进阶的全部视频,主要内容: 1、HBase的安装、目录结构以及启动 2、Hbase原理、运维、API、数据迁移备份 3、Hbase架构解析 4、Hbase实例 5、源码获取编译与分析 等等
SQL Server数据库到HBase数据库的模式转换和数据迁移研究.pdf
行业分类-设备装置-一种将关系型数据迁移至HBase的方法及系统