`

hbase -tables replication/snapshot/backup within/cross clusters

 
阅读更多

 

 

 
serial no solutions

level

preconditino

runon

flow

advances shortcomings use cases
1 direct client API log -  

 transfer data

via both

clusters

     
2 export/import log

 

-src then target

 -mr gen hdfs seq files

-transfer files

-import with mr

 -support time-range filter    
3 copy table stream  

src

case:

1.copy directly data(mem+hfile) to other(IFF cluster to cluster 

is enable)

 

2.(IFF cluster to cluster 

is NOT enable)

same as 

export ,but the

last step is:using 

hdfs put files

     
4 replication wal     sync wal with new cluster      
5 bulkload              
6 snapshot file 

-flush before

snapshoting if 

online

-src then target

 -create snapshot

-clone to new table

-restore from 

new table[cluster internal]

     
7 distcp file  

-src

 -flush memstore

-distcp files within 

both clusters

 

 -cant copy data with specified date-range;

but it can be used as the final

step to transfer the target files generated

by other solutions

-stop hbase before distcp

 
                 
                 

 

now,i want to retrieve last month datum from a table to backup to another cluster,but both clusters cant connected to each other(no MR),so i issued the new steps:

1.subset the table data (last month:2014-06-01--> 2014-06-30)

 

hbase org.apache.hadoop.hbase.mapreduce.CopyTable -Dhbase.client.scanner.caching=1000 -Dmapred.map.tasks.speculative.execution=false --starttime=1401552000000 --endtime=1404057600000 --new.name=new-tableX tableX

 then you MUST flush this table as some data  lie on memstores,and the next step will operate on file level directly,

 

 echo "flush 'new-tableX' "|hbase shell

 

 

2.download hdfs table hfiles

 hadoop fs -get /hbase/new-tableX new-tableX

 (of curse u can run extend this command in multi nodes parallelly by subtasking the dirs)

 

3.transfer these files to other cluster parallelly

 a.scp part files to local nodeA,B,C...

 b.run scp part-files to peer node of another cluster

 (so these will balance the network bandwidth limited by one node for both sides)

 

4.now import the data to hdfs

 hadoop fs -put part-files /hbase

 (just mkdir it if nonexists)

 

5.load these hfiles to meta and assign

 hbase hbck -fixMeta

 then

 hbase hbck -fixAssignments

 (try second step one more time to the jude whether table is readable or not)

 

6.rename the new table to original table[optional]

hbase shell> disable 'tableName'
hbase shell> snapshot 'tableName', 'tableSnapshot'
hbase shell> clone_snapshot 'tableSnapshot', 'newTableName'
hbase shell> delete_snapshot 'tableSnapshot'
hbase shell> drop 'tableName'

  utility snapshot is supported by 0.94.6+ version,and u can patch your old version also if u have a older one.

 

some optimized usages in step 1

-mapreduce failure times

-D=mapred.map.max.attempts=2

 failure ratio

-D=mapred.max.map.failures.percent=0.05

 -close the hlog writing(maybe refactor the Import.Imperter.java)

 

-decrease the block replication

-D-Ddfs.replication=2 or -D-Ddfs.replication=1

 -increase the buffer

-Dhbase.client.write.buffer=10485760

 -presplit the new table when created in step 1

 {NUMREGIONS => [1], SPLITALGO => 'HexStringSplit'}

 

 [1] hbase -how many regions are fit for a table when prespiting or keeping running 

 

ref:

用distcp进行hdfs的并行复制

HBase跨集群复制数据的另一种方法

 CDH:introduction-to-apache-hbase-snapshots

 jira:snapshot of table (attached principle docs)

 复制部分HBase表用于测试 (some tools used java class in shell)

 

 

 

 

分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics