Rebalancing a Datanode in Hadoop 2.0 when the cluster is UNBALANCED

This is somewhat of a continuation from the article, Rebalancing a datanode in hadoop 2.0 when the cluster is balanced

Due to a bug in HDFS versions 2.7.3 and previous for HDP 2.6.5, newly added HDFS disks will not be balanced when they are added to an existing datanode in the cluster. This is NOT the case when adding an entirely new datanode with new disks, but is the case when adding disks to an existing datanode. Naturally you would assume you could run the HDFS balancer utility from Ambari Server or use the HDFS balancer  command from the namenode to correct the issue, but unfortunately this only works to certain extent.

Here is an example of what one of our datanodes looked like after adding two new 1TB disks (/grid/15 and /grid/16), and balancing them with a threshold of 5:

/dev/sdd1      1007G  865G  142G  86% /grid/01
/dev/sdn1      1007G  815G  193G  81% /grid/11
/dev/sdi1      1007G  846G  161G  85% /grid/06
/dev/sdq1      1007G  838G  170G  84% /grid/14
/dev/sdg1      1007G  858G  150G  86% /grid/04
/dev/sdp1      1007G  851G  157G  85% /grid/13
/dev/sdj1      1007G  801G  207G  80% /grid/07
/dev/sdl1      1007G  849G  159G  85% /grid/09
/dev/sdh1      1007G  823G  184G  82% /grid/05
/dev/sdm1      1007G  817G  191G  82% /grid/10
/dev/sde1      1007G  841G  167G  84% /grid/02
/dev/sdk1      1007G  858G  150G  86% /grid/08
/dev/sdf1      1007G  820G  188G  82% /grid/03
/dev/sdo1      1007G  796G  211G  80% /grid/12
/dev/sdt1      1007G   42G  915G   5% /grid/15
/dev/sdu1      1007G   43G  914G   5% /grid/16

So, what can be done about this? Well, you have one of two options.

Option 1 – Let HDP do the work

Decommission the datanode, allowing HDFS to move all data off of the node, then recommission the datanode, allowing HDFS to move all data back onto the node in a balanced manner. When tested, this took roughly 5 days to complete. Time will vary here, depending on the speed of your disks. You also need to be aware of your total HDFS capacity in this scenario. Can you afford to be down TB’s of space for a day or two? This option will take the longest, but will also be the safest and least destructive.

Option 2 – Do the work yourself

You can perform the data distribution manually by following these steps, and possibly plan for some downtime to be safe as you will be moving blocks around.

Note that: You can plan this on one Datanode at a time and plan for other Datanodes after verifying the results. It is advisable to backup your data before executing the steps. You should test these steps in a development environment before implementing this in something like a production environment.

1. Shutdown the HDFS datanode service on the datanode you are balancing.

2. Move/overwrite block directories directly from old disks (the disk filled up above 80%) to new disks.

cp -R /grid/01/hdfs/data/current/BP<->/current/finalized/subdir0 /grid/15/hdfs/data/current/BP <->/current/finalized/subdir0 

rm -rf /grid/01/hdfs/data/current/BP<->/current/finalized/subdir0 

cp -R /grid/02/hdfs/data/current/BP<->/current/finalized/subdir0 /grid/16/hdfs/data/current/BP<->/current/finalized/subdir0 

rm -rf /grid/02/hdfs/data/current/BP<->/current/finalized/subdir0

3. Repeat Step 2 until space on disks is near balanced by selecting various ../finalized/subdirX for copying.

4. Repeat Step 2 and 3 for all Datanodes, where the disk filled up above thresholds.

5. Start the HDFS datanode service on the datanode you just manually balanced.

This option will take far less time than decommissioning an entire datanode, and you can even write a script to perform the process for you. While performing this method, it only took a total of 2 hours to balance one 1TB disk when using a script to do the copying and deletion. I would recommend this method, if you are running low on HDFS capacity and can’t afford to lose an entire datanode for a few days.

Happy Hadooping!

Written by Ryan St. Louis