This is somewhat of a continuation from the article, Rebalancing a datanode in hadoop 2.0 when the cluster is balanced
Due to a bug in HDFS versions 2.7.3 and previous for HDP 2.6.5, newly added HDFS disks will not be balanced when they are added to an existing datanode in the cluster. This is NOT the case when adding an entirely new datanode with new disks, but is the case when adding disks to an existing datanode. Naturally you would assume you could run the HDFS balancer utility from Ambari Server or use the HDFS balancer command from the namenode to correct the issue, but unfortunately this only works to certain extent.
Here is an example of what one of our datanodes looked like after adding two new 1TB disks (/grid/15 and /grid/16), and balancing them with a threshold of 5:
/dev/sdd1 1007G 865G 142G 86% /grid/01 /dev/sdn1 1007G 815G 193G 81% /grid/11 /dev/sdi1 1007G 846G 161G 85% /grid/06 /dev/sdq1 1007G 838G 170G 84% /grid/14 /dev/sdg1 1007G 858G 150G 86% /grid/04 /dev/sdp1 1007G 851G 157G 85% /grid/13 /dev/sdj1 1007G 801G 207G 80% /grid/07 /dev/sdl1 1007G 849G 159G 85% /grid/09 /dev/sdh1 1007G 823G 184G 82% /grid/05 /dev/sdm1 1007G 817G 191G 82% /grid/10 /dev/sde1 1007G 841G 167G 84% /grid/02 /dev/sdk1 1007G 858G 150G 86% /grid/08 /dev/sdf1 1007G 820G 188G 82% /grid/03 /dev/sdo1 1007G 796G 211G 80% /grid/12 /dev/sdt1 1007G 42G 915G 5% /grid/15 /dev/sdu1 1007G 43G 914G 5% /grid/16
So, what can be done about this? Well, you have one of two options.
Option 1 – Let HDP do the work
Decommission the datanode, allowing HDFS to move all data off of the node, then recommission the datanode, allowing HDFS to move all data back onto the node in a balanced manner. When tested, this took roughly 5 days to complete. Time will vary here, depending on the speed of your disks. You also need to be aware of your total HDFS capacity in this scenario. Can you afford to be down TB’s of space for a day or two? This option will take the longest, but will also be the safest and least destructive.
Option 2 – Do the work yourself
You can perform the data distribution manually by following these steps, and possibly plan for some downtime to be safe as you will be moving blocks around.
Note that: You can plan this on one Datanode at a time and plan for other Datanodes after verifying the results. It is advisable to backup your data before executing the steps. You should test these steps in a development environment before implementing this in something like a production environment.
1. Shutdown the HDFS datanode service on the datanode you are balancing.
2. Move/overwrite block directories directly from old disks (the disk filled up above 80%) to new disks.
cp -R /grid/01/hdfs/data/current/BP<->/current/finalized/subdir0 /grid/15/hdfs/data/current/BP <->/current/finalized/subdir0 rm -rf /grid/01/hdfs/data/current/BP<->/current/finalized/subdir0 cp -R /grid/02/hdfs/data/current/BP<->/current/finalized/subdir0 /grid/16/hdfs/data/current/BP<->/current/finalized/subdir0 rm -rf /grid/02/hdfs/data/current/BP<->/current/finalized/subdir0
3. Repeat Step 2 until space on disks is near balanced by selecting various ../finalized/subdirX for copying.
4. Repeat Step 2 and 3 for all Datanodes, where the disk filled up above thresholds.
5. Start the HDFS datanode service on the datanode you just manually balanced.
This option will take far less time than decommissioning an entire datanode, and you can even write a script to perform the process for you. While performing this method, it only took a total of 2 hours to balance one 1TB disk when using a script to do the copying and deletion. I would recommend this method, if you are running low on HDFS capacity and can’t afford to lose an entire datanode for a few days.
Happy Hadooping!
Written by Ryan St. Louis