Maybe you lost a drive, added drives, or deleted and added a lot of data. Your overall balance of the cluster is great. But you are still getting some failures on drives because they’re full, even though their as some completely empty drives, in the same machine waiting for data!
You are operating a little out of the ordinary. (According to original design specs.)
How did you get here?
In Hadoop 2.0 -> Datanodes write data in round robin on their hard drives. Great if you only write data once. Ok if you delete data sometimes. Have a large shift of data and things a little unbalanced. (Note this issue has been fixed in hadoop 3.0 but not all of us get the luxury of upgrading when it’s convenient. FYI “upgrading to 3.0” is more of a migration than an “upgrade”.)
Ok so what now? Is all lost?
No. Their are at least two solutions.
(A little heavy handed but hey it will do the job.) Decommissioning will essentially ensure all the data is moved off of the node, and then recommissioning will move all the data back on. This will lay down the data with a nice round robin of the data, evenly filling the drives. This is likely the safest path forward, and may even be recommended by support as it’s been rigorously tested. It will take a ton of time and bandwidth but it is safe. You should be prepared to wait a time comparable to the amount of the data on your node. You also need enough space left in the rest of the cluster to absorb the data. I should also call out that if this will take you close to %90 of the cluster then you could suffer an outage. (Yarn gets twitchy using hard drives that have less than 10% disk free, so if you haven’t separated Yarn drives from DataNode drives, yarn will complain and stop working.)
What if you don’t have enough space to decommission a node. Is there another way? Of course. What if instead of decomissioning, you stop the DataNode. Delete all the data, start the DataNode and run a balance? Won’t this just get the job done faster than decommission/recommission? Well, sort of, you are intentionally reducing your replication factor, and opening yourself to data loss. Not really a strategy I can get behind. The risk is a lot higher. You might try to reduce the risk, and only delete 1/2 the data, and only on the drives that are “too full” and not on the “emptier” drives. Well that does reduces risk, but I can give you one better, what if instead of deleting the data we moved it? Ok so this is where you might say, “You can’t move data blocks. Only the NameNode can move data blocks.”
Lets’ not forget to check the documentation to see if there’s an easier way. If only there was a way for the DataNode to tell the NameNode where the data is instead of the NameNode dictating where files are located. The last line of the documentation seems to bring some promise:
When a DataNode starts up, it scans through its local file system, generates a list of all HDFS data blocks that correspond to each of these local files and sends this report to the NameNode: this is the Blockreport.
Hmmm. Ok, so it seems that there’s a way that the DataNode can report in where the data is located. Given this knowledge it seems that there’s a way we can get the DataNode to dictate where the data is now. This opens the door for a strategy to ‘Stop’ the DataNode. Make changes to the location of the blocks on disk, (think DataNode disk balancing). Then ‘Start’ the DataNode and have it report the new ‘balanced’ location. There are more details that must be followed but at a high level this can be done. (Extra details: moving the Block, the Block metadata and maintaining the folder structure.)
If you want more detail on how to do this here’s the an excerpt from the disk-balanacer-proposal.pdf located here from the fix that is in Hadoop 3.0. This validates our belief that it is possible to fix the data by moving it on the DataNode. Even if it’s not automated.
Current Apache Solution is far from Ideal -3.12. On an individual datanode, how do you balance the blocks on the disk?
Hadoop currently does not have a method by which to do this automati-cally. To do this manually:
1.Shutdown the DataNode involved
2.Use the UNIX mv command to move the individual block replica and meta pairs from one directory to another on the selected host. On releases which have HDFS-6482 (Apache Hadoop 2.6.0+) you also need to ensure the subdir-named directory structure remains exactly the same when moving the blocks across the disks. For example, if the block replica and its meta pair were under/data/1/dfs/dn/current/BP-1788246909-172.23.1.202-1412278461680/current/finalized/subdir0/subdir1/, and you wanted to move it to/data/5/disk, then it MUST be moved into the same subdirectory structureunderneath that, i.e./data/5/dfs/dn/current/BP-1788246909-172.23.1.202-1412278461680/current/finalized/subdir0/subdir1/.If this is not maintained, the DN will no longer be able to locate the replicas after the move.
3.Restart the DataNode.
There is even a solution that is supposed to work I found here which may work. I haven’t tried it yet it’s not well documented. I have forked it and will try it soon and really document how it works. For now please take it “as is” and something that might do what you need.
Here’s a great article on how to balance a data node in hadoop 3.0.