This week I spent some time looking at Cloudera Data Platform(CDP) in the cloud and how it stacks up against it’s alternatives. For this article I will be looking at Hadoop alternatives and will not be discussing other data pipelining tools.
I feel that the easiest one to use by far was Cloudera Data Platform(CDP). Setup your “stuff” in amazon and you can start working with clusters right away. I say “stuff” as their is a bit of work to that requires a professional to setup. (I’ll post a follow up article soon on the gotchas I think other might hit on.) CDP is by far the cadillac option, it makes all setup ridiculously easy, and has a security first approach. CDP still has a lot of DNA from Cloudbreak. ( Some of the documents even still point to Cloudbreak git repos.) But where Cloudbreak was a beta software(it was!) , with CDP it’s really matured in to a Production tool. Spinning up a new cluster is a couple clicks away. It already incorporates the concept of Data Lake vs Cluster and really make management a breeze. This tool is an administrators dream as it takes away the “grunt work” allowing you to focus on bigger problems. And that is really good, as this mean you aren’t paying your high priced consultants to do grunt work. This saves a lot of money. This make this product out top choice. Out of all the options we’ll discuss it’s also comes with the biggest price tag, based on per instance basis. Where you save money is the GRUNT work. If you have a ton of short lived clusters this is the choice for you. If you have a more junior staff this is also likely a great option for you, as all the cluster’s are setup the right way by expertise of all the years of Cloudera’s experience. If you have senior staff, this is the tool for you because you can keep them working on the hard stuff and again, not doing grunt work. I hope I’m calling out in a way that clear, it maybe the most expensive if you are only looking at the per instance cost, but for that price you get a lot of expertise already baked into the cluster. Which actually ends up saving you money because of what you are getting.
Ok but how does this stack up against Amazon EMR? Well Amazon EMR is likely the cheapest per instance option we’ll discuss. What Amazon EMR does really well is burst computing for jobs that you need run right now. You will need to build security into amazon. If you already have staff that is seasoned in how to do it correctly in Amazon this isn’t an issue. You can’t simulate the CDP SDX security/lineage layer as all the security tools for the cluster are missing in EMR image. To compete with the CDP option described above you’ll have to also construct a data lake and find a way to track lineage/security. This is where there is a lot more hands on grunt work as you have to build out the infrastructure that is just there when you use CDP. When you use EMR you can also use it to setup permanent clusters, they are just more work to maintain, because they have no UI for hadoop management. You can train developers on how to submit jobs to EMR directly, wind them up and let them go. In my experience this can be problematic. They’re is a certain breed of developers that Can’t see the forest for the trees. They’ll copy & paste code and configuration that can actually end up costing you dearly as they button mash your Amazon bill into oblivion. Yes, they’re are absolutely controls in Amazon to help you handle this type of situation. I would prefer the approach of providing them with what they’re allowed to use instead of having them experiment. I believe it’s a better user experience to hand them what they can use, instead of giving killing their job.
Ok well if EMR isn’t the right call what is the right call? Well there’s nothing stopping you from building your own clusters on Amazon. If you have a trusted advisor that can effectively rebuild CDP on EC2. This may be a better options for larger customers that have full time dedicated staff, for administration of hadoop. You will spend a lot of money on “grunt work” but you can make exactly what you need. (On an per instance basis this will be more expensive than Amazon EMR and cheaper than CDP.) This option definitely requires seasoned staff and should only be attempted with experts you trust. Do not attempt this with a rookie team, it will be painful. No matter what they tell you. I have totally gone down this road before and I think it’s completely manageable, just requires a level of expertise.