The New and Improved Cloudera Data Platform on Public Cloud

This week, Cloudera put on a YouTube livestream to show off what was new with their CDP Public Cloud platform. During this livestream we were introduced to new features for the Datahub and Dataflow platforms that have long been requested by the Cloudera community. Some of the big ticket items they announced that should catch your eye are the following;

New NiFi components and queue emptying functionality.
Machine Learning behind Operational Databases like Hbase.
- auto-scale
- auto-tune
- auto-heal
Evolutionary Schemas for Relational Databases i.e. Phoenix and Hbase.

The New NiFi 1.13.x

The newest version of NiFi was the first up on stage. If you haven’t seen this already, here are the release notes for NiFi 1.13.0. The points that I think are the most valuable and want to focus on, are listed below;

Status history is now held for system level metrics and a new endpoint for Prometheus/Grafana is available.

Metrics tell a factual story about your cluster, as they are able to show you exactly how your resources are being used. This can provide you insight on your path when scaling cluster resources up or down in the future. With this new bunch of statistics being tracked, you will now have the ability to view these metrics in tools like Grafana. This will allow you to track your NiFi cluster’s health and efficiency over time with ease to ensure your critical dataflows are running smoothly within NiFi. This is a fantastic change, as we here at Andruff Solutions have used Grafana metric data from other datahub components for analysis when assisting clients in both upgrading and downsizing components of their Hadoop clusters.

Now supports a ListenFTP processor to allow NiFi itself to act as an FTP server.

The presenters did not touch on many specifics about the new NiFi processors being added, but this one is definitely worth mentioning. This is a great quality of life change from Cloudera, as this simply consolidates commonly used technologies together. Having this feature in the past would have been very useful for applications we have built prior, where our clients needed to collect and send files through NiFi. Now files can be directly sent to NiFi and exist locally, without the hassle of needing to setup a separate FTP server.

Now able to automatically alter Hive table structures based on evolution in the schema of incoming data.

To me, it sounds like Hive tables are getting some of the love that the relational databases have gotten, witch schemas being dynamically changed depending on the data that flows in through NiFi. While this topic was not heavily touched on during the livestream by the presenters, a similar question was asked about it during the Q&A. Hive itself will remain as the Data Warehouse we know it as, but the tables within it will be able to be altered directly by NiFi when the data requires the schema to be changed.

You can now empty all queues in a given process group with a single request.

Saving the best for last, this change takes the cake. While this is not something recommended for production flows, it very much aids in testing for development environments. From personal past experience with clients, there have sometimes been hiccups when developing new flows in NiFi, which is usually related to queues hitting capacity. The scenario would go something like this..

A new flow has been developed within a new or existing process group.
The new flow is tested and is either not flowing as intended or the flow is larger than expected.
The queue components begin to fill up, and brings NiFi to a crawling pace.
Every maxed out queue needs to be bled off before the flow becomes concerningly large or all components must be shutoff when NiFi itself is restarted.

Usually, the solution has been the latter by killing all processors and having to turn them on manually again when NiFi restarts. Now we have a proper kill switch of sorts for fiascos like this so, cheers to less headaches!

Machine Learning Behind Hbase

Hbase has seen quite the evolution in this release from just being the operational database it once was, by including features that aim to make it an autonomous database. The presenters go on to talk about how the new automated tasks of deploying, scaling, tuning and healing that assist both admins and developers here.

The first part being about the automated spinning up and scaling of a database. Using the elastic cloud functionality seen in big data environments today, developers can now bring up their own database without the need of an administrator to allocate resources and set them up. Cloudera has advertised that with just 3 clicks and 20 minutes, a new standalone database can become accessible. Behind the scenes, we now have machine learning that tracks the workload of this database to scale it as needed. Usually this task would be for an admin or DBA that would work alongside the team to help get this part of a project’s foundation started, but it is now automated for ease of use with developers in mind. Details on how to connect and start using your new database are presented to you as well, instead of being buried within documentation.

Going beyond the task of scaling your database, this is where auto-tuning tasks aim to help pick up the slack. The example given in the presentation is fairly realistic for smaller clusters, where they express that if all database traffic is funneled to just one node, no amount of auto-scaling will be able to rid you of performance hurdles. The auto-tuning comes into play here to learn about the workload and access patterns being done, then tuning performance for those tables where needed. This should slightly help with common developer growing pains when the performance of their queries and jobs begins to slow as a result of the application or database itself increasing in size.

The last aspect that was touched on only briefly was the new auto-healing capability. We can only assume that Cloudera has implemented some simple dynamic troubleshooting scripts of sort to assist in fixing minor corruption or bad data entries, alongside it’s replication tactics. Just another step in the right direction to try and make Hbase more bulletproof from the sound of it.

Evolutionary Schemas for Relational Databases

The star of the show in this segment was Apache Phoenix by implementing support relational and evolutionary schemas. As your application grows, it can sometimes be the case that the existing technology your application lives, no longer supports the direction that you’re application is going. With databases this can often be the case where your schema type needs to change and adapt to the data that is being ingested, but redefining tables for every query can get cumbersome. Within CDP you will be able to port over any existing MarianDB or MySQL database straight into Phoenix with an underlying evolutionary schema straight away. No table redefining or heavy lifting of the database/tables needed, just direct plug and play essentially. This is great as it simplifies the process of migrating tables from one schema type to the other and it even allows for your existing code to work without needing to retrofit it for the schemas changes. This can work backwards as well, by creating views from your evolutionary schema to match what the original table looked like, but just as before, the schema for the view can also be changed without a new definition needed. This allows you to build, grow and filter the data in your tables with ease.

Conclusion

Coming from a background on the HDP 2.6.x side, it’s clear to see Cloudera has listened to it’s community made some momentous changes to reflect our needs on their platform. They are coming through with ways to make their products fit their customer’s needs, rather than expecting their client’s tools to adapt to these new innovations. It’s an exciting time to be on the Big Data playing field and Cloudera is making it clear that they are here to put on a show.

Written by Ryan St. Louis

« Removing old Host Data from Ambari Server and Tuning the Database

The New and Improved Cloudera Data Platform on Public Cloud

The New NiFi 1.13.x

Machine Learning Behind Hbase

Conclusion

Recent Articles

Removing old Host Data from Ambari Server and Tuning the Database

Kafka Clients and SSL – Qlik Replicate

Fixing Orphaned Configs in the Ambari DB

We would love to hear from you