Some people are uncomfortable connecting to hive with the zookeeper namespace. It’s possible they don’t understand it, or perhaps they’re just old school and like a good old fashioned host and port.

For this tutorial our hive server will be located: hive.server.com:10000

Let me de-mystify what is happening when you use the zookeeper namespace.

When you connect via beeline you likely use a string that look like this:

jdbc:hive2://<ZOOKEEPER QUORUM>/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver

specifically you likely use a command that looks like this:

beeline -u "jdbc:hive2://server1.com:2181,server2.com:2181,server3.com:2181;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver"

So let’s break down this URL like looking thing to explain what it’s doing and why you should use it.

jdbc:hive2://<ZOOKEEPER QUORUM>/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver

jdbc:hive2 –> this tells you to use a jdbc hive protocol.
//<ZOOKEEPER QUORUM>/; –> this tell you what servers and ports you can use. (the next setting actually tells the driver they are zookeeper nodes)
serviceDiscoveryMode=zooKeeper; –> method to use to discover the address we are actually going to use.
zooKeeperNamespace=hiveserver –> Name of the place in zookeeper we will use to store that actual connection settings.

Ok so now we understand how the url works big deal, why do we care?

Well let’s use this info to see what zookeeper is doing: (notice that we will re-use both the ZOOKEEPER QUORUM and zooKeeperNamespace

/path/to/zookeeper/bin/zkCli.sh  -server server1.com:2181,server2.com:2181,server3.com:2181 ls /hiveserver2

This command produces a lot of output that’s, but look at the final lines of it:

[serverUri=hive.server.com:10000;version=1.2.1000.2.6.5.178-1;sequence=0000000197]

So you can see that actually, all beeline is doing is looking up the value from zookeeper.

Well if that’s all it is doing why not just use the host and port and skip the discover?

Why you should use Zookeeper namespace to look up hive settings:

Zookeeper is always right
- When you update, the port of hive, zookeeper is updated.
- When you update the server hive is on, zookeeper is updated.
- If there are multiple hive servers, zookeeper servicesDiscoveryMode will randomly assign you to one of the multiple servers. No need for a load balancer. Why have another piece of hardware you don’t need?
- When the server is down, it’s not in zookeeper. (This is actually more handy than you think.) You have a way of knowing if the server is up. If you are using serviceDicoveryMode, it won’t route traffic to the server that server is down.

Basically you get free load-balancing, and you don’t have to worry about hardcoding.

« Using Matplot in Zeppelin – Invalid Display Variable DBeaver + Windows + Kerberos – Trouble shooting. »

Connecting to hive with zookeeper vs server:port

Why you should use Zookeeper namespace to look up hive settings:

Zookeeper is always right

Recent Articles

Apache Kafka – 5 Best Practices tips

Rebalancing a Datanode in Hadoop 2.0 when the cluster is balanced

Cloudera Data Platform on Amazon vs Amazon EMR vs Amazon Roll your own.

We would love to hear from you