Connecting to hive with zookeeper vs server:port

Some people are uncomfortable connecting to hive with the zookeeper namespace.  It’s possible they don’t understand it, or perhaps they’re just old school and like a good old fashioned host and port.

For this tutorial our hive server will be located: hive.server.com:10000

Let me de-mystify what is happening when you use the zookeeper namespace.

When you connect via beeline you likely use a string that look like this:

jdbc:hive2://<ZOOKEEPER QUORUM>/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver

specifically you likely use a command that looks like this:

beeline -u "jdbc:hive2://server1.com:2181,server2.com:2181,server3.com:2181;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver"

So let’s break down this URL like looking thing to explain what it’s doing and why you should use it.

jdbc:hive2://<ZOOKEEPER QUORUM>/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver

  • jdbc:hive2 –> this tells you to use a jdbc hive protocol.
  • //<ZOOKEEPER QUORUM>/; –> this tell you what servers and ports you can use.  (the next setting actually tells the driver they are zookeeper nodes)
  • serviceDiscoveryMode=zooKeeper; –> method to use to discover the address we are actually going to use.
  • zooKeeperNamespace=hiveserver –> Name of the place in zookeeper we will use to store that actual connection settings.

Ok so now we understand how the url works big deal, why do we care?

Well let’s use this info to see what zookeeper is doing:  (notice that we will re-use both the ZOOKEEPER QUORUM and zooKeeperNamespace

/path/to/zookeeper/bin/zkCli.sh  -server server1.com:2181,server2.com:2181,server3.com:2181 ls /hiveserver2

This command produces a lot of output that’s, but look at the final lines of it:

[serverUri=hive.server.com:10000;version=1.2.1000.2.6.5.178-1;sequence=0000000197]

So you can see that actually, all beeline is doing is looking up the value from zookeeper.

Well if that’s all it is doing why not just use the host and port and skip the discover?

Why you should use Zookeeper namespace to look up hive settings:

  • Zookeeper is always right

    • When you update, the port of hive, zookeeper is updated.
    • When you update the server hive is on, zookeeper is updated.
    • If there are multiple hive servers, zookeeper servicesDiscoveryMode will randomly assign you to one of the multiple servers.  No need for a load balancer.  Why have another piece of hardware you don’t need?
    • When the server is down, it’s not in zookeeper.  (This is actually more handy than you think.)  You have a way of knowing if the server is up.  If you are using serviceDicoveryMode, it won’t route traffic to the server that server is down.

Basically you get free load-balancing, and you don’t have to worry about hardcoding.