If you have an external 3rd party system that needs to connect via hive/LLAP via [host]:[port] and you are not using Kerberos, here is a straight forward setup using open source tools to create load balancing between LLAP/Hive servers. Of course you don’t need to do this if your 3rd party tool can read zookeeper namespaces. I explain zookeeper namespaces and discovery in this post. It’s super handy to have more than one instance of hive so you can complete in place maintenances without having the take critical systems offline. I will discuss what required to setup two instance of hive in a seperate post. Here the idea is you are already setup and want to know what to do.
Login to your loadbalancer
ssh loadbalancer.server.com
Quick setup for Haproxy:
sudo yum install haproxy
Configure HAproxy:
sudo vim /etc/haproxy/haproxy.cfg
Typical Config:
global log 127.0.0.1 local0 log 127.0.0.1 local1 debug maxconn 45000 # Total Max Connections. daemon nbproc 1 # Number of processing cores. defaults timeout server 86400000 timeout connect 86400000 timeout client 86400000 timeout queue 1000s frontend stats # stats page for more info see https://www.haproxy.com/blog/exploring-the-haproxy-stats-page/ bind *:8404 stats enable stats uri /stats stats refresh 10s stats admin if LOCALHOST frontend hive bind *:10500 mode tcp option tcplog default-backend hive-backend backend hive-backend balance roundrobin option tcplog # http://www.haproxy.org/download/1.4/doc/configuration.txt # weight can be used to move loads around check out the docs # if you are using Hive Warehouse Connector, it doesn't load balance because you have to specify llap daemons to it (llap0). # I suggest figuring out what weighting works for you if you are using Hive Warehouse Connector # This is why you might want to weight things to offset this load. server server1.server.com:10500 weight 1 check # gets 1/2 the load that server2 gets. --> weight 1 server server2.server.com:10500 weight 2 check # gets twice the load that server1 gets. --> weight 2
Start HAproxy and restart it on boot:
sudo systemctl start haproxy sudo systemctl enable haproxy
If you make changes to the config you can test it with Reload the HAproxy config: (Safe to do even when you have traffic)
sudo systemctl reload haproxy
Now you can point your 3rd party to at
loadbalancer.server.com:10500
... other config backend hive-backend balance roundrobin option tcplog # ... server server1.server.com:10500 weight 1 check # server server2.server.com:10500 weight 2 check
Reload the config: (it is safe to do this under load)
sudo systemctl reload haproxy
You can then use the stats page to watch the connections drain, or monitor the hive server log.
Once the queries are finished you can restart the hive server as needed.
To add the server back in complete the same process uncommenting the server and reloading the config.
Just turn off the Hive server. This will kill any running queries but the load balancer is setup to check if the server is up. This is handy as you can use this detection to your advantage. You can remove the server from the load balancer, drain the queries from hive, turn off hive and then immediately return the server to the load balancer config. It will run a check see the server is off and will not add it to the load balancer rotation until the server is back up.
Now that you have setup your loadbalancer you can point the 3rd party tool to
loadbalancer.server.com:10500