Setting up HA Proxy to work with LLAP (without KERBEROS integration.)

If you have an external 3rd party system that needs to connect via hive/LLAP via [host]:[port] and you are not using Kerberos, here is a straight forward setup using open source tools to create load balancing between LLAP/Hive servers. Of course you don’t need to do this if your 3rd party tool can read zookeeper namespaces. I explain zookeeper namespaces and discovery in this post. It’s super handy to have more than one instance of hive so you can complete in place maintenances without having the take critical systems offline. I will discuss what required to setup two instance of hive in a seperate post. Here the idea is you are already setup and want to know what to do.

ssh loadbalancer.server.com

Quick setup for Haproxy:

sudo yum install haproxy

Configure HAproxy:

sudo vim /etc/haproxy/haproxy.cfg

Typical Config:

global
        log 127.0.0.1   local0
        log 127.0.0.1   local1 debug
        maxconn   45000 # Total Max Connections.
        daemon
        nbproc      1 # Number of processing cores.
defaults
        timeout server 86400000
        timeout connect 86400000
        timeout client 86400000
        timeout queue   1000s

frontend stats # stats page for more info see https://www.haproxy.com/blog/exploring-the-haproxy-stats-page/
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 10s
    stats admin if LOCALHOST

frontend hive
    bind *:10500
    mode tcp
    option tcplog
    default-backend hive-backend

backend hive-backend
    balance roundrobin
    option tcplog
    
# http://www.haproxy.org/download/1.4/doc/configuration.txt
# weight can be used to move loads around check out the docs 
# if you are using Hive Warehouse Connector, it doesn't load balance because you have to specify llap daemons to it (llap0).  
# I suggest figuring out what weighting works for you if you are using Hive Warehouse Connector
# This is why you might want to weight things to offset this load.  

    server server1.server.com:10500 weight 1 check # gets 1/2 the load that server2 gets.   --> weight 1
    server server2.server.com:10500 weight 2 check # gets twice the load that server1 gets. --> weight 2

Start HAproxy and restart it on boot:

sudo systemctl start haproxy
sudo systemctl enable haproxy

If you make changes to the config you can test it with Reload the HAproxy config: (Safe to do even when you have traffic)

sudo systemctl reload haproxy

Now you can point your 3rd party to at

loadbalancer.server.com:10500

Gracefully Remove a server from the load balancer

... other config
  
backend hive-backend
   balance roundrobin
   option tcplog

   # ... 

   server server1.server.com:10500 weight 1 check 
   # server server2.server.com:10500 weight 2 check

Reload the config: (it is safe to do this under load)

sudo systemctl reload haproxy

You can then use the stats page to watch the connections drain, or monitor the hive server log.

Once the queries are finished you can restart the hive server as needed.

To add the server back in complete the same process uncommenting the server and reloading the config.

Not So Gracefully Remove a server from the load balancer

Just turn off the Hive server. This will kill any running queries but the load balancer is setup to check if the server is up. This is handy as you can use this detection to your advantage. You can remove the server from the load balancer, drain the queries from hive, turn off hive and then immediately return the server to the load balancer config. It will run a check see the server is off and will not add it to the load balancer rotation until the server is back up.

Finally

Now that you have setup your loadbalancer you can point the 3rd party tool to

loadbalancer.server.com:10500

Using Matplot in Zeppelin – Invalid Display Variable »

Setting up HA Proxy to work with LLAP (without KERBEROS integration.)

Gracefully Remove a server from the load balancer

Not So Gracefully Remove a server from the load balancer

Finally

Recent Articles

Apache Kafka – 5 Best Practices tips

Rebalancing a Datanode in Hadoop 2.0 when the cluster is balanced

Cloudera Data Platform on Amazon vs Amazon EMR vs Amazon Roll your own.

We would love to hear from you