作者:潘晓东
LBaaS是OpenStack的负载均衡服务,默认采用的是Haproxy作为Driver实现其负载均衡功能,默认情况下,LBaaS不提供高可能功能,也就是说,一个LBaaS实例故障,将可能会影响业务的负载均衡功能。本文讨论如何实现LBaaS的高可用实现方式。
原理
要实现LBaaS高可用,必须了解LBaaS的实现方式,最简单直接的方法,是从代码进行分析,LBaaS的代码在/usr/lib/python2.7/site-packages/neutron_lbaas下。对于每个LBaaS agent,我们可用从neutron agent-list中查看到,但是对于多个LBaaS Agent,是如何分配LBaaS Service实例的呢,是轮询,还是每个LBaaS Agent都接管,还是其他什么方式呢,只有了解到了LBaaS接管服务的方式,观察代码services/loadbalancer/agent_scheduler.py
1.
2. class ChanceScheduler(object):
3. """Allocatea loadbalancer agent for a vip in a random way."""
4.
5. defschedule(self, plugin, context, pool, device_driver):
6. """Schedule the pool to an activeloadbalancer agent if there
7. is noenabled agent hosting it.
8. """
9. withcontext.session.begin(subtransactions=True):
10. lbaas_agent= plugin.get_lbaas_agent_hosting_pool(
11. context, pool['id'])
12. if lbaas_agent:
13. LOG.debug('Pool %(pool_id)s has already been hosted'
14. ' bylbaas agent %(agent_id)s',
15. {'pool_id': pool['id'],
16. 'agent_id': lbaas_agent['id']})
17. return
18.
19. active_agents= plugin.get_lbaas_agents(context, active=Tru
e) //获取存活的Agent
20. if not active_agents:
21. LOG.warn(_LW('No active lbaas agents for pool %s'), poo
l['id'])
22. return
23.
24. candidates= plugin.get_lbaas_agent_candidates(device_drive
r,
25. active_agents)
//从存活的Agent选出符合条件的Agent
26. if not candidates:
27. LOG.warn(_LW('No lbaas agent supporting device driver %
s'),
28. device_driver)
29. return
30.
31. chosen_agent= random.choice(candidates) //随机选择一个Agent
32. binding = PoolLoadbalancerAgentBinding() //绑定这个Agent
33. binding.agent = chosen_agent
34. binding.pool_id = pool['id']
35. context.session.add(binding)
从代码上看,对于一个LBaaS Pool的调度,首先是从存活的Agent中随机选择一个,然后进行绑定,当绑定以后,就不再变化。观察数据表,发现poolloadbalanceragentbindings正是记录绑定关系的。
1. MariaDB[neutron]> select * from poolloadbalanceragentbindings;
2. +--------------------------------------+---------------------------------
-----+
3. | pool_id | agent_id
|
4. +--------------------------------------+---------------------------------
-----+
5. | 06db8082-2c49-49d2-a0dd-f857bc3db380 |
421d7ae3-24f9-4ef4-be7e-d7a5555686e6 |
6. +--------------------------------------+---------------------------------
-----+
除此以外,就没有表记录相关的信息了。从上面的分析可以得出,要生成一个Pool,必须从存活的LBaaS Agent中选取一个,然后进行绑定,因此,LBaaS实际上是分布式的,具有可扩展的能力,但是LBaaS没有高可用的能力,要实现高可用,必须进行Agent重新Bind,也就是修改poolloadbalanceragentbindings的绑定关系。
网络
修改绑定关系,是否能够实现LBaaS高可用呢?下面我们从网络层面分析一下, LBaaS要建立一个Pool,首先由建立一个VIP,然后绑定几个对应的后端。我们看看网络上发生了什么。
1. [root@con01 neutron_lbaas(keystone_admin)]$ neutron lb-pool-list
2. +--------------------------------------+------+----------+-------------+-
---------+----------------+--------+
3. | id | name | provider | lb_method |
protocol | admin_state_up | status |
4. +--------------------------------------+------+----------+-------------+-
---------+----------------+--------+
5. | 06db8082-2c49-49d2-a0dd-f857bc3db380 | pl04 | haproxy | ROUND_ROBIN
| TCP | True | ACTIVE |
6. +--------------------------------------+------+----------+-------------+-
---------+----------------+--------+
7. [root@con01 neutron_lbaas(keystone_admin)]$ neutron lb-agent-hosting-p
ool 06db8082-2c49-49d2-a0dd-f857bc3db380
8. +--------------------------------------+-------+----------------+-------+
9. | id | host | admin_state_up | alive
|
10. +--------------------------------------+-------+----------------+-------+
11. | 421d7ae3-24f9-4ef4-be7e-d7a5555686e6 | con02 | True | :-)
|
12. +--------------------------------------+-------+----------------+-------+
从上面观察,pool是绑定到了con02上,如下:
1. [root@con02 ~]# ip netns list
2. qdhcp-d267e13d-703e-43a5-863c-e6878390562d
3. qdhcp-bf124fc6-92f0-4453-bdcc-4c6c39de67a4
4. qrouter-8bbd8b10-2284-4d9c-8915-6d2c96d9a81b
5. qlbaas-06db8082-2c49-49d2-a0dd-f857bc3db380
6.
7. [root@con02 ~]# ip netns exec
qlbaas-06db8082-2c49-49d2-a0dd-f857bc3db380 ip a
8. 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
9. link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
10. inet 127.0.0.1/8 scope host lo
11. valid_lftforever preferred_lft forever
12. inet6 ::1/128 scope host
13. valid_lftforever preferred_lft forever
14. 14: tap97031725-2b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc n
oqueue state UNKNOWN
15. link/ether fa:16:3e:d9:e2:3a brd ff:ff:ff:ff:ff:ff
16. inet 192.168.1.6/24 brd 192.168.1.255 scope global tap97031725-2b
17. valid_lftforever preferred_lft forever
18. inet6fe80::f816:3eff:fed9:e23a/64 scope link
19. valid_lft foreverpreferred_lft forever
从上面可用发现,con02上有qlbaas-{id}的网络名字空间,而空间内有tap97031725-2b的网卡,正好配置了192.168.1.6(vip)的IP。此网卡正是VIP的网卡。
1. [root@con02 ~]# ovs-vsctl show
2. 167f700f-d100-4543-bd5c-2bd1912f1fa1
3. Bridge "br-eth1"
4. Port "br-eth1"
5. Interface"br-eth1"
6. type: internal
7. Port "eth1"
8. Interface"eth1"
9. Port "phy-br-eth1"
10. Interface"phy-br-eth1"
11. type: patch
12. options: {peer="int-br-eth1"}
13. Bridge br-int
14. fail_mode: secure
15. Port "qr-2555d9a4-b1"
16. tag: 2
17. Interface"qr-2555d9a4-b1"
18. type: internal
19. Port "tap5dc6c7f3-8a"
20. tag: 2
21. Interface"tap5dc6c7f3-8a"
22. type: internal
23. Port br-int
24. Interfacebr-int
25. type: internal
26. Port "int-br-eth1"
27. Interface"int-br-eth1"
28. type: patch
29. options: {peer="phy-br-eth1"}
30. Port "tap9f060cb7-b1"
31. tag: 3
32. Interface"tap9f060cb7-b1"
33. type: internal
34. Port "tap97031725-2b" // vip网卡
35. tag: 2
36. Interface"tap97031725-2b"
37. type: internal
38. Port "ha-634bd8e2-ed"
39. tag: 1
40. Interface"ha-634bd8e2-ed"
41. type: interna
通过网桥观察,发现tap97031725-2b是桥接在br-int之上,和router在同一网络平面,到达tap97031725-2b的包,原则上可用转发到任一内网。如果要做到LBaaS Agent的切换,需要tap97031725-2b在目的机器新建,并且桥接在br-int上,同时需要配置好IP。要做的上述事情,可用采用keepalived来做,但是通过观察LBaaS的代码,发现有更简单的方式,因为在LBaaS重启的时候,会重新Load所有的Pool,如下:
vim services/loadbalancer/agent/agent_manager.py
1. def _reload_pool(self, pool_id):
2. try:
3. logical_config= self.plugin_rpc.get_logical_device(pool_id
)
4. driver_name= logical_config['driver']
5. LOG.info("xx: driver_name %s" % driver_name)
6. if driver_name not in self.device_drivers:
7. LOG.error(_LE('No device driver on agent: %s.'), driver
_name)
8. self.plugin_rpc.update_status(
9. 'pool', pool_id, constants.ERROR)
10. return
11.
12. self.device_drivers[driver_name].deploy_instance(logical_co
nfig) //对所有pool进行deploy
13. self.instance_mapping[pool_id] = driver_name
14. self.plugin_rpc.pool_deployed(pool_id)
15. except Exception:
16. LOG.exception(_LE('Unable to deploy instance for pool: %s')
,
17. pool_id)
18. self.needs_resync = True
vim drivers/haproxy/namespace_driver.py
1. @n_utils.synchronized('haproxy-driver')
2. def deploy_instance(self, loadbalancer):
3. """Deploysloadbalancer if necessary
4.
5. :return:True if loadbalancer was deployed, False otherwise
6. """
7. LOG.info("xx: deploy_instance,%s" % str(loadbalancer))
8. if notself.deployable(loadbalancer):
9. LOG.info(_LI("Loadbalancer %s is notdeployable.") %
10. loadbalancer.id)
11. returnFalse
12.
13. if self.exists(loadbalancer.id):
14. self.update(loadbalancer) // 如果存在,则进行更新
15. else:
16. self.create(loadbalancer) // 如果不存在,进行创建
17. return True
从Haproxy的实现中可用看出,LBaaS对于所有的Pool都有同步机制,如果存在,则进行更新,如果不存在,则进行创建,因此,上面所的事情完全可用使用LBaaS的机制来搞定,无需采用keepalived,实现起来简单。
实现
首先采用keeplived来监控LBaaS的服务情况,鉴于基础云已经有了keepalived服务,因此可用复用。keepalived的配置如下:
1. [root@con01 neutron_lbaas(keystone_admin)]$ cat
/etc/keepalived/keepalived.conf
2. vrrp_scriptchk_haproxy {
3. script "/usr/bin/killall -0haproxy"
4. interval 1
5. }
6. vrrp_instanceVI_PUBLIC {
7. interfacebr-ex
8. state BACKUP
9. virtual_router_id66
10. priority 103
11. virtual_ipaddress{
12. 172.16.154.50 dev br-ex
13. }
14. track_script{
15. chk_haproxy
16. }
17.
18. notify_master"/etc/keepalived/lbaas_state_manager.pyMASTER"
//LBaaS的的切换脚本
19. notify_backup"/etc/keepalived/lbaas_state_manager.pyBACKUP"
20. notify_fault"/etc/keepalived/lbaas_state_manager.pyFAULT"
21. }
22. vrrp_sync_groupVG1 {
23. group {
24. VI_PUBLIC
25. }
1. [root@con01 neutron_lbaas(keystone_admin)]$ cat
/etc/keepalived/lbaas_state_manager.py
2. #!/usr/bin/envpython
3. #coding:utf-8
4. import MySQLdb
5. import socket
6. import time
7. import os
8. import logging
9. import sys
10.
11. DB_HOST = "172.16.154.50"
12. DB_PORT = 3306
13. DB_NAME = "neutron"
14. DB_USER = "neutron"
15. DB_PASS = "xxx"
16.
17. UNBIND_TIMEOUT= 3
18. UNBIND_NUM= 3
19. LOG_FILE = "/var/log/neutron/lbaas—state.log"
20. STATE_FILE= "/etc/keepalived/lbaas_state"
21.
22.
23. fmt = "%(asctime)s - %(name)s -%(levelname)s - %(message)s"
24. logging.basicConfig(format=fmt, filename=LOG_FILE, level=logging.INFO)
25.
26. # 数据库管理,省略
27. class DBManger:
28. def __init__(self, host, port, db, user, password, debug=False):
29. ...
30.
31. def get_hostname():
32. return socket.gethostname()
33.
34. # getlbaas agent id
35. def get_lbid(db, hostname):
36. sql = "select id from agents where host='%s'and topic='nlbaas_
agent'" % hostname
37. data = db.exec_sql_ret_one(sql)
38. if data:
39. return data['id']
40. else:
41. returnNone
42.
43. #change to master
44. def change_master_state(db, hostname):
45. sql = "update agents set admin_state_up=1where host ='%s' and top
ic='n-lbaas_agent'" % hostname
46. db.exec_sql(sql)
47.
48. #change to backup
49. def change_backup_state(db, hostname):
50. sql = "update agents set admin_state_up=0where host ='%s' and top
ic='n-lbaas_agent'" % hostname
51. db.exec_sql(sql)
52.
53. #change to master
54. def bind_to_master(db, lbid):
55. sql = "update poolloadbalanceragentbindingsset agent_id='%s'" % l
bid
56. db.exec_sql(sql)
57.
58. def is_bind(db, lbid):
59. sql = "select * from poolloadbalanceragentbindingswhere
agent_id='%s'" % lbid
60. if db.exec_sql_ret_one(sql):
61. returnTrue
62. else:
63. returnFalse
64.
65. def write_state(state):
66. with open(STATE_FILE, "w") as fd:
67. fd.write(state)
68.
69. def service_reload():
70. cmd = "service neutron-lbaas-agentrestart"
71. os.system(cmd)
72.
73. def main(state):
74. db = DBManger(DB_HOST, DB_PORT, DB_NAME, DB_USER, DB_PASS, True)
75. hostname = get_hostname()
76. lbid = get_lbid(db, hostname)
77. print lbid
78. if state == "MASTER":
79. change_master_state(db, hostname)
80. bind_to_master(db, lbid)
81. service_reload()
82. logging.info("change state to master succ")
83. write_state("MASTER")
84. else:
85. change_backup_state(db, hostname)
86. i = 0
87. while is_bind(db, lbid):
88. if i < UNBIND_NUM:
89. time.sleep(UNBIND_TIMEOUT)
90. else:
91. logging.error("change to backup error, unbindfailed")
92. break
93. service_reload()
94. if not is_bind(db, lbid):
95. logging.info("change to backup succ")
96. write_state("BACKUP")
97. else:
98. write_state("UNBIND ERROR")
99.
100.
101. if __name__ == "__main__":
102. if len(sys.argv) != 2:
103. print "usage:lbaas_state_manager.py master|backup"
104. sys.exit(1)
105. else:
106. state = sys.argv[1]
107. main(state)
LBaaS的切换程序,当keepalived检测到失败的时候,就会进行切换,会重新选择Master,在Master上,LBaaS会对所有的Pool进行切换,将所有的Pool全部Bind到Master,同时在Backup机器上,会将admin_state_up状态改为false,这样做的目的是为了在LBaaS调度的时候,不在随机选择,而是选择Master作为LBaaS的默认调度机器。修改完成以后,对LBaaS进行reload,保证服务正常运行。通过实验观察,切换能够正确完成。
总结
LBaaS高可用是在OpenStack中没有支持的部分,网络的资料也没有提供一种完整的方法,本文通过分析LBaaS的原理,通过keepalived+lbaas切换脚本,就能够实现LBaaS的高可用,无需修改LBaaS的代码,也无需共享存储。实验表明,LBaaS通过此方法切换,在秒级别就能够完成切换,后续可以多进行测试,以验证其稳定性。
原文来自:TStack 公众号