How to Troubleshoot OpenStack Without Losing Sleep
-
Upload
sadique-puthen -
Category
Internet
-
view
393 -
download
6
Transcript of How to Troubleshoot OpenStack Without Losing Sleep
TRUSTED CLOUD SOLUTIONS
OpenStack Summit Austin
WE HAVE OUR RIGHT TO SLEEP
Sadique Puthen & Dustin BlackCloud Success Architect 26th April 2016
How To Troubleshoot Openstack Without Losing Sleep
[email protected]@sadiquepp
[email protected]@dustinlblack
Manifestation of a Problem
“Our compute service on the compute node is stuck in a state of activating.”
“Most OpenStack Overcloud neutron services inactive and disabled”
No valid host was found. Exceeded max scheduling attempts 3 for instance
PortLimitExceeded: Maximum number of ports exceeded
“User unable to launch new instances”
Instance failed to spawn
Over-Working RabbitMQClick to add subtitle
Insert paragraph of copy here and graphic in box to the right.
● Bullet● Bullet● Bullet
Over-Working RabbitMQProblem Description: Our compute service on the compute node is stuck in a state of activating
Initial evidence are non-descriptive timeouts:
# journalctl --all --this-boot --no-pager | grep novaMay 27 16:20:50 host.example.com systemd[1]: openstack-nova-compute.service operation timed out. Terminating.May 27 16:20:50 host.example.com systemd[1]: Unit openstack-nova-compute.service entered failed state.May 27 16:20:50 host.example.com systemd[1]: openstack-nova-compute.service holdoff time over, scheduling restart.
Rebooting the compute node doesn’t help.
Over-Working RabbitMQProblem Description: Our compute service on the compute node is stuck in a state of activating
An strace of the nova-compute service reveals our trouble communicating with rabbit:
# grep :5672 compute.strace 12938 03:29:28.320069 write(3, "2015-05-28 03:29:28.319 12938 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 192.168.100.47:5672 is unreachable: Socket closed. Trying again in 1 seconds.\n", 169) = 169 <0.000019>12938 03:29:29.321779 write(3, "2015-05-28 03:29:29.321 12938 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnecting to AMQP server on 192.168.100.48:5672\n", 126) = 126 <0.000061>12938 03:29:30.333894 write(3, "2015-05-28 03:29:30.333 12938 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 192.168.100.48:5672\n", 123) = 123 <0.000013>
Over-Working RabbitMQ
The strace leads to more logs...The logs lead to an existing bug report...The bug report leads to an upstream discussion...
Yadda Yadda Yadda
The rabbitmq-server process is out of file descriptors!
Problem Description: Our compute service on the compute node is stuck in a state of activating
https://github.com/puppetlabs/puppetlabs-rabbitmq/pull/215#discussion_r24977957
Now you Know!
Too few RabbitMQ file descriptors is a recipe for
sleepless nights.
Set the rabbitmq-server NOFILE limit to 65436*
*Be careful if you’re using pacemaker -- limits are set by the resource agent.
Knowledge-Centered Support
● Continuous improvement of the knowledgebase simplifies troubleshooting of future issues
● Knowledge automatically captured as a by-product of the problem solving process
● Search and reuse as core disciplines of the support team
● Fast track to publication means easier self-resolution
https://access.redhat.com/solutions/1465753
WE HAVE OUR RIGHT TO SLEEPIssue #2: Random failure while spawning large number of instances
$ nova listERROR (ConnectionRefused): Unable to establish connection to http://192.168.1.1:35357/v2.0/tokens
● Connection to various openstack service APIs (nova-api, cinder-api, neutron-api, etc times out randomly.
● Not reproducible in most of the environments. When it happens, the failure is random without any pattern. Sometimes 1 in 100 or 1 in 500, etc.
● Obviously keystone is up and running perfectly fine.
connection refused!!
neutron-apicinder-apinova-api
Keystone
Issue #2: The symptom is same as issue #1Result: Random failure in spawning instances, creating volumes, networks, etc.
First suspect is Keystone, but he is innocent.Where one can go wrong?
Looking at the error message, It’s natural to point fingers at keystone.
● Looked at keystone api logs. No clue!!● Can see abnormal number of of keystone connections
in CLOSE_WAIT status. Focused and wasted a lot of time by investigating in that direction.
● It’s time to understand how the connections from end user to api and keystone goes by focusing on how the dots are connected.
17
How does it work under the Hood?connection refused!!
haproxy
nova-api keystone
mariadb-galera
haproxy
nova-api keystone
mariadb-galera
haproxy
nova-api keystone
mariadb-galera
VIP
nova-apikeystonedatabase
controller-1 controller-2 controller-3
Possibilities?
Keystone is already ruled out.
● Intermittent network packet drop?● Haproxy (load balancer) drops connection?
end user -> novanova -> keystonekeystone -> database
No, ruled out by network troubleshootingLikely?Highly unlikely as the error is when nova connects to keystone.Slightly likely.Highly likely. Enabled logging and found heavy client termination messages.
haproxy[22346]: 10.243.232.62:48999 [10/Jul/2015:01:41:34.706] galera galera/pcmk-hovsh0800sdc-06 1/0/8734961 37181 cD 1369/1337/1337/1337/0 0/0haproxy[22346]: 10.243.232.14:53092 [10/Jul/2015:02:37:43.666] galera galera/pcmk-hovsh0800sdc-06 1/0/5400007 2875 cD 1375/1337/1337/1337/0 0/0haproxy[22346]: 10.243.232.62:41742 [10/Jul/2015:01:47:44.819] galera galera/pcmk-hovsh0800sdc-06 1/0/8400246 38448 cD 1376/1336/1336/1336/0 0/0haproxy[22346]: 10.243.232.14:53318 [10/Jul/2015:02:37:47.499] galera galera/pcmk-hovsh0800sdc-06 1/0/5400005 3414 cD 1384/1335/1335/1335/0 0/0haproxy[22346]: 10.243.232.62:42507 [10/Jul/2015:02:37:47.529] galera galera/pcmk-hovsh0800sdc-06 1/0/5400006 2875 cD 1383/1334/1334/1334/0 0/0haproxy[22346]: 10.243.232.62:42609 [10/Jul/2015:02:37:49.103] galera galera/pcmk-hovsh0800sdc-06 1/0/5400315 35783 cD 1384/1334/1334/1334/0 0/0haproxy[22346]: 10.243.232.62:42684 [10/Jul/2015:02:37:50.598] galera galera/pcmk-hovsh0800sdc-06 1/0/5400259 28994 cD 1384/1334/1334/1334/0 0/0haproxy[22346]: 10.243.232.14:53493 [10/Jul/2015:02:37:50.885] galera galera/pcmk-hovsh0800sdc-06 1/0/5400007 2875 cD 1383/1333/1333/1333/0 0/0haproxy[22346]: 10.243.232.14:53674 [10/Jul/2015:02:37:53.874] galera galera/pcmk-hovsh0800sdc-06 1/0/5400007 3498 cD 1404/1335/1335/1335/0 0/0haproxy[22346]: 10.243.232.14:54625 [10/Jul/2015:02:38:11.399] galera galera/pcmk-hovsh0800sdc-06 1/0/5400008 12461 cD 1407/1335/1335/1335/0 0/0
19
galera: sessionsmax: 2000 Limit: 2000
Hold on, but where did I set it? Nowhere!!!
● Then from where this limit comes to effect?This is the default hard coded limit for each
proxy if one is explicitly not defined.● Then why there is no proper error message?
Connection by haproxy is sent into a queue waiting for free database connection, then terminated when it hits timeout.
Haproxy has hit maxconn for galera!listen galera bind 10.243.232.62:3306 mode tcp option tcplog option httpchk option tcpka stick on dst stick-table type ip size 2 timeout client 90m timeout server 90m server controller-1 10.243.232.14:3306 check inter 1s on-marked-down shutdown-sessions server controller-2 10.243.232.15:3306 check inter 1s on-marked-down shutdown-sessions server controller-3 10.243.232.16:3306 check inter 1s on-marked-down shutdown-sessions
global daemon group haproxy maxconn 40000 pidfile /var/run/haproxy.pid user haproxy
defaults log 127.0.0.1 local2 warning mode tcp option tcplog option redispatch retries 3 timeout connect 5s timeout client 30s timeout server 30s
maxconn 2000
20
I solved your problem, can I go and sleep? Hold on..
● It took more time to determine the right value for maximum database connection because it depends on,
○ How many workers are spawned by each api?■ Depends on api_workers/workers configuration for
each service.● Depends on how many cpu cores are there on
each controller? ■ This can differ from deployment to deployment.
○ Each worker process opens five long lived database connection.
○ There are also some short lived connections by each worker.
What should be the maxconn for galera?
Now I can sleep like him.
# Number of workers for OpenStack API service. The default will be the number of CPUs available. (integer value)
21
nova-api24x3 = 72
mariadb-galera
controller-1cores = 24
Based on default deployment by RHEL Openstack Platform Director.What should be the maxconn for galera?
keystone24x2 = 48
neutron-s24x2 = 48
glance-ap24x1 = 24
cinder-api24x1 = 24
glance-re24x1 = 24
nova-con24x1 = 24
nova-api24x3 = 72
controller-1cores = 24
keystone24x2 = 48
neutron-s24x2 = 48
glance-ap24x1 = 24
cinder-api24x1 = 24
glance-re24x1 = 24
nova-con24x1 = 24
nova-api24x3 = 72
controller-1cores = 24
keystone24x2 = 48
neutron-s24x2 = 48
glance-ap24x1 = 24
cinder-api24x1 = 24
glance-re24x1 = 24
nova-con24x1 = 24
mariadb-galera mariadb-galera
total = 264x5 =1320
Haproxy-VIP Total is 3960
total = 264x5 =1320 total = 264x5 =1320
Add 1024 for:1 - Short lived connections2 - Other services.3 - New services.Total = 4960
22
To sleep like a …..?
Setting the right maxconn value upfront for database proxy can save you from sleepless nights.
● Decide how many worker threads are required by each api for optimum performance. A 96 core system does not need x3 nova worker processes.
● Automate this calculation and set it during deployment time itself.
Both haproxy and for database server. max_connections
● Those use different load balancers, make sure to address this problem, if applicable.
Decide and Set the right value upfront before going to bed.
Proactive alerts
Real-time risk assessment
No infrastructure cost Validated resolution
Tailored resolution
Quick setup
SaaS
Discover the Beta: access.redhat.com/insights
[email protected]@sadiquepp
[email protected]@dustinlblack
THANK YOU
plus.google.com/+RedHat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHatNewslinkedin.com/company/red-hat