NOTE: This post was initially written a few years ago, so something could be outdated.
What is Cinder
High-level Cinder Architecture
Cinder contains 3 main components: Cinder API (c-api), Cinder Scheduler (c-sched) and Cinder Volume (c-vol). Cinder Backup is a bit different and will be described later.
All Cinder components communicate between each other using RPC protocol. In our case, it’s an AMPQ protocol with RabbitMQ backend. All data is stored in the database (MySQL DB).
Cinder is responsible only for the volume management plane. Cinder backends (storages) are responsible for data plane and all volumes manipulations.
Like all OpenStack services, Cinder provides HTTP REST API. Today, there are 3 versions of the API:
- v1 - removed
- v2 - deprecated, contains all features until Mitaka release and performance improvements like paging, limits, etc
- v3 - current version, contains all features of API v2, all new features are implemented as micro versions.
I do not recommend to use API v2. The latest python-cinderclient uses v3 by default.
There is an official OpenStack Cinder client called ‘python-cinderclient’. Unfortunately, python-openstackclient doesn’t support all Cinder APIs. So the community recommends using python-cinderclient.
All logs are stored in
/var/log/cinder/ directory. All services write log to
different files. E.g.:
If we don’t turn on debug logs, there will be very limited data in logs to troubleshoot something. To enable debug logs you have to pass the following steps:
DEBUGfor the root logger in
- Restart all Cinder services
What kind of tests are described in this document?
There are different performance tests approached. This document describes Cinder API and components performance. By performance, we mean how much time will take to complete API call. We don’t test backends performance and/or DB/message queue, etc.
Why do we need these tests?
We want to know what is our expected Cinder performance:
- What is expected time to handle request A?
- How many requests B can Cinder handle in parallel?
- What does impact Cinder performance?
- Where is performance bottleneck?
As a result of performance testing, we should have answers to the questions above and create recommendations on how to configure Cinder to work faster. These tests results could be a source of new performance-related bugs.
What tools/frameworks should we use?
There are a community and industry standard tools for performance testing like Rally, Mongoose, Wally, Tsung, jMeter, etc. We can use existing and Rally and Wally scenarios to test Cinder performance.
Performance testing for different Cinder components
We should describe what actually data do we want. Different components require different tests. Let’s discuss CRUD (Create-Read-Update-Deletes) API calls.
In general, there are two different types of ‘create’ command: create something on the backend (volume, snapshot, consistency group, etc) and create a DB record (volume type, extra spec, quota, etc).
Create APIs without backend interactions
These types of APIs calls are executed only by c-api services. What components and configuration options could affect performance:
- Database performance
- Memory allocation
- CPU usage
- How many c-api workers are running on the node
- Do we use eventlet or Apache/Ngingx+WSGI
Create APIs with backend interactions
These types of APIs calls are executed by c-api, c-shed and c-vol services.
Usually, such operations are asynchronous, so after we’ll invoke
cinder create volume command, API returns
202 Accepted code. It means,
Cinder created a DB record and sent a message by AMPQ to c-sched or c-vol.
In such cases, in the ideal world, we should have 2 different types of tests:
- How quickly c-api creates something
- How quickly Cinder resource will be in the ‘available’ state instead of ‘-ing’ state
What components and configuration options could affect performance:
- All components, described in ‘Create APIs without backend interactions’ section
- In case, where c-api, c-sched and c-vol are located on the different nodes, we should be sure, that networking would not be a bottleneck
- RabbitMQ performance
- Cinder backend performance
Depends on backend and Cinder driver, Cinder communicates with a storage using different solutions: invoke Linux executable (e.g. LVM), call some 3rd party libs and/or HTTP REST API. It means, that we can’t say that ‘Cinder creates volume in 5 seconds‘ using some backend. There are a lot of affected components which could affect performance tests. That’s why I strongly recommend do not use any real back-end for Cinder performance tests. We have to use Fake driver for tests. For more details, please see ‘Storage Backends’ sections.
Proposal: implement ‘create volume(image)/snapshot’ tests for each tested backend without Cinder to get numbers how much time each backend takes to create volume/snapshot/etc. Then we can compare the results of such tests with Cinder performance tests results.
These APIs tests are almost the same as for ‘Create APIs without backend interactions’. Cinder reads data from DB and returns it to the consumer.
There are exceptions from this rule: few APIs call c-vol for data, so the supposed to be slower due to RPC communications.
Please, read the ‘Create APIs‘ section. Update APIs are very similar to Create APIs with the same behaviors.
Please, read the ‘Create APIs‘ section. Delete APIs are very similar to Create APIs with the same.
Delete volumes/snapshots/backups performance testing
Storages and Cinder drivers have different ‘delete’ feature implementation. According to it, volume/snapshot/etc deletion could take some time. E.g. LVM driver uses ‘dd’ tool to clean-up volume with /dev/zero. Ceph has own solution to shred volume after deletion. For Cinder performance tests we have to disable such options if possible.
Proposal: implement backend-specific tests for volume/snapshots deletion and compare their results with Cinder tests.
Testing using Fake Driver
Fake Driver allows you to run both functional and performance tests without real storage backend impact. It means. We test only Cinder with a minimum required 3rd party software and hardware. The minimum Cinder requirements are Cinder, Keystone, RabbitMQ, MySQL, [Nova and Glance for related tests].
What could impact Cinder performance?
NOTE: Until Ocata release, c-api and c-vol services cannot be runned in Active/Active HA mode. We use HAProxy for c-api and config hacks for c-vol to achieve working HA solution.
By default, Cinder runs one API worker per CPU. We can configure this option if needed in a case when we use Eventlet WSGI implementation. We don’t need to change it if Cinder works under Apache or Nginx.
Usually, Cinder scheduler is runned on Controllers nodes. There should not be a big performance impact if we run more or less c-sched services. We start several s-sched services to achieve HA mode.
In general, we run one c-vol service per backend. We can run more c-vol services per backend when Active/Active HA will be implemented. We can’t run multiple c-vol services for LVM and Block Device Driver now. You can configure multiple c-vol per backend for Ceph and any other distributed or remote storages (e.g. NetApp, EMC, SolidFire, etc). Multiple c-vol could increase Cinder CRUD operation performance if storage is not a bottleneck. We have to test how to increase or decrease c-vol services number impacts performance.
3rd party components
Different backend and even their configuration have different CRUD performance. So we have to test each backend separately from Cinder to get numbers according to CRUD operation performance. After these test, we have to run Cinder CRUD performance tests with real backend and compare results to answer the question: what is the Cinder’s overhead working with specified backend?
Create/delete LVM volume operations are fast enough and depend on disk I/O. In the case of Cinder, by default, it uses secure delete feature to full volume with zeros using ‘dd’ tool before deletion. That’s why disk I/O could impact performance. We can/should disable this feature for Cinder performance testing.
Create volumes and snapshots in Ceph are a fast operation. But Ceph uses secure delete feature while deletion. It increases the time of deletion depended on volume size.
Apache/Nginx or Eventlet-based
Eventlet-based deployments are supported only for old releases now. Apache+mod_wsgi or Nginx+uWSGI will be faster than eventlet. Configuring Cinder API + Apache/Nginx should be done like any other WSGI application and follow industry best practices.
NOTE: I did some performance testing for this in the past, but results are lost.
We have to track DB performance during Cinder testing to find the slowest queries, be sure that there are enough resources for the DB Server.
There are a big performance and concurrency impact created by DB drivers. PyMySQL works better in a concurrency mode with eventlet than MySQL-Python. We have to test Cinder CRUD operations with a Fake Driver to get real numbers.
Message bus (RabbitMQ)
While Cinder performance testing we should monitor RabbitMQ state to be sure that there is enough OS resources and both RabbitMQ with oslo.messaging works well with the current load.
Because Cinder is a distributed component wiсh is runned on different nodes, we have to have enough networking resources, I/O and latency to provide fast communication between Cinder nodes.
As an any component, Cinder should have enough CPU resources. We should monitor CPU load, average load, what is the main CPU users on each node when Cinder is running.
In general, please see ‘CPU Load’ According to different backends like LVM and
Block Device Driver, we have to provide enough disk I/O bandwidth for c-vol
services. It could impact such operations like
create volume from image and
upload volume to image’
Please, see ‘CPU Load’ section.