“Measure what is measurable, and make measurable what is not so.”
At Sentinel.la, one of the services we provide is the centralization of data & statistics with a OpenStack centered approach, from OpenStack services (nova-* , neutron-* , keystone and so on…) to even get performance and status of vital server resources. All this information is acquired using an nondependent role server architecture (All-in-Ones, dedicated Controller/Compute/Storage deployments, Converged deployments, we must support and fetch data from all those types of deployments.)
Managing all this information requires a very flexible way of organization and handling. Our first Proof of Concept attempt was to create an agent that gathers all the server information at Operating System level, so the basic information was being captured: CPU, Disk Usage, Memory Usage and Load Average. All this information was being stored on a Relational Database.
The problem with Relational Databases is that they are not optimal for handling large amounts of data. Instead of unleashing the power of having such great information you feel like playing Jenga with it, like that with every new row that is added you can’t help but feeling like losing a little bit of performance and scalability. Imagine having millions of rows with CPU data from thousands of servers… that won’t end well.
What about using a NoSQL database? Well, standard NoSQL databases help a lot managing large chunks of document data, but time series is different: imagine that instead of growing vertical rows, your data grows sideways and it depends heavily on the time when the data was saved. So, if not a standard NoSQL, what should we use to save our metrics? And what about if instead just 5 metrics we want to capture “n” metrics for “n” services on “n” devices?
This is where a Time-Series database is useful. On this type of database you have a timestamp that is the equivalent of the Id, so your values are always associated with it. Those values are organized in series, which are a collection of a measurement (CPU usage, disk usage, etc.) and the tags that you employ to identify that measurement (server name, cloud id, server location, etc.)
Having the data stored on a Time-Series database enables you to think of the information as points, which are easy to identify, search, display and graph. You have many functions to manipulate the data and get the right information. In our case we realized that we could use some aggregations and transformations functions to get things like behavior over time with great precision and accuracy.
For this purposes we chose InfluxDB as our time-series database because Monasca uses it and while we were playing with Monasca we found out that it was perfect for what we do. Also InfluxDB can be used “as a service”, the same guys from InfluxData that created the product offer it as a service. This way we can use (and love) InfluxDB features with High Availability without having to operate it and thus we can focus in our core business.
We feel very fortunate to coincide our development with InfluxDB lifecycle. We started using it at the very moment when the 0.9 version was released. This version was a turning point because it added support for tags. Also it’s a little bit different in terms of syntax and other functionalities like a new thresholding and alerting component (Kapacitor) which was introduced the very same week we were researching and developing our metrics alerting engine!
A whole new world
After solving the database backend and having no limits with performance and reliability now comes the sweet part: we can store all the measurements that we want. We began getting I/O values from servers, and started having OpenStack service related information at first. How much CPU does nova-api use? Is nova-scheduler having peaks of memory? What’s the uptime of nova-compute process? The limit is only our (OpenStack) imagination.
Influx Data https://influxdata.com/