Telemetry and Microservices part1

Telemetry is a key component in a Microservices architecture solution. If you solution is a monolith or an old fashioned SOA you still need to have some level of telemetry.

There are many companies out there using Nagios and Zabbix. These tools are not the new kids on the block anymore. They were born into a monolith and bare-metal world.

Currently, we live in a Cloud Native, Microservice Oriented world. This is a bigger trend them you might think. Gartner predicts by 2020 companies will be doing more algorithms on they own.  Consider we are transitioning from a centralized architecture with a monolith UI, monolith service and monolith database to a distributed-* we need scale and work with different approaches for telemetry.

Observability: The concepts

According to wikipedia Telemetry is defined as:

Considering a DevOps / SRE perspective telemetry is defined as Observability which is:

  • Monitoring
  • Alerting and Visualizations
  • Distributed Systems Tracking
  • Log Aggregation 
  • Automated Canary Analysis
  • Dynamic Thresholds with ML
You might also see lots of people talking about telemetry in an IOT context. Why should I care? Well, Telemetry should be as big as your system is. As your infrastructure and architecture scale out, you will need to scale your telemetry platform. 

Monitoring: The Basics

The basic need is to know if your system is up and running. Considering a microservices world this is not so simple. There are multiple middleware servers, caches, engines, database clusters using different protocols and languages.  People tend to leverage the use of plugins in solutions like Nagios and Sensu. Nowadays you will need to do custom development because is almost sure you will need to adapt the telemetry to your architecture. 

Alerting and Visualization

Alerting often is done using a very particular data model called Time Series.  Not all solutions are using TSDBs some uses RDD-styles for instance but if you want scalability you will need to consider this. The best solutions are OpenTSDB(Hbase based) and Apache Cassandra.  There is one database getting lots of traction called InfluxDB  which is very nice and easy to use however there cases with issues at scale. 

Visualization is another key aspect, this is not just a simple chart. One of the best solutions for me is Grafana. People often do all kinds of aggregation ad window analysis to check out trends and do analytics to spot issues, performance, degradation and potential incidents. 

Another key idea very popular nowadays is the use of Advanced Math to predict and visualize your telemetry data.  Netflix has a very interesting solution for this called Atlas. Atlas is built with scala,Akka and Spray.

Distributed Systems Tracking

One way of thinking is to consider everything that happens in your system as immutable events and this is particularly interesting because you deal with a Stream problem. That's exactly what Reimann does. Reimann is written in Clojure and you can all sorts of complex math because you are coding in Clojure this is very sexy :-)

Several solutions are working with a concept call retention and some solutions do not keep track of several data points such as Prometheus. It's particularly interesting store all events because you possible will apply regression and compare with past values.

Log Aggregation 

No matter if you are using Container like LXC, Docker or virtualization you will have several servers and being cloud-native or container native will require  in a Stateless and ephemeral solution. Having said that you can't store your logs in your FS like you did before. You will need solutions like ELK(Elastic Search, Kibana and Logstash) or Graylog.

Some people use ELK as a telemetry solution(for storage and visualization) I think this is wrong and you only can do it in very simple and low scale of data. 

Diego Pacheco

Popular posts from this blog

Telemetry and Microservices part2

Installing and Running ntop 2 on Amazon Linux OS

Fun with Apache Kafka