Posts

Internal System Design: The Forgotten Discipline

Image
Several years ago it was impossible to think about software development without having at least a business specification(usually in a word document) and a bunch of UML diagrams like package, class and sequence diagram. That was a pre-agile era where process and formality ruled the IT software development industry(yes folks, yes I'm old). Back on that time software engineering was pretty much conceived as "process".  Now we live in a more civilized era, however, back on that time, there were some interesting and valuable things(tools, techniques, principles and even practices).  Like the English expression, you need throw bath water but keeping the baby. I was wondering why these phenomena happened and still happen today. First of all, we are SOCIAL animals we tend to have some kind of "fashion" behavior where we follow the "flow". Needless to say that UML and Software internal design is not quite popular today with the new kids. I don't miss the …

Code for Reliability

Image
It's normal to talk about failure when we talk about cloud-native architectures. Pretty much anything that runs on the cloud(data center) means more distribution(networking). Distributed Systems tend to fail all time. Chaos Engineering is great, however, it's for the infrastructure, therefore focused on the big picture. When I say big picture I mean outside of a microservice. Reliability is not just for the outside boundary or infrastructure but also for Inside a service or inside the system. Reliability Inside the system might have many names such as(very popular in Brazil during 2k years) Defensive Programming.  There are lots of synergy between reliability, defensive programming, anti-fragility and Efficient Internal System design. Efficient design it's not only about making your system efficient in sense of economics(code, readability, reduce maintained cost) but also reliability. Today I want to share some internal system design way of think in order to achieve bette…

C Unit Testing with Check

Image
When I start on IT, several years ago(+15), C was the first language that I learned. But was not the first language I mastered and deployed in production which was...surprisingly Visual Basic 6.0. However, when I learned C or any language back on that old and arcane time noddy was into testing. However now is impossible to work professionally in any language without tests. C is a low-level language but still kicks ass and has several interesting libs/frameworks. People use to fear C a lot because of lack of syntactical sugar but once you get used to it is not that hard at all. Today I want to show how you can do Unit Testing with C. I will not show how to do Unit Testing with C++ but with C ANSI actually. We will be using Check a unit test lib for C ANSI.



Generic and Script Remediation

Image
Some time ago I was sharing my experiences with a Dynomite Remediation Process I wrote. Iḿ using remediation systems for a while and IMHO the add lots of value since they automate manual Cloud Operation work and save time for people. Currently, I refactored my Remediation code and now the same code can support Dynomite / Dynomite Manager but also Apache Cassandra. There are very similar concepts between Dynomite and Cassandra Remediation such as discovering AWS EC2 Ips to be remediated, AWs Resourcing(Creating SGs, Deleting LCs, Updating ASGs), Health checking(Is the node up and running? Could I remediate right now or you are in the middle of a backup or just booting up?). Once I identified that core concept I was able to create a high level and generic design for Cassandra and Dynomite. This is great for cases of Patch / Fix(apply a new AMI) or Scale Up(Increase the memory or CPU for instance). Although that the main use cases and most important ones there were some corner cases and…

Mocking and Testing AWS APIs with TestContainers and LocalStack

Image
Cloud Computing is the default today. I do believe the future is containers and multi-cloud solutions. However today I work a lot with AWS. There are specific endpoints such as S3, AutoScaling, Route53 and other that are among the ones I use more in my day to day work. AWS API is easy to use however not no easy to test. Distributed systems tend to be hard to test. Having quick feedback is very important for engineers. There is some kind of tasks that need to interact with AWS APIs like S3 for backups for instance. However, if you need to wait to deploy in AWS to test it because is basically impossible to test it locally then we have a problem. We need to be able to do end-2-end testing however while you are coding or doing some troubleshooting is important to do things faster. There are 2 specific projects that can help us with this task. TestContainers and LocalStack. Today I will show to use LocalStack and TestContainers together with JUnit in order to do unit tests mocking S3 API.…

DevOps Engineering with Service Orientation

Image
DevOps Engineering is often "perceived" as a operations matter. In fact, it's true but is not 100% JUST ops. There are some blurred lines between engineering and operations. When we take a look at the DevOps ecosystem we can highlight some great tools like Ansible, Terraform, Jenkins, Packer and so many others. Provisioning, Release, Observability, and SRE are different aspects of a broader DevOps adoption. The higher benefit comes when we start applying more engineering to DevOps. It's very important to have good tooling around. When we think about tooling engineering makes a lot of sense, however, even outside of the tooling space there are areas where we definitely need an engineering injection like Cloud Production Operation Automation. Releasing is the easy part, things get complicated when you automate the operational aspects of your systems. It sounds so meta or Inception(If you watch the movie) but Systems managing Systems looks the way to go for me. This pa…

Dynomite Remediation

Image
Working with the cloud is great but it's not easy. The Cloud allows us spins machines pretty easily, however, update these machines are not that an easy task, especially if we talking about databases. Today I want to share some experience with remediation process I've built for Dynomite in my current project.  I'm using NetflixOSS Dynomite in production(AWS) for more than 2 years now both as cache but also as Source of Truth. I need to say that is rock solid and just works. However often we need to increase some machine capacity, i.e: Add more memory or increase the disk size. There are times where we need to apply patches on the OS - I use Amazon Linux on Production on EC2 and recently there was the Meltdown and Spectre situation. Other times there are telemetry or even provisioning and configuration bugs that need to be fixed.  No matter the reason you will always need, time to time, change something on the underlying AMI or in the Java or C application which is running…