It’s a new year! As a protocol, the new year should be accompanied by new changes. Following the protocol, I’ve decided to try new things too.

It’s been ages since I blogged and I’ve always loved blogging. I’m never short of thoughts or stuff to share, but blogging requires a little patience, at least for me. There’ve been times where I spent ~2 months on a single blog post, to make sure that I get the content right. And I believe that this extreme scrutiny sometimes stops me to put a chunk of time aside to jot down a new post.

I’ve been working in a startup called Instabase for the past year. When you work in a startup, time as an asset becomes more valuable, and it becomes even more challenging to extract that time. On the flip side, there’s so much learning and new stuff happening, there’s never a shortage of things to share. So the plan is to leverage this advantage to cater to my love for blogging, fix the scrutiny process, and write more. I’m going to do a weekly update series, where I’m going to cover different domains of applications I’m working on and problems that I tackle on day to day basis. The content is basically going to be a mix of things I’m learning, problems I’m tackling, with a blend of opinions on stuff (Which can be biased), so I’d recommend taking everything with a pinch of salt 🙂

Context

My team handles the monitoring and observability side of our platform/Infra. Monitoring and observability is a pretty hot thing in Industry because every company that is in Tech, directly or indirectly, needs it. There are tons of companies like datadog, New Relic, AppDynamics, Sumo logic, Pixie, etc that specifically cater to the monitoring needs of other organizations who don’t have the expertise or resources to build their own monitoring platform. Outsourcing the observability to these vendors allows organisations to focus on the product they’re building. These vendors are in high demand because they offer you good services but that being said, it can get pretty expensive too. The cost implications of every extra log and metrics your product is generating needs to be accounted.

Building your own monitoring platform can be challenging but It can be cost effective in longer run.

What do I work on?

I hope you got the gist of the domain, so let’s come back to the topic of what exactly do I work on? Our team works on creating the monitoring Infra and framework. To achieve that target, we use a plethora of already existing tools and solve problems that come with monitoring a dynamic platform (think of Kubernetes). Here is a rough list of some of the necessary tools:

On day to day basis, my work is basically divided into 2 categories:

Every Developer is aware of Application domain. It involves writing code and building apps using various languages and libraries. In operations, you acquire domain-specific knowledge of a tool/technology to deploy, serve, test or monitor your applications.

For example, if my target is to deploy a website and make it accessible at a particular URL on the gcloud kuberentes engine. This is what I may need to do:


There’s so much more to each of these domains, technology and tools that I’ve covered till now. For the sake of avoiding information overload, I’ll stop here and use future posts to drill down on each of them.

Weekly Update

For a couple of weeks, I’ve been working on delivering our monitoring platform to one of our customers. Delivering involves preparing container images for different services, creating deployment files, and then working with customers’ engineers to deploy and test the end-to-end working. Instabase is an OS, that we deploy wherever our customer wants. It could be on gcloud, AWS, azure, openshift, or their own on-prem Infra. We’ve realised that every cloud infra comes with its own set of challenges.

We’re doing a beta release of our monitoring platform for our customers to test the waters and figure out different challenges we might run into. The purpose is to find out if we need to do certain things differently. I faced few blockers but one major one was the lack of data around CPU and memory usage of microservice. If we don’t have data, neither our visualization dashboard works nor our alerting rules. It tends to break few things.

I brainstormed on different ways we could get those missing data, tested a few of those solutions in my sandbox Kubernetes environment, and was all set to deploy it at customers. But unfortunately, we ran into permission issues to perform certain actions on the customers’ cloud. We had to involve their cloud team to help us with this.


On the applications side, I continued working on instrumenting the code for our database library. The purpose of instrumentation is to get some telemetry data on the health and performance of the code/service. This telemetry data could be:

I also created a design doc outlining different telemetry data we plan on capturing, why we’re capturing it, the implementation plan, and what problems we intend to solve through this data. I’ll try to discuss the importance of a design doc in next post.

I’ve kept this weekly update small and concise because it was important to provide the metacontext of “What I work on” first. In the coming weeks, the plan is to make updates more elaborate, so I can actually share my learnings.

Cheers to more learning..!!