The Internship that I was so excited about in the late 2K19 is finally over and I’m back here again to share my learnings. Even though my job profile stated “Software Engineer”, it was quite different from traditional software engineering because I was part of the “Infrastructure team”.
A simple google search about “infrastructure engineer” would result in something like this:
An Infrastructure Engineer is responsible for designing, coordinating and maintaining the infrastructure used by an organisation to deploy, run and scale their software.
Here is a small story for more context:
What is Infrastructure Engineering?
Imagine you invented a new type of tasty chocolate through some chemical experimentation. You’re sure that this chocolate would change the food industry forever and you’ll be a billionaire once you launch it in the market.
You bought a factory, cooking machines, chocolate material in bulk, and hired some workers to do the baking. You started the production on a large scale. By the way, Is chocolate baked or cooked? Nevermind.
But on the first day of the market launch, you suffered a huge loss as no one was buying your chocolate. To figure out what went wrong, you opened one chocolate packet to taste and to your surprise, it tasted nothing like what you invented. What caused this? Is the recipe wrong? Is this the fault of workers? Or is there something wrong with your factory’s Infrastructure (i.e. the machines and chocolate material)?
This is how crucial underlying infrastructure is for the success of any idea. It doesn’t matter how tasty your chocolate is on paper or in the kitchen, if you cannot produce it on a large scale with the minimum difference between kitchen cooked chocolate and factory cooked chocolate, you can’t become rich. To produce perfect chocolate, you need to build an efficient infrastructure for your factory, and to build and operate that infrastructure, you need to hire engineers.
The same goes for software and web application. Would you have used google if it would have taken 2 minutes every single time you searched “nearest hospital”? That’s why infrastructure is very critical in defining the success of an organization and this is the reason most of the organizations are moving to cloud (See Google Cloud and AWS), where they outsource the design and implementation of hardware infrastructure (Server racks, networking, hardware load balancing) to the best players in the industry.
I got interested in infra when I was learning how to design systems. The term “building a scalable and reliable system” used to bring sparkles in my eyes. The opportunity to explore this field came knocking at my door when I cracked the interviews of a startup and got an option to either join the applications team or Infrastructure team, and the decision was a no-brainer.
When I started my work as an Infrastructure engineer, I was considerably unaware of the infra oriented terminologies. My manager slash mentor did his best to guide me in the right direction. Before starting the internship, I knew that this experience would be different from traditional software engineering, but I was not sure in what way. I couldn’t find sources that could inform me what to expect and what framework I should adopt to perform better in this field.
So, after working as an Infra engineer for the past 6 months, this article is for anyone who’s starting in the same field or is interested in knowing what challenges one faces and what goals one should set in terms of learning. Although 6 months is a very small time to form any opinions or to give any advice/recommendations, my motive with writing this essay is to suggest a framework to get head-start before you start work as an infra engineer. I believe that this article would be a good read if you’re a noob like me 🙂
Disclaimer: My work mainly revolved around microservices and cloud, therefore most of the examples are cited from these particular domains.
I’ve tried my best to define the framework to be stack agnostic. Therefore, this essay won’t list the topics you should learn or the courses you should take.
Anyways, let’s get back to the topic. Here is the list of few pointers one should keep in mind if they’re a noob infra engineer like me:
1. Understand modern cloud application & systems
In the very first month, I was deploying systems to the cloud and I had very little idea what I was doing. I was dealing with tools like Kubernetes(k8s), gcloud/aws cli, load balancer, docker, etc. These tools heavily dominate modern cloud applications and having a good understanding of basic CS concepts helps one to understand the underlying notion of all such tools.
While operating such systems, we spend a considerable amount of time making them run. Remember the first time you ran an open-source project on your local machine? Well, you better get good at these sorts of things. Then there are different ways to run tools: run a binary, build the tool from source or just run a simple docker image.
Some of the challenges that I faced involved running a local Kubernetes cluster, deploying a platform that I was working on to a K8s cluster in google cloud (took me few days), running a time-series database in the cloud (took me weeks due to some error in mounting a volume).
There was a time when I was deploying systems ~20 times a day. Then I monitored it’s stability, performance, and reliability. A good understanding of cloud apps and systems gradually helped me to speed-up this process.
2. Develop a sense of understanding and comparing software solutions
As an Infra Engineer, you make critical decisions that have a considerable impact on the organization’s future. Those decisions include deciding what third-party software to use for custom needs, ex. automation tool to use for CI/CD, load balancer for traffic needs, database for storing logs and metrics. When I made all such decisions, I received extensive feedback on what should be considered.
Here are some rough parameters that should be taken into account while making decisions:
- Does the software have better community adoption? (Is it being used by developers and engineering organizations)
- Does it scale? (As the load increases, does it’s performance deteriorate?)
- Is it stable? (We don’t want the hell to break loose in production)
- Do they roll-out new features at regular intervals?
- Is the solution being maintained and developed by a good organization?
- Is this an efficient solution in terms of resource usage (CPU, Memory, Disk, and Network)
- Would this require extra installation complexity?
I had to answer all the above questions whenever I made a case to adopt a particular app or tool. I worked on a requirement that had 7-8 potential solution candidates and it dropped to 2-3 because we didn’t want extra installation complexity and large CPU & memory usage. It’s wise to choose a particular version of a tool/software not based on how many features they offer but how stable it is. This was a new revelation for me because I thought that the latest is the best.
Currently, I’m working on getting better at going through and understanding the release notes of a software.
3. Learn to measure performance
Another important area that Infra engineers spend their time on is evaluating the performance of tools/service/system. Although third-party tools like caches, databases, etc provide rough numbers on performance, that should not stop you from doing your testing. The production environments are far from ideal. Here different tools would be interacting with different components and the true picture remains unclear(in terms of performance & stability) until you run your system/tools in production and people start using it.
Everything looks efficient on paper until all hell breaks loose in prod. Just to cite an example, I worked with a database that promised faster queries but when I deployed it in the cloud, it showed 10x the expected latency. I debugged the issue for ~15 days only to realize that the issue lies in a different tool that resided between user and database. I only realized this after I conducted a load test on the database(by isolating it).
4. Not just fix the errors, learn from them
This is an important point when you are aiming to maximize learning. Whenever I solve an issue I do a ton of google search and I try all the different options until I either run out of options or the issue is resolved. it’s very tempting to not look at the actual cause of an issue after it’s resolved, but try to resist that temptation. Evaluate all the solutions that you tried, figure out what worked and what didn’t, and then document the findings for future reference. Trust me, this process works like a charm.
5. Master the art of isolating the problem
While operating systems at scale, Infra engineers encounter several issues. Once you face an issue, you come up with a probable cause or the bottlenecks. The faster you identify the root cause, the more time you save. According to me, 3 major things help engineers to solve an issue. The first one is a smart google search 😛, second is historical knowledge and the 3rd one is the art of isolating the problem.
The art of isolating a problem helps you to reach the root cause faster. It minimizes the space of probable cause by discarding causes on reasonable grounds one at a time.
I faced an issue where the load balancer interacted with a visualization tool and visualization tool talked to the database & a monitoring tool. There were performance issues with the database and I assumed the problem lies within the database. After debugging for approx. 10 days, I eventually added the load balancer and visualization tool to the list of probable cause of the issue and It turned out that the visualization tool was the culprit. Now those 10 days could easily have been a few hours if I had better skills in isolating the problem. There’s no doubt that one becomes better at this with time but always keep an eye out for issues where you require this expertise.
6. Learn vim
As an infra engineer, I spent around 60% of my time in the terminal. Oftentimes I had to exec into a container or ssh into a virtual machine and only text editors that I had at my disposal were vim or nano. I needed to add code in a file, run it, edit already existing files, etc and I didn’t have the option to perform those operations in fancy text-editor (like vs-code or sublime). This is where vim comes in the picture. It’s quite powerful and It would not be an overstatement to call vim as powerful as a programming language. It has its commands, some intriguing shortcuts, and tons of cool features. This is how vim helps to optimize many repetitive text editing operations.
The learning curve of vim is a bit steep as I’ve also struggled in memorizing commands and shortcuts, but once you get acquainted with it you can save tons of hours. I’m still a noob when it comes to using vim but it’s a work in progress.
7. Take security seriously
The Internet is an evil place where millions of attackers are always ready to exploit the vulnerabilities in your system, therefore security should always be the priority, especially for infra engineers. I learned this the hard way. I almost performed an equivalent of “an intern deleting the production database” when I exposed a port of a virtual machine in the cloud to the internet which then almost got cryptojacked.
As an Infra engineer, we deploy many apps in the cloud, and then we expose them to the internet to be accessible. We need to be aware of the security policies and best practices to minimize the impact if our system gets attacked. So for example, it’s not wise to keep “the user type” inside a docker container as root. This is because even if the attacker successfully takes control of the container, it could not take advantage of the situation. It won’t have the privilege to run dangerous commands as most of the lethal commands require root access.
8. Document, Document & Document
This is true for engineers, no matter the field. The amount of value you get by documenting what you work on and learn is immense. I regret not documenting anything that I learned during my 2 months Internship. As a result, I don’t remember most of the things I built there.
Documentation helps you to present your story: the challenges you faced, the new findings, and how you solved problems. I made notes of everything that I did during my Internship as an infra engineer, and now whenever I look at those notes, everything that I learned and experienced rushes back to my mind at one glance.
9. Designing and understanding the system at scale
This is one of the cool skills that an infra engineer generally possess. I came across many system diagrams of various tools and it’s always cool to understand why they are designed like that. The system’s knowledge answers whether to go for, monolith or microservice, pull or push, synchronous process or async process, cache or database?
Understanding multiple systems helps one to develop a broader sense of things and gradually a person becomes capable enough to design a system of their own. The system is designed by multiple people but It’s difficult to test the scalability, performance & reliability of the system without an infra engineer 😉
Conclusion
I’m quite positive that If you adopt this framework from day one, you’d be performing a lot better at your job. Then you’ll be learning lessons of your own and the process goes on. Since I’ve identified these points, it doesn’t imply that I have mastered them but I believe that identifying them is the first step towards getting better at infrastructure engineering.
The real reason I like infra engineering is that it requires the use of concepts that made me interested in computer science in the first place (like OS, database, computer architecture, networking, etc). If you are also interested in these concepts and this essay sparks the interest in you regarding infrastructure engineering, then maybe it’s something you should try to explore.