How to "open source" cloud operations
OpenSource has become the defining way of developing software. One major factor for its success was that, if you had a problem with the software, you could peek into the details of the code, search the issue tracker, ask for help and maybe even provide a fix. This means that the majority doesn’t even write code, but the mere fact that everything is open will help the majority of users.
Today, in the age of cloud computing, we consume provided services that we expect to just work. And our applications are a complex mesh of those services. You need to configure software on demand with elasticity, resilience, security, and self-service in mind. So the implementation and operations of those services, or some call it “the cloud”, has become equally more complicated.
Now if open source made software great, how do we “open source” an implementation or the operation of something?
By definition it’s always different, there is no single binary, that gets deployed multiple times, but it’s an implementation of a procedure, a process. Same with operations: it’s all the live data of metrics, logs and tickets and how software and the operations team reacts to it. So all implementations of a cloud - be it the large scale proprietary public service or the on premise private cloud - they are all snowflakes. Yes, best practices exist and there are excellent books. But still, you cant
git clone cloud or
rpm -i cloud.
So, maybe we can open up what it takes to stand up and operate a production-grade cloud. This must not only include architecture documents, installation, and configuration files, but all the data that is being produced in that procedure: metrics, logs, and tickets. You probably heard the AI mantra that “Data is the new Gold” multiple times already. And there is some deep truth about it. As seen software is no longer the differentiating factor. It’s the data. Dashboards, post-mortems, chat logs - everything. Basically a public read-only access. And signing up for read-write should be easy. Lowering the barrier of access was the winning factor of open source, so let’s limit the barrier to peek into the back office of the cloud as well. It opens up a slew of new opportunities. Suddenly we can create a real operations community. The current operations communities either center around a particular piece of technology, like the great Prometheus monitoring community, or a certain approach to operations, like the Site-Reliability-Engineering (SRE) methodology. So by real, I don’t want to dismiss the existing ones, but bring it down from the meta-level to the real world, where you can touch things. If you can’t log into it, it does not exist.
We can also extend the community to people that operate their clouds. Those human DevOps people can watch and learn how a cloud is operated, then contribute by sharing their opinion on architectural decisions or their internal practices. And maybe even engage in operating bits of the open cloud. It’s the same progression as with open source projects.
Then there’s a term called “Shift-Left”, which says that we should involve testing really early in the development cycle, so moving left in the process. This is already done with unit and integration tests. No line of code get’s merged if it does not pass the tests. But what about operations?
At Red Hat we coined the term “Operate First” for this. Similar to “Upstream First”, where we strive to get every line of code into an upstream project before we ship it in a product. In Operate First, we want to run the software in an operational context, by the group that develops the software. And since we develop mainly in open source communities this extends our open cloud to another group of people, the engineering community. The very authors of the code can be asked in an incident ticket of a misbehaving piece of the cloud. Not only increases this the probability of getting the incident closed quickly, but it also exposes the software developers to the operational context of their brainchild. Maybe they come back later and just watch how their software is being used and make future design decisions based on the operations. The next level would be to try out new features in bleeding-edge alpha versions of a particular service and get real workload instead of fake test data.
And speaking of data brings us to the next audience of an open cloud: the research and AI community. AIOps is another term that is being used frequently - and to be honest it is as nebulous as the term cloud was used a decade ago. To me, it means to augment IT operations with the tools of AI, which can happen on all levels. Starting with data exploration. If a DevOps person uses a Jupyter notebook to cluster some metrics, I would call it an AIOps technique. And since the data is available at the open cloud, it should be pretty easy. But the road to up to the self-driving cluster is paved with a lot of data, labeled data. You will find large data sets with images that are labeled as a cat, but try to find data sets of clusters that are labeled with incidents. Creating such data sets and publishing them under an open license will spark the interest of AI researchers, because suddenly we can be more precise about a problem, we can be data-driven. We can try to predict an outage before it happens. Once the model is trained and tested against the test data, with the open cloud, we can go even one step further.
Researchers can collaborate with the operations team to validate their models against a live target. Operations can then adopt the model to enhance their operational excellence and finally involve software engineering. Because ultimately, you want the model and the intelligence captured in code, right in the software that is being deployed. The software that will be deployed in another data center, in another incarnation of a cloud. So it will improve the operational excellence of all the clouds. This brings us closer to a world, where operations of a cloud can be shared and can be installed since it’s embedded in the software itself. To get there, we need that feedback cycle and open source community that involves all three parties, operations, engineering and research and we need a living environment to iterate upon.
Sounds like a story from the future? The process has already begun as Red Hat is working with an evolving open cloud community, at the Massachusetts Open Cloud, to help define an architecture of an open cloud environment where operability is paramount and data driven tools can play a key role. All discussions happen in public meetings and, even better, are tracked in a git repository, so that we can involve all parties early in the process and trace back how we came to a certain decision - since the decision process is equally important as the final outcome. All operational data will be accessible and it will be easy to run a workload there and to get access to backend data.
If you’re interested in collaborating, join us at https://www.operate-first.cloudThis article appeared first in Red Hat Research Quarterly Issue 2:2