At the company I work for, the software we develop is still mostly used on premises. In a world where cloud is the future, we are slowly making parts of our software available as a cloud service. This involves a huge shift in our way of working as we now become responsible for operations ourselves. We have only very limited experience with this as the we only had to maintain our own infrastructure, running the programs we use internally. Keeping our services available for our customers requires a complete different approach to keep the system up and running all the time.
Doing something new always brings forth the possibility of failing, since operations will become a critical part of the software we should do anything to minimize the chance of failure. The good thing is that we are entering this world pretty late, many other companies already provide cloud services and have gained experience in doing this. A hot topic of the past few years is DevOps, which is meant to increase efficiency of both development and operations, minimize conflicts and decrease time to market. Although it is still evolving and discussions (such as whether development should get access to the production environment) is still raging, DevOps is something that should not be ignored.
The lack of experience with operations must have scared management, as instead of doing DevOps and keeping (or attracting) expertise in-house they have decided to partner up with an external company that will handle our operations. While this is a quick fix to make sure that our regular development activities are not too much interrupted by operations, it can be dangerous to put all this responsibility at an external company:
- They become a dependency of development to release new versions.
- They have no knowledge of the software, what can go wrong and how to fix it.
I am open for any type of change, and I don’t believe this arrangement has to be bad. But lets look at the deal we have with them:
- Development manages the development environment.
- They manage the staging and production environment.
- Development has no admin access to staging or production
- Development sets up a first version of the infrastructure, which is cleaned up by them for staging and production.
The separation of responsibility will lead to difficult situations as you want all environments to be the same. There is no point of running something in the development environment if it has a completely different setup then productions, things could still go terribly wrong then. This means that all the changes we do in development they have to apply in staging and production, and the other way around. All changes they do in staging and production we have to apply to development. This wouldn’t be that much of a problem if it wasn’t for the fact that they are not familiar with the concept of branches, instead they use a tree structure for every environment. This makes adapting changes a lot harder than it should be.
Due to the fact that we are blind in production it will be hard for us to fix any problem that arises. Since they have no knowledge of the software we write, they will have to ask use to find the cause, which we are unable to do. This has already occurred and we aren’t even live yet. The production environment was coming up as it should and it showed a lot of errors, we were asked to figure out what was wrong with nothing more than just a log, which is the same as look at a car from the outside trying to find out why it won’t start. It is not possible unless you pop the hood and start looking and feeling inside.
Eventually we did figure out what was wrong, there was a disk not mounted properly. Due to this the software could not access crucial configuration files it required to work properly. Even in this hard circumstances we managed to find the problem, and taking the nature of the cause into account it was clearly something we could not fix. It is clearly a problem with the infrastructure, and should be the responsibility of the external company. But apparently we had to fix it as well, fine but it wasn’t causing any problems in development, and we aren’t allowed to update production. Are we expected to ‘fix’ it in development and ask the external company to try the new version? Sounds like a very bad way of working to me. We used a dirty hack to get things working, after which the external company discovered they did something wrong in one of the scripts.
In my opinion this way of working will cause more problems that it will take away concerns. The responsibilities are not clear, is the external company only responsible for monitoring and updating the staging and production environment? Is it up to the development team to actually fix everything as soon as something starts failing? Sounds like a bad deal to the development team if you ask me. This will inevitable lead to more friction between development and operations.