Why can't they access the database? Why can't they access servers? You are only trying to make my job harder. I accused him. I was tired of fighting with the security team.
The security team lead stood, saying he was trying to ensure the company could stay in business. I was angry. I needed to get specific answers to why my team had to keep the TOIL service desk for doing things for developers in production. He kept saying that giving everyone access to prod was dangerous.
After failing to get an answer to what it means to be safe, I reverse-engineered what I was doing. If we built a system that encoded all the things our team did, everyone would be able to access prod.
Answer the questions security teams don't know how to ask
Asking the security team how to do things will get you in an endless fight. Ask what, not how.
Overloaded security teams deal with policies, compliance, regulations, etc. They won't learn Kubernetes networking plugin layer four isolation. They will ask you for a separate network, and you'll have to manage a new cluster. If you ask them what the requirement is, they will tell you that you need to prevent someone with access to the servers in one network from accessing the other. You then can explain how container networking can do that. One less cluster to manage.
After years of having this sort of discussion, I realized that when it comes to access, every security requirement boils down to four features. Let's call it the 4A framework to keep the sec vibe.
1. Access: where is it, and how can I get there?
Break-glass scenarios aren't going away.
In theory, folks should use instrumentation and observability tools to build and operate their services. However, reality shows teams of any size need ad-hoc access to production for break-glass scenarios. So you either embrace it and prep or live with the risks or bottlenecks. The problem with access is complexity. You have a problem if you have 4-5 interfaces a developer needs to use.
Consolidate access in one or two interfaces. Here's how:
- Map the resources. You may use everything AWS, but inside AWS, you may have ELK, Kubernetes, and RabbitMQ. These are different resources.
- Map interfaces. ELK, Kubernetes, and RabbitMQ have different interfaces even though you use them as AWS services.
- Centralize. Combine services into fewer interfaces. It increases the time to fix the problem when you have ten places to go during an outage. Managing ten interfaces is expensive.
2. Authentication: are you you?
SSO is standard, but it is hard to add it to all interfaces. You may add SSO to GitHub and CircleCI in a few clicks. For Kubernetes and databases, depending on how you run them, it will take weeks to configure. In-house tools take weeks or months to develop.
Each of your resources will be in one of these levels. The more, the better, but more expensive to configure and manage.
- Shared password
- Individual users with passwords
- Individual users with passwords + MFA
- Centralized users directory with OpenID or SAML federation
3. Authorization: should you be doing this?
Complexity increases as you add more controls. For example, you could manually add and remove users from ten places. It soon takes a significant portion of your team's time. Automating the provisioning of policies and roles inside all these systems is impossible. Using something like OPA takes a ten people team dedicated to it. Only available to Netflix and other large cos.
Here you can go from everyone with root access to fine-grained roles. What is your level in the authorization scale?
- Everyone with root access
- Everyone with full access but non-root
- Everyone reads everything, and only leaders can write
- One role per team accessing only their resources
- One role per user with ad-hoc provisioned permissions to resources related to features, incidents, or bugs they are working on
- Time-based grants of access with permissions based on the user context like incidents, bugs
4. Audit: Track who did what, when, where, and why.
Cherry of the cake. After solving the authentication and authorization problems, you have to monitor what is happening to get alerts if anything goes wrong.
Managed services charge a lot for audit, and many tools don't have built-in support. It'll take weeks to months to configure and store Kubernetes API audit trails. But the API is only one piece of the problem. How do you track commands executed inside pods? How about shell commands inside processes of the pods, like the Rails Console. Each layer of the onion has a different resource that needs trails.
When you multiply the number of resources by the complexity of adding each of these features, you understand two things: why the security team couldn't answer you and why many times, the default solution is to block access.