dft street manager spike 0013 k8s aws

AWS Self-managed K8s

This ODP describe the design, implementation and learnings from spike STREEWORK-198.

High-Level PoC Design

Street Manager high level data model

Internet publishing approach

To expose service to public Kubernetes uses Ingresses. This approach uses single cloud-provided load balancer and Nginx-as-reverse-proxy pods. In that scenario we're able to use Nginx in it's full potential (ie. TLS termination, header modifications, redirects etc.).

Another approach is to publish service directly using cloud-provided load balancers. The downside is lack of controll over Layer7.

Implementation Tooling

Terraform will be used to provide most necessary components. This is well proven tool, which allows to use Infrastracture-as-a-Code approach.

Kubernetes cluster will be provided and maintaned using KOPS tool. KOPS uses S3 bucket as a backend to store configuration. It also allows to export/import cluster configuration in/from YAML format files according to documentation. All changes are being deployed partially using rolling deployment approach and verified afterwards.

This approach allows us to do proper code reviews, test and promote changes, version configuration using source version control (ie. Git) and deploy small changes quickly while not having any service downtime.

K8s cluster reconfiguration / scale-up / scale-down approach

As configuration will be stored as a code in Git, WebOps engineer will create a branch, do neccessary changes, push local branch to origin server and open a Pull Request and assign it to senior team member. After code review and passing tests, senior team member will merge pull request to master branch. CI server will deploy the delta.

Application rolling-update approach

Until not specified otherwise, Kubernetes uses rolling update approach by default. Any update to deployment resource (ie. container image update) will be treated that way.

All configuration will reside as a YAML files in Git. This approach allows us to do proper code reviews, test and promote changes, version configuration and deploy/promote releases quickly while not having any service downtime.

Application scale-up & scale down approach

Application scaling can be done using Horizontal Pod Autoscaler based on various metrics. Most common to use is CPU usage, but this is well-know as not being much effective as Latency.

Required CICD tooling, code

CircleCI/Travis is sufficient, as long as we can configure kubectl credentials for interraction with Kubernetes clusters. Both solutions are cloud based and well integrated with GitHub.

We might require to build and maintain small Jenkins instance to use Terraform more efficient.

FAQ

Q: How would we automate the testing & deployment of our apps - ie what are the industry/community-recommended approaches for CICD within this domain?

The easiest way to accomplish the goal would be to use containers. Docker suite contains docker compose tool, which can be used to run whole applications (with mocks if required) or some part of it.

It is much easier to implement proper development process and setup pipelines withing any CDCD tool, because once app is contenerised you can use the same container in every environment.

Q: What might the developer experience and workflow look like? Will developers be able to have full access to investigate and resolve CICD pipeline issues, for example? Will they be able to define and deploy apps to dev environments with zero ops involvement?

Developers will be able to build and run containerers locally.

Once their code is pushed to remote repo / merged to proper branch, CICD will build and publish (upload to private repo) container.

CICD will take care for development deployments. For any other deployments (qa, demo, stage, production, ....) we will be using Helm.

Q: How much of the solution can be covered by IaaC tooling? What gaps are there? What tools/approaches are available to plug those gaps?

AWS API is coveraged in 100% by Terraform. For Kubernetes cluster management we will initiallu use KOPS (with an option to use only Terraform later on).

Q: What would the patching process look like? How could this be achieved without impacting service?

Application patching process is no different than any usual release.

Server patching is possible via rolling update approach either with Terraform or KOPS. It is also possible to build own customized images using packer. Ad-hoc patching can also be done via SSH.

Q: Are internal services deployed with TLS enabled? If not, is it an option?

Initially no TLS certificates are deployed.

Enabling TLS for internal services might require to also deploy internal-nginx-rev-proxy containers. Certificates can be proivided via build-in secret storage.

For external-exposed services TLS can be managed using kube-lego or provided via build-in secret storage.

Q: Is it possible to use acls/security groups to whitelist trusted IPs/ports at the perimeter? eg locking down cluster/apps in dev environments to Kainos IPs only

Kubernetes API and SSH access can be restricted to certain CIDRs.

Additional security groups can be attached, as well as ACL sets using non-default CNI.

Q: Can the service provide fully automatic auto-scaling (and auto-healing)?

Auto-healing is implemented by design. Additionally pod healtchecks will be implemented.

Application auto-scaling can be implemented using Horizontal Pod Autoscaler.

Cluster auto-scaling can be implemented using Cluster Autoscaler.

Q: Will the 'frontend' and 'backend', connected via a 'Service' object, allow us to deploy apps in a multi-tier, public frontend /private backend fashion?

Yes. Until defined otherwise, every deployed app is private.

Q: Will the private-via-bastion network layout cause us issues when publishing services to the internet?

No. This is implemented only for better control over SSH access. Additionally we will get NAT Gateways with Elastic IP addresses attached.

Q: How does outbound internet access work, eg for third party API calls? Do outbound calls originate from a fixed IP, to allow IP whitelisting on the remote side?

See above.

Q: What technical gaps, limitations or technical challenges were encountered with the technology, tooling or process, during the spike?

None.

dft street manager spike 0013 k8s aws

AWS Self-managed K8s

High-Level PoC Design

Internet publishing approach

Implementation Tooling

K8s cluster reconfiguration / scale-up / scale-down approach

Application rolling-update approach

Application scale-up & scale down approach

Required CICD tooling, code

FAQ

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally