When we started to think about the kind of architecture we wanted to build for Playtomic we knew what we didn't want to end up with: a huge monolith. Whether we call them micro services or just plain old boring services we wanted a decentralized system that allows us to survive if one or some of the pieces goes down.
We also wanted to be able to independently build and test different parts of the system so different teams could be focused on solving problems that matters to us without having to worry too much about understanding every small detail of the whole system. Being able to compose features by creating and reusing services is key to grow at the pace we currently need.
So, we started to think about the different components we would need in order to put together such a system. Let's see some of them.
How are we going to build services?
We decided to start using Spring Boot to build our services upon. We're using both Java and Kotlin. Most of the people in the team had previous experience with it for building monolithic applications and we thought it would be a good start point. Spring boot has proven itself to be a good tool to build services because of the hundreds of frameworks and tools you can use out of the box (security, monitoring, logging, testing, ...). The main problems we've found are startup time and memory footprint. We're currently trying different things to reduce them both.
Where are we going to run our stuff?
We need something that gives us the ability to deploy services independently, that guarantees us that everything will be continuously up and running without too much pain.
We also don't want to have to depend too much on manual provisioning every time we come up with a new service (at least, minimize it as much as possible). Docker and Docker Stacks seemed like a great solution for us.
We looked at Docker Swarm and we found it was simple to set up, easy to understand and it scales pretty well. I personally had had some experience with Docker Swarm in the past and although I had some serious problem with stability and quality of service those problems were mostly due to the lack of maturity of the system at that time so we decided to give it another try.
We set up a Docker Swarm of 3 manager nodes and 4 worker nodes deployed in our own servers. It's been working smoothly so far without any problem at all. So we're pretty happy with our decision.
For us, Kubernetes seems like too complicated for what we need. We're glad that Docker Swarm supports Kubernetes natively right now though, so we could take advantage of it in the future if we need so.
We're also using Docker Flow Proxy to route traffic from the different applications and APIs to the corresponding service in the Swarm.
How can we be sure that everything is working properly?
In all systems, but specially in a distributed one, it's very important to be able to see the whole picture to understand how the system is behaving. We've built a monitoring system based on Prometheus to gather real-time metrics and Grafana to visualize services and how they are working together.
Logging is also very important to be able to find out fast where a problem is. We started using our own ELK stack but we recently decided to move to logz.io. Setting up our own elastic search cluster and keeping it running in production was too much for us. We thought we were spending too much time doing things we felt we shouldn't be doing (not our focus) so we moved that piece out.
Logz.io also has a great insight feature. Logs are analysed by their AI, comparing them to thousands of articles in different sites to give you valuable insights of your system. We've been able to find bugs in production using this feature that otherwise would have very likely skipped our radar.
We know we still have a lot of things to explore and improve but we're pretty happy with our current setup and we wanted to share it with you. What kind of infrastructure are you using? What ideas do you have to improve ours? Let us know in the discussion below.