Early this year, two of Yola’s operations engineers in Cape Town, Duncan and Jonathan, organized ScaleConf, a tech conference dealing with scalability issues and the problems faced by those building and running services on the internet today. They felt that more exposure on these issues was needed, and that the tech community in Cape Town would greatly benefit from such an event.
ScaleConf took place on the 26th and 27th of January, in the beautiful Kirstenbosch Botanical Gardens. The conference was a huge success with nearly 300 people attending including a number of international speakers. As an organizer of the conference, Jonathan (who is Operations Team Leader at Yola) gave the introductory talk, an overview of the types of issues that the conference would be covering. You can check out the slides for his speech on SlideShare: Jonathan Hitchcock – “Clearly, I Have Made Some Bad Decisions”.
Jonathan began by talking about how some people think they don’t have the sort of scaling problems that were going to be discussed. Perhaps because they don’t have millions of users, they think they don’t need to be worried about “scaling”. However, it is important to note that “scaling” is not about how many users you have, but about whether you can handle a rapid increase in traffic, or any other sort of change in the way things run normally. And the fact of the matter is, change always happens, and that is not something you can do anything about. In addition, people make mistakes, and you don’t always know that you have made a mistake until much later. Problems are going to occur, and you need to be ready for them when they do, so that they don’t have a catastrophic effect.
Jonathan then walked through some of the things you can do to prepare yourself for the problems:
- Use monitoring systems to watch how your systems behave, all the time. This is important because you need to know whether their current state is normal or not – otherwise you might think something is a problem when it is not, and waste time trying to “fix” a situation, when the real issue lies elsewhere.
- Store your infrastructure as code. Instead of manually setting up your systems, you need to write code to set them up automatically. This way, you can use all the clever tools you use to manage your source code to manage your systems as well. When you need a new system, you just run the code, and you instantly have a new one!
- Pay attention to how your code runs, as well as just what it does. You need to think about how to get your code onto your servers in a way that you can do again and again, very easily, so that updating your systems is trivial and you always know exactly what is deployed where. Too often, people just copy their code onto a server manually and think that this is sufficient.
- Make lots of little changes instead of one huge change. If you continuously push out new changes, you can see what effect they have immediately, and if they aren’t desirable, you just undo them. This way, you get constant, instant feedback, and know exactly what the consequences of your actions are as you are performing them. This means you become comfortable with making little mistakes and can fix them straight away, and aren’t scared to innovate or try things.
- Test how your systems behave when things go wrong. Things are going to go wrong, one day, and it is much better to be in control when it happens, so you can see how your failsafes behave, and make them better. Then, when things go wrong and you’re not in control, it won’t matter, because you have already tested the situation and know that you are fine.
Jonathan wrapped up by saying that if you have these good practices in place, and are ready for it, it won’t matter when things inevitably go wrong, because you have prepared yourself and your systems for these eventualities.
We hope you found this insight into the things our engineers need to do interesting!