While the environment surrounding IT services is constantly changing, balancing the speed and stability of software development is becoming one of the important points in the development site. Slack engineers, a team communication tool, explain the flow of their software development execution on a blog. How are they realizing software development that balances speed and stability?
In Slack, development is done using Git, and if code reviews and tests pass in-house, developers can merge pull requests into the master branch. The deployment of the master branch development environment takes place within business hours of the headquarters in North America to respond to unexpected problems. Distribution occurs 12 times per day and each batch is assigned responsibilities. Segmenting the distribution is to minimize the impact of errors.
When Slack releases software as a new build, it first creates a release branch in Git. The release branch is what you need to tag your release history, and is a point to troubleshoot problems found during rollout to production. When you’re done with the release branch, you deploy a new build to a closed test environment and test it. Builds that pass the test are evaluated by internal staff, and if there is no problem, 10%, 25%, 50%, 75%, and 100% of all Slack users will release new builds in stages.
If a problem occurs in the distribution, the person responsible for each batch directs the response. The pull request that is the cause of the problem can be discovered as early as possible and a new build can be created by fixing the problem part, but if a problem occurs after deploying to the production environment, rollback to the previous build.
It is said that there have been numerous trials and errors before establishing this development flow. In the days when Slack itself was much smaller than it is today, it ran a service on Amazon AWS 10 instances and was a system that synchronizes all servers via rsync. The development flow is a simple deployment in the production environment if the test passes in the test environment, and it is said that the developer was able to freely distribute his code to the server.
However, as the scale of the Slack service expanded, development accelerated, and the method of synchronizing the entire server with Arsync by distributing code by acting on each server has reached a limit. For this reason, we changed the way that the console key for each server was monitored and the server that received the key change notification requested deployment to Amazon S3. It is said that these changes were able to respond to the acceleration of development due to the expansion of the service scale.
Another project that supports the current distribution system is Atomic Deploy. Before adopting this, it was deployed directly to the running operating environment, so the technology to call the function before the new function was applied, causing an error or damaging the web page. It is said that it is possible to deploy a new version without causing errors by preparing a hot directory, which is a version running in atomic delay, and a cold directory that is not running, and applying a new version to the initial directory, replacing the hot and cold directories.
Slack says it will continue to improve its development system through better tools and automation. Related information can be found here .