Firstly, we want to apologize for the disruption of service on Monday, January 12th. We know how important reliability is, and we take the impact of this incident and our responsibility seriously. We are your partners in making your business successful and your customers happy and we would like to share the details of our Post Mortem and Root Cause Analysis.
On January 12th at 1:27pm MST, we deployed a planned performance and reliability release. The deployment completed successfully and began receiving production traffic at 1:28 PM. Shortly after it started handling full load, an issue surfaced that was not detected in our automated or manual testing. As soon as the errors were observed (within 30-60 seconds) we initiated a rollback of the release.
Unfortunately, while we were able to revert the change, the rollback took longer than expected—approximately 30 minutes instead of a typical 2–3 minutes. The cause of the delayed rollback has been fully diagnosed and is being remediated before any further releases are performed in this area of the system.
By 2:00pm MST all services were fully restored to operation. We have spent this week digging into the root cause of the failed release as well as key improvements to testing, our release mechanisms, and our proactive communication.
We are confident that we understand the causes for the issues in the release and in our release mechanism and that our improvements will reduce the likelihood of disruptions in future.
In the spirit of proactive communication we want to share that we are making many significant positive improvements to nearly all of JobNimbus in the coming year based on your feedback, the needs of a growing platform, and the exciting opportunities AI brings to internal tooling and customer features. We are committed to making JobNimbus the platform to run your business on and are excited to share this journey with you!
Thank you for your continued partnership and patience.