GitLab deploys on a Friday and ... is down for a few hours

Snafu blamed on config change

Updated GitLab, a hosted git service not unlike Microsoft's GitHub, was down for some users as of Friday morning, Pacific Time.

Around 1634 UTC (0934 PT), the code hosting service started returning 503 Service Unavailable errors to those attempting to access the website.

Software developers who depend on the service were quick to celebrate the unexpected day off.

They also took time to cite sysadmin superstition about not deploying on a Friday. "GitLab seems to have deployed on a Friday breaking their site," quipped UK-based dev Luke Warlow. "Which is annoying cause it's stopping me deploying on a Friday and breaking my site."

The issue page for the IT breakdown itself returned an error banner when loading: "An error occurred while fetching the incident status. Please reload the page."

Nonetheless, the page loaded to explain that the cause of the downtime is presently described as a "config change."

"The service is currently being restored, we're taking multiple measures to have an immediate restore of the service, as long as a targeted fix to the root cause," the issue page explains.

"More information will be added as we investigate the issue. For customers believed to be affected by this incident, please subscribe to this issue or monitor our status page for further updates."

The impact is described as a site-wide outage and some customers, it's said, should expect their projects to be unavailable "for a period of time after service is restored."

GitLab did not immediately respond to a request for further information.

The GitLab status page appears to blame Google Cloud, noting that the affected location is "Google Compute Engine."

(The only glitch we can see on Google Cloud is some disruption around the world stemming from the Google Kubernetes Engine, but that is just a problem with "unexpected additional messages in GKE cluster logs" rather than unavailable systems. So we take GitLab's status page to mean that the downtime was caused by something within its GCE deployment.)

GitLab's status page lists the following GitLab services as disrupted: Git Operations, Container Registry, GitLab Pages, CI/CD - GitLab SaaS Shared Runners, CI/CD - GitLab SaaS Private Runners, CI/CD - Windows Shared Runners (Beta), SAML SSO - GitLab SaaS, Background Processing, and Canary.

As of 1846 UTC (1146 PT), the status page reported that the issue was still being investigated: "We have implemented a fix to mitigate Web/API services. Investigation is ongoing for other services."

At least the incident does not appear to be as severe as GitLab's 2017 loss of production data, in which an administrator deleted a directory on the wrong server during a replication process, resulting in the loss of 300 GB of live production data. ®

Updated to add

According to a postmortem report by GitLab, the outage was caused in part by a change request, "an old pipeline was triggered, applying an obsolete Terraform plan to the production environment."

While you're here... We just want to flag up that the Fedora Linux project is considering adding the collection of usage metrics – some might call it telemetry – to the distribution from release 40 on an opt-out basis. The current release is 38. The project hasn't yet worked out what metrics to collect, and says it is keen to preserve users' privacy. We're keeping an eye on it.

More about

TIP US OFF

Send us news


Other stories you might like