Category Direction - Alert Management

The following page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features or functionality remain at the sole discretion of GitLab Inc.

Overview

The cost of IT service disruptions continues to increase exponentially as every company becomes a tech company. Services that were previously offered during "business hours only" are now 24/7 and are expected to adhere to 6 nine's of uptime. Moreover, operating these services becomes increasingly complex in this age of digital change. New technologies are emerging in the market on a daily basis, software development teams are moving to CI/CD frameworks, and legacy platforms are evolving into globally distributed networks of micro-services. It is critical for modern operations teams to implement an accurate and flexible IT alerting system that enables them to detect problems and solve them proactively.

Teams responsible for maintaining available and reliable services require a stack of tools to monitor the different layers of technology that comprise software services. These tools capture events (changes in the state of an IT environment) and generate alerts for critical events that indicate a degradation in application or system behavior. The complexity of IT applications, systems, and architectures and the many tools required to monitor them causes multiple problems for operators with regards to alerting. First, it is very challenging to figure out the correct metrics to monitor and track and the right thresholds to alert on. Most teams end up defining alerts too broadly in fear of missing critical issues. This results in a constant barrage of alert notifications where the problem is further exacerbated when multiple tools are concurrently alerting. When this happens, teams are forced to react to problems versus proactively mitigating them because they can't keep up with the stream of alerts and are always switching in between tools and interfaces. This causes 'alert fatigue' and leads to high stress and low morale. What these teams need is a single central interface where they can easily add new alerts and fine-tune existing ones as soon as they learn that their system needs to be improved.

Category Direction - Alert Management

Overview

What is next