Thursday, July 9, 2020
Home AI Microsoft open-sources Lumos, a Python library for automatically monitoring web app metrics

Microsoft open-sources Lumos, a Python library for automatically monitoring web app metrics

Microsoft recently open-sourced Lumos, a Python library for automatically detecting and diagnosing metric regressions in web-scale applications. In a technical paper, company researchers claim Lumos has been deployed in millions of sessions across the developer teams at Skype and Microsoft Teams, enabling engineers to detect hundreds of changes in metrics and reject thousands of false alarms surfaced by anomaly detectors.
Online services health is typically monitored by tracking key performance indicator (KPI) metrics over time. Regressions in these require a follow-up as they could indicate major problems, resulting in costs and the potential of loss of users. But its time-consuming to track down the root cause of every KPI regression because a single anomaly can take days or weeks to investigate.
Lumos is a novel methodology that encompasses existing, domain-specific anomaly detectors but reduces the false-positive alert rate by a claimed over 90%. It eliminates the process of establishing whether a change is due to a shift in population or a product update by providing a prioritized list of the most important variables in explaining changes in the metric value. The library also serves the wider purpose of understanding the difference in a metric between any two corpora, including bias, by comparing a control and treatment data set while remaining agnostic to the time series component.
[Lumos] provides product owners with key insights about demographics changes of their application, and it identifies opportunities for service owners to improve their engineering system, wrote the papers coauthors. [Lumos enables engineers] to spend less time in diagnosing metric regressions and more time on building exciting features.

VB Transform 2020 Online – July 15-17. Join leading AI executives: Register for the free livestream.
Lumos leverages the principles of A/B testing to compare pairs of data sets. Each data set is a tabular data set where rows correspond to samples and the column values include metrics of interest, such as variables that represent the KPI, describe the population (e.g., platform, device type, network type, and country), and provide hypotheses for diagnosing metric regressions. An accompanying configuration file specifies hyperparameters (variables) for running the workflow and details which columns in the data sets correspond to the metric, invariant, and hypothesis columns.
Lumos begins by verifying if the regression in the metric between data sets is statistically significant. It then follows up with a population bias check and bias normalization to account for any population changes between the two data sets. If theres no statistically significant regression in the metric after the data has been normalized, the regression in the metric can be explained by the change in the population. But if the delta in the metric is statistically significant, the features are ranked according to their contribution to the delta in the target metric.
The Microsoft researchers say Lumos serves as the primary tool for scenario monitoring of hundreds of metrics related to the reliability of calling, meetings, and public switched telephone network (PSTN) services at Microsoft. Its running on Azure Databricks, the companys Apache Spark-based big data analytics service, with multiple jobs configured based on priority, complexity, and metrics type. And jobs complete asynchronously such that whenever an anomaly is detected, it triggers the Lumos workflow, raising an incident alert (ticket) if the library determines it to be a legitimate issue.
We have 15 primary metrics each of which are being monitored against key dimensions like platform, tenant, meeting type, [join, dial out, and create call], resulting in thousands of aggregated time series we track for a single metric. We have millions of call legs per day and each leg generates hundreds of telemetry fields serving as the input for Lumos, wrote the coauthors, who claim Lumos freed up 65% to 95% of teams development time. One incident that Lumos was able to detect for involved a bug in the code that impacted video-based screen sharing. Two different teams released updates and those conflicted with each other. As a result, when users tried to use the screen sharing functionality, they experienced errors.
The Microsoft researchers caution that Lumos isnt guaranteed to catch all regressions in services and that it cant provide insights without a sufficiently large amount of data. In an effort to address this, they plan to focus on expanding support for continuous metrics, perform feature ranking using multi-variate features, and introduce feature clustering to tackle the problem of multicollinearity in feature ranking.

Source link


Please enter your comment!
Please enter your name here

Most Popular

Pompeo takes aim at Chinese tech firms over data theft concerns

U.S. Secretary of State Mike Pompeo speaks during a news conference at the State Department in Washington, U.S., July 8, 2020. Tom Brenner |...

COVID-19: Telcos still losing money despite strong outlook Adebayo

Gbenga AdebayoBy Prince Osuagwu, Hi-Tech Editor Engr Gbenga Adebayo, is the Chairman Association of Licensed Telecom Operators in Nigeria, ALTON; the umbrella body of...

Sumit Ghosh says his firm will not accept Chinese investments

Indians surf the internet on their phones at a free Wi-Fi zone inside a suburban railway station in Mumbai on August 22, 2016.Indranil...

Recent Comments

Translate ยป