Detailed Diagnosis in Enterprise Networks
By studying trouble tickets from small enterprise networks, we conclude that their operators need detailed fault diagnosis. That is, the diagnostic system should be able to diagnose not only generic faults (e.g., performance-related) but also application specific faults (e.g., error codes). It should also identify culprits at a fine granularity such as a process or firewall configuration. We build a system, called NetMedic, that enables detailed diagnosis by harnessing the rich information exposed by modern operating systems and applications. It formulates detailed diagnosis as an inference problem that more faithfully captures the behaviors and interactions of fine-grained network components such as processes. The primary challenge in solving this problem is inferring when a component might be impacting another. Our solution is based on an intuitive technique that uses the joint behavior of two components in the past to estimate the likelihood of themimpacting one another in the present.We find that our deployed prototype is effective at diagnosing faults that we inject in a live environment. The faulty component is correctly identified as themost likely culprit in 80% of the cases and is almost always in the list of top five culprits.
| Attachment | Size |
|---|---|
| p243.pdf | 1.07 MB |

Detailed Diagnosis in Enterprise Networks
Summary
The paper presents an approach of diagnosing the cause of performance problems in small enterprise networks. The paper argues that there are a large number of methods for debugging performance problems in large enterprise networks. These localize generic faults or reachability faults on a coarser granularity. On the other hand, authors have studied small enterprise networks and realized that they are different than the large enterprises in that the administration is less sophisticated, the connectivity is less rich and there are a large number of resources are shared.
Like many others [1], the paper frames the problem as an inference problem. However, instead of representing a component with a single variable, the state of the component is stored as multiple dimensions (a somewhat similar approach has been presented in [2]). The edges in the inference graph have causality directions. The paper presents an approach based on crudely estimating conditional probability as weight to reflect impact of a component on the other. Finally, path weights are computed to generate a ranked list of root causes. The approach has been experimentally evaluated and has been compared with coarser methods [1]. Overall, a nice paper.
Strengths
• The paper presents an approach that enables detailed analysis at a finer granularity.
• Application configuration is the major contributor to being a cause of performance problems in small enterprise system. This observation is very insightful.
• The approach uses multiple techniques makes the approach robust.
o A simple way to compute impact of one component on the other resulting into causal dependencies.
o Technique of inference or belief propagation is another approach that has been made use of.
o The approach of finding the state of the system and the edge weight by comparing values of multiple state variables.
• The approach could even be applied for root cause analysis for large scale enterprises after identifying problems at a coarser level (See [1]).
Weaknesses
• Large amount of information is required.
• There are some thresholds that have been chosen empirically. These have been shown to work well for the experiments that have been carried out by the authors. The thresholds may require tuning in case of other scenarios. Any pointer to tuning the thresholds would add strength to the paper.
o \delta = 1/3 (Section 5.3.2).
o Edge weight of 0.8 for variables for which no historical data is available (Section 5.3.2).
o Size of bins, K, would also impact the results. How does one come up with an appropriate size (Section 5.3.2) (See [2]).
• Though this approach could be applied to large scale networks, as suggested by the authors, the error in diagnosis would scale up with the scale of the graph. Limiting the error in this case would require more thought.
• The use of conditional probability for identifying causality does not hold base theoretically. However, it has been shown to work in practice. It would be good to see a theoretical reason behind it.
• The approach requires large amount of data, details and domain knowledge to do well. It would be good to have a handle on how would it perform when all the details are not available (See [3]).
• Every time one needs to debug a performance problem, one needs to go through all the steps of obtaining dependency graph, computing edge weights etc. Could this be made more efficient, since some of the dependencies may not change over time.
References
[1] P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. A. Maltz, and M. Zhang. Towards highly reliable enterprise network services via inference of multi-level dependencies. In SIGCOMM, Aug. 2007.
[2] Manoj K Agarwal, "Performance Management for Large Scale Service Delivery Platforms", SCC 2009, Bangalore India.
[3] Emmanuel Cecchet, Maitreya Natu, Vaishali Sadaphal, Prashant Shenoy, Harrick Vin, “Performance Debugging in Data Centers: Doing More With Less”. COMSNETS 2009, Bangalore, India