Printer FriendlyEmail Article Link

CloudSure - K8S_LS_HPA: LandSlide counter display is inconsistent.

Symptoms

K8S_LS_HPA:  I added a new chart for LandSlide Failure Chart (displaying LandSlide error counters) and LandSlide Discovery message counts. Landslide iterates through multiple LandSlide tests. Either the counters go back to zero after each test completes or they continue with the previous test results. Unfortunately, the counter values are between. This does not make sense at all. The counter should drop to zero and not to a value between.

See the image attached as a reference.
 
Environment
 
  • CloudSure
  • K8S_LS_HPA
  • CloudSure 22.05.3822
Explanation/Resolution
 
enlightenedAdd a 6-second delay between LandsLide iterations
  • CR-01546416 
  • JIRA CLDCNTR-5301
From the defect above:
 
Focusing on the NF Discover Requests data and NF Discover Successes data, the lines in the chart image in the ticket appear to be the aggregation of the values across all the Landslide test cases.

One point of interest is that the first Landslide test case rises from zero and then stays at a steady value throughout all the remaining test cases. Since the chart is aggregating all the test cases, the new steady state for test case 0 becomes essentially the new zero (or reset) value for test cases that follow. If test case 1 resets its count to zero while running, for example, the total aggregated value for the test cases will not fall below the steady value of test case 0. Each test case that leaves a non-zero count will essentially elevate the new “zero” (or reset) value for test cases to come. The charts below show each of the test cases charted as an aggregate and then individually. You will see that while some of the test cases reset to zero, the aggregated value never falls below the steady state values of previous test cases in the aggregated results. 

There is something else to consider when looking for a particular value – say zero. CloudSure is querying Landslide for the current values using its own interval. CloudSure only sees the values that it gets at the time of fetching the values. If a Landslide metric is reset to zero and then quickly increases from zero, for example, there is a good chance that the CloudSure fetch will not occur at the exact point of time the zero-value occurred (depending, of course, on how quickly the value is incrementing). Unless the value is reset to zero and stays there past one CloudSure fetch interval, it is likely the fetch of the metric will produce a non-zero value. If you aggregate more than one test case in your charts, it is even more unlikely that you will see exactly zero since, while we might catch a subset of test cases at zero, it is less likely that all the test cases will be fetched at a time where all the counts will be zero.

Aggregation of all test cases:

 

Test Case 0:
 

Test Case 1:
 

Test Case 2:
 

Test Case 3:
 

Two timeouts can affect an attempt to identify a specific value from the landslide results:
  1. Landslide Test Status Poll Interval (sec): This value defines the poll interval that CloudSure will use when checking the state of the landslide test. By default, this value is set to 3 seconds.
  2. Landslide Test Live Results Poll Interval (sec): This value is used as the poll interval for gathering results. By default, this interval is set to 1 second.
These values can be found in the Timeouts section of a Landslide-related test.
When looking at the output of the provided test, it is clear that some Landslide test cases reset to a lower value. We noticed that most of the Landslide test cases hit a zero value or were very close to it, but not all simultaneously. Our understanding of the metrics tracked in the test provided in the ticket is that they are counters. Hence, the metrics have to be purposely reset to a lower value (presumably zero) by the Landslide test. Are we thinking of this correctly?
We are only tracking the Landslide states (i.e., not controlling them). If you want the CloudSure charts state transitions to line up with Landslide test cases aggregating to a zero value, you should reset all Landslide test cases and hold all their values at zero for at least one "Status Poll Interval."

enlightenedIf you simply need the results to line up on a zero value (without regard to us noticing a state transition), you will need to hold those metrics across all the Landslide tests being aggregated for at least one full "Live Results Poll Interval".

 
Root Cause

Working as design
 
Attachments
Attachment Description
CloudSure - K8S_LS_HPA: LandSlide counter display is inconsistent.

Product : Landslide,CloudSure