tballx
Registered:1335566952 Posts: 2
Posted 1335568796
Reply with quote
#1
I would like to use an equation to solve for the extent to which data in a time series deviate from an expected value. I am using data sets where I expect high correlation. If the data in my sample does not highly correlate, I would like to trigger a business process of data validation. However, I need a mathematical way to identify which data points are most anomalous and should be investigated. Is there a statistical formula to quantify this? Below is an example chart. The data sets are plotted on two y axes. Points 1&5 represent instances where each data set produces a result that visually agrees with expectations while points 4 & 2 do not and should be investigated. Point 3 represents a deviation between the data sets but only marginally so. Is there an application or formula that can be used to calculate a statistical measure that can demonstrate this phenomenon quantitatively?:

Anders
Registered:1275392797 Posts: 18
Posted 1335605551
Reply with quote
#2
Are the series plotted on different axes? The two y-axes in your chart have very different scales, thus all the points the series would deviate significantly from each other if the series are plotted on different axes.

tballx
Registered:1335566952 Posts: 2
Posted 1335621238
Reply with quote
#3
Exactly correct. One of the two data sets must be converted to an equivalent value in the other scale before any mathematical comparisons can be accomplished.

pzajkowski
Registered:1240321507 Posts: 46
Posted 1335812280
Reply with quote
#4
Hmm... I'll throw some thoughts out. First, it sounds like you are not interested in visualizing the data, but rather attempting to employ a calculation that highlights a "concern" that needs further investigation.

Second -- Although your example graph suggests that sometimes the two series correlate well, you need to be careful with such a conclusion. Please take a look at Stephen Few's book "Now You See It" (pg. 172 - 174) where he illustrates a variety of graphs that appear to have two series of data that seem to correlate, but in fact might not (and vice-versa -- i.e. two series that don't seem to correlate, but in fact actually do.)

Review the techniques he suggests for comparing rates of change. In particular, Stephen notes the value of converting raw data points to a log scale. In one example, two lines appear to have the same rates of change by plotting actual dollars, but when the dollar values are converted to a log scale, it becomes quite clear that the rates of change for the two series are indeed different (pg 173, figure 7.45).

In your example, if the scales of the two series were more similar (rather than comparing thousands to hundreds), I'd suggest calculating the rate of change from a single baseline data point (e.g., March 11) within each series, and then comparing the rates of change between each series; if the rates disagree by a certain percentage for any given date, then that would be an indicator that a something needs to be investigated.

Given the difference in scales between your two series, maybe the values could be converted to log first, and then the rates of change calculated against a baseline data point (that's been converted to log). Then compare the rates of change between the series for each date slice. For a given date slice comparison, if the rates are drastically different, then that would perhaps be an indicator for further investigation. (I personally don't recall if I've ever done such an analysis, but thinking it might work.).

sfew
Moderator
Registered:1135986598 Posts: 805
Posted 1335914673
Reply with quote
#5
Tbalix, Rather than responding to your question as asked, I'd like to probe your plan a little. How are you coming up with the expected time series to which you're comparing the actual values? I ask because it is quite common for organizations to focus on variation that isn't meaningful in that it is routine and doesn't indicate anything that can be controlled. Unless your expectations are based a viable statistical model, it isn't likely that you'll find it useful to track deviation from those expectations. If what you need is a way to identify variation in times series data that indicates a potential problem that could be addressed, I recommend that you read the works of Donald Wheeler, beginning with the book Understanding Variation .
__________________ Stephen Few

pzajkowski
Registered:1240321507 Posts: 46
Posted 1336145244
Reply with quote
#6
Stephen,

Your comments to Tbalix prompted me to seek out Wheeler's articles and website on the internet to gain a better understanding of your recommendation -- i.e., what does Wheeler offer outside of the scope of Stephen Few? Turns out Wheeler is an advocate of using control charts (SPC charts). I last created a control chart for work about five years ago... So, I spent last evening "getting up to speed" with SPC charts -- indeed they are fascinating and quite useful.

Employing control charts seems to require a bit of statistics knowledge, however, coupled with knowing which type of control chart to use to make an appropriate analysis -- i.e., use the wrong type of control chart and risk ending up making incorrect conclusions. In contrast, your books and articles typically steer clear of complex statistical formulas, and the ensuing graphs, in favor of the fundamentals of visual data analysis to achieve insight -- (pg 234 in "Now You See It", figure 10.36, is one of my "all time favorites" of insight merely by changing the order of male vs. female salary distributions.)

So, can other means of visualization be employed in situations where a control chart might typically be leveraged? SPC charts is a topic I have not seen you write about so I'm not sure whether you leverage alternative methods or whether control charts don't fit into your message to the masses of "simple visualization techniques for quantitative analysis." (I'm hoping this question will supplement the original poster's inquiry rather than lead to a separate discussion, entirely.)

--Peter

sfew
Moderator
Registered:1135986598 Posts: 805
Posted 1336146788
Reply with quote
#7
Peter, If you take a look at one of Donald Wheeler's books, you'll find to your delight that he keeps the statistics well within reach. He teaches simple calculations for setting the limits of routine variation in control charts. As I've been thinking during the last couple of years about the topic of my next book, I'm increasingly inclined to help people like you who have taken the journey that my books have provided so far to take the next step in analytical thinking. To do this, we'll need to expand our knowledge of statistics. An understanding of variation--the focus of Wheeler's work--is a logical component of this. Rest assured, however, that everything in my next book will be as easy to understand as everything that I've written about so far. The concepts of statistics can be taught in readily accessible ways. It is unfortunate that this is rarely done.
__________________ Stephen Few