Note: Images may be inserted into your messages by uploading a file attachment (see "Manage Attachments"). Even though it doesn't appear when previewed, the image will appear at the end of your message once it is posted.
Register Latest Topics

  Author   Comment  

Posts: 247
Reply with quote  #1 
This is a bit long, but I thought this real-world example of Graphics in action might be of interest to the users that frequent Stephen's forums...

Summary: It appears the 30-year daily average snowfall data for the individual weather stations has a very bad corruption/bias towards the beginning/end of the month.  (I detected this problem by graphing the data.)


I was looking for winter-related data, to do some timely graphs.  I found a NOAA/NCDC (National Oceanic and Atmospheric Administration / National Climatic Data Center) website with some free snowfall data:



And specifically these tables of data:




I imported the raw data (http://cdo.ncdc.noaa.gov/climatenormals/clim20-02/NWS_SNOW_MNFALL_dly.dat ) into our graphing software, and did some various plots of the data.  To make it easier to analyze the data both geographically & temporally, I set up a map showing locations of the ~500 weather stations, and when you click on the weather station it plots all the data for that station (ie, NOAA's 365 summarized data points for that weather station, summarizing the 30 years of snow data).


Here's the map (please excuse the "gratuitous" use of colors & snowflakes - I was trying to create "fun" holiday graphs ;)  Sorry, but the drilldowns won't work for you, since the machine running the drilldowns is on our intranet:




First I did a graphical representation of NOAA's ascii summary tables of data.  Here's a snapshot of one such graph (this is what the map drills-down to):



After looking at several of these graphs, I noticed that for many of the weather stations, there were “mysteriously” high snowfall amounts at the beginning/end of the month, when there was little/no snowfall for several days before/after – this seemed very odd & unlikely.


Then I created the following plots, which vividly show that there is indeed a serious spike/bias toward having values at the beginning/end of the month (and occasionally on the 15th/middle of the month), where there are no (or much lower) snow values for the days before/after.  This is most evident for the locations with somewhat sparse snow, therefore my plots show the data for locations where the maximum snowfall is less than 2/10ths of an inch.

I experimented with several graphs, but here are 2 that definitely show the problem:


(this 2nd graph was a co-worker's suggestion, and probably the most direct/elegant "proof" of my theory)

NOAA's data description file (http://cdo.ncdc.noaa.gov/climatenormals/clim20-02/normalsnwssnow.pdf ) has a “computational methodology” section which indicates that …


Daily snowfall and snow depth values are not simple means of the observed daily values. They are interpolated from the much less variable monthly normals by use of the natural spline function (Greville, 1967). The procedure involved constructing a cumulative series of monthly sums from the monthly normals. The cumulative series was for a 24-month period (July, August, …, December, January, …, December, January, …, June), so that the interpolating function could adequately fit the end points in the annual series.”


I suspect this technique is not a good one to use on this data, or the technique was incorrectly applied to the data(?)  Otherwise, perhaps the raw data is biased/corrupted(?)


Whatever the underlying cause of the corruption, the end result is that it makes it look like (for example) there is snow on March 1st and April 1st, when there is no snow for the weeks prior & after those dates.


I forwarded my concern to NOAA, and quickly received a reply - they said they're going to have 2 of their best people investigate, and they'll let me know what they find (probably early January).  When I get their reply, I'll post it up here, in case anyone is curious how it comes out!




Posts: 247
Reply with quote  #2 
Appended below is the reply I got, confirming that there is indeed a problem:


The information below came from the product developer and confirms your suspicions about daily snowfall problems for stations that have very

little snow.   We will be posting a notice on our snowfall normals web

access page alerting customers to the problems in the data.


Thank you for bringing this to our attention.


Tom Whitehurst


The issue is driven by a subroutine that returns daily snowfall values for months with totals that are less than the number of days in a month times 0.1" (e.g., January monthly totals less than 3.1").


For such months, the previous and next month's monthly totals are evaluated.  If the previous month's total is greater than the next month's total, then the daily values are distributed at the beginning of the month (and vice-versa).  If the previous and next month's values are the same, then the daily values are distributed in the middle of the month.


Case 1:  Huntsville, AL, February

January: 1.3"  and March: 0.4"  (previous greater than next, distribution at beginning of month):


DAY  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31  MONTH


FEB   1  1  1  1  1  1  1  T  T  T  T  T  T  T  T  T  T  T  T  T  T  T

T  T  T  T  T  T               7


Case 2:  Tuscalosa, AL, November

October: 0"  and December: 0.2"  (previous less than next, distribution at end of month):


DAY  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31  MONTH


NOV   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

0  0  0  0  0  0  T  1         1


Case 3:  Montgomery, AL, February

January: 0.2"  and March: 0.2"  (previous greater than next, distribution in middle of month):


Mean Snowfall (tenths of an inch, T =

Trace)                                                          DAY  1

2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

26 27 28 29 30 31  MONTH


FEB   T  0  0  0  0  0  0  0  0  0  0  0  0  T  1  T  0  0  0  0  0  0

0  0  0  0  0  T               1


Obviously, there are some real deficiencies to this approach, and it should be identified as problematic to the public in the context of the arbitrary nature of daily snowfall values generated from a spline fit to begin with.


Posts: 853
Reply with quote  #3 

I think it's wonderful that you brought this problem to the National Climatic Data Center's attention and equally wonderful that they took it seriously and are doing something to fix the problem. This is precisely what data analysis is ultimately about – acquiring new knowledge and then using it to make something better.

Stephen Few
Previous Topic | Next Topic

Quick Navigation:

Easily create a Forum Website with Website Toolbox.