Discussion


Note: Images may be inserted into your messages by uploading a file attachment (see "Manage Attachments"). Even though it doesn't appear when previewed, the image will appear at the end of your message once it is posted.
Register Latest Topics
 
 
 


Reply
  Author   Comment   Page 1 of 3      1   2   3   Next
sfew

Moderator
Registered:
Posts: 823
Reply with quote  #1 
This discussion thread is in response to a post in my blog that critiques a research paper about connected scatterplots. I was asked by the researcher Steve Franconeri to redesign a connected scatterplot that originally appeared on the website FiveThirtyEight about M. Night Shyamalan's films (see below).

Connected Scatterplot Shyamalan.png 

Based on the points made in the original article, below is an alternative the original chart that, in my opinion, is easier to read and understand and suffers no losses compared to the original.

Shyamalan's Films.png 
I have no doubt that this can be further improved, so I welcome others to share their versions as well.

The correlation, or lack thereof, between Rotten Tomato ratings and revenues was not mentioned in the original article. If it were part of the story, however, and it was important to show how this changed through time, the graph below is one way to do it. The ratio of rating points to revenues directly encodes correlation information, so a single line that displays this ratio does a good job of revealing the fact that the relationship between ratings and revenues varied significantly and that The Last Airbender was a significant outlier.

Revenue per Rating.png


__________________
Stephen Few
jannepyykko

Registered:
Posts: 39
Reply with quote  #2 
I'm one of those who think connected scatterplots have certain advantages over line graphs. That said, I don't think the case above is a very good example of that. Therefore, I'd like to provide a new case, if you don't mind.

I live in Finland, so I'm telling you a story about my home country. See below how Finland has developed in 175 years.

[23783751260_6f3cbb54b0_b]

In general, people live longer and family sizes get smaller as years pass by. You might see there some interesting turns of events, right? How prominent are the anomalies? Do you start asking questions about them?

Now, let's see the same data in a connected scatterplot.

[23996816861_8c523d5440_b]

Wow. What causes the loops in the data? They are very prominent. They must mean something significant! (You start asking questions.)

Let's add some background information.

[23996816851_4fa59c4a1a_b] 

The famine of 1866-1868 was extremely horrible in Finland. During 1868, the median age of people starved to death was only 8 years 40 days [Wikipedia]. The other prominent loops present wars that interrupted otherwise steady development towards longer lives and smaller families (in Finland, World War II consists of two separate wars against Soviet Union -- that is also visible in the scatterplot as zigzagging).

Aside, the statistics are very precise due to active book keeping of the churches in Finland.

Another important question arises. Why are the loops always below the general trajectory?

The reason is, these variables are connected by nature with a delay. When a catastrophe happens, it first results in more deaths and therefore decreased life expectancy. When the catastrophe is over, the birth rate increases -- sometimes to a new level such as after World War II.

The delayed connection of variables did not show up in the line graph.

Let's get some more background information. How is Finland in comparison with Sweden?

[24053333726_059fa25541_b] 

In a single line graph, you quickly run out of options to present two variables of two countries clearly. The solution above is one of the possibilities. Another (and probably better) way is to use two different graphs, one for life expectancy and another for children per woman (same colors for countries for both graphs).

In any case, the line graph shows that during the last 175 years people have lived longer in Sweden than in Finland -- and the birth rate is lower in Sweden in most years. In Sweden, the anomalies are smaller than in Finland.

Now, let's see the connected scatterplot.

[23971276392_59be99a5a4_b] 

In Sweden the data creates loops too, but not as big as in Finland. I've marked some common years in the graph to show the difference. Sweden did not take part in the World War II, so there's no loop at all.

In general, in a connected scatterplot, you can put reference data (other countries) in the same graph more easily than in a line graph (but as said before, you can create two line graphs to include more countries -- however I suppose two graphs require more eye movement and thus slower deciphering of data a bit).

Summary:

I would definitely keep connected scatterplots in my data analysis toolbox. Some significant phenomenon, such as loops, are very prominent in a connected scatterplot. Loops tell story about the delayed connectivity of variables (and there might be other interesting aspects in CSs as well, though I'm not aware of them yet). All in all, you must train your ability to read connected scatterplots in order to best utilize them into your needs.

As a side note, I downloaded the data from Gapminder.org. Here's a link to the first connected scatterplot (Finland 1840-2015 in Gapminder). Since Gapminder has an ability to draw connected scatterplots (plus it has a useful tool tip that helps exploring and understanding), I think that would be a useful tool for online data analysis -- not the only tool of course, but one special tool in the toolbox.

PS. The discussion started in Stephen's blog from the word "engagement" and what it means, which I find a bit odd (I'm more interested in finding ways to understand the nature of data and engagement comes along). I hope I did not change the topic too much -- if so, please remove the text (I have a copy).

__________________
-- Mr. Janne Pyykkö, Espoo, Finland, Europe
sfew

Moderator
Registered:
Posts: 823
Reply with quote  #3 
Janne,

I find your connected scatterplots harder to read and understand than your line graphs. They suffer from all of the same problems as the examples that were cited in the research article.

I like to challenge a claim that the authors of the connected scatterplots paper made, which you've also made in your comments above. You claimed that it is easier to insert annotations in a connected scatterplot than in a line graph. This is not the case. In fact, even in your dual-axis line charts it is easier to associate annotations with particular points along the lines and it is a whole lot easier to do it if you create two line graphs, one above the other, as I have.

__________________
Stephen Few
acraft

Registered:
Posts: 51
Reply with quote  #4 

I've given this a bit of thought over the last couple days and the problem I have with connected scatterplots is that people don't treat them like scatterplots - they treat them like line charts.  I think that's the main reason a line chart is better - because it actually IS a line chart.

Scatterplots examine relationships between two variables (sometimes more, but not as well).  Sometimes samples are measured at points in time, and it makes sense that these dots could be connected in a chronological order.  I don't question that it's a good practice - it's more information, which can be helpful.  But it's still a scatterplot; it's still designed to show, first and foremost, the relationship between two variables.

People focus too much on the trajectory of the points, which is really just secondary information after the relationship between variables.  A scatterplot of M. Night Shyamalan's movies might suggest a correlation between revenue and ratings (the FiveThirtyEight chart shows this with a best fit line).  Connecting the dots shows the order of his movies, which helps the user determine exactly that - the order of his movies - but it's a pretty lousy way to examine the trend over time, if that's what you're after.

As far as the Finland charts above, the scatterplot suggests a relationship between higher life expectancy and lower birth rate.  Samples are over time, so it makes sense to connect them.  But examining trends that way is a headache - I'd rather look at line charts.  Looking at the scatterplot as a timeline, aside from the outlier in 1868, you might as well just throw away 1840 to 1910 (almost half the data).  Good luck looking for loops in that tangled mess.

Speaking of loops, I feel their benefits are overestimated.  Are loops a more prominent way to show delayed connections than in typical line charts?  Sure, but are fleeting delays (outliers, in a sense) in variables' relationships really so important that we need to feature them so prominently?  And if yes, then are loops in CS plots really the best way to identify them?  A line chart that plots each variable's shift from sample to sample shows them pretty clearly too:

sample.png 

I agree that CS plots might have a place in data visualization, but I think that can only be the case if they are used to examine relationships between variables - if they are treated as scatterplots.  Whereas if they are being examined for trends (if they are being treated like line charts) then line charts are clearly the better choice.


bilal

Registered:
Posts: 2
Reply with quote  #5 

I agree with Janne that connected scatterplots have their place in the data analysis toolbox, even for paired time-series. For example, they are extensively used in physics and mathematics to visualize parametric equations, where two independent variables depend on time. You learn different things about the phenomenon at hand by looking at the line charts and at their respective "connected scatterplots", as the following example shows (theta is for time):

circle.png

For somebody who did not study mathematics, it would be hard to infer the relation between the two variables just by looking at the separate line charts. Likewise, it is hard to tell that the variables follow a sine wave just by looking at the "connected scatterplot".
Beyond circles and ellipses, a connected scatterplot can show a variety of patterns that would be hard to see in separate line charts. An example for this are spirals: (http://mathworld.wolfram.com/NielsensSpiral.html)

[NielsensSpiral_1000]

Of course a data analyst deals mostly with (unknown) data, not with equations. So plotting a connected scatterplot of real data will probably not result in perfect spiral and ellipses. But it can reveal many patterns in the data, that might be very interesting (depending on the phenomenon being studied) and hard to discern from line charts. In his post above, Janne nicely illustrated examples for this.

Connected scatterplots surely have their issues with clutter and with the interpretation of time, compared with the more intuitive line charts. But IMO they have a different purpose: While the line charts give visual primacy to time (assigning it the most efficient visual variable), a connected scatterplot gives visual primacy to the relation between the two variables and embeds time as a secondary information over the data points.
So for analysis, I would like to see my data in both views and hope to gain better understanding of the data than just using one view.
For presentation, I will follow Occam's razor, and use line charts if they convey all the findings I want to communicate.

sfew

Moderator
Registered:
Posts: 823
Reply with quote  #6 
Bilal,

Can you provide an actual example of a connected scatterplot with paired time series that you believe is particularly useful? I agree that if a two variables, as they changed through time, formed a spiral pattern in a connected scatterplot, viewing the data in a connected scatterplot would be useful. I'm not aware of any actual datasets, however, that form such a pattern.

Regarding your statement, "For somebody who did not study mathematics, it would be hard to infer the relation between the two variables just by looking at the separate line charts," I believe that the opposite it true. Most people who have not studied mathematics or statistics do not know how to read a scatterplot, so the line graphs would work better for them even though they are not the ideal display for focusing primarily on the relationship between two quantitative variables. I know this from many years of experience working mostly with people who are untrained in either mathematics or statistics.

You mentioned that line graphs do a better job of showing change through time and connected scatterplots givevisual primacy to the relationship between the two variables independent of time. I would amend your statement slightly to say that a scatterplot (not a connected scatterplot) does a better job than a line graph of showing the relationship between two variables independent of time. Keep in mind that no single chart will do everything well. If you want to view two variables as they change through time, use line graphs. If you want to focus on the relationship between two quantitative variables, use a scatterplot. Attempting to use a single chart for both purposes would not make sense.

__________________
Stephen Few
danz

Registered:
Posts: 190
Reply with quote  #7 
Happy New Year to everyone!

My first post in this thread is a redesign of the original connected scatter plot from website FiveThirtyEight about M. Night Shyamalan's movies. The chart should be clear enough to show the decline in performance on both critics score and box office revenue, referenced to the most successful movie. The extra scales added on the right side are properly aligned to the common percentage scale. Obvious discrepancies can be noticed for "Unbreakable - 2000" and "The Last Airbender - 2010". 

   Comparative Score-Revenue.png 
   

This is, in general, my preferred design when I need to compare the variation in time of two or more variables. It does not always need to relate to first values, but it can be used in relation with any other significant statistical value: Minimum, Maximum, Mean, Median, Range Center. 

Dan

danz

Registered:
Posts: 190
Reply with quote  #8 

My second post is related to the connected scatterplot display.

Obviously we have limited abilities in decoding multiple graphical information from the same view. While it is true that a connected scatter-plot may bring some extra information, they are better alternatives to display in-time variations of two variables. Not that it maters much, but not only time can be used to connect the points, but any ordinal variable can be used for that. I personally consider that points can be connected in a scatter plot for better purposes, first coming in mind being displaying Tukey bag, the 2d generalization of a box plot.

A connected scatter plot brings primarily two more geometrical elements, both of them related to the variation of the two variables: distance between consecutive points and the slope and orientation of connected lines. The distance within any space defined by two variables is a metric we cannot actually interpret properly. Pure geometric formula for distance is sqrt(dx*dx + dy*dy) which has not much of a meaning in most of cases. The orientation and the slope of individual lines bring extra information related to the sense of variation and the ratio between the two variations within their own range.

In order to compare two variations over time within their own range, just derivate two variables they measure that. New Value measured in i’th point will be (Vi - Vi-1)/(Vmax-Vmin) expressed in percentage. Then just use one chart with two-line series each of them showing the above variation. If gapminder would provide a way to extract the data behind a chart (if it does, tell me how to), it would take no time to design the above solution.

Enriching a scatter plot with ordinal connection lines might be informative now and then, but for proper time variation analysis better methods exist. It is no point in stretching the mind of designers and their audience to design and then to decode complicate charts. Eventually I don’t think I need an extra hint to be engaged in studying variability in time of measures. If this makes any sense, I would do it using properly connected calculated variables.

Dan
bilal

Registered:
Posts: 2
Reply with quote  #9 
Quote:
Originally Posted by sfew
Bilal,

Can you provide an actual example of a connected scatterplot with paired time series that you believe is particularly useful?


I believe the following connected scatterplot by Moritz Stefaner is particularly useful: (source http://truth-and-beauty.net/projects/remixing-rosling)
[01] 
The reason why I find it useful (compared to 2 x 12 line charts), is that I can compare the development of countries in many ways that are not easily enabled using line chart. For example, I can see that Vietnam reached in 2009 the same life expectancy and fertility rate like the US in in the 1980s (Moritz colored both countries to emphasize this finding). Such relative time comparison between countries based on both variables would be probably hard to detect in 24 separate charts.
Quote:
Originally Posted by sfew

I agree that if a two variables, as they changed through time, formed a spiral pattern in a connected scatterplot, viewing the data in a connected scatterplot would be useful. I'm not aware of any actual datasets, however, that form such a pattern.

Well, the parametric equations do actually involve actual data (coordinates, current, voltage, field intensity, force, etc.) and govern the relations between these variables. Plotting these equations using a connected scatterplot is very common in phsyics textbooks (many simulation software even use animation to draw these plots over time).

Quote:
Originally Posted by sfew

Regarding your statement, "For somebody who did not study mathematics, it would be hard to infer the relation between the two variables just by looking at the separate line charts," I believe that the opposite it true.

I wanted to say, it will probably hard for people to infer that the two variables in my previous post which follow a sine wave will result in a circle (or an ellipse in general) when they are plotted against each other, without having studied this in school mathematics.

Quote:
Originally Posted by sfew

Most people who have not studied mathematics or statistics do not know how to read a scatterplot, so the line graphs would work better for them even though they are not the ideal display for focusing primarily on the relationship between two quantitative variables. I know this from many years of experience working mostly with people who are untrained in either mathematics or statistics.

I agree

Quote:
Originally Posted by sfew

You mentioned that line graphs do a better job of showing change through time and connected scatterplots givevisual primacy to the relationship between the two variables independent of time. I would amend your statement slightly to say that a scatterplot (not a connected scatterplot) does a better job than a line graph of showing the relationship between two variables independent of time. Keep in mind that no single chart will do everything well. If you want to view two variables as they change through time, use line graphs. If you want to focus on the relationship between two quantitative variables, use a scatterplot. Attempting to use a single chart for both purposes would not make sense

I totally agree
jannepyykko

Registered:
Posts: 39
Reply with quote  #10 
In order to reveal additional useful possibilities of connected scatterplots (other than prominent loops), I'd like to provide two more cases.

When commenting the "Finland's course" example (image again below), acraft wrote: "Looking at the scatterplot as a timeline, aside from the outlier in 1868, you might as well just throw away 1840 to 1910 (almost half the data). Good luck looking for loops in that tangled mess."

[23996816851_4fa59c4a1a_b] 

I don't want to throw away the tangled mess, because it carries an important information.

During years 1840-1909 Finland was an agricultural country, where family size averaged in 4.5-5.2 children and life expectancy varied between 33 and 47 years. In the big picture, there is no need to analyze micro movements within those years. As a person interested in history, I can summarize it by saying: It was just normal life for that era.

In 1910 modernization (better health care) overtook Finland enough to be seen in the connected scatterplot. Since then, there was no way back to the "old normal". A new normal was a long development towards longer lives and smaller families.

Let me repeat this, because this is an important discovery: Year 1910 was the first year of statistically-proven improved health care in Finland. And that is very easy to see from the connected scatterplot.

More questions arise:
- Was 1910 early or late compared to other countries, such as Sweden?

In fact, I can check this from Gapminder very quickly, because it provides connected scatterplots:

In Sweden, the old normal varied in the range of 40-48 years and 4-4.5 children. New normal started in 1887.
In Denmark, the old normal varied in the range of 39-48 years and 4-4.5 children. New normal started in 1894.
In Norway, the old normal varied in the range of 45-53 years and 4.2-4.8 children. New normal started in 1901.
In United Kingdom, the old normal varied in the range of 38-45 years and 4.7-5 children. New normal started in 1881.

And so on.

Therefore, Finland (1910) was late. Sweden started to modernize health care 23 years before Finland.

Question: How easily can you get this information by using line graphs?

------------------

As I pointed out in the example above, a connected scatterplot is good to show where "normal limits" are, because humans are good to perceive 2-dimensional space. In the example above "normal limits" was roughly an area of a rectangle. It doesn't have to be so.

Let me introduce a fictional case. Suppose a factory makes thousands of cars, of which some get complaints about extra fuel usage. Hopefully each car has a database to store variables every five seconds, such as velocity, fuel usage, power got out from motor, motor temperature, outside temperature, jittering of car, and so on.

In a scatterplot, the normal limits can be as follows.

[24039875291_59a322f4e1_n] 

The more power we want out from motor, the more fuel it takes. I've drawn the "normal limits" area so that the relation is not linear but polynomial with degree of 2 or so.

The data points in the scatterplot are not within the normal limits = the motor takes too much fuel.

Now the question is, when did the motor start to use too much fuel? If I can find that moment, I can check what was the velocity, outside temperature and jittering for that moment. Perhaps that can give a cause for the unfortunate extra fuel usage?

Because "normal limits" is not a rectangle, the best tool I can imagine for this task is to use a connected scatterplot. So I take the history data and if the car motor was not broken in the beginning, there is a tangled mess within normal limits. At some point in time the line goes outside the the limits and that's the moment I was looking for. Simple.

I challenge you to solve this second problem without a connected scatterplot. (I'm not saying that there isn't any. I'm only saying a connected scatterplot would have been my tool to do it.)

__________________
-- Mr. Janne Pyykkö, Espoo, Finland, Europe
sfew

Moderator
Registered:
Posts: 823
Reply with quote  #11 
Bilal,

The example of the time-series scatterplot by Moritz Stefaner is worth considering. This particular design was originally inspired by Hans Rosling, who added tails to show the paths taken by time series in his animated bubble plots so the paths could be compared. This work was later extended by George Robertson, et. al., who examined various ways of displaying the tails. As you can see, one of the conditions that makes this work is the fact that the data for each country is quite separate in location, with little overlap, otherwise it wouldn't be readable. Notice that this example does not exhibit any of the patterns that the paper by Haroz, Kosara, and Franconeri identified as particularly revealing (e.g., loops and Ls). Also notice that, if your purpose were to compare countries at particular points in time, this plot would not work well. I'm planning to get the data that Stefaner used for this chart and experiment with line graphs to see if a connected scatterplot actually provides any benefits in comparison. Stay tuned for the results.

Janne,

Regarding your example showing the fertility rate and life expectancy in Finland, your additional points don't actually demonstrate any benefits of connected scatterplots compared to line graphs. Every one of the features that you've mentioned can be shown more clearly in line graphs. Regarding your hypothetical example (power and fuel), it illustrates the usefulness of scatterplots, not connected scatterplots. There is no need in this case to connect the values to show a time series. If there were, line graphs would handle this more effectively. Line graphs are routinely used to show values in relation to ranges of routine behavior (e.g., process control charts).

__________________
Stephen Few
jannepyykko

Registered:
Posts: 39
Reply with quote  #12 
Quote:
Originally Posted by sfew

Janne,

Regarding your example showing the fertility rate and life expectancy in Finland, your additional points don't actually demonstrate any benefits of connected scatterplots compared to line graphs. Every one of the features that you've mentioned can be shown more clearly in line graphs. Regarding your hypothetical example (power and fuel), it illustrates the usefulness of scatterplots, not connected scatterplots. There is no need in this case to connect the values to show a time series. If there were, line graphs would handle this more effectively. Line graphs are routinely used to show values in relation to ranges of routine behavior (e.g., process control charts).


Stephen,

I've read your book Signal, so process control charts came into my mind after posting (I also wrote a positive book review into company intranet).

That is, with suitable process control charts, I think it would be possible to find the year 1910, when modernization (better health care) overtook Finland.

However, as scatterplots and connected scatterplots are available more easily (=no mathematical model building), I would prefer those as quick tools to analyze situations mentioned above. For example, I needed only 1 minute for each country to check the year when (statistically proven) better health care started in Sweden, Denmark, Norway, and United Kingdom.

PS. You can download data from Gapminder.org by clicking "Source(s)" icon near X and Y axis.

__________________
-- Mr. Janne Pyykkö, Espoo, Finland, Europe
sfew

Moderator
Registered:
Posts: 823
Reply with quote  #13 
Janne,

The shift from large to small families is not caused by improvements in healthcare. Assuming that improvements in healthcare caused lifespan to lengthen, a single line graph displaying time series of lifespan values per country would have clearly shown when this began to occur.

__________________
Stephen Few
danz

Registered:
Posts: 190
Reply with quote  #14 
The only benefit I can see in using time connected scatterplots over dual linear charts is the possibility of visually detecting different correlations behaviors in time.

In Janne example, 
1840-1910 very weak or no linear correlation
1910-1945 negative linear correlation with obvious outliers during Civil War and Wolrd War 2
1948-1973 negative linear correlation
1973-2015 weak positive linear correlation

Janne CS.png   



A scatterplot main goal is to study correlations between two quantitative variables. A third ordinal or time variable used to connect a reasonable amount of points might reveal different aspects of the correlation on intervals. However I personally find more interesting to draw partial trend curves to reveal hidden aspects of data correlation. 

Below a case where I enriched a scatterplot with partial trend lines and their 95% confidence bands using a third quantitative variable for coloring and grouping information. A connected scatterplot, would have bring nothing but a mess.

split correlation.png 


And below a quick redesign of Janne's connected scatterplot having a similar approach using years as third variable.
 
 Janne CS alternative.png 

Does this study worth to go beyond that? Looking into individual segments lengths and their angles it means to study variations or correlation between variations. Not aligned line lengths and slopes are not the best way to encode/decode quantitative information. They are better ways to study in time variations. They are better way of enriching scatterplots using a third variable. Eventually, instead of shrinking all relevant information in one view, is always better to split it in more clear and appropriate views.

I don't find harmful to learn different geometric aspects of connected scatterplots or using it now and then (Stefaner example is one of them). I personally find evident how to interpret a vertical line, a horizontal line, right or opposite angles, yet I see no benefit in loosing time in decoding those weak (and possible busy) signals. What I find harmful is to encourage the usage of a connected scatter plot when better solutions exists.



jannepyykko

Registered:
Posts: 39
Reply with quote  #15 
Quote:
Originally Posted by danz
The only benefit I can see in using time connected scatterplots over dual linear charts is the possibility of visually detecting different correlations behaviors in time.

In Janne example, 
1840-1910 very weak or no linear correlation
1910-1945 negative linear correlation with obvious outliers during Civil War and Wolrd War 2
1948-1973 negative linear correlation
1973-2015 weak positive linear correlation

[...]

I don't find harmful to learn different geometric aspects of connected scatterplots or using it now and then (Stefaner example is one of them). [...]


Thank you Dan for finding my "Finland's course" example useful to detect year ranges 1840-1910, 1910-1945, 1948-1973, and 1973-2015.

That's pretty much what I'm trying to say in this thread (with my examples) = a connected scatterplot is useful in some cases, so I'm not going to throw it away from my data analysis toolbox.

__________________
-- Mr. Janne Pyykkö, Espoo, Finland, Europe
Previous Topic | Next Topic
Print
Reply

Quick Navigation:

Easily create a Forum Website with Website Toolbox.