Discussion


Note: Images may be inserted into your messages by uploading a file attachment (see "Manage Attachments"). Even though it doesn't appear when previewed, the image will appear at the end of your message once it is posted.
Register Latest Topics
 
 
 


Reply
  Author   Comment   Page 5 of 5     «   Prev   2   3   4   5
bella_gotie

Registered:
Posts: 21
Reply with quote  #61 

In my opinion, what should be remembered is the information that we understand through data visualization rather than the data visualization itself. Data visualization is a tool to translate data to "human language". 

acraft

Registered:
Posts: 51
Reply with quote  #62 
Dan,

I posted on Ben's website - my understanding now is that in his particular solution (adding elements to his display to better connect decision-makers to their customers) memorability was a goal/requirement.  His was part of a larger presentation, and was likely to be buried among other topics and forgotten, even with a clear and comprehensible visualization.  Given that, I can understand and agree that more research into memorability might seem to be of consequence, despite Stephen's objections.

That said, I also believe that the reason his visualization was memorable has more to do with how it engaged the audience (not by catching their attention, but by encouraging them to interact with it through discussion and connect with their customers) than anything related to its actual appearance being an image that stuck in the execs' minds.  So IF any research into memorability is needed, studying how well people remember different things after only seeing them for 10 seconds is most certainly NOT it.  Especially if the visualizations don't directly concern them.

I can understand Ben's side of the discussion (despite his misunderstanding of Stephen's), but the Borkin et al. study was clearly the wrong direction (with or without all the glaring flaws pointed out in the article).
sfew

Moderator
Registered:
Posts: 802
Reply with quote  #63 
Acraft (Andrew),

Memorability of information is indeed useful. It's called learning. As a teacher, I passionately care about learning. Michelle Borkin's paper, however, did not concern itself with the memorability of information. Studies into memory in general and learning in particular are abundant. They are primarily done in the field of psychology. We now know a great deal about memory and what it takes to retain something in memory. I would enthusiastically support studies that help us understand how useful information that we glean from data visualizations can be more easily retained when it is useful. The phrase "when it is useful" is important to keep in mind. As I've explained, most uses of data visualization do not require retention of information beyond a few seconds. There are exceptions, of course, and in these cases learning to visualize data in ways that make it easier to retain would be useful. To learn this, studies very different than the one that Borkin did are needed.

In post #54, I took some time describe what I think usually happens when we look at visualizations, beginning with the words "So what happens when you view a chart?" I was attempting to illustrate my point that retention of information when viewing visualizations is usually for no more than a few seconds. No one has commented on this yet and I'm curious if you or anyone else has detected any flaws in my description based on your experience and understanding of cognition.

__________________
Stephen Few
sfew

Moderator
Registered:
Posts: 802
Reply with quote  #64 

A few people who are prominent in the infovis research community have suggested that I crossed a line in my critique of the paper by Michelle Borkin and her co-authors. Precisely where is the line that should not be crossed? That’s a question that hasn’t been addressed in specifics. Let me suggest a line that I believe we should never cross. Research papers that make invalid claims should not be published. This line is built into science. As scientists, we should not make invalid claims. If a researcher inadvertently makes a claim that is invalid, then the work should either be corrected or blocked from publication during the peer review process. If the peer review process fails to detect invalid claims, which are later discovered, the publication should acknowledge those errors publicly. This is the corrective mechanism that ought to operate in science.

Attempts to discredit the messenger while ignoring flaws in the research serves no one’s interests, and certainly not those of the infovis research community. Circling the wagons and calling in the cavalry to shoot dissenters is not a sign of a healthy community. Despite the diversionary nature of attacks on my character while ignoring the content of my critique, it is worthwhile to consider where the line should be drawn to encourage useful critique of published research papers. Let’s examine each of the statements that I made about the infovis research community in general and about the research paper and its authors in particular that have been deemed aggressive and even vitriolic to more thoughtfully assess whether a line was actually crossed.

"Research in the field of information visualization is usually mediocre, often severely flawed, and only occasionally well done."

This is the opening sentence of my article. What I described here is the frequency distribution of research papers in the field based on their merits. A distribution of this type has a name in statistics; it’s called a normal distribution. Most of the papers fall in the middle of the bell-shaped curve. Mediocre means middling, of average quality. Many papers are severely flawed. They live in the tail located in the low-value portion of the distribution. Only a few are well done and live in the tail at the high end of the distribution. Compared to the total number of papers that are submitted for publication each year, most of which are rejected, it is only the occasional paper that is well done. A well-done paper involves a study that is properly designed, executed, and reported, contains only valid claims, and is potentially useful. The backlash created by my description is an example of “protesting too much” of Shakespearean proportions. The community is suffering from the Lake Woebegone syndrome, derived from Garrison Keillor’s fictitious town, where all of the children are above average. Does the community really believe that is superior to other fields of research. Despite their protests, leaders in the community know better. In truth, I was generous in my description. The immature field of infovis research is still producing work that is skewed: more poor than good. This wouldn’t be a great problem if the community recognized its immaturity and was doing what it should to improve. That isn’t happening—at least not nearly enough.

"Before reading further, take some time first to review Borkin’s paper for yourself."

This is the first time that I refered to the research paper as Michelle Borkin’s, rather than naming the other authors. In the first paragraph of the article, I named all of the authors, so there was clearly no intention to rob the others of appropriate credit. After identifying all of the authors, I referred to the paper as Borkin’s, because she was the primary author, the person who was given primary credit for the work in the press, and the person who, as I understand it, was primarily responsible for the work. The objection has been made that by referencing the paper as Borkin’s, my critique was too personal. It was suggested that I should have referred to instead to “the authors.” Let’s consider this. Would any objection have been made if I wrote a positive critique about the paper and referred to it as Borkin’s? Actually, I know the answer to this. When I wrote a favorable review of Borkin’s 2011 paper, I failed entirely to mention any of the other authors, and not a peep was made in protest. So, the objection focuses on the fact that I narrowed my reference to Borkin alone in a critique that was negative. Is it appropriate to focus on the primary author when a critique is positive but not when it is negative? Even though I confessed in my response to Jeff Heer that, in retrospect, I wish I had referenced the paper’s authorship more abstractly, because I was using the paper as an example of a greater, systemic problem, I do not believe that my use of a personal reference crossed some inviolable line.

I’m in the habit of naming the people who are responsible for the work that I critique, whether it is positive or negative. This is a conscious decision that I made after much thought long ago. Work is created by people. I routinely abstain from indirect references to others such as “the authors” as well as indirect references to myself as “the author” in work that I write, as opposed to using the personal pronoun “I.” Communication is clearest when it is direct. Style manuals for writing tend to agree.

In his response to me, Jean-Daniel Fekete implied that I was abusing a junior member of the research team. This was not the case. Michelle Borkin was not the junior member of the team. Even if she were, she is an Assistant Professor at a university. A professor who publishes research is not being abused when she is identified as the author of that research. She should take responsibility for her work. When it is criticized in a rational and objective manner, as I endeavored to do, she shouldn’t cry foul and others shouldn’t jump to her defense unless they can do so rationally in response to the substance of the critique.

Some have suggested that I singled Borkin out because she is a woman. A quick review of my work will show the absurdity of that insinuation. I can’t help but wonder, however, if others are guilty of the sexist tendency that they attribute to me. Would they have so readily jumped to the defense of a man? Does Borkin need their help because she is a woman? A professor is a professor. A researcher is a researcher. Whether they are male of female is of no consequence in this matter.

"Borkin’s study illustrates a fundamental problem in many visualization research studies: the researchers do not understand what people actually do with data visualizations or how visualizations work perceptually and cognitively. Consequently, they don’t know what’s worth studying."

Anyone who understands data visualization—not strictly from a theoretical or computer algorithm perspective, but also from actual knowledge of the ways it is used in the world and how our brains process it, recognizes my statement as true. Many of those who write research papers have a narrow understanding of data visualization. This can be addressed by establishing standard curricula that all researchers should study before submitting papers for publication.

"Most information visualization research is done by people who have not been trained in the scientific method."

This opinion has emerged from my reviews of infovis research papers over the years. Common flaws in the papers illustrate a lack of training. The scientific method is not required training in most academic infovis programs. This is a problem that several leaders in the field recognize. Their efforts to address this problem need to rise to the attention of the community at large.

"Information visualization publications do not adequately vet research studies."

The fact that published papers such as the one that I critiqued are not uncommon is evidence of this fact. Vetting is done, but it is not adequate. Again, this is something that many leaders in the field recognize, but their efforts have not yet effected the changes in the peer review process that are needed.

"The information visualization community is complacent."

There are certainly those in the community who are not complacent, but they are not yet as vocal as they should be. All of the problems that I’ve pointed out have always existed, yet they prevail. Why? It is in part because the community at large is ignoring them. If that is not a sign of complacency, what’s the explanation?

"Borkin didn’t produce a flawed study because she lacks talent…I suspect that her studies of memorability were dysfunctional because she lacked the experience and training required to do this type of research."

I made this statement with kind intentions. I was saying that Borkin’s memorability papers were not flawed because she lacks talent but because she managed to get through her studies without being exposed to the full spectrum of knowledge that studies of this type require. In other words, I was pointing out that the problem is systemic, not unique to her. Those who supervised her studies share much of the responsibility.

"She [Borkin] is now an Assistant Professor at Northeastern University, teaching the next generation of students. I’m concerned that she will teach them to produce pseudo-science."

The term “pseudo-science,” which also appeared in the title of my article, is a button pusher. I know this. The word choice was intentional. I wanted it to be provocative because the seriousness of the problem demands it. Pseudo-science is anything that claims to be science without actually conforming to the methods of science. The paper that I critiqued and many others in the field are in this sense pseudo-science.

I used Borkin as an example of a larger problem. Some professors working in the field are teaching poor research practices. My concern is genuine. If they’re exhibiting poor research practices in their papers, they’re teaching them in their classrooms. They are passing on to the next generation of researchers the failures of the current generation. Unless this is corrected, these problems will continue. No one who cares about infovis research can ignore this problem. Anyone who chooses to ignore it is contributing to the problem.

----------------------------------

That’s it. I believe that I’ve identified all of the statements in my article that some feel crossed the line. If I’ve missed any, feel free to add to the list. I am not writing about this to defend my position. I actually don’t think my position needs a defense. I am writing this because critique of infovis research must be embraced, and if lines of decorum need to be drawn, they should be drawn consciously and clearly, not as knee-jerk reactions. If you think I have crossed a line, I believe that line has been drawn in the wrong place. I did not set out to offend or hurt anyone. If anyone has actually been harmed in any way, I apologize—that was not my intention. In reviewing the statements that I have listed above, do you really feel that I have crossed a line that should serve as a barrier to discourse within the infovis research community?


__________________
Stephen Few
acraft

Registered:
Posts: 51
Reply with quote  #65 
@Stephen
Quote:
Memorability of information is indeed useful. It's called learning. As a teacher, I passionately care about learning. Michelle Borkin's paper, however, did not concern itself with the memorability of information.

I wasn't implying that the paper was useful (in fact, I specifically said that it wasn't).  And I understand that you care about learning and memorability of information - I wasn't trying to suggest otherwise.

Quote:
In post #54... I was attempting to illustrate my point that retention of information when viewing visualizations is usually for no more than a few seconds. No one has commented on this yet and I'm curious if you or anyone else has detected any flaws in my description...
Nope, that's generally what I find among my users.  Needing to remember any actual data beyond making decisions in a few seconds (or remembering just long enough to do further research) never happens.
sfew

Moderator
Registered:
Posts: 802
Reply with quote  #66 
Thanks Andrew (acraft). I appreciate the follow-up and clarification.
__________________
Stephen Few
tamara_munzner

Registered:
Posts: 2
Reply with quote  #67 

[Tamara Munzner responded in her own blog, and I (Stephen Few) have reproduced her comments here. I have also responded to Tamara’s points directly. My responses appear in brackets and red italics, beginning with my initials: SCF.]

I’m writing in response to a still-unfolding debate and conversation within the visualization community that was catalyzed by two newsletter/blog posts from Stephen Few. He wrote two strongly negative critiques of two papers on memorability from a group of researchers at Harvard and MIT; Michelle Borkin was first author on both of these papers. He also critiqued the InfoVis conference itself, where these papers were published.

The first paper was published at InfoVis13: What Makes a Visualization Memorable? by Borkin, Vo, Bylinskii, Isola, Sunkavalli, Olivia, and Pfister. In my posts, I’ll call it [Mem13] for short. The second was published at InfoVis15: Beyond Memorability: Visualization Recognition and Recall by Borkin, Zylinskii, Kim, Bainbridge, Yeh, Borkin, Pfister, and Olivia. I’ll call it [Mem15]. Few’s critique of Mem13 is Chart Junk: A Magnet for Misguided Research, I’ll call it [Few13]. His critique of Mem15 was called Information Visualization Research as Pseudo-Science, I’ll call it [Few15]. The discussion about that article is on a separate set of pages.

I note two roles of my own, for full disclosure and context.

I was Michelle Borkin’s postdoc supervisor from mid-summer 2014 through mid-summer 2015. I was not personally involved with any of the memorability research, which was done while she was PhD student at Harvard with Hanspeter Pfister.

I’ve been heavily involved with InfoVis for quite a while now. I’ve attended every single one since it started in 1995, and first published there in 1996. My first organizational role was being webmaster in 1999, I started the posters program in 2001, I was papers chair in 2003 and 2004, and I’ve been a member of the steering committee since 2011.

All of which is to say yes, I do have some skin in the game on both of these fronts.

On Conventions Between Fields in Experimental Design and Analysis

I think of science as a conversation that is carried out through paper-sized units. Any single paper can only do so much – it must have finite scope, so that the work behind it can be done in finite time and described in a finite number of pages. There is a limit on how much framing and explanation can fit into any paper. Supplemental materials can expand that scope somewhat, but even without explicit length limits for them there must still be a boundary.

In the particular case of InfoVis as a venue, the restriction on length is 9 pages of text (plus one more for references). That’s fewer than venues such as cognitive psychology journals, where authors might have dozens of pages. In these papers, it’s the common case that a single paper covers a series of multiple experiments that hit on different facets of the same fundamental research question. The InfoVis length is longer than venues such as some bioinformatics journals, where the main paper is sometimes only a few pages, with the bulk of the heavy lifting is done in supplemental materials.

[SCF: Science is not “a conversation that is carried out through paper-sized units.” It is much more fluid and ongoing—or should be. Only the publication of scientific findings is confined to documents consisting of a few pages. Some of the problems that plague science are a result of thinking of it as paper-sized units. Too often, publication of a paper, whatever it takes, becomes the goal, rather than the production of good science.]

This inescapable fact of finite scope means that fields develop conventions of the standard practice: what’s normally done, the level of detail that’s used to describe it, and the amount of justification that’s reasonable to expect for each decision. These conventions can diverge dramatically between fields. The interdisciplinarity of InfoVis can lead to very different points of view of what’s reasonable and what’s valid.

[SCF: While it is true that the interdisciplinary approach to infovis “can lead to very different points of view of what’s reasonable and what’s valid,” this situation creates problems that must be resolved. Conventions must be developed that that specifically support information visualization, despite the many disciplines that inform it. We visualize information for particular purposes. This should never be forgotten when infovis research is conducted, regardless of the background and training of the researchers. The fact that visual science researchers participated in the two memorability studies done by Michelle Borkin and her colleagues does not mean that the conventions of visual science apply. Attempting to discover what catches someone’s attention or remains in memory after brief exposure to an image might be of interest in and of itself to visual scientists, but it should only be of interest to infovis researchers if it pertains to the use of visual representations of data to make sense of or communicate data. These studies were not properly designed with this objective in mind.]

We discussed both of the memorability papers in our visualization reading group at UBC. The difference in initial opinions based on backgrounds was remarkable.

A person with a vision science background initially thought the methods were completely straightforward: they were closely in line with decades of work in her specific field of vision science in particular, and aligned with the larger field of experimental psychology in general. Although the vision scientist could identify some minor quibbles, she was fully satisfied with the rigor. She was intrigued to see that the methods of vision science, which are typically directed to experiments with extremely simple stimuli, were successfully being applied to the more complex stimuli that are of interest in visualization.

In contrast, a person with a biomedical statistics background initially thought the methods were completely indefensible, with far too many variables under study to make any of the statistical inferences meaningful, and most importantly no discussion of confidence intervals or odds ratios. (I was well aware of confidence intervals, but I hadn’t heard of odds ratios. For a concise introduction to these ideas, see Explaining Odds Ratios by Szumilas.)

The biostatistician had had this highly negative reaction to many of the papers she’d been reading in the visualization literature, and had been thinking long and hard for the past year about how to understand her misgivings at a deeper lever than a first kneejerk reaction of “they’re just ignorant of the methods of science”. She articulated several crucial points that have helped me think much more crisply about these questions.

There are several fundamental differences between the experimental methods used in vision science and the methods considered the gold standard in medical experiments that test the effectiveness of a particular drug for treating a disease: randomized controlled trials.

Two of the most crucial points are the ability to manipulate the experimental variables/factors, and effect sizes.

First, in many medical contexts, some kinds of manipulation of experimental variables are off the table. Repeated-measures designs are impossible because of carryover effects: you can’t just give the same cancer patient 100 different cancer drugs, one after the other, because the effects will linger instead of stopping when the treatment stops. With great care, it’s sometimes possible to carefully design “case-crossover” experiments for just two conditions, where for example two drugs are tested on the same person, but certainly it’s not possible to do test many conditions on the same person. That’s why the common case is to design experiments with between-subjects comparison not within-subject comparison. Moreover, the trial lasts a long time: months or even years. Thus, the number of trials is typically equal to the number of participants.

Second, when manipulating variables that affect human subjects, you also have to consider harm to the participant. In medicine, there are many situations where you either cannot manipulate a variable (you can’t retroactively expose somebody to asbestos 20 years ago in order to see how sick they are today, and you can’t just divide a set of people into two groups and give one of these groups brain cancer), or you should not manipulate it for ethical reasons (you shouldn’t deliberately expose somebody to a massive dose of radiation today to see how sick they get tomorrow). One response to this situation is to develop methods for “observational” (aka “correlational”) studies, rather than “experimental” studies where the experimenter has full control of the independent variable. For example, in one kind of retrospective observational studies, a “cohort” is identified (a group that has been identified as having some property, such as exposure to an environmental toxin) and then it is compared to a similar group that hasn’t been exposed. Selecting appropriate participants for each of these groups is an extremely tricky problem, because of the possibility that the cohort also varies from the control group according to some confounding variable that has a stronger effect than the intended target of study.

What I used to think of as “experimental” studies turn out to be more properly called “quasi-experimental” methods because the experimenter doesn’t have full control of the independent variable: they can’t tell people to smoke or not to smoke, but they can ask the people who already smoke to do something else – but there’s still the extreme hazard of confounds. What if you divided so that one group happens to have more heavy smokers than the other, or what if an underlying reason that people smoke is stress and so you’re really measuring stress rather than the effects of smoking per se. The randomized controlled trials that are the gold standard of medicine are in this category. You can divide cancer patients into two groups, one that gets the experimental treatment and the control group that gets the placebo, and then analyze the differences in outcomes to try to uncover their linkage to the intervention. But you can’t control for how virulent of a strain of cancer they have, because you didn’t give them cancer. And, as above, you can’t give the same patient the experimental drug and the placebo.

(One good reference for all of this is the book “How to Design and Report Experiments” by Andy Field and Graham Hole, especially Section 3.2 on “Different Methods for Doing Research”.)

Above, I’ve been alluding to the other crucial aspect, effect size. The typical goal in medicine is to detect quite subtle effects, and thus experiments need to be designed for large statistical power in order to have a hope of detecting these effects.

In contrast, in vision science, life is very different: experimental trials are fast, independent, and harmless; frequently, effect sizes are big. First, trials are very short: just a few seconds in total for the full thing, and the actual exposure to the visual stimulus is often much shorter than one second! Moreover, it’s straightforward to design experiments that preclude carryover effects when you’re testing a perceptual reaction to a visual stimulus instead of a physiological reaction to an experimental drug. Thus, it’s the extremely common case to run many trials with each participant: dozens, hundreds, or even thousands of trials per participant. When considering the statistical power of an experiment, the designer is concerned with the total number of trials, which is the realm of hundreds or thousands. The number of participants is typically far, far smaller than in medical experiments, where in order to have thousands of trials you need thousands of participants. Also, in this domain, it’s not just feasible to design within-subjects experiments, it’s actively preferable whenever possible – because these designs provide greater statistical power for the same number of trials compared to between-subjects designs, since you can control for intersubject variability.

The combination of these two things — the ability to control for intersubject variability through within-subjects designs, and the ability to run many trials — means that there is not nearly so much concern for confounding variables based on splitting your subjects into groups improperly. One implication is that in this experimental paradigm, multi-factor / “factorial” designs are entirely practical and reasonable. That is, a single experiment can test more than one experimental variable, and each variable might be set to several values. For example, the visual stimuli shown to the participant might systematically vary according to multiple properties, resulting in many possibilities. Another implication is that “convenience sampling” is extremely common and does not require special justification, for example undergrads on campus or workers on Mechanical Turk.

Moreover, it’s even possible to design between-subjects experiments with multi-factor designs, given a crucial assumption: that individual differences have a smaller effect size than the effect size that we’re trying to study. This assumption is reasonable because there’s a huge amount of evidence from decades of work in vision science that it’s true – and moreover you can test that assumption in your statistical analysis of the results. And this point brings me back to the concept of effect sizes as the second key difference between the methods of medical research and vision science. In medical research, individual difference effects (how virulent is your cancer) are usually enormous compared to the variable under study (does the drug help). In vision science, individual differences in low-level visual perception are typically very small compared to the variable under study (does the size of the dot on the screen affect your speed of detecting its color).

All of these points are part of the reason that work in vision science is scientifically valid, because the methods are appropriate to the context – even though multi-factor testing with a small number of participants would be ridiculous in the very different context of medical drug trials.

Coming back to visualization, we’re in a context that’s very close to HCI (human-computer interaction) – and controlled laboratory experiments in HCI are a lot closer to vision science than to medicine. It’s common to use multi-factor designs and we run many trials on each participant. There is significant trickiness with carryover effects, typically more so than in vision science, and we often consider “learning effects” in particular as something that must be carefully controlled for in our designs. Our trial times are typically longer than in vision science, ranging from a minute to many minutes – but still far shorter than in medicine. There’s more to say here, but I’ll leave that discussion to another post because I have more ground to cover in this one.

Coming all the way back to the memorability papers and Few’s reponse to them, this analysis allowed me interpret a comment from Few somewhat more charitably: his complaint in the response to the paper about the demographics of Mechanical Turk not matching up with the population of the US. In the context of HCI research, it seems extremely naive because there has been enough previous work establishing how to use MTurk in a way that replicates in-person lab experiments that most of us in the field consider it a settled issue. By considering it in the context of randomized drug trials, as I describe above, I can better understand why Few might have thought along these lines – and my discussion above also covers why his criticism is not valid in this context.

[SCF: My critique of the experimental methods that were used in Borkin’s paper was not influenced by a background in a different research discipline (e.g., medicine). Instead, I was addressing the specific ways in which experimental research should be designed to produce meaningful and valid findings regarding information visualization. Nothing useful can be said about information visualization based on Borkin’s paper.]

(Two of the most relevant papers are from Heer’s group: Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design by Heer and Bostock, from CHI 2010; Strategies for Crowdsourcing Social Data Analysis by Willet, Heer, and Agrawala, from CHI 2012.)

Again coming back to these papers, a contentious point in this whole debate is whether these experiments had sufficient statistical power to draw valid conclusions. Few has contended that the Mem15 paper can’t possibly be valid because there are too few participants. As above, I think this argument is missing the point that in this kind of experiment the power is more appropriately analyzed in terms of the number of trials.

[SCF: Even based on this assumption about the number of trials, did Borkin’s research demonstrate appropriate statistical power? We have no reason to believe that it did.]

I would certainly be happier with the Mem13 paper if it explicitly discussed confidence intervals and/or effect sizes, but it does not. That’s the common case right now in HCI and vis: most papers in HCI and vis don’t, although a few do. I note that Stephen Few did specifically state that he’s critiquing the whole field through this paper as an exemplar, so saying “everybody does it” isn’t a good defense – that’s exactly his point!

Pierre Dragicevic has written extensively and eloquently about how HCI and Visualization as a community might achieve culture change on the question of how to do statistical analysis by emphasizing confidence intervals rather than just doing t-tests: that is, null-hypothesis significance testing (NHST). I do highly recommend his site http://www.aviz.fr/badstats. I also note that he gave a keynote on this very topic at the BELIV14 workshop, a sister event to InfoVis 2014, which sparked extensive discussion. This kind of attention and activity is one of the many reasons I don’t agree with Few’s characterization of the vis research community as being “complacent”.

[SCF: You are skirting the point of my critique. My belief that the infovis community is suffering from complacency is primarily based on the fact that papers such as Borkin’s are accepted and promoted as good when they are in fact poorly done and invalid. When I begin to see the overall quality of infovis research improve and a greater openness to thoughtful critiques, I will have a reason to believe that complacency is diminishing.]

(Dragicevic also also contributed to the online discussion on Few15, with posts 6, 19, 40, and 45.)

The biostatistician in my group argued that even this culture change might not be the best end goal; she sees confidence intervals as just one mechanism towards a larger goal of using methods that take into effect sizes as a central concern, and report on them explicitly in the analysis. She points out the in the medical community there is the concept of levels of evidence: while randomized controlled trials are are gold standard in terms of being the highest level of evidence, they’re absolutely not the only way to do science. In fact, it’s well understood that studies leading lower levels of evidence are exactly required as steps along the way towards such a gold standard. They’re not invalid — or pseudo-science — they use different methods to achieve different goals. (For a concise introduction to these ideas, see The Levels of Evidence and their role in Evidence-Based Medicine by Burns, Rohrich, and Chung.)

[SCF: If Borkin’s study were properly designed to produce valid results, I would not have called it pseudo-science. I have not judged this paper as pseudo-science because it used methods that are different from mine, but because it used methods that were not designed to produce valid findings about information visualization. I was quite specific in my critique of the study’s flaws. You have not addressed any of those flaws specifically and directly. If you wish to argue that this study qualifies as legitimate science, then show how the specific flaws that I addressed are in fact examples of legitimate scientific design.]

The upshot is that I do think this question of statistical validity is complex and subtle, and that Few’s approach of just asserting “you’re not following the scientific method” is dramatically oversimplifying a complex reality in a way that’s not very productive.

[SCF: You are misrepresenting my argument. I did not say that this paper was flawed solely because of statistical problems. My lengthy critique is not guilty of oversimplification. To the contrary, you are oversimplifying this matter by misrepresenting my position and failing to address the many flaws that I identified.]

I hope that my analysis above starts to give some sense the nuance here: the methods of science depend very much on the specific context what what is being studied. Yes, it’s true that we talk about “the” scientific method: observe, hypothesize, predict, test, analyze, model. But when we operationalize this very general idea, the much more interesting point is that there are many, many methods used in science. There is no single answer, and a lot of the training of a scientist involves to learning when to use which method; and within every method are many smaller methods that require judgement, and so on – arguably it’s methods all the way down. Methods appropriate for medical drug trials aren’t even the same as those for epidemiology, much less for low-level perception as in vision science, or human behavior as in social science, or in the complicated mix of low-level perception, mid-level cognition, and high-level decision making that is visualization.

Moreover, all of this discussion has just been about the relatively narrow question of controlled experiments featuring quantitative measurement! There’s an enormous field of qualitative research methods that are also extremely useful in the context of visualization.

On the InfoVis Review Process

The process of reviewing papers is relevant in this memorability discussion, since the Few15 critique specifically called into question whether the peer review process at InfoVis yields appropriate quality.

Papers as a Mix of Strengths and Weaknesses

No paper is perfect, any paper has a mix of strengths and weaknesses. The job of the reviewers is to decide whether the strengths outweigh the weaknesses, and it is valid for two reasonable scientists to disagree on these given that it is an individual judgement call. That is, all papers have flaws; the judgement of the reviewer is to decide is whether these flaws are fatal. Few argues that the Mem13 paper and the Mem15 paper have fatal flaws. I disagree with this assessment, and I explain why at length below.

[SCF: You and I understand the job of reviewers differently. You believe that a paper should be accepted for publication if the “strengths outweigh the weaknesses.” That’s a rather low bar. I believe that the review process should determine if the paper is scientifically valid and worthwhile. In your final sentence above, you disagree with my assessment that Borkin’s paper has fatal flaws and promise to “explain why at length below,” but you never do. At no point do you address the specific flaws that I identified.]

Peer Review and the Conversation of Science

I’ll echo and expand on the words of two other InfoVis steering committee members that a conference is a conversation (Fekete), and that science is a conversation (Heer).  The review process is an intrinsic part of that conversation, even though much of it is not visible to the readers of the final draft of the paper.

Papers are the major units of speech in the scientific conversation. Papers cite and discuss past work, and frame their new contributions with respect to the limitations of that past work. Typically the way somebody argues against the conclusions drawn a paper is to write another paper that carefully shows why that original one didn’t get the story right. The strength and validity of of that argument is judged in the peer review process, where frequently the reviewers are the authors of the very papers that the new paper is characterizing as having limitations. It’s not usually quite so simplistic as just saying the old work is flat-out wrong (although that sometimes does occur). It’s often a matter of noting situations where it falls short, or extending to new situations not previously considered, or proposing the existence of new confounding factors that serve illuminate a previously murky assumption or explanation.

Like most practitioners, Few doesn’t take part in that academic conversation as an author. That’s not suprising – if he did we’d normally call him an academic, since that choice to engage in publishing research is exactly the dividing line between those categories.

Unlike many practitioners, Few has engaged with scientific papers at a sufficiently detailed level that he has been asked to take part in that conversation as a reviewer. He has chosen to decline the most recent invitation because of his belief that anonymous peer review is implicitly unethical.

[SCF: I decline to participate in the infovis paper review process because I believe that anonymity invites bad behavior and that I have no right to pass judgment on someone’s work while remaining anonymous. Unlike the review processes for other events, which allow reviewers the choice of remaining anonymous or revealing their identities, the infovis process forbids reviewers from revealing their identities, which is absurd. In a court of law, the accused have the right to know the identity of their accusers. There’s a reason for this, which I believe applies to the paper review process as well, even though the consequences are not as dire.]

While of course Few is free to make his own choices in this situation, since they affect only himself, I strongly disagree with the assertion that anonymous reviews are fatally flawed. Anonymous reviews provide the opportunity for honesty of assessment without fear of future retribution or retaliation. It’s a structural check against the problem that papers could be rejected from grudges rather than from merit. It’s also a protection for junior people being able to honestly assess the work of senior people without the fear of such retaliation as unenthusiastic letters when tenure time rolls around. Neither of these situations is a problem for Few personally, since he doesn’t submit papers or want tenure, but they are very real concerns for academics.

[SCF: You are ignoring the fact that anonymity gives reviewers the right to reject papers due to grudges that they have against paper authors. In your effort to protect reviewers you are putting authors at risk.]

In the comment thread, Few expresses concern that anonymity supports irresponsible or incompetent behavior “in the shadows”. The fact that he isn’t acknowledging is that there is indeed considerable and significant oversight in the review process that happens at multiple levels. Reviewer identity is only anonymous *to the authors*. It is not at all anonymous to the other members of the program committee or the papers chairs!

First, there’s a two-tier reviewing system, where the (primary and secondary) reviewers who are on program committee have positions of higher responsibility than the external reviewers that they invite. These program committee members are carefully chosen based on the quality of the reviews they have written in the past.

The primary reviewer exercises judgement about the competence and thoughtfulness of the other reviewers when writing up the meta-review. As Jeff Heer alluded to in his first and second comments, all four reviewers read what the others wrote, and then discuss – sometimes at length. I consider it a sign of strength, not a process problem, that reviewers can and do regularly disagree on the merits of a particular paper. Usually these discussions end with some level of agreement, where either an initially positive person gets convinced by arguments about flaws from a more negative reviewer that there is a problem, or vice versa – that a reviewer who champions the worth of a paper (despite inevitable imperfections) convinces the others that it should see the light of day. As a PC member, I most certainly notice if an external on that team does a poor or incoherent job of reviewing, and I make it a point to not invite them again (and would sound an alarm if I saw that another PC member tried to do that in the future for a paper where I was on board).

Second, there’s oversight from the three papers chairs, who read every single review. They explicitly note cases where there is a review quality problem. Program committee members whose review quality is too low — or who consistently invite unqualified externals who write low quality reviews — are not invited to participate in subsequent years. At this point the pool has been sufficiently carefully vetted that there’s only a few per year who are disinvited, and some years there’s no need to eliminate anybody. Moreover, if the papers chairs are concerned that they don’t have enough information to judge a particular paper, they may call in a “crash reviewer” to do an additional review with just a few days of turnaround time. I’ve asked for these a few times when I was papers chair, and I’ve done a few of these myself in later years.

[SCF: These processes and safeguards are not effectively addressing the problems that I’ve identified. Invalid research papers are getting through the review process.]

It’s true that the strengths and weaknesses of anonymous review is an active issue of debate across many scientific communities, and visualization is no exception. While I think that it’s reasonable to discuss whether InfoVis should change the process, I believe that the stance that anonymity necessarily begets irresponsibility is overly simplistic. The strength of a single-blind reviewing system very much depends on process questions of how it is run, and I think InfoVis has an extremely robust and careful process. It yields higher quality results than most other communities that I’m aware of.

I may well write further about this question in some later blog post, but that’s enough for now.

Quality of Evaluation Papers at InfoVis

The bar for ‘publishable’ and ‘strong’ typically moves over time at most venues. I’m confident that it’s gone in the right direction at InfoVis for evaluation papers: quality has increased. In the early years of InfoVis, there were no controlled experiments at all. Then there were a few, and they were fairly weak. As there came to be more and more, the bar was gradually raised, where they needed to be stronger and stronger to get in. I believe we’re now in a place where most are strong, and a few are great. I don’t believe we’ll ever be in a place where everybody thinks every single paper that gets in is great, because there is so much variation in the judgement about what it means to be great. That’s true for any venue at all.

On Tone

Punching Up vs Punching Down

Few clearly sees himself as punching up: he’s the David, the lone voice in the wilderness, the underdog. The Goliath that he’s fighting against is the slowly turning wheels of entrenched academia in general, of which the academics who dominate conferences like InfoVis are an instance in particular. All of his language frames himself as somebody who is fighting the good fight.

[SCF: This characterization of my position is “punchy,” but ill chosen. I do not see myself as David facing down Goliath. I merely see myself as someone who knows and cares a great deal about data visualization and is concerned about the quality of data visualization research. That’s it. Goliath was the champion of the Philistines, who were exercising oppressive dominion over the Israelites. The infovis research community exercises dominion over nothing but its own members. The infovis research community has little affect on the world. I’ve been trying the change that by helping you become more useful and relevant to the world. If you’re searching for a biblical analogy, perhaps the Good Samaritan would be a better fit. I’ve taken the time to notice your wounds and give a damn. Few others in the world of data visualization practice have bothered. Given the reception that I’ve received, their disinterest is easy to understand.]

In contrast, nearly every academic I’ve heard from who has seen his newsletter has reacted in shock, and there’s a palpable sense that Few crossed a line. I think that’s because we see it as punching down: he’s a senior person who is publicly attacking a junior person, and there’s a strong convention against doing that in academia.

[SCF: I find it revealing that “nearly every academic” feels a “palpable sense that Few crossed a line,” but apparently no one can explain where that line is drawn, who drew it, and why I should respect it. I’ve requested an explanation, but no one has responded.]

I need to think more about exactly why that social convention exists. My first speculation is that it’s a reaction to the strong hierarchical system of academia. Senior people have direct power over more junior ones in so many ways (hiring, reviews, tenure) that there’s a sense of noblesse oblige – that those with power and privilege have a duty to those who lack that power. (Or, if you like the pop culture superhero version better than the snooty French version – “with great power comes great responsibility”.)

[SCF: As a non-academic, I don’t subscribe to this your sense of hierarchy. I don’t think of Michelle Borkin as junior. She is an assistant professor at a university. She has students of her own. She publishes research papers for the world to see (and I thought, for the world to critique). We’re all adults. When we put our work out there in the world, we must accept responsibility for it. It’s that simple. I would expect the world of academia, perhaps above all others, to be open to critique. Alas, to my great dismay, I’ve found that this is far less the case than in the world of business where I’ve spent most of my career.]

I might in the future write a longer post just on this subject, but there’s a lot of ground that I want to cover so I’ll move on.

Pseudo-Science as Fighting Words

It’s disingenuous at best for Few to accuse somebody of doing ‘pseudo-science’ and then express surprise that people are getting upset. That’s like complaining that I can’t believe that person over there hit me in the face — when all I did was kick him in the stomach!

[SCF: I was not surprised that people were upset. I was surprised and disappointed by the unreasonable ways that many in the academic community have responded (i.e., in the form of unwarranted personal attacks rather than by addressing the content of my critique). My claims were rational, accurate, and supported by evidence. Only two academics have responded in the thoughtful manner that I would expect from scientists: Jeff Heer and Pierre Dragacivec.]

His later comments said that the academics are not open to feedback and are slamming the door in the faces of people who don’t have PhDs following their names. I don’t agree that the irritation expressed by many academics at his remarks is fair to interpret as sign of a disrespect towards all non-academics; they’re a sign that his rhetorical choices have made people angry at him in particular.

[SCF: Let’s get straight what I’m saying. Many (perhaps most) infovis researchers are out of touch with the real world of data visualization practice and are responding to my critique in ways that demonstrate no concern for getting in touch. I’m tired of the excuse that my so-called “rhetorical choices” excuse the academic community from responding thoughtfully. This is nothing but a diversion from a very real set of concerns that few in the academic community are willing to acknowledge, let alone address.]

‘Pseudo-science’ is fighting words: that label is a direct personal insult to the intelligence and integrity of a scientist. Of course there will be an emotional response. It’s implausible to me that Few does not understand that this word choice would be a red flag. He made the deliberate choice to frame this debate as fight rather than a discussion. He even admits in a later comment to being “intentionally provocative”.

[SCF: Please don’t falsely assign inflammatory intentions to me. I have not insulted anyone’s intelligence or integrity. If you believe otherwise, you are welcome to provide specific examples. If my behavior were really at fault, you would not need to exaggerate it I as you have. I went out of my way in my article to point out that the failures of Borkin’s paper were not failures of intelligence or integrity.]

I suggest that the label of ‘pseudo-science’ should be reserved for things like Intelligent Design, where there is a deliberate attempt to cloak a non-scientific practice in the garb of science to deceive.

[SCF: I intentionally used the term “pseudo-science” to emphasize the harmful nature of the problem—a problem that is being propagated by the infovis research community, including several of its leaders. Rather than worrying about my use of the term “pseudo-science,” I suggest that you worry about fixing the problems that I have taken great pains to describe. Put your energy where it’s most needed. Opposing me is helping no one.]

If his goal is to make a useful contribution to the extensive and ongoing debate about the methods of science, he should not start things out by slinging personal insults. That choice makes it a lot harder to find a way to work with him. Given his choice of rhetoric, I’m not sympathetic to his position that the people who protest his tone are missing the point of his scientific critique. He made that bed, he gets to lie in it.

[SCF: No examples of “slinging personal insults” have come from me, although a few have been directed at me. I have appropriately assigned responsibility for a flawed research paper to the people who wrote it. When people find problems in my work, they assign responsibility to me. That how this works. This is a responsibility that we must all accept when we put our work out there in the world. A fitting example of “slinging personal insults” actually comes from you in the “On Rhetoric” section below.]

On Rhetoric

Few frequently uses a family of rhetorical device that I find very irritating.

[SCF: What I find irritating are false accusations, especially when they suggest that I am guilty of deception.]

One of these I’ve seen called by many different names, including the loaded question, begging the question, circular reasoning, or presupposed guilt. Perhaps the best-known example of this device is “Have you stopped beating your wife?”, where either a yes or a no answer implies guilt because of the false presupposition.

[SCF: As a longtime student of rhetoric, I can assure you that I am not guilty here of the loaded question, begging the question, circular reasoning, or presupposed guilt. I’ll respond to each of your specific claims below.]

Here’s one of many examples from the comments:

“If you disagree, you should defend the review process, not by quoting statistics about the number of papers, etc., but by explaining why poor research papers are accepted.” (Few to Fekete, post 27)

No. Fekete doesn’t have to explain *why* poor research papers are accepted because he did not agree with your assertion *that* poor papers are accepted.

[SCF: In my article, I pointed out several specific problems in infovis research. At no point did I assume or suggest that Fekete accepted my claims that these problems exist. At no point did I ask a question of Fekete that he could not answer without admitting fault. At no point did I lay a deceptive trap for him. This is not a court of law where someone is confined to a yes or no answer. I made the case that poor research papers are being accepted. He was welcome to respond by countering my argument that they are. Instead, Fekete made a speech about the glories of infovis research. In a debate, when you make an argument, your opponent is obligated to respond to your argument with reason and evidence. Fekete chose to ignore my arguments entirely. In other words, he chose to treat it like a televised debate among political candidates rather than an serious debate among peers.]

Here’s another example that’s even more blatant:

“My statement that professors who produce research papers such as this one will encourage their students to produce pseudo-science is not speculative, assuming that you accept my premise that this paper qualifies as pseudo-science.” (Few to Heer, post 20)

No. Heer explicitly *rejected* the premise that this paper qualified as pseudo-science, in the directly preceding paragraph. He most certainly did not accept the premise.

[SCF: I neither said nor implied that Jeff Heer accepted my premise. Furthermore, you should reread Jeff’s comments in the paragraph preceding my comments. He did not explicitly reject my premise as you claim.]

A related device is the insinuation of things that people did not mean, for example:

“What do you suggest that the infovis research community should do to prevent the kinds of flaws that you and I have both identified in this paper?” (Few to Heer)

Misleading. This phrasing strongly implies that Heer agreed with all of Few’s assertions – but he did not. Heer’s answer deftly sidesteps the attempted trap: “… I would have raised the issues I noted above (which only partially intersect with yours)”. (Heer to Few, post 33)

[SCF: No, the phrasing of my question does not imply that Jeff agreed with all of my assertions. Your argument here is an example of the flaw that you are accusing me of committing. You are insinuating something that I neither said nor meant. I laid no trap for Jeff. Also, the response from Jeff that you quoted as a deft sidestepping of my “attempted trap” is his answer to an entirely different question. The question to which Jeff was responding was, “If you had reviewed the ‘Beyond Memorability’ paper, would you have recommended it for publication in its current form?” I would not describe his sidestepping as deft, but as fearful. I believe that Jeff would not have accepted this paper, but that he fears the recrimination that would result from this admission.]

A third rhetorical device is the continual interweaving between facts that are well substantiated and agreed on by others, and his own opinions – without clearly distinguishing between the two – to present the misleading impression that everything he says is a faithful reflection of the conventional wisdom.

[SCF: When you make an accusation such as this, you should provide an example. I have no idea what you’re referring to.]

On Passion

Even as I’m irritated by Few’s choices of tone and rhetorical style, the silver lining is that I appreciate his passion for the cause of improving the work that we all do in the field of visualization. I’m delighted that the field is vibrant enough that we both care enough to argue about it – and that a bunch of other people care enough to follow that argument as well through tweets, blogs, and other social media avenues. That’s much better than apathy or disinterest!

[SCF: I’ve contributed a bit more than passion. Suggesting the I can only be valued for my passion is an example of the attitude that makes many leaders among data visualization practitioners dismiss the infovis research community as insular and irrelevant. Why should they get involved if this is how you respond to someone who has contributed as much as I have to the field of data visualization?]

sfew

Moderator
Registered:
Posts: 802
Reply with quote  #68 

I appreciate the fact that Tamara Munzner took time to respond to my article. She has added several good points to the discussion. Overall, however, I find her response disheartening. Three members of VisWeek’s Infovis Steering Committee have now responded: Jean-Daniel Fekete, Ben Shneiderman, and Tamara Munzner. They have unanimously failed to address my concerns. Tamara came the closest to responding, but failed to directly address the issues that I raised. The responses from these three infovis leaders leave me discouraged about near-term progress in infovis research.


__________________
Stephen Few
markb101

Registered:
Posts: 3
Reply with quote  #69 

Interesting discussion which brings back memories to when I worked as an academic in a related research field. It is disheartening to see that the same problems which turned me off academia (publishing for sake of career development only and not the advancement of knowledge) appear to still be ingrained.

The responses from a number of the academic “experts” reminded me of the famous Feynman speech where he introduces the Cargo Cult. As Feynman outlined, the “A‑Number‑l” experiment was not the one which comes up with fabulous evidence to support theory A or B, but the one that understands all possible limitations of the proposed experiment, and controls for those factors prior to executing the experiment. Only then can we fully evaluate what we are truly interested in with any confidence.

http://calteches.library.caltech.edu/51/2/CargoCult.htm

Sadly, this type of rigor will not help secure tenure. 

What Stephen is pointing out is an important problem: a lot of noise being generated by a research field without any care for understanding the limitations the experimental design. This ultimately has a knock on effect in the real world. In the 90’s Paul Meehl also wrote about the same problem repeatedly in Psychology:

Paul E. Meehl. Why Summaries of Research on Psychological Theories are often Uninterpretable. Psychological Reports, Monograph Supplement 1-V66, 1990, 195-244, 50 pp

Anyhoo, the reason I am here is to take issue with Pierre’s comment on a sample size of 2.

“Here is a simple (perhaps simplistic) example involving a lab study. Suppose an investigator, John, wants to establish the existence of a strictly positive effect using conventional hypothesis testing (a t-test with alpha = .05). Also suppose John has strong reasons to believe that his metric of interest is normally distributed, and has good reasons to think that the effect size is enormous (i.e., a Cohen's d of about 10 -- bear with me).”

This is a dangerous misinterpretation of study design. As a trained and working statistician, there would be no instance where a sample size of 2 would be appropriate for a study of this nature. Irrespective of the pre-specified effect size and variability, the power (1- type 2 error rate) would be so low as to render the study result meaningless. We would have no confidence the outcome was not due to chance. This is also assuming the hypothesis was pre-specified and only that hypothesis was tested, and not a trawling expedition through multiple competing hypotheses to find a significant p-value or CI not including zero (which is sadly the norm in such studies). We would also have to assume all confounding variables were controlled for. Unless practice has changed dramatically from when I was in academia, I could count on one finger the number of studies which pre-specify the research hypothesis upfront.  

This blog post sums it up

http://andrewgelman.com/2014/11/17/power-06-looks-like-get-used/

On the topic of the memorability I can’t say it has any relevance in presenting findings in the field I work in (life sciences). Clear and accurate summaries of the data are crucial to support understanding.  Any graphical summary which is not motivated by these ideals would be memorable but for the wrong reasons.

On the issue of crossing the line. When did academics become so meek? There is a great tradition of academic 'published' peer reviews of newly published research and rejoinders in statistics and other fields.  


dragice

Registered:
Posts: 7
Reply with quote  #70 
Thanks Mark. Can you elaborate on why it is a dangerous misinterpretation? In my example the power is 0.8.
markb101

Registered:
Posts: 3
Reply with quote  #71 

Sure I can elaborate but I doubt I can do a better job than the Meehl article I put up as reference.

My comment was relating to the type of user studies/experiment under discussion (e.g. supporting a working hypothesis of human cognition) not the contrived example that followed on from the quote I selected (but I will comment on that later).

My main objection is similar to Stephen's in that a sample size of two would not support general statements about the wider population. We can go in to the theory of Student's original test and small sample sizes, but I doubt the underlying assumptions which would make the t-test valid for a sample size of two would hold for a user study. There are numerous confounding factors which may influence such an experiment; selection biases, measurement biases, etc. There is also the issue of pre-specification (and multiplicity) which I mentioned in the previous post.

I don’t doubt that examples exist in the physical sciences where the measurement of an effect is so accurate that a small sample size would suffice. But I am sure you are aware that the same conditions and assumptions soon unravel when it comes to the soft sciences (which I am throwing this type of research in with).  

In terms of your contrived example, I think you indirectly highlight the issue with null hypothesis significance testing. The assumed effect is large (I’m going to gloss over what you mean by a large effect here for a user study), and we can achieve 80% power to reject the null hypothesis (significance level of 0.05) with a sample size of 2 using a t-test (again I will gloss over if this a one, two sample or paired t-test). I am also assuming the hypothesis is pre-specified and is the only hypothesis to be tested.

After completion of the experiment, rejecting the null hypothesis is of little meaningful value. As I am sure you are aware, rejecting such a weak null hypothesis does not necessarily translate in to evidence supporting the alternative hypothesis you wish to support. A more meaningful test would be to evaluate the assumption of the effect used to design the study and predict if the new experiment validated that assumption. That would be the more meaningful analysis.  

I would urge you to read the Meehl article closely and then rethink the contrived scenario you proposed in the context of your research field.

If you could also give concrete examples where prior research supported such a contrived scenario this would help the credibility of your position.

Finally, it would also help if you critically evaluate the references you cite. I wanted to quote back to you a sentence from the Winter paper which you used to support a sample size of 2: “Taking this further, it can be argued that if a psychologist observes a statistically significant effect based on an extremely small sample size, it is probably grossly inflated with respect to the true effect, because effect sizes in psychological research are typically small. Accordingly, researchers should always do a comprehensive literature study, think critically, and investigate whether their results are credible in line with existing evidence in the research field”. Food for thought?

dragice

Registered:
Posts: 7
Reply with quote  #72 
Mark, I think you misunderstood my point. If you read my post carefully and in context you will see that I am not using this example to advocate the use of NHST and very small samples in actual research. My example is a thought experiment used to support a logical argument.

My point was that we cannot judge the reliability of a study based on information on sample size alone, e.g., by examining whether N is above or below whatever may be our favorite sample size. I could have easily found an example where N=10,000 is insufficient but I like the case where a ridiculously small sample (N=2) is sufficient because it's striking and perhaps counter-intuitive to many.

All the problems you discuss are real and important, but they are irrelevant to the point I was trying to make. More specifically, all of your objections are valid if we think of my example as actual research, but they would remain valid if my example assumed a larger N and the same statistical power (and therefore a smaller effect size).

Note that I'm assuming that all this discussion happens within the NHST framework, because this is the framework under which most researchers operate, including the reviewers who reject studies based on sample size alone. I do realize that if we switch to a Bayesian framework, sophisticated arguments against small samples become possible.
markb101

Registered:
Posts: 3
Reply with quote  #73 
I understood why you were raising the example to support an argument that sample size alone doesn't help critique a study or should be a reason to reject a study. I also agree a large sample size will not save a poorly designed study. I think we are singing from the same hymn sheet. But I still don't think its a good example to use in this context. The example could send the wrong message: that is only the statistical analysis that needs to change not the study design or measurement. So lets agree to disagree. 

I also don't agree that a Bayesian analysis will resolve anything if the study fundamentals are flawed (design and measurement). Although power is not the aim anymore during planning, you still need to collect enough data points for robust model estimation. 

Anyway, its an interesting discussion and I do commend that you are trying to fix your research field from the inside. Its a tough task. Good luck.
Previous Topic | Next Topic
Print
Reply

Quick Navigation: