Discussion


Note: Images may be inserted into your messages by uploading a file attachment (see "Manage Attachments"). Even though it doesn't appear when previewed, the image will appear at the end of your message once it is posted.
Register Latest Topics
 
 
 


Reply
  Author   Comment  
MattC

Registered:
Posts: 5
Reply with quote  #1 
Hi all,

I'm looking for comments as to the top graph depicted here

[leu2013336f1] 

which was taken from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3918868/figure/fig1/ and this journal article http://www.nature.com/leu/journal/v28/n2/full/leu2013336a.html 


This has now become the "industry standard" graph for depicting these type of results, with the author trying to show 4 features:

1. The number/percentage of patients (y-axis) per gene mutation (x-axis), with 47 gene mutations being evaluated using over 800 patients.

2. That there is a Pareto principal in that a small number of genes account for most of the patients but that there is a "long tail" i.e. the distribution shows right or positive skew.

3. The composition of each gene mutation by the 8 different WHO subtypes (RA, RARS, RARS-T, etc etc - as denoted by the different colours) differs per gene mutation. For example, the 13th category on the x-axis, JAK2, is composed mainly of the RARS-T WHO subtypes.

4. The converse of point of 3: i.e. that most RARS and RCMD-RS are found in the 2nd, SF3B1, category. 

All 4 features are relevant and need to be represented, yet I was wondering if there was a more effective way of demonstrating these 4 features in a single graph as I can find it difficult to compare, for example, if there is a different frequency of RAEB-1 (in light-blue) in the SF3B1 category compared to the RUNX1 category.  
 
(see also figure 1A in http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3837510/figure/F1/ which depicts the same thing but uses number as opposed to frequency on the y-axis, and is copyrighted)

Thanks in advance


sfew

Moderator
Registered:
Posts: 823
Reply with quote  #2 
Hi MattC,

There is a considerably more effective way to display this information, which addresses all four of the points that you described. I will gladly illustrate, but it would work best if I have the underlying data. Do you have it or know how I can easily get it?

__________________
Stephen Few
MattC

Registered:
Posts: 5
Reply with quote  #3 
Hi Stephen,

Many thanks for taking the time to look at this and trying to find a better way of representing.

Like I mentioned, these two papers are basically the first to show this type of genetic data but, with the advent of whole genome sequencing of different diseases, they certainly won't be the last!

As these two papers were the first in their field and both have used this way of graphically representing the data then everyone will (have to) adopt this graphical method because, "well, that's how this data has already been represented". And so it becomes a circular argument as everyone then graphs it this way.  

Ok, diatribe over! So the data from both publications is publicly available but may not be quite so easy to work with.

The one from the inserted graph is found here as Supplementary Information (not Supplementary Figures):

 http://www.nature.com/leu/journal/v28/n2/suppinfo/leu2013336s1.html

Unfortunately, this is a word document and the raw data is tabulated in Supplementary Table S8.

It may be easier to work with the data from the hyperlinked study at the bottom of my previous post and repeated here:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3837510/figure/F1/

The author of this study has provided the same type of data but in an .xlsx format but with 13 categories and 106 genes.

This is available here http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3837510/bin/supp_122_22_3616__index.html and is named Table S2. Gene mutations in study.

The critical columns in the .xlsx sheet are SAMPLE NAME, Category, and Gene. One thing to note is that each SAMPLE NAME in the .xlsx sheet can have more than one Gene. This is shown in "c" in the picture I previously inserted into the original post. This is why there are 652 SAMPLE NAME entries but 2259 rows in the .xlsx file.

To make your life easier, if you wanted to filter on "Oncogenic" in the "Annotation" column, that is perfectly acceptable to reduce the number of Genes.

I hope this makes some sense and I'm looking forward to your solution.

Many thanks again.

Matt 
 
sfew

Moderator
Registered:
Posts: 823
Reply with quote  #4 
Hi Matt,

I used a reduced set of data from the third source that you mentioned above to create the example below. Notice the the first column following the gene names on the left shows the total count of instances for each gene and the following columns display values per category. The quantitative scales for the category columns is different from the scale for the total column because the category values are smaller. Compared to display of this information as stacked bars, this approach allows you to easily compare the values for each category.

Genes and Categories.png 


__________________
Stephen Few
MattC

Registered:
Posts: 5
Reply with quote  #5 
Hi Stephen,

Many thanks for this. I knew that there must be an alternative and better approach for visualizing the data. It contains the same data but makes it far easier to compare the different categories.

Unfortunately, as previously stated, the previous method now seems to be the industry standard. I did show your graph to a colleague (who had previously published in a journal using the other approach) and they agreed that your approach was far clearer and was something we should adopt.

The same colleague and I then discussed how neither of us found "Circos" plots helpful. These are again quite widespread in the same field of study to show any correlation between genetic mutations and are a different way of representing the data found in plot of the original post. An example of its use can be found here http://www.bloodjournal.org/content/126/6/790.figures-only

So, I would be interested to hear your thoughts or alternatives on that approach. I'm quite happy to post another thread and try and provide some examples which can be posted under the Creative Commons licence but am unsure whether these should be posted in the "good graphs or "bad graphs" category. I think the latter but others, obviously, think the former due to their widespread use.

Matt
sfew

Moderator
Registered:
Posts: 823
Reply with quote  #6 
Hi Matt,

Is this what you're calling a Circos plot?

Circos Graph.png 


__________________
Stephen Few
MattC

Registered:
Posts: 5
Reply with quote  #7 
Hi Stephen,

Yes. That's the one. Again, it's becoming a really common way to look at large number of cancer patients and visually represent correlations between gene mutations and it has been used in very high ranked scientific journals.

Apologies if you've covered this before. I have tried searching in the archives using the "circos" term but it didn't return any results. However, it may have been referred to under a different name. If you have covered it before please let me know the alternative name and I'll search on that.

Many thanks again,

Matt
sfew

Moderator
Registered:
Posts: 823
Reply with quote  #8 
Matt,

This is an example of a circular network graph. I wasn’t familiar with the Circos term, so I looked it up. Circos is the name of a particular software product that can be used to create several different types of graphs that are circular. Most circular graphs are of little use (pie charts, radar charts, polar charts, etc.), but the circular network graph can be useful when it's used and designed properly. It's properly used to show relationships between a number of items that all belong to the same category. For example, you could use one to show relationships between different genes, as you mentioned. These graphs can be hard to read if they consist of many items with overlapping bands that connect them to show relationships. They always benefit from being interactive, with the ability to select a particular item to highlights its connections to other items.

__________________
Stephen Few
MattC

Registered:
Posts: 5
Reply with quote  #9 
Hi Stephen,

I would be inclined to agree. I can see how they would have utility when embedded in a webpage where you can isolate, for example, an individual gene to see the network of relationships.

However, when I see something like this in a high impact factor journal (which the company itself proudly advertises as one of its successes, especially as it made the front cover)

http://www.cell.com/action/showImagesData?pii=S1535-6108%2813%2900003-2

Well, I just switch off as it just becomes incomprehensible (and I actually understand what they are trying to show!). Maybe it's just me, though!

Anyway, thanks again for your time and advice, and I've already applied the method you suggested in your earlier post to visualize some of my data from a project I'm working on and it makes deciphering the data far easier.

Matt
Previous Topic | Next Topic
Print
Reply

Quick Navigation:

Easily create a Forum Website with Website Toolbox.