Some personal background: (feel free to jump to first table below if you wish)
Back in the day, when I did my first Master degree in Statistics and Operation Research, way before Data Science was even a thing, I went to my adviser, Prof. Yoav Benjamini, and asked him for a cool project for my thesis. That was around 2003, and Bioinformatics was the hot, trendy field to be. He told me he has a nice project with Prof. Dan Graur and Dr. Shaul Shaul, that involved an interesting genomic data, and has a great potential to benefit from statistical analysis, specifically, classification modeling (CART). I immediately dived into the fascinating field of Bioinformatics (or computational biology, if you prefer that term), which ended up in a publication in a respectful peer-reviewed journal RNA.
Back then, tt took me hours to plot the following tree visualization with R, and then carefully editing it, centering the numbers and letters at the nodes and leaves, in a manual way in PowerPoint. Back then, I wished there was a better way to do it. Maybe even R package that specializes just in that.
I then moved to another hot topic in the field, analyzing gene expression data from mRNA microarray chips, and left phylogeny trees behind. Couple of years later, Dr. Tal Galili continued working on the project, and published a follow up paper, with much nicer tree plots, and was also published at the RNA journal. Tal created his own R package, dendextend, to plot and compare phylogeny trees like the following. Such a nice plot!
And then (finally), came the fun days of tidyverse and ggplot. What an exciting era to live through when yesterday’s worst nightmare of convoluted data structures and spaghetti code is much more elegant, or even… tidy.
After a couple of more fun years of microarrays data analysis and multi-omic data, I transitioned into a Data Science role in industry, and instead of counting genes, and proteins I used my well crafted R tools to count… clicks and IOs. Occasionally I wondered how my genomic data analysis would have looked like if I use the new(ish) packages that are out there for phylogeny tree analysis and tree visualizations. While I try to follow up and keep myself updated on new bioconductor S4 classes, unfortunately, I didn’t have the motivation to dive deeper and try these packages. I did, however, took a mental note on the name of the package. It is even easier to remember them if they have a tidy!#$@ or gg^&*^ prefix.
One shiny day, a challenging data structure arrived at my desk. It was some sort of a deep nested list of lists, similar to: names to file system of file names and folder names; or organization roles chart; or a JSON file format. It required wrangling tasks beyond reshaping into a tabular shape, such as recursive aggregations across the “paths” of the nested parent-child edges/nodes. I first felt overwhelmed, but then, for a moment it reminded me a data structure of a … tree! And then I knew that the day I was patiently waiting for had finally arrived. This will be my motivation to clean the dust from these mental notes that I kept collecting over the years, and finally spend some time to learning them, and using them.
So where to begin? Well, everything is tidy now, so no need to reinvent the wheel, and no need to bet on some random unmaintained package that might do the job, but has a limited scope and a risk to get stuck on with limited resources. Luckily, the tidy#%^$ packages did provide what I was hoping to find. The first promising sign was a reference in the package documentation to the ancestor legacy packages that were preceding to the current package, and in which in many cases were inheriting the class types from the original classes (yet with some extensions and generalizations). When I was still in doubt about some functionality, it was relatively easy to jump into the source code or the .R file on the GitHub repository, and see the exact definition and dependency of the function. I then tried to conceptualize the similarities between the packages, and summarized the relationship between the packages. I tried to organized it by:
- Which packages are part of the same Eco-system, and depend on similar classes.
- Are there utility functions to easily coerce and convert one class to another?
Trees are bi-directional, ordered graphs:
Exploring old and new tree data structure and visualization packages was not enough. There was some overlap of functionality with Graph/Network packages. Well, this is not surprising at all, since Trees are a private case of a bi-directional, ordered graph! Bi-directional means that you can navigate from parent to child and vice versa. Ordered means that the sort order of the child of a parent node is well-defined. Or, the other way around… a graph is a generalization of a tree.
Eventually, the following summary table of packages and Eco-systems was emerged:
If you don’t care much of the “historic” ancestors of the “modern“ packages, a main take home message from the above table is:
ggtree is the modern package suite for everything-tree-related analysis + visualizations.
data.tree is a specialized package for tree “operations” such as: traversal, search, and sort operations, and an infrastructure for recursive tree programming.
tidygraph package is a modern wrapper around the igraph package that leverage ‘tidy’ style/structure/verbs/formats. It goes hand in hand with the ggraph package for visualization and many other algorithms.
I also complied some more detailed functionality comparison, but didn’t fill all of the blanks, so will keep it incomplete for now:
So which package should I use?
Cool, so whatever I wish to do with designated packages for tree data structure, I can also do with graph/network packages. In both cases, I can either: use the Base R and do everything from scratch; I could use the good old original packages; I could use the modern tidy/gg packages; or dive even deeper with the specialized packages (e.g. data.tree). So which one should I choose?
Well, since most packages have utility functions to coerce from one class to another, in theory I could use each one of them, and then pick the one that best fit my needs (e.g. customized visualization, potential to extend and relax some of the constraints on the analysis).
However, reality is that you often start with one package, but then get stuck on some dead end, annoying unresolved issue. No matter how much you tried to fix an error message, going through endless stack-overflow pages, and GitHub issues, you still can’t get it to work. Sometimes, even though there is a big community for R open source tools, when you encounter such issues it does feel very lonely and frustrated, which makes you want to scream:
How come I am the only one in the entire universe who gets this annoying error message! ^&*#$%#@
And in such frustrating moments, it is good to try the alternative package, even if it wasn’t your first priority. However, jumping through packages and different resources of documentation, many times it helps you to figure out parts from the other packages that did not resonate when you first went through them. Hurray!!! You are now an expert of both packages.
In a future blog post I do hope to get a tree structure data and analysis case study demonstrated with ALL of the above packages. Who knows, maybe even with the exact same data I had in Master thesis. Or maybe even using an image analysis package that take above picture as in input and breaks it back into the date structure. What an exciting era. Stay tuned!
I usually try to ignore the snake in the room, but Python should probably also have cool packages for this type of data structure. Well, pardon my French, but in some analytics domains some open source communities are still way behind the other language. This of course goes on both ways. It almost reminds me of the days when I used only Base R. Here are couple of Python Tree Data Structure that I found:
Python - Binary Tree
Tree represents the nodes connected by edges. It is a non-linear data structure. It has the following properties − One…
ape https://cran.rstudio.com/web/packages/ape/index.html https://nantucketdeveloper.github.io/2019Workshop/tree-structure/ phangorn https://cran.r-project.org/web/packages/phangorn/index.html phylobase https://cran.r-project.org/web/packages/phylobase/index.html treeio http://bioconductor.org/packages/release/bioc/html/treeio.html tidytree https://yulab-smu.top/treedata-book/
dendExtend https://talgalili.github.io/dendextend/index.html data.tree https://cran.r-project.org/web/packages/data.tree/vignettes/data.tree.html#tree-traversal https://cran.r-project.org/web/packages/data.tree/vignettes/applications.html
tidygraph + igraph https://www.data-imaginist.com/2017/introducing-tidygraph/