Difference between revisions of "SPR Supertrees README"

From Bioinformatics Software
Jump to: navigation, search
Line 1: Line 1:
 
<pre>
 
<pre>
 
################################################################################
 
################################################################################
rspr
+
spr_supertree
  
 
################################################################################
 
################################################################################
  
Usage: rspr [OPTIONS]
+
Usage: spr_supertree [OPTIONS]
Calculate approximate and exact Subtree Prune and Regraft (rSPR)
+
      spr_supertree-omp [OPTIONS]
distances and the associated maximum agreement forests (MAFs) between pairs
+
Calculate supertrees that minimize the SPR distance from the input
of rooted binary trees from STDIN in newick format. Supports arbitrary labels.
+
trees. By default calculates a rooted SPR supertree from a list
The second tree may be multifurcating.  
+
of rooted binary trees from STDIN in newick format. An initial
 +
tree is built by greedily adding taxa in decreasing order of
 +
ocurrence. The tree is then improved by SPR rearrangements.
 +
Additional options allow for unrooted and/or multifurcating input trees.
  
Can also compare the first input tree to each other tree with -total or
+
Copyright 2013-2014 Chris Whidden
compute a pairwise distance matrix with -pairwise.
+
 
+
Copyright 2009-2014 Chris Whidden
+
 
whidden@cs.dal.ca
 
whidden@cs.dal.ca
http://kiwi.cs.dal.ca/Software/RSPR
+
http://kiwi.cs.dal.ca/Software/SPR_Supertrees
April 29, 2014
+
March 3, 2014
Version 1.2.2
+
Version 1.2.1
  
This file is part of rspr.
+
This file is part of spr_supertrees.
  
rspr is free software: you can redistribute it and/or modify
+
spr_supertrees is free software: you can redistribute it and/or modify
 
it under the terms of the GNU General Public License as published by
 
it under the terms of the GNU General Public License as published by
 
the Free Software Foundation, either version 3 of the License, or
 
the Free Software Foundation, either version 3 of the License, or
 
(at your option) any later version.
 
(at your option) any later version.
rspr is distributed in the hope that it will be useful,
+
 
 +
spr_supertrees is distributed in the hope that it will be useful,
 
but WITHOUT ANY WARRANTY; without even the implied warranty of
 
but WITHOUT ANY WARRANTY; without even the implied warranty of
 
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
Line 32: Line 33:
  
 
You should have received a copy of the GNU General Public License
 
You should have received a copy of the GNU General Public License
along with rspr.  If not, see <http://www.gnu.org/licenses/>.
+
along with spr_supertrees.  If not, see <http://www.gnu.org/licenses/>.
  
 
*******************************************************************************
 
*******************************************************************************
Line 38: Line 39:
 
*******************************************************************************
 
*******************************************************************************
  
These options control what algorithm is used
+
These options control what algorithm is used to determine the SPR distance
 +
from the supertree to the input trees. By default -bb is used.
  
 
-fpt        Calculate the exact rSPR distance with an FPT algorithm
 
-fpt        Calculate the exact rSPR distance with an FPT algorithm
  
 
-bb        Calculate the exact rSPR distance with a branch-and-bound
 
-bb        Calculate the exact rSPR distance with a branch-and-bound
             FPT algorithm. This is enabled by default.
+
             FPT algorithm. This is the default option.
 +
 
 +
-approx    Calculate just a linear -time 3-approximation of the rSPR distance
  
-approx    Calculate just a linear-time 3-approximation of the rSPR distance
+
-max k      Calculate the exact rSPR distance if it is k or less and
 +
            otherwise use the 3-approximation
  
 
-split_approx
 
-split_approx
 
-split_approx x  Calculate the exact rSPR distance if it is k or less and
 
-split_approx x  Calculate the exact rSPR distance if it is k or less and
 
                 otherwise use the exponential-time approximation
 
                 otherwise use the exponential-time approximation
 
-cluster_test  Use the cluster reduction to speed up the exact algorithm.
 
                This is enabled by default.
 
 
-total          Find the total SPR distance from the first input tree to
 
                the rest of the list of trees. Uses the other algorithm
 
                options as specified (including unrooted options).
 
  
 
*******************************************************************************
 
*******************************************************************************
Line 64: Line 62:
 
These options control the use of optimized branching. All optimizations are
 
These options control the use of optimized branching. All optimizations are
 
enabled by default. Specifying any subset of -cob, -cab, and -sc will use
 
enabled by default. Specifying any subset of -cob, -cab, and -sc will use
just that subset of optimizations.
+
just that subset of optimizations. See the README for more information.
  
-allopt   Use -cob -cab -sc and a new set of improvements. This is the
+
-allopt   Use -cob -cab -sc and a new set of optimizations. This is the default
          default
+
          option
option
+
  
-noopt     Use 3-way branching for all FPT algorithms
+
-noopt   Use 3-way branching for all FPT algorithms
  
-cob       Use "cut one b" improved branching
+
-cob     Use "cut one b" improved branching
  
-cab       Use "cut all b" improved branching
+
-cab     Use "cut all b" improved branching
  
-sc       Use "separate components" improved branching
+
-sc       Use "separate components" improved branching
 +
 
 +
-bipartition_cluster x  Do not consider supertree rearrangements that violate
 +
                        biparitions supported by x% of gene trees containing
 +
                        at least two members from each side of the bipartition.
 +
                        Enabled by default with x=0.5
  
 
*******************************************************************************
 
*******************************************************************************
 
MULTIFURCATING COMPARISON OPTIONS
 
MULTIFURCATING COMPARISON OPTIONS
 
*******************************************************************************
 
*******************************************************************************
 +
 +
-allow_multi  Allow multifurcating gene trees
  
 
-support x    Collapse bipartitions with less than x support
 
-support x    Collapse bipartitions with less than x support
Line 88: Line 92:
 
*******************************************************************************
 
*******************************************************************************
  
-unrooted  Compare the first input tree to each other input tree.
+
-unrooted  Compare the supertree to each rooting of the input trees.
             Output the best found distance and agreement forest.
+
             Use the best found distance
            This option can be used with gen_rooted_trees.pl to provide
+
            the rootings.
+
            Note that this option is a bit unintuitive to maintain
+
            compatibility with previous versions of rSPR.
+
            If -total or -pairwise analysis is used then there is no need
+
            to specify rootings.
+
  
-unrooted_min_approx    Compare the first input tree to each other input tree.
+
-unrooted_min_approx    Compare the supertree to each rooting of the
                         Run the exact algorithms on the pair with the
+
                        input trees.
 +
                         Run the exact algorithm on the rooting with the
 
                         minimum approximate rspr distance
 
                         minimum approximate rspr distance
  
-simple_unrooted        Root the gene trees using
+
-simple_unrooted        Root the gene trees at each iteration using
 
                         a bipartition balanced accuracy measure
 
                         a bipartition balanced accuracy measure
                         (fast but potentially less accurate). Only
+
                         (fast but potentially less accurate)
                         used with -total.
+
                         Reports an unrooted SPR distance comparison
 +
                        at the end of each iteration for comparable
 +
                        iteration scores
 +
 
 +
-simple_unrooted x      Root the gene trees at the first x iterations
 +
 
 +
-simple_unrooted_fast  The same as -simple_unrooted but does not use
 +
                        an unrooted comparison at the end of each
 +
                        iteration
 +
 
 +
-outgroup FILE          Root the gene trees with the outgroup taxa
 +
                        listed in FILE, one per line. Trees with a
 +
                        polyphyletic outgroup are considered invalid.
 +
 
 +
-reroot                Reroot the super tree at each iteration using
 +
                        a bipartition balanced accuracy measure
 +
 
 +
-rspr_reroot            Root trees using the SPR distance instead
 +
                        of the bipartition balanced accuracy
 +
 
 +
 
  
 
*******************************************************************************
 
*******************************************************************************
PAIRWISE COMPARISON OPTIONS
+
SEARCH STRATEGY OPTIONS
 
*******************************************************************************
 
*******************************************************************************
  
-pairwise
+
-i x    Run for x iterations of the global rearrangement search
-pairwise a b
+
 
-pairwise a b c d        Compare each input tree to each other tree and output
+
-r x    Only consider transfers of length x in the global rearrangement
                        the resulting SPR distance matrix. If -unrooted is
+
        search. Default is infinite (All SPRs). For NNI search use
                        enabled this will compute the "best rooting" SPR
+
        -r 1
                        distance by testing each rooting of the trees. The
+
 
                        optional arguments a b c d compute only rows a-b and/or
+
-include_only <file>  Build the supertree only from taxa included in
                        columns c-d of the matrix.
+
                      <file>, one per line
  
-no-symmetric-pairwise  By default, -pairwise will ignore the symmetric lower
+
-initial_tree <file>  Begin the search with the tree in <file>
                        left triangle of the matrix. With this option the
+
                        lower triangle is filled in.
+
  
-pairwise_max x         Use with -pairwise to only compute distances at most x.
+
-num_leaves x         Build the supertree from the x taxa that are found
                        Larger values are output as -1. Very efficient for
+
                      in the largest number of trees
                        small distances (e.g. 1-10).
+
 
 +
-random_insert_order  Insert taxa in random order when building the
 +
                      greedy addition tree. The default order is
 +
                      descending occurence
 +
 
 +
-rf_ties              Break SPR distance ties with the RF distance
  
 
*******************************************************************************
 
*******************************************************************************
OTHER OPTIONS
+
LGT ANALYSIS
 
*******************************************************************************
 
*******************************************************************************
-cc        Calculate a potentially better approximation with a quadratic time
 
            algorithm
 
  
-q         Quiet; Do not output the input trees or approximation
+
-lgt_analysis         Conduct an LGT analysis with the initial user-specified
 +
                      or greedy addition tree
 +
 
 +
-lgt_evaluate          Print inferred transfers for each tree with the initial
 +
                      user-specified or greedy addition tree
 +
 
 +
-lgt_csv              Output the LGT analysis seperated by commas rather than
 +
                      spaces.
 +
 
 +
-lgt_groups FILE      Specify a set of groups (e.g. genus or class) to analyze
 +
                      with -lgt_analysis. The group FILE contains a set of
 +
                      groups consisting of a group name on one line, group
 +
                      members one per line, and a blank line to seperate each
 +
                      group.
 +
                     
 +
*******************************************************************************
 +
OTHER OPTIONS
 
*******************************************************************************
 
*******************************************************************************
 +
-time                  Print iteration and total CPU time used at each
 +
                      iteration
  
Example:
+
-cc                    Calculate a potentially better approximation with a
$ ./rspr.exe -fpt <test_trees/trees2.txt
+
                      quadratic time algorithm
T1: ((((1,2),(3,4)),((5,6),(7,8))),(((9,10),(11,12)),((13,14),(15,16))))
+
T2: (((7,8),((1,(2,(14,5))),(3,4))),(((11,(6,12)),10),((13,(15,16)),9)))
+
  
F1: (((1,2),(3,4)),(7,8)) 14 13 5 12 11 6 9 10 (15,16)
+
-valid_trees          Output the set of trees that appear valid
F2: ((7,8),((1,2),(3,4))) 14 5 13 12 6 11 9 (15,16) 10
+
-valid_trees_rooted    Output the set of trees that appear valid after applying
approx drSPR=9
+
                      any rooting options.
  
3 4
+
-multi_trees          Output the set of multifurcating or invalid trees
F1: ((((1,2),(3,4)),(7,8)),((10,(11,12)),(13,(15,16)))) 14 6 9 5
+
F2: (((7,8),((1,2),(3,4))),(((11,12),10),(13,(15,16)))) 14 6 9 5
+
exact drSPR=4
+
  
 
################################################################################
 
################################################################################
Line 156: Line 189:
 
Chris Whidden
 
Chris Whidden
 
whidden@cs.dal.ca
 
whidden@cs.dal.ca
http://kiwi.cs.dal.ca/Software/RSPR
+
http://kiwi.cs.dal.ca/Software/SPR_Supertrees
  
 
################################################################################
 
################################################################################
  
 
FILES
 
FILES
 
  
 
ClusterForest.h  Cluster Decomposition
 
ClusterForest.h  Cluster Decomposition
 +
ClusterInstance.h Cluster Decomposition
 
Forest.h          Forest data structure
 
Forest.h          Forest data structure
gen_rooted_trees.pl   Generate all rootings of an unrooted binary tree
+
gen_rooted_trees.pl Generate all rootings of an unrooted binary tree
 
gpl.txt          The GPL license
 
gpl.txt          The GPL license
 
LCA.h            Compute LCAs of tree leaves
 
LCA.h            Compute LCAs of tree leaves
 +
lgt.h            LGT Analysis
 
Makefile          Makefile
 
Makefile          Makefile
 
Node.h            Node data structure
 
Node.h            Node data structure
 
README.txt        This README
 
README.txt        This README
rspr.h            Library to calculate rSPR distances between pairs of trees
+
rspr.h            Calculate rSPR distances between pairs of trees
rspr.cpp         Calculate rSPR distances between pairs or sets of trees
+
SiblingPair.h    Sibling pair data structure
test_trees/      Folder of test tree pairs
+
spr_supertree.cpp Main file
SiblingPair.h    Sibling Pair class
+
spr_supertree    Compute supertrees that minimize spr distance
 +
UndoMachine.h    Structure to record and undo tree alterations
  
 
################################################################################
 
################################################################################
Line 180: Line 215:
 
INSTALLATION
 
INSTALLATION
  
rSPR is a command-line program written in C++. To use it, simply
+
SPR Supertrees is a command-line program written in C++. To use it, simply
compile rspr.cpp and execute the resulting program. On systems with
+
compile spr_supertree.cpp and execute the resulting program. On systems
the g++ compiler and make program, the included make file will
+
with the g++ compiler and make program, the included make file will
compile rspr; simply run `make'.
+
compile spr_supertree; simply run `make'.
 +
 
 +
SPR Supertrees can also use multiple cores on SMP machines through OpenMP.
 +
Compile with the -fopenmp flag or run `make omp'. The multicore executable
 +
will be called spr_supertree-omp
  
 
################################################################################
 
################################################################################
Line 189: Line 228:
 
INPUT
 
INPUT
  
rSPR requires pairs of Newick format trees with arbitrary labels
+
SPR Supertrees requires a list of Newick format trees with arbitrary labels
as input. The first tree must be binary and rooted. The second tree
+
as input. A sample Newick tree is shown below:
may be multifurcating and rooted. A sample Newick tree is shown below:
+
  
 
((1,2),(3,4),(5,6));
 
((1,2),(3,4),(5,6));
  
rSPR can also compare a rooted reference tree to an unrooted test tree.
+
By default the trees must be rooted and binary.
First use gen_rooted_trees.pl to generate all rootings of the unrooted
+
If you wish to allow multifurcating input trees use the -allow_multi
test tree. Then use the -unrooted or -unrooted_min_approx options and
+
option. Bipartitions with less than x% support can be collapsed with
input the test tree and the set of rootings. rSPR will find the best
+
-support x.
rooting of the test tree with the -unrooted option and guess the best
+
rooting based on the approximation algorithm with the
+
-unrooted_min_approx option. Alternatively, the -total option with
+
the -unrooted or -unrooted_min_approx options will provide just the
+
distance. The -total option with -simple_unrooted will use a faster
+
biparition based measure to approximate the optimal rooting.
+
  
The -support x option can be used to collapse poorly supported branches
+
SPR Supertrees can also construct a rooted tree from unrooted gene
of the second tree.
+
trees. Use the -unrooted, -unrooted_min_approx, -simple_unrooted, or
 +
-simple_unrooted_fast options rSPR will find the best rooting of each input
 +
tree with respect to the current supertree using the -unrooted option, guess
 +
the best rooting based on the approximation algorithm with the
 +
-unrooted_min_approx option, and guess the best rooting based on
 +
a bipartition balanced accuracy measure with the -simple_unrooted or
 +
-simple_unrooted_fast options. These are much faster but may be less accurate.
  
With the -pairwise option, rSPR will compare each pair of input trees
+
The -outgroup <FILE> option roots gene trees based on a list of outgroup taxa.
and output the results as a distance matrix. To save time, only the
+
This option ignores gene trees with a polyphyletic outgroup or no outgroup
upper right triangle is output as the lower left triangle is symmetric.
+
members. To root these trees, one can construct a supertree from just the trees
Use the included fill_matrix program to fill in missing values or the
+
where the outgroup is monophyletic and then root the remainder of the trees
-no-symmetric-pairwise option to explicitly compute these values.
+
with the -simple_unrooted 1 option.
Optional arguments to -pairwise can be used to compute subsets of the
+
matrix (e.g. for partitioning computation over multiple processes).
+
The -pairwise_max x option can be used to quickly find trees with
+
SPR distance at most x when x is small (e.g. 1-10).
+
  
 +
With the -lgt_analysis option, the program conducts
 +
an LGT analysis of an initial or greedy addition supertree. The gene trees
 +
should be rooted, either as input or using -simple_unrooted_fast. This analysis
 +
considers a single minimal reconciliation scenario between the supertree and
 +
each gene tree. The output is a series of matrices (comma-seperated with the
 +
-lgt_csv option) showing the number of inferred SPR moves, transfers, and
 +
transfers ignoring direction between groups of taxa or to "mixed" portions of
 +
the tree. The -lgt_groups <FILE> option is required and specifys taxonomic
 +
groups or individual taxa.
  
 
################################################################################
 
################################################################################
Line 230: Line 273:
 
/////////////////////
 
/////////////////////
  
$ ./rspr < test_trees/trees2.txt
+
$ ./spr_supertree -i 1 < test_trees/trees2.txt  
T1: ((((1,2),(3,4)),((5,6),(7,8))),(((9,10),(11,12)),((13,14),(15,16))))
+
NUM_ITERATIONS=1
T2: ((((3,4),(8,(2,((11,12),1)))),((15,16),(7,(6,5)))),(14,((10,13),9)))
+
skipped 0 lines with no opening bracket
 +
skipped 0 multifurcating or invalid trees
 +
skipped 0 trees with less than 4 leaves
 +
2 gene trees remaining
  
F1: ((3,4),(5,6)) 13 14 10 (11,12) 9 1 8 7 2 (15,16)
+
Initial Supertree: ((15,14),(13,12))
F2: ((3,4),(6,5)) 13 10 14 (11,12) 1 9 8 2 7 (15,16)
+
Adding leaf 11  (5/16)
approx drSPR=12
+
(((16,15),(14,13)),12)
 +
Adding leaf 10  (6/16)
 +
(((16,15),(14,13)),(12,11))
 +
Adding leaf 9  (7/16)
 +
(((16,15),(14,13)),((12,11),10))
 +
Adding leaf 8  (8/16)
 +
((((16,15),(14,13)),9),((12,11),10))
 +
Adding leaf 7  (9/16)
 +
(((((16,15),(14,13)),9),((12,11),10)),8)
 +
Adding leaf 6  (10/16)
 +
(((((16,15),(14,13)),9),((12,11),10)),(8,7))
 +
Adding leaf 5  (11/16)
 +
(((((16,15),(14,13)),9),((12,11),10)),((8,7),6))
 +
Adding leaf 4  (12/16)
 +
(((((16,15),(14,13)),9),((12,11),10)),((8,7),(6,5)))
 +
Adding leaf 3  (13/16)
 +
(((((16,15),(14,13)),9),((12,11),10)),((8,7),((6,4),5)))
 +
Adding leaf 2  (14/16)
 +
(((((16,15),(14,13)),9),((12,11),10)),((8,7),((6,(4,3)),5)))
 +
Adding leaf 1   (15/16)
 +
(((((16,15),(14,13)),9),((12,11),10)),((8,7),((6,(4,3)),(5,2))))
 +
Adding leaf 0  (16/16)
 +
(((((16,15),(14,13)),9),((12,11),10)),((8,7),((6,(4,3)),((5,2),1))))
  
4
+
Initial Supertree:
F1: ((((1,2),(3,4)),((5,6),7)),((9,10),14)) 13 (11,12) 8 (15,16)
+
(((((16,15),(14,13)),9),((12,11),10)),((8,7),((6,(4,3)),((5,2),1))))
F2: ((((3,4),(2,1)),(7,(6,5))),(14,(10,9))) 13 (11,12) 8 (15,16)
+
Total Distance: 5
exact BB drSPR=4
+
Current Supertree:
 +
(((((16,15),(14,13)),9),((12,11),10)),((6,(8,7)),((4,3),((5,2),1))))
 +
Total Distance: 4
 +
Final Supertree:
 +
(((((16,15),(14,13)),9),((12,11),10)),((6,(8,7)),((4,3),((5,2),1))))
 +
Final Distance: 4
  
 
/////////////////////
 
/////////////////////
  
The first set of lines show the input trees. The second set of lines are the
+
The first set of lines indicate the options chosen, the number of invalid
approximate agreement forests and the corresponding approximate rSPR distance.
+
trees and the number of valid trees.
The third set of lines are the maximum agreement forests and the corresponding
+
The program then builds a supertree greedily by placing the most
exact rSPR distance. When calculating exact distances, the distance
+
frequent taxa first. Finally, the program applies 25 iterations of
currently being considered is printed on the first line of this section.
+
global SPR rearrangements (or a user-specified number using the -i option
 +
as shown here ) and outputs the best tree and distance found at the end of
 +
each iteration. To build larger trees the -r x option will limit the
 +
SPR rearrangements to transfers of length at most x. For example,
 +
-r 1 uses only NNI rearrangements.
  
Each component of an agreement forest corresponds to an rSPR operation.
 
The set of rSPR operations required to turn one tree into the other can
 
be found by applying rSPR operations that move these components to their
 
correct place in the other tree.
 
 
An agreement forest may contain p (rho) as a component. This represents
 
the root of the trees and indicates that an extra rSPR operation is
 
required to correctly root the tree.
 
 
################################################################################
 
 
OUTPUT WITH CLUSTERING
 
 
/////////////////////
 
 
$ rspr < test_trees/cluster_test
 
T1: (((x,((b1,b3),b2)),y),(f,(a,c)))
 
T2: (((x,y),f),((a,((b1,b2),b3)),c))
 
 
F1: (((0,((1,2),3)),4),(5,(6,7)))
 
F2: (((0,4),5),((6,((1,3),2)),7))
 
approx drSPR=9
 
 
 
CLUSTERS
 
C1_1: ((1,2),3)
 
C1_2: ((1,3),2)
 
cluster approx drSPR=3
 
 
1
 
F1_1: (1,2) 3
 
F1_2: (1,2) 3
 
cluster exact drSPR=1
 
 
C2_1: (((0,(1,2)),4),(5,(6,7)))
 
C2_2: (((0,4),5),((6,(1,2)),7))
 
cluster approx drSPR=6
 
 
2
 
F2_1: (5,(6,7)) (1,2) (0,4)
 
F2_2: (5,(6,7)) (1,2) (0,4)
 
cluster exact drSPR=2
 
 
F1: (f,(a,c)) b2 (b1,b3) (x,y)
 
F2: (f,(a,c)) b2 (b1,b3) (x,y)
 
total exact drSPR=3
 
 
/////////////////////
 
 
When clustering is enabled (as it is by default), each solved
 
cluster is displayed along with its approximate and exact distance in
 
an intermediate representation with labels mapped from 0-(N-1) where
 
N is the number of labels. The final agreement forest and distance
 
are output last.
 
 
################################################################################
 
 
OUTPUT WITH PAIRWISE
 
 
$ cat test_trees/big_test* | rspr -pairwise
 
0,46,0,46
 
,0,46,50
 
,,0,46
 
,,,0
 
 
cat test_trees/big_test* | rspr -pairwise | fill_matrix
 
0,46,0,46
 
46,0,46,50
 
0,46,0,46
 
46,50,46,0
 
  
 
################################################################################
 
################################################################################
Line 329: Line 336:
 
leaves in the trees.
 
leaves in the trees.
  
The unoptimized FPT and branch-and-bound algorithms run in O(3^k n) time, where
+
the exact algorithms run in O(2.42^k n) time, where $k$ is the computed
k is the rSPR distance and n is the number of leaves in the trees. The
+
SPR distance. Using a set of new optimizations we conjecture that
branch-and-bound algorithm should be significantly faster in practice.
+
the running time has been improved to O(2^k n) time.
  
Using all 3 of the -cob -cab and -sc optimizations improves the running times of
+
When using the -unrooted option, the exact algorithms run in O(2.42^k
the algorithms to O(2.42^k n) time. This provides a significant improvement in
+
n^2) time. (conjectured O(2^k n^2)). The -simple_unrooted option
practice and is provably correct, thus this is the default.
+
has the same worst case performance as the regular exact algorithms.
  
In addition, this version contains new improvements that
+
When using the -max x option, the exact algorithms will run up to
give a bound of O(2^k n). This provides another significant improvement
+
a distance of x and then the approximation is used. This provides
and is provably correct so these options are also enabled by default.
+
a running time of O(n + 2^x n) or O(n + 2^x n^2) for rooted
 +
trees and allows for a trade-off between space and efficiency.
 +
The -split_approx x option works similarly but is both much more
 +
accurate and slower. -split_approx is recommended over -max.
  
For much larger trees, the -split_approx option will compute an
+
Since there are O(n^2) possible SPR rearrangements, the total running
exponential time approximation of the distance that is exact for
+
time is O(i * n^2 * X), where i is the number of iterations and X
small distances and generally within a few percent of the optimal
+
is the running time of the chosen SPR computation method.
distance otherwise.
+
NOTE: The exact algorithms are exponential algorithms that exactly solve an
 +
NP-hard problem.  Thus the algorithms may not finish in a reasonable amount
 +
of time for very large rSPR distances without the -split_approx or -max
 +
options. For very large supertrees, it may also be necessary to
 +
limit the scope of the search with the -r option.
  
When using the -unrooted option, the exact algorithms run in O(2^k n^2) time.
+
The -bipartition_cluster x option ignores SPR rearrangments that violate any
 
+
bipartition that agrees with x% of the gene trees that contain at least two
The cluster reduction improves the running time of the
+
taxa from each side of the bipartition. This is enabled by default with x=0.5
algorithm to O(2^k n) time where k is the largest rSPR distance of
+
and grealy accelerates tree searches at the expense of some searching power.
any cluster (as opposed to the full rSPR distance). This provides a large
+
This option can be disabled with -bipartition_cluster 1, requiring total
speedup when the trees are clusterable.
+
agreement.
 
+
With the -pairwise option on m rooted trees, the program takes O(m^2
+
2^k n) time, where k is the largest SPR distance computed. With -unrooted
+
this becomes O(m^2 2^k n^3). The -pairwise_max x option limits k to x,
+
but does not use clustering and is slow for large distances.
+
-
+
 
+
NOTE: This is an exponential algorithm that exactly solves an NP-hard problem.
+
Thus the algorithms may not finish in a reasonable amount of time for large
+
rSPR distances (> 20 without optimizations and > 70 with optimizations).
+
  
 
################################################################################
 
################################################################################
Line 371: Line 375:
 
Whidden, C., Zeh, N., Beiko, R.G.  Fixed-Parameter and Approximation
 
Whidden, C., Zeh, N., Beiko, R.G.  Fixed-Parameter and Approximation
 
Algorithms for Maximum Agreement Forests of Multifurcating Trees.
 
Algorithms for Maximum Agreement Forests of Multifurcating Trees.
(Submitted). 2013. Preprint available at
+
(In Preparation). 2013. Preprint available at
 
http://arxiv.org/abs/1305.0512
 
http://arxiv.org/abs/1305.0512
 
Whidden, C., Zeh, N., Beiko, R.G.  Supertrees based on the subtree
 
prune-and-regraft distance.  To appear in Systematic Biology. 2014. Preprint
 
available at https://peerj.com/preprints/18/
 
  
 
Whidden, C., Beiko, R.G., Zeh, N. Fixed-Parameter Algorithms for Maximum
 
Whidden, C., Beiko, R.G., Zeh, N. Fixed-Parameter Algorithms for Maximum
Line 382: Line 382:
 
Available at http://epubs.siam.org/doi/abs/10.1137/110845045
 
Available at http://epubs.siam.org/doi/abs/10.1137/110845045
  
Whidden, C. Efficient Computation of Maximum Agreement Forests and their
+
Whidden, C., Zeh, N., Beiko, R.G. Supertrees based on the subtree
Applications. PhD Thesis. Dalhousie University, Canada. 2013. Available at
+
prune-and-regraft distance. To appear in Systematic Biology. 2014. Preprint
www.cs.dal.ca/~whidden
+
available at https://peerj.com/preprints/18/
  
 
Whidden, C., Beiko, R.G., Zeh, N. Fast FPT Algorithms for Computing
 
Whidden, C., Beiko, R.G., Zeh, N. Fast FPT Algorithms for Computing
Line 403: Line 403:
 
################################################################################
 
################################################################################
  
CITING rSPR
+
CITING SPR Supertrees
 
+
If you use rSPR in your research, please cite:
+
 
+
Whidden, C., Beiko, R.G., Zeh, N.  Computing the SPR Distance of Binary
+
Rooted Trees in O(2^k n) Time. (In Preparation). 2013.
+
 
+
Whidden, C., Beiko, R.G. Zeh, N.  Fixed-Parameter and Approximation
+
Algorithms for Maximum Agreement Forests of Multifurcating Trees.
+
(Submitted). 2013.
+
  
 +
If you use SPR Supertrees in your research, please cite:
 
Whidden, C., Zeh, N., Beiko, R.G.  Supertrees based on the subtree
 
Whidden, C., Zeh, N., Beiko, R.G.  Supertrees based on the subtree
prune-and-regraft distance. Syst. Biol. Advance Access published April
+
prune-and-regraft distance. Syst. Biol. 63 (4): 566-581. 2014.
2, 2014, doi:10.1093/sysbio/syu023.
+
doi:10.1093/sysbio/syu023.
  
Whidden, C., Beiko, R.G., Zeh, N. Fixed-Parameter Algorithms for Maximum
+
################################################################################
Agreement Forests. SIAM Journal on Computing 42.4 (2013), pp. 1431-1466.
+
Available at http://epubs.siam.org/doi/abs/10.1137/110845045
+
  
Whidden, C., Beiko, R.G., Zeh, N. Fast FPT Algorithms for Computing
 
Rooted Agreement Forests: Theory and Experiments. Experimental Algorithms.
 
Ed. by P. Festa. Vol. 6049. Lecture Notes in Computer Science. Springer
 
Berlin Heidelberg, 2010, pp. 141-153. Available at
 
http://link.springer.com/chapter/10.1007/978-3-642-13193-6_13
 
 
Whidden, C., Zeh, N. A Unifying View on Approximation and FPT of
 
Agreement Forests. In: WABI 2009. LNCS, vol. 5724, pp. 390.401.
 
Springer-Verlag (2009).
 
 
################################################################################
 
 
</pre>
 
</pre>

Revision as of 18:28, 18 June 2014

################################################################################
spr_supertree

################################################################################

Usage: spr_supertree [OPTIONS]
       spr_supertree-omp [OPTIONS]
Calculate supertrees that minimize the SPR distance from the input
trees. By default calculates a rooted SPR supertree from a list
of rooted binary trees from STDIN in newick format. An initial
tree is built by greedily adding taxa in decreasing order of
ocurrence. The tree is then improved by SPR rearrangements.
Additional options allow for unrooted and/or multifurcating input trees.

Copyright 2013-2014 Chris Whidden
whidden@cs.dal.ca
http://kiwi.cs.dal.ca/Software/SPR_Supertrees
March 3, 2014
Version 1.2.1

This file is part of spr_supertrees.

spr_supertrees is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

spr_supertrees is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with spr_supertrees.  If not, see <http://www.gnu.org/licenses/>.

*******************************************************************************
ALGORITHM
*******************************************************************************

These options control what algorithm is used to determine the SPR distance
from the supertree to the input trees. By default -bb is used.

-fpt        Calculate the exact rSPR distance with an FPT algorithm

-bb         Calculate the exact rSPR distance with a branch-and-bound
            FPT algorithm. This is the default option.

-approx     Calculate just a linear -time 3-approximation of the rSPR distance

-max k      Calculate the exact rSPR distance if it is k or less and
            otherwise use the 3-approximation

-split_approx
-split_approx x  Calculate the exact rSPR distance if it is k or less and
                 otherwise use the exponential-time approximation

*******************************************************************************
OPTIMIZATIONS
*******************************************************************************

These options control the use of optimized branching. All optimizations are
enabled by default. Specifying any subset of -cob, -cab, and -sc will use
just that subset of optimizations. See the README for more information.

-allopt   Use -cob -cab -sc and a new set of optimizations. This is the default
          option

-noopt    Use 3-way branching for all FPT algorithms

-cob      Use "cut one b" improved branching

-cab      Use "cut all b" improved branching

-sc       Use "separate components" improved branching

-bipartition_cluster x  Do not consider supertree rearrangements that violate
                        biparitions supported by x% of gene trees containing
                        at least two members from each side of the bipartition.
                        Enabled by default with x=0.5

*******************************************************************************
MULTIFURCATING COMPARISON OPTIONS
*******************************************************************************

-allow_multi   Allow multifurcating gene trees

-support x     Collapse bipartitions with less than x support

*******************************************************************************
UNROOTED COMPARISON OPTIONS
*******************************************************************************

-unrooted   Compare the supertree to each rooting of the input trees.
            Use the best found distance

-unrooted_min_approx    Compare the supertree to each rooting of the
                        input trees.
                        Run the exact algorithm on the rooting with the
                        minimum approximate rspr distance

-simple_unrooted        Root the gene trees at each iteration using
                        a bipartition balanced accuracy measure
                        (fast but potentially less accurate)
                        Reports an unrooted SPR distance comparison
                        at the end of each iteration for comparable
                        iteration scores

-simple_unrooted x      Root the gene trees at the first x iterations

-simple_unrooted_fast   The same as -simple_unrooted but does not use
                        an unrooted comparison at the end of each
                        iteration

-outgroup FILE          Root the gene trees with the outgroup taxa
                        listed in FILE, one per line. Trees with a
                        polyphyletic outgroup are considered invalid.

-reroot                 Reroot the super tree at each iteration using
                        a bipartition balanced accuracy measure

-rspr_reroot            Root trees using the SPR distance instead
                        of the bipartition balanced accuracy



*******************************************************************************
SEARCH STRATEGY OPTIONS
*******************************************************************************

-i x    Run for x iterations of the global rearrangement search

-r x    Only consider transfers of length x in the global rearrangement
        search. Default is infinite (All SPRs). For NNI search use
        -r 1

-include_only <file>  Build the supertree only from taxa included in
                      <file>, one per line

-initial_tree <file>  Begin the search with the tree in <file>

-num_leaves x         Build the supertree from the x taxa that are found
                      in the largest number of trees

-random_insert_order  Insert taxa in random order when building the
                      greedy addition tree. The default order is
                      descending occurence

-rf_ties              Break SPR distance ties with the RF distance

*******************************************************************************
LGT ANALYSIS
*******************************************************************************

-lgt_analysis          Conduct an LGT analysis with the initial user-specified
                       or greedy addition tree

-lgt_evaluate          Print inferred transfers for each tree with the initial
                       user-specified or greedy addition tree

-lgt_csv               Output the LGT analysis seperated by commas rather than
                       spaces.

-lgt_groups FILE       Specify a set of groups (e.g. genus or class) to analyze
                       with -lgt_analysis. The group FILE contains a set of
                       groups consisting of a group name on one line, group
                       members one per line, and a blank line to seperate each
                       group.
                       
*******************************************************************************
OTHER OPTIONS
*******************************************************************************
-time                  Print iteration and total CPU time used at each
                       iteration

-cc                    Calculate a potentially better approximation with a
                       quadratic time algorithm

-valid_trees           Output the set of trees that appear valid
-valid_trees_rooted    Output the set of trees that appear valid after applying
                       any rooting options.

-multi_trees           Output the set of multifurcating or invalid trees

################################################################################

CONTACT INFORMATION

Chris Whidden
whidden@cs.dal.ca
http://kiwi.cs.dal.ca/Software/SPR_Supertrees

################################################################################

FILES

ClusterForest.h   Cluster Decomposition
ClusterInstance.h Cluster Decomposition
Forest.h          Forest data structure
gen_rooted_trees.pl Generate all rootings of an unrooted binary tree
gpl.txt           The GPL license
LCA.h             Compute LCAs of tree leaves
lgt.h             LGT Analysis
Makefile          Makefile
Node.h            Node data structure
README.txt        This README
rspr.h            Calculate rSPR distances between pairs of trees
SiblingPair.h     Sibling pair data structure
spr_supertree.cpp Main file
spr_supertree     Compute supertrees that minimize spr distance
UndoMachine.h     Structure to record and undo tree alterations

################################################################################

INSTALLATION

SPR Supertrees is a command-line program written in C++. To use it, simply
compile spr_supertree.cpp and execute the resulting program. On systems
with the g++ compiler and make program, the included make file will
compile spr_supertree; simply run `make'.

SPR Supertrees can also use multiple cores on SMP machines through OpenMP.
Compile with the -fopenmp flag or run `make omp'. The multicore executable
will be called spr_supertree-omp

################################################################################

INPUT

SPR Supertrees requires a list of Newick format trees with arbitrary labels
as input.  A sample Newick tree is shown below:

((1,2),(3,4),(5,6));

By default the trees must be rooted and binary.
If you wish to allow multifurcating input trees use the -allow_multi
option. Bipartitions with less than x% support can be collapsed with
-support x.

SPR Supertrees can also construct a rooted tree from unrooted gene
trees. Use the -unrooted, -unrooted_min_approx, -simple_unrooted, or
-simple_unrooted_fast options rSPR will find the best rooting of each input
tree with respect to the current supertree using the -unrooted option, guess
the best rooting based on the approximation algorithm with the
-unrooted_min_approx option, and guess the best rooting based on
a bipartition balanced accuracy measure with the -simple_unrooted or
-simple_unrooted_fast options. These are much faster but may be less accurate.

The -outgroup <FILE> option roots gene trees based on a list of outgroup taxa.
This option ignores gene trees with a polyphyletic outgroup or no outgroup
members. To root these trees, one can construct a supertree from just the trees
where the outgroup is monophyletic and then root the remainder of the trees
with the -simple_unrooted 1 option.

With the -lgt_analysis option, the program conducts
an LGT analysis of an initial or greedy addition supertree. The gene trees
should be rooted, either as input or using -simple_unrooted_fast. This analysis
considers a single minimal reconciliation scenario between the supertree and
each gene tree. The output is a series of matrices (comma-seperated with the
-lgt_csv option) showing the number of inferred SPR moves, transfers, and
transfers ignoring direction between groups of taxa or to "mixed" portions of
the tree. The -lgt_groups <FILE> option is required and specifys taxonomic
groups or individual taxa.

################################################################################

OUTPUT

rspr writes to standard output.

A sample command line and output are shown below:

/////////////////////

$ ./spr_supertree -i 1 < test_trees/trees2.txt 
NUM_ITERATIONS=1
skipped 0 lines with no opening bracket 
skipped 0 multifurcating or invalid trees
skipped 0 trees with less than 4 leaves
2 gene trees remaining

Initial Supertree:  ((15,14),(13,12))
Adding leaf 11  (5/16)
(((16,15),(14,13)),12)
Adding leaf 10  (6/16)
(((16,15),(14,13)),(12,11))
Adding leaf 9   (7/16)
(((16,15),(14,13)),((12,11),10))
Adding leaf 8   (8/16)
((((16,15),(14,13)),9),((12,11),10))
Adding leaf 7   (9/16)
(((((16,15),(14,13)),9),((12,11),10)),8)
Adding leaf 6   (10/16)
(((((16,15),(14,13)),9),((12,11),10)),(8,7))
Adding leaf 5   (11/16)
(((((16,15),(14,13)),9),((12,11),10)),((8,7),6))
Adding leaf 4   (12/16)
(((((16,15),(14,13)),9),((12,11),10)),((8,7),(6,5)))
Adding leaf 3   (13/16)
(((((16,15),(14,13)),9),((12,11),10)),((8,7),((6,4),5)))
Adding leaf 2   (14/16)
(((((16,15),(14,13)),9),((12,11),10)),((8,7),((6,(4,3)),5)))
Adding leaf 1   (15/16)
(((((16,15),(14,13)),9),((12,11),10)),((8,7),((6,(4,3)),(5,2))))
Adding leaf 0   (16/16)
(((((16,15),(14,13)),9),((12,11),10)),((8,7),((6,(4,3)),((5,2),1))))

Initial Supertree:
(((((16,15),(14,13)),9),((12,11),10)),((8,7),((6,(4,3)),((5,2),1))))
Total Distance: 5
Current Supertree:
(((((16,15),(14,13)),9),((12,11),10)),((6,(8,7)),((4,3),((5,2),1))))
Total Distance: 4
Final Supertree:
(((((16,15),(14,13)),9),((12,11),10)),((6,(8,7)),((4,3),((5,2),1))))
Final Distance: 4

/////////////////////

The first set of lines indicate the options chosen, the number of invalid
trees and the number of valid trees.
The program then builds a supertree greedily by placing the most
frequent taxa first. Finally, the program applies 25 iterations of
global SPR rearrangements (or a user-specified number using the -i option
as shown here ) and outputs the best tree and distance found at the end of
each iteration. To build larger trees the -r x option will limit the
SPR rearrangements to transfers of length at most x. For example,
-r 1 uses only NNI rearrangements.


################################################################################

EFFICIENCY

The 3-approximation algorithm runs in O(n) time, where n is the number of
leaves in the trees.

the exact algorithms run in  O(2.42^k n) time, where $k$ is the computed
SPR distance. Using a set of new optimizations we conjecture that
the running time has been improved to O(2^k n) time.

When using the -unrooted option, the exact algorithms run in O(2.42^k
n^2) time. (conjectured O(2^k n^2)). The -simple_unrooted option
has the same worst case performance as the regular exact algorithms.

When using the -max x option, the exact algorithms will run up to
a distance of x and then the approximation is used. This provides
a running time of O(n + 2^x n) or O(n + 2^x n^2) for rooted
trees and allows for a trade-off between space and efficiency.
The -split_approx x option works similarly but is both much more
accurate and slower. -split_approx is recommended over -max.

Since there are O(n^2) possible SPR rearrangements, the total running
time is O(i * n^2 * X), where i is the number of iterations and X
is the running time of the chosen SPR computation method.
NOTE: The exact algorithms are exponential algorithms that exactly solve an
NP-hard problem.  Thus the algorithms may not finish in a reasonable amount
of time for very large rSPR distances without the -split_approx or -max
options. For very large supertrees, it may also be necessary to
limit the scope of the search with the -r option.

The -bipartition_cluster x option ignores SPR rearrangments that violate any
bipartition that agrees with x% of the gene trees that contain at least two
taxa from each side of the bipartition. This is enabled by default with x=0.5
and grealy accelerates tree searches at the expense of some searching power.
This option can be disabled with -bipartition_cluster 1, requiring total
agreement.

################################################################################

REFERENCES

For more information on the algorithms see:

Whidden, C., Zeh, N., Beiko, R.G.  Fixed-Parameter and Approximation
Algorithms for Maximum Agreement Forests of Multifurcating Trees.
(In Preparation). 2013. Preprint available at
http://arxiv.org/abs/1305.0512

Whidden, C., Beiko, R.G., Zeh, N. Fixed-Parameter Algorithms for Maximum
Agreement Forests. SIAM Journal on Computing 42.4 (2013), pp. 1431-1466.
Available at http://epubs.siam.org/doi/abs/10.1137/110845045

Whidden, C., Zeh, N., Beiko, R.G.  Supertrees based on the subtree
prune-and-regraft distance.  To appear in Systematic Biology. 2014. Preprint
available at https://peerj.com/preprints/18/

Whidden, C., Beiko, R.G., Zeh, N. Fast FPT Algorithms for Computing
Rooted Agreement Forests: Theory and Experiments. Experimental Algorithms.
Ed. by P. Festa. Vol. 6049. Lecture Notes in Computer Science. Springer
Berlin Heidelberg, 2010, pp. 141-153. Available at
http://link.springer.com/chapter/10.1007/978-3-642-13193-6_13

Whidden, C., Zeh, N. A Unifying View on Approximation and FPT of
Agreement Forests. In: WABI 2009. LNCS, vol. 5724, pp. 390.401.
Springer-Verlag (2009). Available at
http://www.springerlink.com/content/n56q2846v645p655/

Whidden, C. A Unifying View on Approximation and FPT of Agreement Forests.
Masters Thesis. Dalhousie University, Canada. 2009. Available at
www.cs.dal.ca/~whidden

################################################################################

CITING SPR Supertrees

If you use SPR Supertrees in your research, please cite:
Whidden, C., Zeh, N., Beiko, R.G.  Supertrees based on the subtree
prune-and-regraft distance.  Syst. Biol. 63 (4): 566-581. 2014.
doi:10.1093/sysbio/syu023.

################################################################################