If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Building a phylogenetic tree

The logic behind phylogenetic trees. How to build a tree using data about features that are present or absent in a group of organisms.

Key points:

  • Phylogenetic trees represent hypotheses about the evolutionary relationships among a group of organisms.
  • A phylogenetic tree may be built using morphological (body shape), biochemical, behavioral, or molecular features of species or other groups.
  • In building a tree, we organize species into nested groups based on shared derived traits (traits different from those of the group's ancestor).
  • The sequences of genes or proteins can be compared among species and used to build phylogenetic trees. Closely related species typically have few sequence differences, while less related species tend to have more.


We're all related—and I don't just mean us humans, though that's most definitely true! Instead, all living things on Earth can trace their descent back to a common ancestor. Any smaller group of species can also trace its ancestry back to common ancestor, often a much more recent one.
Given that we can't go back in time and see how species evolved, how can we figure out how they are related to one another? In this article, we'll look at the basic methods and logic used to build phylogenetic trees, or trees that represent the evolutionary history and relationships of a group of organisms.

Overview of phylogenetic trees

In a phylogenetic tree, the species of interest are shown at the tips of the tree's branches. The branches themselves connect up in a way that represents the evolutionary history of the species—that is, how we think they evolved from a common ancestor through a series of divergence (splitting-in-two) events. At each branch point lies the most recent common ancestor shared by all of the species descended from that branch point. The lines of the tree represent long series of ancestors that extend from one species to the next.
Image modified from Taxonomy and phylogeny: Figure 2, by Robert Bear et al., CC BY 4.0
For a more detailed explanation, check out the article on phylogenetic trees.
Even once you feel comfortable reading a phylogenetic tree, you may have the nagging question: How do you build one of these things? In this article, we'll take a closer look at how phylogenetic trees are constructed.

The idea behind tree construction

How do we build a phylogenetic tree? The underlying principle is Darwin’s idea of “descent with modification.” Basically, by looking at the pattern of modifications (novel traits) in present-day organisms, we can figure out—or at least, make hypotheses about—their path of descent from a common ancestor.
As an example, let's consider the phylogenetic tree below (which shows the evolutionary history of a made-up group of mouse-like species). We see three new traits arising at different points during the evolutionary history of the group: a fuzzy tail, big ears, and whiskers. Each new trait is shared by all of the species descended from the ancestor in which the trait arose (shown by the tick marks), but absent from the species that split off before the trait appeared.
When we are building phylogenetic trees, traits that arise during the evolution of a group and differ from the traits of the ancestor of the group are called derived traits. In our example, a fuzzy tail, big ears, and whiskers are derived traits, while a skinny tail, small ears, and lack of whiskers are ancestral traits. An important point is that a derived trait may appear through either loss or gain of a feature. For instance, if there were another change on the E lineage that resulted in loss of a tail, taillessness would be considered a derived trait.
Derived traits shared among the species or other groups in a dataset are key to helping us build trees. As shown above, shared derived traits tend to form nested patterns that provide information about when branching events occurred in the evolution of the species.
When we are building a phylogenetic tree from a dataset, our goal is to use shared derived traits in present-day species to infer the branching pattern of their evolutionary history. The trick, however, is that we can’t watch our species of interest evolving and see when new traits arose in each lineage.
Instead, we have to work backwards. That is, we have to look at our species of interest – such as A, B, C, D, and E – and figure out which traits are ancestral and which are derived. Then, we can use the shared derived traits to organize the species into nested groups like the ones shown above. A tree made in this way is a hypothesis about the evolutionary history of the species – typically, one with the simplest possible branching pattern that can explain their traits.

Example: Building a phylogenetic tree

If we were biologists building a phylogenetic tree as part of our research, we would have to pick which set of organisms to arrange into a tree. We'd also have to choose which characteristics of those organisms to base our tree on (out of their many different physical, behavioral, and biochemical features).
If we're instead building a phylogenetic trees for a class (which is probably more likely for readers of this article), odds are that we'll be given a set of characteristics, often in the form of a table, that we need to convert into a tree. For example, this table shows presence (+) or absence (0) of various features:
FeatureLampreyAntelopeBald eagleAlligatorSea bass
Table modified from Taxonomy and phylogeny: Figure 4, by Robert Bear et al., CC BY 4.0
Next, we need to know which form of each characteristic is ancestral and which is derived. For example, is the presence of lungs an ancestral trait, or is it a derived trait? As a reminder, an ancestral trait is what we think was present in the common ancestor of the species of interest. A derived trait is a form that we think arose somewhere on a lineage descended from that ancestor.
Without the ability to look into the past (which would be handy but, alas, impossible), how do we know which traits are ancestral and which derived?
  • In the context of homework or a test, the question you are solving may tell you which traits are derived vs. ancestral.
  • If you are doing your own research, you may have knowledge that allows you identify ancestral and derived traits (e.g., based on fossils).
  • You may be given information about an outgroup, a species that's more distantly related to the species of interest than they are to one another.
If we are given an outgroup, the outgroup can serve as a proxy for the ancestral species. That is, we may be able to assume that its traits represent the ancestral form of each characteristic.
For instance, in our example (data repeated below for convenience), the lamprey, a jawless fish that lacks a true skeleton, is our outgroup. As shown in the table, the lamprey lacks all of the listed features: it has no lungs, jaws, feathers, gizzard, or fur. Based on this information, we will assume that absence of these features is ancestral, and that presence of each feature is a derived trait.
FeatureLampreyAntelopeBald eagleAlligatorSea bass
Table modified from Taxonomy and phylogeny: Figure 4, by Robert Bear et al., CC BY 4.0
Now, we can start building our tree by grouping organisms according to their shared derived features. A good place to start is by looking for the derived trait that is shared between the largest number of organisms. In this case, that's the presence of jaws: all the organisms except the outgroup species (lamprey) have jaws. So, we can start our tree by drawing the lamprey lineage branching off from the rest of the species, and we can place the appearance of jaws on the branch carrying the non-lamprey species.
Image based on Taxonomy and phylogeny: Figure 6, by Robert Bear et al., CC BY 4.0
Next, we can look for the derived trait shared by the next-largest group of organisms. This would be lungs, shared by the antelope, bald eagle, and alligator, but not by the sea bass. Based on this pattern, we can draw the lineage of the sea bass branching off, and we can place the appearance of lungs on the lineage leading to the antelope, bald eagle, and alligator.
Image based on Taxonomy and phylogeny: Figure 6, by Robert Bear et al., CC BY 4.0
Following the same pattern, we can now look for the derived trait shared by the next-largest number of organisms. That would be the gizzard, which is shared by the alligator and the bald eagle (and absent from the antelope). Based on this data, we can draw the antelope lineage branching off from the alligator and bald eagle lineage, and place the appearance of the gizzard on the latter.
Image based on Taxonomy and phylogeny: Figure 6, by Robert Bear et al., CC BY 4.0
What about our remaining traits of fur and feathers? These traits are derived, but they are not shared, since each is found only in a single species. Derived traits that aren't shared don't help us build a tree, but we can still place them on the tree in their most likely location. For feathers, this is on the lineage leading to the bald eagle (after divergence from the alligator). For fur, this is on the antelope lineage, after its divergence from the alligator and bald eagle.
Image based on Taxonomy and phylogeny: Figure 6, by Robert Bear et al., CC BY 4.0

Parsimony and pitfalls in tree construction

When we were building the tree above, we used an approach called parsimony. Parsimony essentially means that we are choosing the simplest explanation that can account for our observations. In the context of making a tree, it means that we choose the tree that requires the fewest independent genetic events (appearances or disappearances of traits) to take place.
For example, we could have also explained the pattern of traits we saw using the following tree:
Image based on Taxonomy and phylogeny: Figure 6, by Robert Bear et al., CC BY 4.0
This series of events also provides an evolutionary explanation for the traits we see in the five species. However, it is less parsimonious because it requires more independent changes in traits to take place. Because where we've put the sea bass, we have to hypothesize that jaws independently arose two separate times (once in the sea bass lineage, and once in the lineage leading to antelopes, bald eagles, and alligators). This gives the tree a total of 6 tick marks, or trait change events, versus 5 in the more parsimonious tree above.
In this example, it may seem fairly obvious that there is one best tree, and counting up the tick marks may not seem very necessary. However, when researchers make phylogenies as part of their work, they often use a large number of characteristics, and the patterns of these characteristics rarely agree 100% with one another. Instead, there are some conflicts, where one tree would fit better with the pattern of one trait, while another tree would fit better with the pattern of another trait. In these cases, the researcher can use parsimony to choose the one tree (hypothesis) that fits the data best.
You may be wondering: Why don't the trees all agree with one another, regardless of what characteristics they're built on? After all, the evolution of a group of species did happen in one particular way in the past. The issue is that, when we build a tree, we are reconstructing that evolutionary history from incomplete and sometimes imperfect data. For instance:
  • We may not always be able to distinguish features that reflect shared ancestry (homologous features) from features that are similar but arose independently (analogous features arising by convergent evolution).
  • Traits can be gained and lost multiple times over the evolutionary history of a species. A species may have a derived trait, but then lose that trait (revert back to the ancestral form) over the course of evolution.
Biologists often use many different characteristics to build phylogenetic trees because of sources of error like these. Even when all of the characteristics are carefully chosen and analyzed, there is still the potential for some of them to lead to wrong conclusions (because we don't have complete information about events that happened in the past).

Using molecular data to build trees

A tool that has revolutionized, and continues to revolutionize, phylogenetic analysis is DNA sequencing. With DNA sequencing, rather than using physical or behavioral features of organisms to build trees, we can instead compare the sequences of their orthologous (evolutionarily related) genes or proteins.
The basic principle of such a comparison is similar to what we went through above: there's an ancestral form of the DNA or protein sequence, and changes may have occurred in it over evolutionary time. However, a gene or protein doesn't just correspond to a single characteristic that exists in two states.
Instead, each nucleotide of a gene or amino acid of a protein can be viewed as a separate feature, one that can flip to multiple states (e.g., A, T, C, or G for a nucleotide) via mutation. So, a gene with 300 nucleotides in it could represent 300 different features existing in 4 states! The amount of information we get from sequence comparisons—and thus, the resolution we can expect to get in a phylogenetic tree—is much higher than when we're using physical traits.
To analyze sequence data and identify the most probable phylogenetic tree, biologists typically use computer programs and statistical algorithms. In general, though, when we compare the sequences of a gene or protein between species:
  • A larger number of differences corresponds to less related species
  • A smaller number of differences corresponds to more related species
For example, suppose we compare the beta chain of hemoglobin (the oxygen-carrying protein in blood) between humans and a variety of other species. If we compare the human and gorilla versions of the protein, we'll find only 1 amino acid difference. If we instead compare the human and dog proteins, we'll find 15 differences. With human versus chicken, we're up to 45 amino acid differences, and with human versus lamprey (a jawless fish), we see 127 differences1. These numbers reflect that, among the species considered, humans are most related to the gorilla and least related to the lamprey.
You can see Sal working through an example involving phylogenetic trees and sequence data in this AP biology free response question video.

Want to join the conversation?