PhyloTrace Version 1.5.0
Web: www.phylotrace.com
Contact: info@phylotrace.com
Github: https://github.com/infinity-a11y/PhyloTrace
PhyloTrace is a platform for bacterial pathogen monitoring on a genomic level. Its components evolve around Core-Genome Multilocus Sequence Typing (cgMLST) and Antimicrobial Resistance Screening. Complex analyses and computation are wrapped into an appealing and easy-to-handle graphical user interface. Users build a local database comprising analyzed isolates, manageable directly with the application. The visualization of isolate relationship and genetic profile is highly interactive, aiding to reveal patterns explaining outbreak dynamics and events by connecting genomic information with epidemiologic variables. PhyloTrace achieves universal compatibility by assigning unique hashes based on sequence and allele information. This implementation enables efficient comparison and sharing of inter-lab results.
PhyloTrace is supposed to be used for research and academic purposes only.
Install the application by following the steps disclosed in the README document on GitHub. Launch PhyloTrace from the applications menu of your system. The app runs in the system’s default browser. PhyloTrace is optimized for Chrome, Chromium, Brave as well as Opera and Vivaldi. Avoid using Firefox as some elements are distorted or not visible at all.
PhyloTrace doesn’t force but encourages to build a local database and
iteratively add new bacterial isolates together with respective allelic
profile and meta data. Upon first launch either load an already existing
database or create a new one.
To start completely from scratch with no previously built database
available, select + Create New
on the start screen
(Figure 1) and choose a path where the database should
be built. A folder named Database will be created in the
respective location. Make sure to select a location that has writing and
reading permission. Since there are no entries added or schemes
downloaded yet, the database is empty and you are immediately directed
to the > Manage Scheme
tab after
clicking on Load
. The drop down menu lists all bacterial
species that are available in the cgMLST.org Nomenclature Server
(h25). Selecting a species will display information about the
scheme, such as the seed genome or the curators. Pick the species you
want to work with and press Download
. You can now proceed
to type the first assemblies belonging to the respective bacterial
species (see 3 Allelic Typing).
If you or your working group / institution has already used
PhyloTrace before, they might have saved the respective database folder
on the internal file system. Click Browse
on the start
screen (Figure 1) and select the path of the database
folder. PhyloTrace will automatically recognize if the selected folder
contains compatible data.
The database is structured by folders for each bacterial species you
have worked with (see Figure 3). Therefore, when
loading a local database, select which species you want to work with in
this session. For example, if the database contains entries typed with
Bordetella pertussis, Burkholderia pseudomallei and
Klebsiella pneumoniae schemes, you can choose between one of
them. Proceed by clicking on the Load
button. The database
section containing data regarding the selected strain will load.
If the already existing database doesn’t include the strain you want
to work with, pick any arbitrary strain and load the database. Then head
over to the > Manage Scheme
tab and
select your desired bacterial species from the list. Proceed to download
the scheme files comprising gene variants and scheme info by clicking
Download
. After the download is complete you are prompted
to load the database again (see Figure 4). Select the
strain which was just downloaded and confirm. Proceed to start the first
typing process for this species (see 3 Allelic
Typing).
The currently loaded species/scheme is displayed on the top of the sidebar below the PhyloTrace logo. If there is more than one scheme available in the current database directory, it can be changed in the same session. To switch, click the button next to the displayed scheme and choose the new one. After confirmation, the database is loaded with the newly selected scheme. If you like to switch to a scheme present on a database located in a different directory, restart the app and select the respective path linking to this database folder.
The typing process is the fundamental step which generates the data (i.e. the allelic profile) for the genomic comparison. The method applied is based on core-genome multi locus sequence typing (cgMLST). An allelic profile is generated for selected bacterial isolates. The allelic profile determines, which allele variants are present for each gene in the cgMLST scheme. If the process was successful, the results, i.e. the allelic profile of the respective isolate as well as epidemiologic meta data, are added as entry to the local database (see 4 Database Browser). By repeating this process with further isolates, a foundation for a library of bacterial isolates is created. Technically there is no limit for the number of entries in the database, although the performance might be reduced if there are several hundred entries in the currently loaded scheme (depends on system capacity). The variant calling and alignment steps of the typing process are facilitated by BLAT (BLAST-like Alignment Tool) for whole genome assemblies and KMA (k-mer alignment) algorithm for raw reads 1,2. Allelic typing for raw reads will be available soon.
In the sidebar of the
> Allelic Typing
tab select
☑ Single | ☐ Multi
(see Figure 5).
Clicking on Browse
will open a window so that an assembly
file from the local system can be selected. Any of the commonly used
FASTA file formats (.fasta, .fna or .fa) are accepted. Selecting an
incompatible file type will inhibit the start of the typing process.
Make sure that the assembly files contains sequence data of a bacterial
species that matches the selected scheme. Afterwards the basic meta data
(i.e. Assembly ID
, Assembly Name
,
Isolation Date
, Host
, Country
,
City
) can be declared. Filling out every field is not
mandatory if you don’t wish to or don’t have the respective information.
Note, that the Assembly ID
has to be unique, proceeding is
not possible if the same name is already present in the local database.
Except for the Assembly ID
these isolate variables can
still be change afterwards in the
>> Browse Entries
tab. Clicking on
Confirm
will save the metadata and render the process
executable.
Before starting the process, select whether to save the assembly file
to the local database. If an assembly file is not saved, screening for
resistance and virulence genes will later not be available for the
respective isolate (see 5 AMR Screening).
The assembly file can not be added in retrospect. Pressing
Start
will launch the typing process. The alignment
algorithm is now searching the selected assembly for the alleles
contained in the scheme and checking which variant is present. The
loading bar provided feedback on this progress. The duration varies
depending on the capability of your system and the number of alleles and
variants included in the scheme and can take a while. Once 100% is
reached the typing results are evaluated and appended to the local
database. Database changes in the tab
>> Browse Entries
are automatically
inhibited during this finalization step to avoid issues. After this last
step is finished you can reset to start another one. If the typing was
successful, the addition of a new entry is indicated by a pulsating
button in the >> Browse Entries
tab.
Click this button to load the updated database including the newly added
entry.
Multi typing is recommended for larger collections of several
assemblies belonging to the same species. This saves the time needed to
start the process one by one. In the sidebar of the
> Allelic Typing
tab switch to
☐ Single | ☑ Multi
and click on Browse
to
select a folder containing the assemblies. If you plan to type just a
subset of the selected folder, untick the unwanted assemblies in the
table below and choose a compatible Assembly ID
. The multi
typing process is only startable if no incompatible files are ticked.
Because all the files are seamlessly piped into the process the basic
meta data can be only declared once for all assemblies. The values
declared for Isolation Date
, Host
,
Country
and City
will apply for every new
entry that is produced in this multi typing process. The
Assembly Name
will first be identical with the
Assembly ID
, representing unique identifiers of the
assembly. The file name of the respective assembly is automatically
assigned to both. However all of the basic meta data values, except
Assembly ID
, can be changed in retrospect once the entry
has been successfully added to the database. After confirming the
metadata the Start
button will be rendered. Note, that if
the assembly file is selected not to save to the local database,
screening for resistance and virulence genes will not be available for
the respective isolate later (see 5 AMR
Screening). The assembly file can also not be added in retrospect.
Upon starting the multi typing process, a field where the progress is
logged is displayed. The process can be monitored with this overview.
The log of the multi typing process can be downloaded as text file.
Notifications, providing feedback about the status of the multi typing
process, show up for every relevant event, such as the (un-)successful
addition of an entry or the finalization of the multi typing process. A
pending typing process can be canceled by clicking
Terminate
. During the process is in the typing or alignment
phase (indicated by Processing in the log), you can keep
working with PhyloTrace, e.g. visualizing or editing the local database.
However, just as for single typing, the app is automatically recognizing
when the process is switching to the evaluation and addition phase
(indicated by Attaching in the log), hence any database changes
are prohibited. After each successful addition you can reload the
database in the >> Browse Entries
tab, to inspect the new entry. Unsuccessful typing attempts are captured
in the log and in the multi typing summary once the process has been
finalized (see Figure 6). Individual results can be
inspected by choosing them from the selector in the right column.
Displayed are only notable events in which e.g. a new allele variant was
found or unsuccessful allele calling attempts. Press Reset
to start another multi typing process.
After each variant from the cgMLST scheme has been searched and
aligned to the assembly, the results are evaluated to determine which
allele variant is present for each locus. This is conducted by a
conditional multi-step process that ensures correctness and minimizes
false positive assignments. The steps and the logic applied in this
process are shown in Figure 7. If none of the variants
from the scheme could be found in the bacterial isolate, the presence of
a potential new gene variant is evaluated (see 3.3.1 New Variant Validation).
In case none of the variants from the locally available scheme match
perfectly, the locus is checked for the existence of a new and valid
variant. To ascertain whether this variant is valid, the locus must
fulfill conditions such that it is likely to encode a gene. If there are
multiple different nucleotide regions in the assembly possibly coding
for a gene, each of them is sorted and passed through the validation
logic (see Figure 8).
Unlike the genetic distance between a pair of sequences, summing up the number of positions in which nucleotides are different, the calculation of allelic distance considers entire loci/alleles for the calculation. To receive the allelic distance, algorithms based on the distance calculation method employed by Hamming in 1950, originally meant for information technology, are used3. The Hamming distance is a metric that quantifies the discrepancy between two strings of equal length. It calculates the number of positions where the characters differ between the two strings. Essentially, it indicates the minimum number of substitutions required to transform one string into the other. For cgMLST with PhyloTrace, hashes, i.e. 64-bit words, organized in an array represent the allelic profile. The positions of the array elements correspond to the loci in the scheme and the hash represents the allele sequence for the respective locus. This allelic profile is generated during the typing process. Thus, for pairwise comparison of the allelic profile of two isolates, the total number of discrepant alleles result in the allelic distance value. Comparing a selection of isolates results in a distance matrix (see 4.4 Distance Matrix), which are then used to compute a tree (see 6 Visualization).
If no variant could be assigned for some genes contained in the scheme, NA values are be placed in the allelic profile for the respective position of the gene/locus. This can happen either if the corresponding gene is not found in the assembly sequence, if there are multiple hits or when the variant in the assembly is non-coding (refer to 3.3 Variant Assignment).
In order to showcase how allelic distances are calculated for
isolates with missing values, we set up an example. For simplicity
reasons we consider just three isolates, Isolate 1
,
Isolate 2
and Isolate 3
with three loci only,
Locus A
, Locus B
and Locus C
. For
Isolate 1
let Locus A
have variant 1,
Locus B
a missing value NA and Locus
C
variant 1. For Isolate 2
let Locus
A
be a missing value NA, Locus B
variant 1 and Locus C
variant 1. For
Isolate 3
let Locus A
be 2, Locus
B
also 2 and Locus C
1.
allelic_profile <- data.frame(A = c(1, NA, 2), B = c(NA, 1, 2), C = c(1, 1, 1),
row.names = c("Isolate 1", "Isolate 2", "Isolate 3"))
allelic_profile
## A B C
## Isolate 1 1 NA 1
## Isolate 2 NA 1 1
## Isolate 3 2 2 1
Option 1: Ignore missing values for pairwise comparison
Selecting the first option as missing value handling strategy, will have NA’s ignored in the pairwise comparison between two isolates. Unlike Option 2, only single missing values are ignored, not the entire locus.
# Option 1
hamming.distIgnore <- function(x, y) {
sum( (x != y) & !is.na(x) & !is.na(y) )
}
proxy::dist(allelic_profile, method = hamming.distIgnore)
## Isolate 1 Isolate 2
## Isolate 2 0
## Isolate 3 1 1
The pair isolate 1 & 2, each have an NA for one of the first two
loci A
and B
with the third locus
C
being identical. Their allelic distance is 0,
hence these two isolates are considered identical in their allelic
profile. The two other pairs Isolate 1 & 3 as well as 2 & 3 both
result in an allelic distance of 1.
Option 2: Omit loci with missing values for all assemblies
If the second option is selected, loci containing at least one missing value, will be ignored for the calculation of allelic distances. Unlike Option 1, the loci with missing values are entirely omitted for all pairwise comparisons. Even if an isolate pair might both have valid variant numbers for a locus, it is not included in the analysis if the locus contains just one NA for another isolate. For the missing value statistics shown in Figure 10 [5.5 Missing Values], 41 loci, displayed as columns in the missing value table, would not be considered for the distance calculation. For this option the respective loci are filtered out from the allelic profile before applying the distance computation. Because of the potential to skew the whole picture with this option, choosing it is only recommended if there are very few afflicted loci with missing values.
# Option 2
hamming.distOmit <- function(x, y) {
sum(x != y)
}
allelic_profile_noNA <- select(allelic_profile, -A, -B)
proxy::dist(allelic_profile_noNA, method = hamming.distOmit)
## Isolate 1 Isolate 2
## Isolate 2 0
## Isolate 3 0 0
Locus A
and B
are omitted before
calculating the distance. This leads to all isolates being considered
identical with an allelic distance of 0, because they all carry
variant 1 for the only remaining locus C
.
Option 3: Treat missing values as allele variant
The third option is rather specific and, considering the consequences for subsequent calculation of allelic distances and analyses, should be used with caution. Here, NA values are treated as if they were a separate variant.
# Option 3
hamming.distCategory <- function(x, y) {
sum((x != y | xor(is.na(x), is.na(y))) & !(is.na(x) & is.na(y)))
}
proxy::dist(allelic_profile, method = hamming.distCategory)
## Isolate 1 Isolate 2
## Isolate 2 2
## Isolate 3 2 2
Due to both NA’s being considered a further valid variant. All isolate pairs receive an allelic distance of 2.
Depending on the options for NA handling applied to these two allelic profiles, the result of the allelic distance will be different. The results of these example calculations are summarized in the table below.
Pair | Option 1 | Option 2 | Option 3 |
---|---|---|---|
Isolate 1 & 2 | 0 | 0 | 2 |
Isolate 1 & 3 | 1 | 0 | 2 |
Isolate 2 & 3 | 1 | 0 | 2 |
The > Database Browser
tab allows to
examine and manage information saved in the local database of the
selected scheme. It is divided in the
>> Browse Entries
,
>> Scheme Info,>> Loci Info
,
>> Distance Matrix
and
>> Missing Values
tabs.
Each assembly that has been successfully typed is added to the table
in >> Browse Entries
. This overview
allows to edit (see 4.1.1 Edit Meta Data),
delete (see 4.1.3 Delete Entries), inspect
(see 4.1.4 Browse the Allelic
Profile) and add (see 4.1.2 Custom
Variables) information connected with the entries. The table can
also be downloaded (see 4.1.5 Download
Entry Table). The table contains both, the meta data and the allelic
profile for each entry. The meta data as well as custom variables (see
4.1.2 Custom Variables) appear first on
the left part of the table, while the allelic profile with the assigned
variants is positioned on the right part of the table (see 4.1.4 Browse the Allelic
Profile). The Index
automatically assigns a number to
each entry and is eventually updated if entries are deleted (see 4.1.3 Delete Entries). The
Include
status decides over the inclusion or exclusion of
the respective entry for further analyses, such as Visualization (see 6 Visualization).
The basic meta data comprising Assembly Name
,
Isolation Date
, Host
, Country
and
City
can be edited in the entry table by left-clicking in
the corresponding field. As soon as changes are detected, a pulsating
button appears, that saves the changes on click. If you decide
otherwise, press the Undo
button and go back to the
previous state. Assembly ID
is the name of the isolate in
the Isolate directory of the local database and can’t be
changed. The Index
number as well as the assigned hashes
representing the allele variants in the allelic profile also can’t be
edited because it would vitiate the analysis.
>> Browse Entries
overview with
several options and functions to manage the local database.There is also the option to add custom variables using the controls
in the >> Browse Entries
sidebar.
Choose a name for the variable and press the green +
button
to add it. In the dialogue window select the variable type, categorical
(character) or continuous (numeric). After confirmation the variable is
ready to be filled with values. These can be changed in retrospect in
the same way as basic meta data (see 4.1.1
Edit Meta Data). Note, that the database needs to be saved,
otherwise the custom variables are not permanently added. The custom
variable type and name can’t be changed in retrospect, but they can be
deleted by selecting them from the drop-down menu in the sidebar and
clicking the red -
button. If more than five custom
variables are present, a table summarizing them is displayed in the
sidebar.
The Delete Entries panel on the top right corner of the
>> Browse Entries
tab allows to
delete single or multiple entries at once. Select one or multiple
entries to be deleted according to their Index
in the
drop-down menu. Clicking the red x
button will open a
dialogue window, prompting for confirmation about the intention to
irreversibly delete the selection. The deletion will lead to a complete
removal of the respective entry together with all the meta data, custom
variable values and allelic profile. However, if the database is not
saved after the deletion, it will appear again in the next session or
could also be undone with the Undo
button in the same
session. Note, that if you select all entries
for deletion, confirmation will immediately and irreversibly empty the
database for the currently selected scheme and you will
not have the option to undo this action.
Scrolling the entry table to the right will reveal the allelic
profile. The variant numbers for each allele/locus are sorted
column-wise for each entry. By default, only the first 20 loci are
displayed. Its possible to manually change, which loci are shown by
selecting or deselecting them in the Compare Loci panel on the
right below the Delete Entries panel. The respectively assigned
hash, representing distinct allele sequences is truncated to the first
and last four digits. Locus columns, containing at least one entry with
an allele variant that is different from the others, are highlighted in
green.
If the Only Varying Loci
option is activated, only loci
with differing variants (i.e. the columns highlighted in red) are
displayed. For missing variant values, i.e. if no variant could be
allocated to a locus (see 3.4.1
Missing Value Handling), the corresponding cell appears empty.
The entry table can be downloaded as CSV file. There are two options
to control this output. As the user sometimes might choose to only
include a subset of entries in a current analysis, there is the option
to include only the entries of interest in the output file. Activate the
switch Only included Entries
to include only the entries
that are checkmarked in the Include
column. Control the
Include
status either by checking or unchecking the
checkboxes in the Include
column or select or unselect all
at once by using the buttons on the top-left of the entry table. Note
that the database has to be saved for the changes to take effect. The
Index
of the entries marked as included are highlighted in
green and exclusively selected to be considered in visualization (see 6 Visualization). Moreover you can choose if
and which loci should be included in the download. By default only the
meta data and custom variables of the entries are included in the csv
file. If you activate the switch Include Displayed Loci
,
the currently displayed loci are included as well. Use the control in
the Compare Loci
box, to decide which and how many loci are
displayed. Upon clicking the Download
button you can choose
to which location on your system the file should be saved.
The tab >> Scheme Info
allows to
inspect the properties of the currently selected scheme. The table
displays information regarding the cgMLST scheme downloaded from the cgMLST.org Nomenclature Server (h25).
It comprises the name of the scheme, the version, the seed genome, genus
and species, the number of loci included, the complex type distance and
count parameters, the date of the most recent changes, the official
curators, publications addressing this scheme as well as the accessory
scheme.
>>Scheme Info
tab showing information about
a Bordetella pertussis scheme from cgMLST.org.The overview in the tab
>> Loci Info
provides information on
the loci included in the scheme as well as the distribution of alleles
among isolates present in the local database. The table allows to browse
the Locus ID (e.g. BP0001, BP2483), if known the gene identifier
(e.g. glpK, pykA), the position of the loci in the seed genome, the
length in nucleotides (e.g. 1233), the gene product (e.g. pyruvate
kinase, chromosome partitioning protein) as well as the number of
variants included in the base scheme. There is the option to filter the
table by keywords or numbers. Note, that this applies to all attributes,
so searching for “566” would result in the display of loci having an ID
that includes this number (e.g. “BP0056”, “BP0566”, “BP1566”, etc.), or
position (e.g. 317566, 1255669), length (e.g. 1566) and every other
attribute containing the keywords or numbers.
>> Loci Info
tab
showing information on the selected locus BP0008 with allele sequences
and frequency.Selecting a locus from the table will render alleles present in the
database and their respective DNA sequence. Browse alleles by choosing
them from the selector showing the respective frequency of the selected
allele in the database. The sequence can be copied to the clipboard. A
FASTA file comprising all hashed allele sequences from the currently
selected locus can be exported with Save FASTA
. To export
the table with metadata of all loci included in the scheme, click the
download button right next to the header Loci at the top.
The tab >> Distance Matrix
shows
a heatmap matrix of the allelic distances between the entries. For
details on how the allelic distances are derived refer to 3.4 Calculation of Allelic
Distance. For each pair of entries, the sum of allele variants that
are not identical, i.e. allelic distance, is displayed in the respective
cell. Here the choice, how missing values, i.e. entries having
unsucessfull variant allocations for some loci, can have both small and
big impact for the values and depends on different parameters (see 3.4.1 Missing Value Handling). In
addition to the visualization with tree plots, changes in the missing
value handling strategy can be directly observed in this overview. The
readability of the matrix is enhanced by a heatmap. The values contained
are normalized resulting in a color gradient from light green to dark
red. The lowest value, which is always 0 in the diagonal (allelic
distance of the same entry logically is zero), is highlighted in light
green. The highest value (dark red) varies and depends on the highest
allelic distance value in the matrix.
There is the option to change the appearance of the matrix. Choose
whether Assembly Name
, Assembly ID
or
Index
is displayed as column or row headers. As sometimes
the focus might be centered on the subset of entries that are marked as
included ion the entry table, the switch
Only Included Entries
can be toggled to show only this
selection. Also the display of the diagonal line and the upper triangle
can be activated or deactivated using the switches
Show Diagonal
and Show Upper Triangle
respectively. The distance matrix can be downloaded as CSV file. Note,
that the matrix is downloaded as currently displayed, including all the
changes made to the appearance (e.g. with or without diagonal or
Index
instead of Assembly Name
as header).
Missing values occur if a locus can not be found in the assembly or
if the present allele contains mutations leading to a dysfunctional
gene. As long as no entry in the local database has any missing values,
the >> Missing Values
tab is not
displayed. When adding a new entry with NA value(s) to the
local database, containing no missing values so far, reloading the
database will automatically have the
>> Missing Values
tab render, to
call attention on the newly occurring missing values. This tab provides
statistical information about the occurrence of missing values, and most
importantly: control buttons for the user, to select the strategy how
missing values are treated for subsequent analyses. The selection how
these values should be handled directly impacts the calculation of the
allelic distances between the bacterial isolates. The options to choose
from are detailed in 3.4.1 Missing
Value Handling. Due to the importance of missing values and how they
are treated, upon loading local databases containing at least one
missing value, the >> Missing Values
tab will always be rendered first.
Figure 14 shows statistics about the missing values of the entries in this database. There are 1069 unsuccessful allele allocations in total, i.e. the global sum of NA values of all entries and loci. There are 2983 loci in total in the selected Bordetella pertussis scheme and 217 of these have one or more missing values, which makes up about 7.3 %. Isolates for which more than 5% of loci contain missing values are highlighted in orange. These should be included in further analyses with caution because a significant share of alleles couldn’t be determined.
Each row in the table on the right shows an entry that contains at
least one missing value. The next column, Errors
,
respectively includes the sum of missing values for that isolate. The
following columns are loci including at least one missing value (denoted
by NA
).
Screening for species-specific genes of interest, e.g. antibiotics
resistance, virulence or stress genes, can be performed using the
integrated NCBI/AMRFinder
tool. The tab > Resistance Profile
provides the interface for this feature and lets users inspect the
screening results in
>> Browse Entries
and perform the
screening from the tab >> Screrning
.
Note, that not every species is available for screening with AMRFinder.
The availability for the currently selected scheme is automatically
checked.
Use the tab >> Screening
to run
AMRFinder. Selecting one or multiple isolates and clicking
Start
initiates the process. The runtime is estimated less
than a minute per isolate. Only isolates for which the respective
assembly file is present in the local database can be applied to gene
screening. The results can be inspected in parallel using the selector
on the right, appearing once at least one isolate finalized the
screening. Feedback on unsuccessful typing attempts is displayed as
well.
There are two viewing modes available to browse the resistance
profile, resulting from gene screening. Selecting the view mode
☑ Picker | ☐ Table
renders the option to select isolates
from a simple selector. The table showing the resistance profile
(including also virulence genes, stress genes, etc.) for the selected
isolate will appear below . The view mode
☐ Picker | ☑ Table
, shows the isolate entry table above the
resistance profile instead of the selector and therefore, next to
providing a good overview, enables filtering and sorting. Select an
entry from the table to render the respective resistance profile for
this isolate. The currently selected table can be exported as CSV with
Profile Table
.
Based on the allelic distances in the distance matrix (see 4.4 Distance Matrix), different tree plots
can be created. PhyloTrace allows to choose between three different tree
construction algorithms, Minimum-Spanning
,
Neighbour-Joining
and UPGMA
. This tree type
can be selected in the sidebar of the
> Visualization
tab (see Figure
17). On click of the Create Tree
button, a tree
plot of the currently selected tree type will be computed and displayed.
You can switch to a different tree type and create another tree without
losing the tree created before. If you switch back to the previous tree
type, you will still have the previously created tree. Unless you create
a new tree for the same tree type, the plot will be conserved in the
current session. Switching between different tree types enables to
seamlessly compare trees created with the same data set, but different
tree construction algorithms. Changes for the entry table in the
>> Browse Entries
tab, such as
inclusion of additional isolates (via ticking Include
) or
edited variables, will only take effect in the tree plot, if you save
the database with the changes and click Create Tree
again.
Once a tree has been created, it can be modified and customized without
having to reload it again.
The minimum-spanning-tree (MST) algorithm constructs a tree by connecting the closest points or nodes of the distance matrix without forming cycles. It focuses on finding the shortest path to connect all the nodes, resulting in a tree that minimizes the total edge length. Refer to 6.1.1 MST Modification to find out, how the tree appearance can be modified. The nodes represent single bacterial isolates. Isolates with identical allelic profile, i.e. a distance value of 0, are summarized in a single node. If the allelic distance between isolates lies within a certain threshold, clusters are drawn.
Figure 18 shows the modification panels for MST
plots. These are divided into Layout (see 6.1.1.1 Layout), Nodes (see 6.1.1.2 Nodes) and Edges (see 6.1.1.3 Edges). There are several options to customize
MST graphs, e.g. colors, forms, sizes, titles, labels, and more. Note,
that due to the nature of the generation of MST plots, the plot is reset
to its initial position, when changing one of the modification
parameters. MST graphs can be enriched with information by mapping
variables. to the plot.
The Layout control panel allows to add title, subtitle and
footer to the graph by typing them in the text fields. Individually
change the color for them using the color button below the text fields.
Also the overall background color can be modified. Toggle the
Transparent
switch, to make the background transparent.
The Nodes control panel allows to control the appearance of the nodes and related elements such as the label. The upper left controls are related to the label, i.e. which isolates are represented by the respective node. Using the drop-down menu, the label can be changed to any variable present for the respective isolates according to the entry table. The color of the node labels can be modified using the color button and their sizes by clicking on the blue menu button right next to it. The color of the nodes themselves can be changed using the color button from the control panels on the upper right. Clicking the menu buttons allows to change the opacity.
Node colors can also be used to map a variable to the graph. Nodes are colored according to the value present for the respective isolates and transformed in a pie chart to show the distribution of values if there are several clonal isolates summarized in a single node. Currently only variables of categorical type can be used in this feature.
The node size can be controlled from the bottom left controls. The
size of nodes containing multiple isolates with identical allelic
profiles, can be scaled by the number of isolates contained in them.
Toggle the Scale by Duplicates
switch to activate this
feature. Consequently, the slider to set the node size changes to a
range selection instead of distinct values. In this way, the size of the
smallest nodes, i.e. containing just one isolate, the size of the nodes
containing most isolates as well as the overall range can be
controlled.
The form of the nodes can be customized using the control panels on
the bottom right. Activate the switch Show Shadows
to
display shadows for the nodes. The shape of the nodes can be changed
here as well. Choose between shapes that render the node labels below
(Diamond
, Hexagon
, Dot
,
Square
) or inside them (Circle
,
Box
, Text
). If a variable is mapped, the form
Pie Chart
is locked in and can not be changed.
The Edges control panel allows to control the appearance of the edges
and related elements. Each edge is labelled by the value of the allelic
distance that the isolates from connecting nodes have to each other.
Except its appearace, this label currently can’t be changed. The color
and size can be modified using the upper left controls Label.
The color of the edges themselves can be controlled by Color in
the upper right controls. Click the menu button to see the control for
the transparency of the edges. On the bottom left, there is the option
to scale the edge lengths by the allelic distance they represent. Toggle
the Scale Edge Length
switch to activate this effect. The
multiplier of this effect can be customized using the slider below.
Activating this option when the subset of isolates displayed in the MST
graph has very different allelic distances, e.g. for a maximum of 200
and a minimum of 10, can lead to an untidy look of the plot. Drag the
slider to lower values to minimize this issue.
The clustering controls are to be found in the Edges panel at the bottom right. By default the “Complex Type Distance” value disclosed for each scheme available on the cgMLST.org Nomenclature Server is selected as the current cluster threshold. The threshold value can be modified to any desired value. Nodes with distances that lie withing the selected threshold are accordingly engulfed by cluster shapes. These are differently colored in order to distinguish between the cluster groups. Choose between the Rainbow and Viridis scales to modify the coloring. There are two types of cluster shapes available: Area and Skeleton. The cluster type Area renders an area surrounding nodes that are part of a cluster. Skeleton instead uses the edges to visualize clusters. This can be particularly useful if the selection of isolates is complex, which can potentially lead to overlapping clusters with the Area cluster type.
The Neighbour-Joining (NJ) method constructs a tree by iteratively joining pairs of nodes based on their pairwise distances. It aims to minimize the total branch length in the tree and is commonly used for constructing phylogenetic trees from distance matrices. Refer to 6.4 NJ and UPGMA Modification for information on how the tree appearance can be modified.
The Unweighted Pair Group Method with Arithmetic Mean (UPGMA) computes tree plots by grouping the most similar sequences or taxa together at each step and then averaging their distances. It produces a tree with equal branch lengths and is often used for hierarchical clustering of data. Refer to 6.4 NJ and UPGMA Modification for information on how the tree appearance can be modified.
The tree elements can be customized in great detail and supplemented
with additional information such as variables (see 6.4.4 Variable Mapping). However the basic
appearance, e.g. text and element sizes, are automatically adjusted to
the qualities and quantities of the entries that were selected to be
included for the tree. Due to the variable nature of different data
sets, it is sometimes required to manually readjust some elements to
receive a balanced look. While Minimum-Spanning trees have slightly
different modification features and control inputs, NJ and UPGMA trees
share the same control inputs. This is due to the different
visualization technique used for the creation and display of MST plots.
The controls to modify the tree are arranged in panels and divided in
Layout
, Label
, Elements
and
Variables
. In some panels you will find small menu buttons
(highlighted in light blue). They allow to further modify the elements
addressed by the respective panel in more detail (e.g. position or
font-style).
The appearance of the general layout can be modified in detail. There
is a range of different options, e.g. for controlling theme, colors,
title & subtitle, size, legend and other elements. To switch to
these controls navigate to the
>> Visualization
tab and click the
Layout
button from the menu left to the control panels.
Layout themes allow to change the geometrical appearance. You can choose from a selection of themes that are further categorized in linear and circular layouts. While the visual look changes when switching between linear and circular theme, the quality, i.e. the order and arrangement, of the hierarchical NJ and UPGMA trees, stays the same.
Linear: Rectangular
,
Roundrect
, Slanted
and
Ellipse
Circular: Circular
,
Inward
Moreover, a Rootedge
can be added by turning on the
switch. The root of the tree can be considered as starting point,
representing a theoretical “common ancestor” with an initial allelic
profile, from which all other isolates developed. Next to aesthetics,
displaying this element can help to distinguish “normal” branches,
representing actual allelic distance between the isolates, from the
root. The root menu lets you further modify it’s length and line
type.
The Ladderize
switch is turned on by default. It sorts
the tree branches by their length.
The color of lines, text as well as background, can be modified in
the Color panel. The colored buttons show the color currently
displayed as well as the respective HEXA code. Clicking them opens the
color menu. You can either select a color by choosing it directly from
the gradient field or by providing a HEXA or RGBA code. Note, that the
Lines/Text color applies to the tree branches, legend text and
title, but not to the tip labels. Their color can be modified in the
respective Label
menu (see 6.4.2.1
Tips).
Add title and subtitle in the Title panel. Their color changes in accordance to the selected Lines/Text color, but can be separately modified. The title menu allows to customize the font size.
The Sizing panel provides control of plot dimensions and
position. For the aspect ratio, you can choose from 16:10
,
16:9
and 4:3
. The overall size can be scaled
with the slider below. If some elements are cut off you can zoom out
using the slider at the bottom. Especially trees having a circular
layout can sometimes appear small with too much white space around. In
this case zooming in might be beneficial. The Sizing menu
allows to horizontally and vertically position the content.
Legend and tree scale controls share the same panel. The tree scale helps to estimate the actual allelic distance, represented by the branch length. In case you prefer not to show this element you can hide it by toggling the switch. It’s length can be changed in the tree scale menu and proportionally scales with the branch length. If the scale superimposes other elements, adjust its position by dragging the sliders in the menu.
If variables are mapped to the plot, a legend will appear. For the
orientation, the options are either horizontal or vertical (see
Figure 22). The legend menu allows to also adjust
position and size.
The Label
menu allows to control whether and how certain
labels are displayed. There are three different kinds of labels:
Tips, Branches and Custom Labels. They can be
modified in many different ways, e.g. in color, size or position.
The label at the tips represent the actual entries with their allelic
profile that determined their position in the tree. By default, the
Assembly Name
is displayed as tip label. However it is
possible to select other basic variables, e.g. Host
,
Country
, City
or Isolation Date
,
from the drop down menu, or even choose not to show tip labels at all by
toggling the Show switch. Instead of the tip labels being positioned
right next to the tips, they can be aligned to the right by activating
the Align
switch. UPGMA trees always have the tip labels
aligned and NJ trees only have this activated by default for circular
layouts. The menu on the right provides further customization options.
The Opacity
slider can be used to change the transparency.
The Position
parameter modifies the offset of the labels
from the tip. Angle, size and font face can be changed as well.
Customize the color of the label text with the color button and the
color of the panel with the color button below. The panels envelope the
tip label and are not shown by default. The controls in the panel menu
allow to modify size of the panels (not the text itself) and to smooth
the form.
Branch labels allow to supplement the tree with additional
information by labelling the branch leading to the final tips with
variables that are connected to the respective isolate. To show this
element toggle the Show
switch in the Branches
panel. The drop down lets you choose which variable or meta data to
annotate. The color of the panel surrounding the branch label can be
changed with the color button below. The menu button includes further
controls, e.g. opacity, size, horizontal and vertical position, font
face as well as edge smoothing. Note, that having branch labels doesn’t
work for trees with circular layout. Also more complex linear trees with
many isolates included mostly have too confined space for adding branch
labels. Instead, consider mapping a variable to other tree elements such
as tip points (see 6.4.4 Variable
Mapping).
If there is a need for labels somewhere other than tips or branches,
there is the option to create customized ones. The panel
Custom Labels
lets you define the label. Click the green
+
button too add it. The label will be positioned at plot
center. Create more labels by giving them a name and adding them again.
To change the size and position, select the respective label from the
drop down and open the menu next to the +
button. Do the
desired changes and click the Apply
button for them to come
into effect. Figure 25 shows a tree with two
highlighted clades (see 6.4.3.5 Clade
Highlight). The custom label function was used to annotate them.
The Elements
menu provides control over several special
elements such as tip and node points or a heatmap. These are not
essential but can amplify the explanatory power of the tree. Elements
can be deactivated or activated and their appearance can be changed.
Tip points are located at the end of the tree branches and correspond
to the isolates displayed. They can be modified in color or size to
bring the ends of the tree into prominence. Alternatively this element
can be used to map a variable (see 6.4.4
Variable Mapping).
Node points, in contrast to tip points, solely represent theoretical predecessors and relatives with respect to the isolates and their allelic profile. Despite the option to map a variable, their look can be customized in the same way like tip points. Mapping variables is not possible because they connect several isolates which may potentially have discrepant values for a chosen variable.
Tiles are supplementary elements that can be used to map variables to
the plot. They work with both circular and linear layouts. Up to five
different tiles can be added by activating them in the
Variables
menu (see 6.4.4
Variable Mapping). To modify opacity, width or position, select the
respective tiles that you wish to change with the selector at the top
left corner of the panel. Any modifications will apply only for the
selected tile. Opacity defines the transparency of the tile, enabling
overlaying it e.g. over the tree. The width slider controls the width.
Changing the position of the tiles for linear layouts, they are moved
horizontally, while in circular layouts they are moved inwards or
outwards in relation to the center of the circle.
Heatmaps can be a powerful tool to visualize related variables of the
same type (either categorical or continuous). For more details refer to
6.4.4 Variable Mapping. If the heatmap
is activated in the Variables
menu it can be modified using
the respective control panel in the Elements
menu. Width
changes apply to the heatmap overall, not to single columns. Just as
with tiles, the position control is moving the heatmap horizontally for
linear layouts and inwards or outwards for circular layouts. In some
situations, e.g. for long variable names or in circular layouts, it
might make sense to modify the angle and/or position of the column
headers. This can be done by using the controls in the heatmap menu.
Isolates are grouped in distinct hierarchical clades, which are
defined by nodes that comprise several isolates or other daughter nodes
and their respective isolates. In order to emphasize one or several
clades toggle the Node View switch and inspect the respective node index
of the clades you wish to highlight. Select the nodes in the drop-down
menu below and deactivate the Node View again to see the highlighted
clades. If only one clade is highlighted there is the option to
customize its color with the color button below. If there is more than
one clade selected, you choose from a color scale instead. Also use the
menu in the Clade Highlight control panel to control the
alignment of the clade highlights to each other. The borders of the
colored squares can be modified to round or rectangular appearance.
Clades, which are located within another clade that is higher in the
hierarchy can also be highlighted (see Figure 31).
Mapping variables, representing epidemiologic metadata or other
properties of the isolates displayed, is a powerful way of enriching the
plot with information. The Variables
menu provides full
control which variables are mapped, the elements they are mapped to and
the color scale that represents the different values of the selected
variable. The control panel is ordered into Element, Variable and Color
Scale columns (see Figure 32). The switches in the
Element column can be turned on or off to activate or deactivate the
display of a variable with the respective element. Select the variable
to be mapped from the drop-down menu right of the element switch. It
contains the basic meta data (Isolation Date
,
Host
, City
, Country
) as well as
the manually added custom variables (see 4.1.2 Custom Variables). The currently
selected variable is checked for its number of distinct values and
variable type (categorical or continuous). As this information is
relevant for selecting the color scale, it is displayed directly next to
the color scale selection menus. For categorical variables, the
selectable color scales automatically change depending on the number of
distinct values. If the number of distinct values is 7 or less you can
select from qualitative color scales. As there is a limited number of
distinct colors available in the qualitative color scales, they are not
selectable if the variable exceeds 7 distinct values. Instead, gradient
color scales can be selected from. Continuous variables have continuous
and divergent color scales available. Using the colorblind friendly
gradients Viridis and Cividis is recommended.
Divergent color scales are useful for visualizing data where there’s a
clear central point of interest to highlight positive and negative
deviations from a central value like 0. An example for a use case are
gene expression variables. E.g. fold change values, with colors
indicating whether the change is positive (upregulation) or negative
(downregulation) relative to a baseline expression level of 0 (no
change).
In Figure 33 the Isolation Date
variable is mapped to the tip label color (see 6.4.2.1
Tips). Hence the tip labels indicate both the
Assembly Name
and the Isolation Date
, with the
Greys color scale highlighting more recently added isolates in
darker shades. The tip point color is assigned to display the
categorical City
variable in which the sample was acquired.
In this example with two values only, the cities Graz and
Vienna. The qualitative scale Set2 is chosen to
distinguish the variable as well as possible from other variables. The
tip point shapes circle and triangle represent the host from which the
bacterial sample was taken. As the variable values are represented by
shapes instead of colors, there is no color scale for this option.
Continuous values can’t be represented by shapes. There are six
different shapes available, hence selecting the tip point shape to
represent categorical variables is only possible if there are 6 or less
distinct values. The custom variables Patient Age
and
ftsA
, which stands for expression values of the
ftsA gene, are mapped to Tile 1 and 2 respectively. Except
color values which are assign by the variable mapping, the appearance of
the elements, such as tip point sizes, can still be modified (e.g. 6.4.3.1 Tip Points).
Figure 35 shows an example for gene expression fold
changes mapped on a heatmap. While white/yellow colors indicates
baseline expression levels around 0, green colors indicate upregulation
and red colors downregulation. When a diverging scale is selected, you
can choose the midpoint of the scale (Zero
,
Mean
or Median
) using the drop down menu that
appears right to the color scale selector. Zero
assigns the
middle color of the diverging color scale to the value 0. The choices
Mean
and Median
assign the middle color to the
arithmetic mean and median of the respective value range. The appearance
of the heatmap, such as width and position, can be modified using the
respective control panel from the Elements
menu (see 6.4.3.4 Heatmap).
Neighbour-Joining and UPGMA trees can be downloaded in PNG, JPEG, BMP
and SVG format. Minimum-Spanning trees can be downloaded in PNG, JPEG
and BMP format. In addition they can be downloaded as HTML to preserve
the interactivity of dragging, zooming and moving the MST graph. To
initiate the download head to the
> Visualization
tab. In the sidebar,
below the Create Tree
button, you find the drop down to
select the file type as well as the download button right next to it.
Note: In order for the download to work, the plots have to be created
first.
A report of HTML format can be created by clicking the button
Print Report
, located in the sidebar of the
> Visualization
tab. There are several
options to control which information is included in the report. The
elements are categorized in Entry Table
,
General
, Analysis
and
Attach Plot
(see 7.2.1 Report
Elements). Note that the report requires prior creation of a tree
plot. The entry table in the report, the attached plot as well as some
analysis parameter, such as the tree algorithm, are all settled in the
moment a tree is created. Therefore a proper report can only be
generated after tree creation. For the entry table, instead of the
entire local database for the respective scheme, only isolates of
interest, i.e. the ones that have been used to generate the currently
displayed tree are listed in the report. The download will be directed
to the system location set in your browser download settings.
The sub-elements belonging to General
are
Date
, Operator
, Institute
and
Comment
. If you wish to include only a selection of these
elements, tick or untick them accordingly. Unticking the
General
element will deactivate the display of any
sub-elements as well.
Ticking the Isolate Table
prints the entry table,
comprising the isolate names as well as the selected metadata columns on
the report. Note, that only entries that are marked as
Included
in the database
(>> Browse Entries
) are printed.
Hence only isolates that are shown in the current tree are included.
The sub-elements belonging to the Analysis
parameters
are Scheme
, Tree
, Distance
,
NA Handling
and Version
. These parameters are
automatically derived from the session as well as the created tree and
can only be selected to be shown or hidden. As with the
General
parameters, unticking
Analysis Parameter
will hide all sub-elements.