User Manual

PhyloTrace Version 1.5.0

Github: https://github.com/infinity-a11y/PhyloTrace

PhyloTrace is a platform for bacterial pathogen monitoring on a genomic level. Its components evolve around Core-Genome Multilocus Sequence Typing (cgMLST) and Antimicrobial Resistance Screening. Complex analyses and computation are wrapped into an appealing and easy-to-handle graphical user interface. Users build a local database comprising analyzed isolates, manageable directly with the application. The visualization of isolate relationship and genetic profile is highly interactive, aiding to reveal patterns explaining outbreak dynamics and events by connecting genomic information with epidemiologic variables. PhyloTrace achieves universal compatibility by assigning unique hashes based on sequence and allele information. This implementation enables efficient comparison and sharing of inter-lab results.

PhyloTrace is supposed to be used for research and academic purposes only.

1 Launch Application

Install the application by following the steps disclosed in the README document on GitHub. Launch PhyloTrace from the applications menu of your system. The app runs in the system’s default browser. PhyloTrace is optimized for Chrome, Chromium, Brave as well as Opera and Vivaldi. Avoid using Firefox as some elements are distorted or not visible at all.

2 Loading Database

PhyloTrace doesn’t force but encourages to build a local database and iteratively add new bacterial isolates together with respective allelic profile and meta data. Upon first launch either load an already existing database or create a new one.

Figure 1: This screen appears upon app launch prompting you to choose between loading or creating a database. Here an existing database was selected. It contains also invalid elements.

2.1 Creating New Database

To start completely from scratch with no previously built database available, select + Create New on the start screen (Figure 1) and choose a path where the database should be built. A folder named Database will be created in the respective location. Make sure to select a location that has writing and reading permission. Since there are no entries added or schemes downloaded yet, the database is empty and you are immediately directed to the > Manage Scheme tab after clicking on Load. The drop down menu lists all bacterial species that are available in the cgMLST.org Nomenclature Server (h25). Selecting a species will display information about the scheme, such as the seed genome or the curators. Pick the species you want to work with and press Download. You can now proceed to type the first assemblies belonging to the respective bacterial species (see 3 Allelic Typing).

Figure 2: Downloading schemes from the cgMLST.org Nomenclature Server (h25).

2.2 Loading Existing Database

If you or your working group / institution has already used PhyloTrace before, they might have saved the respective database folder on the internal file system. Click Browse on the start screen (Figure 1) and select the path of the database folder. PhyloTrace will automatically recognize if the selected folder contains compatible data.

The database is structured by folders for each bacterial species you have worked with (see Figure 3). Therefore, when loading a local database, select which species you want to work with in this session. For example, if the database contains entries typed with Bordetella pertussis, Burkholderia pseudomallei and Klebsiella pneumoniae schemes, you can choose between one of them. Proceed by clicking on the Load button. The database section containing data regarding the selected strain will load.

Figure 3: Database structured by species. The species folder is further structured in the scheme containing a folder comprising assembly files and gene screening results of included isolates (Isolates), allele variants for each locus (e.g. Bordetella_pertussis_alleles), scheme meta information (scheme_info.html and targets.csv) and the saved allelic profiles together with epidemiologic variables (Typing.rds). Note, that Typing.rds and the Isolate directory are not present if no isolates have been added yet.

If the already existing database doesn’t include the strain you want to work with, pick any arbitrary strain and load the database. Then head over to the > Manage Scheme tab and select your desired bacterial species from the list. Proceed to download the scheme files comprising gene variants and scheme info by clicking Download. After the download is complete you are prompted to load the database again (see Figure 4). Select the strain which was just downloaded and confirm. Proceed to start the first typing process for this species (see 3 Allelic Typing).

Figure 4: Download cgMLST schemes from the cgMLST.org publicly accessible database. Once the download is successful, choose the just downloaded scheme from the appearing dialog window.

2.3 Switching Database

The currently loaded species/scheme is displayed on the top of the sidebar below the PhyloTrace logo. If there is more than one scheme available in the current database directory, it can be changed in the same session. To switch, click the button next to the displayed scheme and choose the new one. After confirmation, the database is loaded with the newly selected scheme. If you like to switch to a scheme present on a database located in a different directory, restart the app and select the respective path linking to this database folder.

3 Allelic Typing

The typing process is the fundamental step which generates the data (i.e. the allelic profile) for the genomic comparison. The method applied is based on core-genome multi locus sequence typing (cgMLST). An allelic profile is generated for selected bacterial isolates. The allelic profile determines, which allele variants are present for each gene in the cgMLST scheme. If the process was successful, the results, i.e. the allelic profile of the respective isolate as well as epidemiologic meta data, are added as entry to the local database (see 4 Database Browser). By repeating this process with further isolates, a foundation for a library of bacterial isolates is created. Technically there is no limit for the number of entries in the database, although the performance might be reduced if there are several hundred entries in the currently loaded scheme (depends on system capacity). The variant calling and alignment steps of the typing process are facilitated by BLAT (BLAST-like Alignment Tool) for whole genome assemblies and KMA (k-mer alignment) algorithm for raw reads ^1,2. Allelic typing for raw reads will be available soon.

3.1 Single Typing

In the sidebar of the > Allelic Typing tab select ☑ Single | ☐ Multi (see Figure 5). Clicking on Browse will open a window so that an assembly file from the local system can be selected. Any of the commonly used FASTA file formats (.fasta, .fna or .fa) are accepted. Selecting an incompatible file type will inhibit the start of the typing process. Make sure that the assembly files contains sequence data of a bacterial species that matches the selected scheme. Afterwards the basic meta data (i.e. Assembly ID, Assembly Name, Isolation Date, Host, Country, City) can be declared. Filling out every field is not mandatory if you don’t wish to or don’t have the respective information. Note, that the Assembly ID has to be unique, proceeding is not possible if the same name is already present in the local database. Except for the Assembly ID these isolate variables can still be change afterwards in the >> Browse Entries tab. Clicking on Confirm will save the metadata and render the process executable.

Figure 5: Single typing interface with declared meta data, ready to start the typing process.

Before starting the process, select whether to save the assembly file to the local database. If an assembly file is not saved, screening for resistance and virulence genes will later not be available for the respective isolate (see 5 AMR Screening). The assembly file can not be added in retrospect. Pressing Start will launch the typing process. The alignment algorithm is now searching the selected assembly for the alleles contained in the scheme and checking which variant is present. The loading bar provided feedback on this progress. The duration varies depending on the capability of your system and the number of alleles and variants included in the scheme and can take a while. Once 100% is reached the typing results are evaluated and appended to the local database. Database changes in the tab >> Browse Entries are automatically inhibited during this finalization step to avoid issues. After this last step is finished you can reset to start another one. If the typing was successful, the addition of a new entry is indicated by a pulsating button in the >> Browse Entries tab. Click this button to load the updated database including the newly added entry.

3.2 Multi Typing

Multi typing is recommended for larger collections of several assemblies belonging to the same species. This saves the time needed to start the process one by one. In the sidebar of the > Allelic Typing tab switch to ☐ Single | ☑ Multi and click on Browse to select a folder containing the assemblies. If you plan to type just a subset of the selected folder, untick the unwanted assemblies in the table below and choose a compatible Assembly ID. The multi typing process is only startable if no incompatible files are ticked. Because all the files are seamlessly piped into the process the basic meta data can be only declared once for all assemblies. The values declared for Isolation Date, Host, Country and City will apply for every new entry that is produced in this multi typing process. The Assembly Name will first be identical with the Assembly ID, representing unique identifiers of the assembly. The file name of the respective assembly is automatically assigned to both. However all of the basic meta data values, except Assembly ID, can be changed in retrospect once the entry has been successfully added to the database. After confirming the metadata the Start button will be rendered. Note, that if the assembly file is selected not to save to the local database, screening for resistance and virulence genes will not be available for the respective isolate later (see 5 AMR Screening). The assembly file can also not be added in retrospect. Upon starting the multi typing process, a field where the progress is logged is displayed. The process can be monitored with this overview. The log of the multi typing process can be downloaded as text file. Notifications, providing feedback about the status of the multi typing process, show up for every relevant event, such as the (un-)successful addition of an entry or the finalization of the multi typing process. A pending typing process can be canceled by clicking Terminate. During the process is in the typing or alignment phase (indicated by Processing in the log), you can keep working with PhyloTrace, e.g. visualizing or editing the local database. However, just as for single typing, the app is automatically recognizing when the process is switching to the evaluation and addition phase (indicated by Attaching in the log), hence any database changes are prohibited. After each successful addition you can reload the database in the >> Browse Entries tab, to inspect the new entry. Unsuccessful typing attempts are captured in the log and in the multi typing summary once the process has been finalized (see Figure 6). Individual results can be inspected by choosing them from the selector in the right column. Displayed are only notable events in which e.g. a new allele variant was found or unsuccessful allele calling attempts. Press Reset to start another multi typing process.

Figure 6: Finalized multi typing feedback showing successes and failures.

3.3 Variant Assignment

After each variant from the cgMLST scheme has been searched and aligned to the assembly, the results are evaluated to determine which allele variant is present for each locus. This is conducted by a conditional multi-step process that ensures correctness and minimizes false positive assignments. The steps and the logic applied in this process are shown in Figure 7. If none of the variants from the scheme could be found in the bacterial isolate, the presence of a potential new gene variant is evaluated (see 3.3.1 New Variant Validation).

Figure 7: Overview of the steps for allelic typing. Four different outcomes are possible: If not enough loci could be located, the process is aborted. If a variant from the local scheme is perfectly matching, it is determined to be present for the respective locus. If none of the variants from the local scheme are perfectly matching, the locus from the assembly is checked for the presence of a new and valid variant.

3.3.1 New Variant Validation

In case none of the variants from the locally available scheme match perfectly, the locus is checked for the existence of a new and valid variant. To ascertain whether this variant is valid, the locus must fulfill conditions such that it is likely to encode a gene. If there are multiple different nucleotide regions in the assembly possibly coding for a gene, each of them is sorted and passed through the validation logic (see Figure 8).

Figure 8: Variant validation process to verify the presence of a coding gene. The main conditions that are examined are the presence of frameshifts, start and (internal) stop codons as well as the total length that should not exceed more than 9 nucleotides than reference variants for the respective locus.

3.4 Calculation of Allelic Distance

Unlike the genetic distance between a pair of sequences, summing up the number of positions in which nucleotides are different, the calculation of allelic distance considers entire loci/alleles for the calculation. To receive the allelic distance, algorithms based on the distance calculation method employed by Hamming in 1950, originally meant for information technology, are used³. The Hamming distance is a metric that quantifies the discrepancy between two strings of equal length. It calculates the number of positions where the characters differ between the two strings. Essentially, it indicates the minimum number of substitutions required to transform one string into the other. For cgMLST with PhyloTrace, hashes, i.e. 64-bit words, organized in an array represent the allelic profile. The positions of the array elements correspond to the loci in the scheme and the hash represents the allele sequence for the respective locus. This allelic profile is generated during the typing process. Thus, for pairwise comparison of the allelic profile of two isolates, the total number of discrepant alleles result in the allelic distance value. Comparing a selection of isolates results in a distance matrix (see 4.4 Distance Matrix), which are then used to compute a tree (see 6 Visualization).

3.4.1 Missing Value Handling

If no variant could be assigned for some genes contained in the scheme, NA values are be placed in the allelic profile for the respective position of the gene/locus. This can happen either if the corresponding gene is not found in the assembly sequence, if there are multiple hits or when the variant in the assembly is non-coding (refer to 3.3 Variant Assignment).

In order to showcase how allelic distances are calculated for isolates with missing values, we set up an example. For simplicity reasons we consider just three isolates, Isolate 1, Isolate 2 and Isolate 3 with three loci only, Locus A, Locus B and Locus C. For Isolate 1 let Locus A have variant 1, Locus B a missing value NA and Locus C variant 1. For Isolate 2 let Locus A be a missing value NA, Locus B variant 1 and Locus C variant 1. For Isolate 3 let Locus A be 2, Locus B also 2 and Locus C 1.

allelic_profile <- data.frame(A = c(1, NA, 2), B = c(NA, 1, 2), C = c(1, 1, 1), 
                              row.names = c("Isolate 1", "Isolate 2", "Isolate 3"))
allelic_profile

##            A  B C
## Isolate 1  1 NA 1
## Isolate 2 NA  1 1
## Isolate 3  2  2 1

Option 1: Ignore missing values for pairwise comparison

Selecting the first option as missing value handling strategy, will have NA’s ignored in the pairwise comparison between two isolates. Unlike Option 2, only single missing values are ignored, not the entire locus.

# Option 1

hamming.distIgnore <- function(x, y) {
    sum( (x != y) & !is.na(x) & !is.na(y) )
}

proxy::dist(allelic_profile, method = hamming.distIgnore)

##           Isolate 1 Isolate 2
## Isolate 2         0          
## Isolate 3         1         1

The pair isolate 1 & 2, each have an NA for one of the first two loci A and B with the third locus C being identical. Their allelic distance is 0, hence these two isolates are considered identical in their allelic profile. The two other pairs Isolate 1 & 3 as well as 2 & 3 both result in an allelic distance of 1.

Option 2: Omit loci with missing values for all assemblies

If the second option is selected, loci containing at least one missing value, will be ignored for the calculation of allelic distances. Unlike Option 1, the loci with missing values are entirely omitted for all pairwise comparisons. Even if an isolate pair might both have valid variant numbers for a locus, it is not included in the analysis if the locus contains just one NA for another isolate. For the missing value statistics shown in Figure 10 [5.5 Missing Values], 41 loci, displayed as columns in the missing value table, would not be considered for the distance calculation. For this option the respective loci are filtered out from the allelic profile before applying the distance computation. Because of the potential to skew the whole picture with this option, choosing it is only recommended if there are very few afflicted loci with missing values.

# Option 2

hamming.distOmit <- function(x, y) {
    sum(x != y)
}

allelic_profile_noNA <- select(allelic_profile, -A, -B)

proxy::dist(allelic_profile_noNA, method = hamming.distOmit)

##           Isolate 1 Isolate 2
## Isolate 2         0          
## Isolate 3         0         0

Locus A and B are omitted before calculating the distance. This leads to all isolates being considered identical with an allelic distance of 0, because they all carry variant 1 for the only remaining locus C.

Option 3: Treat missing values as allele variant

The third option is rather specific and, considering the consequences for subsequent calculation of allelic distances and analyses, should be used with caution. Here, NA values are treated as if they were a separate variant.

# Option 3

hamming.distCategory <- function(x, y) {
    sum((x != y | xor(is.na(x), is.na(y))) & !(is.na(x) & is.na(y)))
}

proxy::dist(allelic_profile, method = hamming.distCategory)

##           Isolate 1 Isolate 2
## Isolate 2         2          
## Isolate 3         2         2

Due to both NA’s being considered a further valid variant. All isolate pairs receive an allelic distance of 2.

Depending on the options for NA handling applied to these two allelic profiles, the result of the allelic distance will be different. The results of these example calculations are summarized in the table below.

Example calculation of allelic distance for all isolate pairs using the three different options. Option 1: *Ignore missing values for pairwise comparison*, Option 2: *Omit loci with missing values for all assemblies*, Option 3: *Treat missing values as allele variant*.
	Resulting Allelic Distance
Pair	Option 1	Option 2	Option 3
Isolate 1 & 2	0	0	2
Isolate 1 & 3	1	0	2
Isolate 2 & 3	1	0	2

4 Database Browser

The > Database Browser tab allows to examine and manage information saved in the local database of the selected scheme. It is divided in the >> Browse Entries, >> Scheme Info,>> Loci Info, >> Distance Matrix and >> Missing Values tabs.

4.1 Browse Entries

Each assembly that has been successfully typed is added to the table in >> Browse Entries. This overview allows to edit (see 4.1.1 Edit Meta Data), delete (see 4.1.3 Delete Entries), inspect (see 4.1.4 Browse the Allelic Profile) and add (see 4.1.2 Custom Variables) information connected with the entries. The table can also be downloaded (see 4.1.5 Download Entry Table). The table contains both, the meta data and the allelic profile for each entry. The meta data as well as custom variables (see 4.1.2 Custom Variables) appear first on the left part of the table, while the allelic profile with the assigned variants is positioned on the right part of the table (see 4.1.4 Browse the Allelic Profile). The Index automatically assigns a number to each entry and is eventually updated if entries are deleted (see 4.1.3 Delete Entries). The Include status decides over the inclusion or exclusion of the respective entry for further analyses, such as Visualization (see 6 Visualization).

4.1.1 Edit Meta Data

The basic meta data comprising Assembly Name, Isolation Date, Host, Country and City can be edited in the entry table by left-clicking in the corresponding field. As soon as changes are detected, a pulsating button appears, that saves the changes on click. If you decide otherwise, press the Undo button and go back to the previous state. Assembly ID is the name of the isolate in the Isolate directory of the local database and can’t be changed. The Index number as well as the assigned hashes representing the allele variants in the allelic profile also can’t be edited because it would vitiate the analysis.

Figure 9: The >> Browse Entries overview with several options and functions to manage the local database.

4.1.2 Custom Variables

There is also the option to add custom variables using the controls in the >> Browse Entries sidebar. Choose a name for the variable and press the green + button to add it. In the dialogue window select the variable type, categorical (character) or continuous (numeric). After confirmation the variable is ready to be filled with values. These can be changed in retrospect in the same way as basic meta data (see 4.1.1 Edit Meta Data). Note, that the database needs to be saved, otherwise the custom variables are not permanently added. The custom variable type and name can’t be changed in retrospect, but they can be deleted by selecting them from the drop-down menu in the sidebar and clicking the red - button. If more than five custom variables are present, a table summarizing them is displayed in the sidebar.

4.1.3 Delete Entries

The Delete Entries panel on the top right corner of the >> Browse Entries tab allows to delete single or multiple entries at once. Select one or multiple entries to be deleted according to their Index in the drop-down menu. Clicking the red x button will open a dialogue window, prompting for confirmation about the intention to irreversibly delete the selection. The deletion will lead to a complete removal of the respective entry together with all the meta data, custom variable values and allelic profile. However, if the database is not saved after the deletion, it will appear again in the next session or could also be undone with the Undo button in the same session. Note, that if you select all entries for deletion, confirmation will immediately and irreversibly empty the database for the currently selected scheme and you will not have the option to undo this action.

4.1.4 Browse the Allelic Profile

Scrolling the entry table to the right will reveal the allelic profile. The variant numbers for each allele/locus are sorted column-wise for each entry. By default, only the first 20 loci are displayed. Its possible to manually change, which loci are shown by selecting or deselecting them in the Compare Loci panel on the right below the Delete Entries panel. The respectively assigned hash, representing distinct allele sequences is truncated to the first and last four digits. Locus columns, containing at least one entry with an allele variant that is different from the others, are highlighted in green.

Figure 10: Comparing the allelic profile in the entry table.

If the Only Varying Loci option is activated, only loci with differing variants (i.e. the columns highlighted in red) are displayed. For missing variant values, i.e. if no variant could be allocated to a locus (see 3.4.1 Missing Value Handling), the corresponding cell appears empty.

4.1.5 Download Entry Table

The entry table can be downloaded as CSV file. There are two options to control this output. As the user sometimes might choose to only include a subset of entries in a current analysis, there is the option to include only the entries of interest in the output file. Activate the switch Only included Entries to include only the entries that are checkmarked in the Include column. Control the Include status either by checking or unchecking the checkboxes in the Include column or select or unselect all at once by using the buttons on the top-left of the entry table. Note that the database has to be saved for the changes to take effect. The Index of the entries marked as included are highlighted in green and exclusively selected to be considered in visualization (see 6 Visualization). Moreover you can choose if and which loci should be included in the download. By default only the meta data and custom variables of the entries are included in the csv file. If you activate the switch Include Displayed Loci, the currently displayed loci are included as well. Use the control in the Compare Loci box, to decide which and how many loci are displayed. Upon clicking the Download button you can choose to which location on your system the file should be saved.

4.2 Scheme Info

The tab >> Scheme Info allows to inspect the properties of the currently selected scheme. The table displays information regarding the cgMLST scheme downloaded from the cgMLST.org Nomenclature Server (h25). It comprises the name of the scheme, the version, the seed genome, genus and species, the number of loci included, the complex type distance and count parameters, the date of the most recent changes, the official curators, publications addressing this scheme as well as the accessory scheme.

Figure 11: >>Scheme Info tab showing information about a Bordetella pertussis scheme from cgMLST.org.

4.3 Loci Info

The overview in the tab >> Loci Info provides information on the loci included in the scheme as well as the distribution of alleles among isolates present in the local database. The table allows to browse the Locus ID (e.g. BP0001, BP2483), if known the gene identifier (e.g. glpK, pykA), the position of the loci in the seed genome, the length in nucleotides (e.g. 1233), the gene product (e.g. pyruvate kinase, chromosome partitioning protein) as well as the number of variants included in the base scheme. There is the option to filter the table by keywords or numbers. Note, that this applies to all attributes, so searching for “566” would result in the display of loci having an ID that includes this number (e.g. “BP0056”, “BP0566”, “BP1566”, etc.), or position (e.g. 317566, 1255669), length (e.g. 1566) and every other attribute containing the keywords or numbers.

Figure 12: >> Loci Info tab showing information on the selected locus BP0008 with allele sequences and frequency.

Selecting a locus from the table will render alleles present in the database and their respective DNA sequence. Browse alleles by choosing them from the selector showing the respective frequency of the selected allele in the database. The sequence can be copied to the clipboard. A FASTA file comprising all hashed allele sequences from the currently selected locus can be exported with Save FASTA. To export the table with metadata of all loci included in the scheme, click the download button right next to the header Loci at the top.

4.4 Distance Matrix

The tab >> Distance Matrix shows a heatmap matrix of the allelic distances between the entries. For details on how the allelic distances are derived refer to 3.4 Calculation of Allelic Distance. For each pair of entries, the sum of allele variants that are not identical, i.e. allelic distance, is displayed in the respective cell. Here the choice, how missing values, i.e. entries having unsucessfull variant allocations for some loci, can have both small and big impact for the values and depends on different parameters (see 3.4.1 Missing Value Handling). In addition to the visualization with tree plots, changes in the missing value handling strategy can be directly observed in this overview. The readability of the matrix is enhanced by a heatmap. The values contained are normalized resulting in a color gradient from light green to dark red. The lowest value, which is always 0 in the diagonal (allelic distance of the same entry logically is zero), is highlighted in light green. The highest value (dark red) varies and depends on the highest allelic distance value in the matrix.

Figure 13: Distance matrix showing the allelic distance of each entry pair.

There is the option to change the appearance of the matrix. Choose whether Assembly Name, Assembly ID or Index is displayed as column or row headers. As sometimes the focus might be centered on the subset of entries that are marked as included ion the entry table, the switch Only Included Entries can be toggled to show only this selection. Also the display of the diagonal line and the upper triangle can be activated or deactivated using the switches Show Diagonal and Show Upper Triangle respectively. The distance matrix can be downloaded as CSV file. Note, that the matrix is downloaded as currently displayed, including all the changes made to the appearance (e.g. with or without diagonal or Index instead of Assembly Name as header).

4.5 Missing Values

Missing values occur if a locus can not be found in the assembly or if the present allele contains mutations leading to a dysfunctional gene. As long as no entry in the local database has any missing values, the >> Missing Values tab is not displayed. When adding a new entry with NA value(s) to the local database, containing no missing values so far, reloading the database will automatically have the >> Missing Values tab render, to call attention on the newly occurring missing values. This tab provides statistical information about the occurrence of missing values, and most importantly: control buttons for the user, to select the strategy how missing values are treated for subsequent analyses. The selection how these values should be handled directly impacts the calculation of the allelic distances between the bacterial isolates. The options to choose from are detailed in 3.4.1 Missing Value Handling. Due to the importance of missing values and how they are treated, upon loading local databases containing at least one missing value, the >> Missing Values tab will always be rendered first.

Figure 14: Missing values tab showing table and statistics about the currently loaded Bordetella pertussis database.

Figure 14 shows statistics about the missing values of the entries in this database. There are 1069 unsuccessful allele allocations in total, i.e. the global sum of NA values of all entries and loci. There are 2983 loci in total in the selected Bordetella pertussis scheme and 217 of these have one or more missing values, which makes up about 7.3 %. Isolates for which more than 5% of loci contain missing values are highlighted in orange. These should be included in further analyses with caution because a significant share of alleles couldn’t be determined.

Each row in the table on the right shows an entry that contains at least one missing value. The next column, Errors, respectively includes the sum of missing values for that isolate. The following columns are loci including at least one missing value (denoted by NA).

5 AMR Screening

Screening for species-specific genes of interest, e.g. antibiotics resistance, virulence or stress genes, can be performed using the integrated NCBI/AMRFinder tool. The tab > Resistance Profile provides the interface for this feature and lets users inspect the screening results in >> Browse Entries and perform the screening from the tab >> Screrning. Note, that not every species is available for screening with AMRFinder. The availability for the currently selected scheme is automatically checked.

5.1 Perform Screening

Use the tab >> Screening to run AMRFinder. Selecting one or multiple isolates and clicking Start initiates the process. The runtime is estimated less than a minute per isolate. Only isolates for which the respective assembly file is present in the local database can be applied to gene screening. The results can be inspected in parallel using the selector on the right, appearing once at least one isolate finalized the screening. Feedback on unsuccessful typing attempts is displayed as well.

Figure 15: Screening interface showing the result of a successful run.

5.2 Resistance Profile

There are two viewing modes available to browse the resistance profile, resulting from gene screening. Selecting the view mode ☑ Picker | ☐ Table renders the option to select isolates from a simple selector. The table showing the resistance profile (including also virulence genes, stress genes, etc.) for the selected isolate will appear below . The view mode ☐ Picker | ☑ Table, shows the isolate entry table above the resistance profile instead of the selector and therefore, next to providing a good overview, enables filtering and sorting. Select an entry from the table to render the respective resistance profile for this isolate. The currently selected table can be exported as CSV with Profile Table.

Figure 16: Resistance Profile showing screening results of a Klebsiella pneumoniae isolate in Picker view mode.

6 Visualization

Based on the allelic distances in the distance matrix (see 4.4 Distance Matrix), different tree plots can be created. PhyloTrace allows to choose between three different tree construction algorithms, Minimum-Spanning, Neighbour-Joining and UPGMA. This tree type can be selected in the sidebar of the > Visualization tab (see Figure 17). On click of the Create Tree button, a tree plot of the currently selected tree type will be computed and displayed. You can switch to a different tree type and create another tree without losing the tree created before. If you switch back to the previous tree type, you will still have the previously created tree. Unless you create a new tree for the same tree type, the plot will be conserved in the current session. Switching between different tree types enables to seamlessly compare trees created with the same data set, but different tree construction algorithms. Changes for the entry table in the >> Browse Entries tab, such as inclusion of additional isolates (via ticking Include) or edited variables, will only take effect in the tree plot, if you save the database with the changes and click Create Tree again. Once a tree has been created, it can be modified and customized without having to reload it again.

Figure 17: Visualization tab showing some of the sidebar controls as well as the modification menu in the middle.

6.1 Minimum-Spanning-Tree

The minimum-spanning-tree (MST) algorithm constructs a tree by connecting the closest points or nodes of the distance matrix without forming cycles. It focuses on finding the shortest path to connect all the nodes, resulting in a tree that minimizes the total edge length. Refer to 6.1.1 MST Modification to find out, how the tree appearance can be modified. The nodes represent single bacterial isolates. Isolates with identical allelic profile, i.e. a distance value of 0, are summarized in a single node. If the allelic distance between isolates lies within a certain threshold, clusters are drawn.

6.1.1 MST Modification

Figure 18 shows the modification panels for MST plots. These are divided into Layout (see 6.1.1.1 Layout), Nodes (see 6.1.1.2 Nodes) and Edges (see 6.1.1.3 Edges). There are several options to customize MST graphs, e.g. colors, forms, sizes, titles, labels, and more. Note, that due to the nature of the generation of MST plots, the plot is reset to its initial position, when changing one of the modification parameters. MST graphs can be enriched with information by mapping variables. to the plot.

Figure 18: Modification panels for MST plots.

6.1.1.1 Layout

The Layout control panel allows to add title, subtitle and footer to the graph by typing them in the text fields. Individually change the color for them using the color button below the text fields. Also the overall background color can be modified. Toggle the Transparent switch, to make the background transparent.

6.1.1.2 Nodes

The Nodes control panel allows to control the appearance of the nodes and related elements such as the label. The upper left controls are related to the label, i.e. which isolates are represented by the respective node. Using the drop-down menu, the label can be changed to any variable present for the respective isolates according to the entry table. The color of the node labels can be modified using the color button and their sizes by clicking on the blue menu button right next to it. The color of the nodes themselves can be changed using the color button from the control panels on the upper right. Clicking the menu buttons allows to change the opacity.

Node colors can also be used to map a variable to the graph. Nodes are colored according to the value present for the respective isolates and transformed in a pie chart to show the distribution of values if there are several clonal isolates summarized in a single node. Currently only variables of categorical type can be used in this feature.

The node size can be controlled from the bottom left controls. The size of nodes containing multiple isolates with identical allelic profiles, can be scaled by the number of isolates contained in them. Toggle the Scale by Duplicates switch to activate this feature. Consequently, the slider to set the node size changes to a range selection instead of distinct values. In this way, the size of the smallest nodes, i.e. containing just one isolate, the size of the nodes containing most isolates as well as the overall range can be controlled.

The form of the nodes can be customized using the control panels on the bottom right. Activate the switch Show Shadows to display shadows for the nodes. The shape of the nodes can be changed here as well. Choose between shapes that render the node labels below (Diamond, Hexagon, Dot, Square) or inside them (Circle, Box, Text). If a variable is mapped, the form Pie Chart is locked in and can not be changed.

6.1.1.3 Edges

The Edges control panel allows to control the appearance of the edges and related elements. Each edge is labelled by the value of the allelic distance that the isolates from connecting nodes have to each other. Except its appearace, this label currently can’t be changed. The color and size can be modified using the upper left controls Label. The color of the edges themselves can be controlled by Color in the upper right controls. Click the menu button to see the control for the transparency of the edges. On the bottom left, there is the option to scale the edge lengths by the allelic distance they represent. Toggle the Scale Edge Length switch to activate this effect. The multiplier of this effect can be customized using the slider below. Activating this option when the subset of isolates displayed in the MST graph has very different allelic distances, e.g. for a maximum of 200 and a minimum of 10, can lead to an untidy look of the plot. Drag the slider to lower values to minimize this issue.

6.1.1.4 Clustering

The clustering controls are to be found in the Edges panel at the bottom right. By default the “Complex Type Distance” value disclosed for each scheme available on the cgMLST.org Nomenclature Server is selected as the current cluster threshold. The threshold value can be modified to any desired value. Nodes with distances that lie withing the selected threshold are accordingly engulfed by cluster shapes. These are differently colored in order to distinguish between the cluster groups. Choose between the Rainbow and Viridis scales to modify the coloring. There are two types of cluster shapes available: Area and Skeleton. The cluster type Area renders an area surrounding nodes that are part of a cluster. Skeleton instead uses the edges to visualize clusters. This can be particularly useful if the selection of isolates is complex, which can potentially lead to overlapping clusters with the Area cluster type.

6.2 Neighbour-Joining-Tree

The Neighbour-Joining (NJ) method constructs a tree by iteratively joining pairs of nodes based on their pairwise distances. It aims to minimize the total branch length in the tree and is commonly used for constructing phylogenetic trees from distance matrices. Refer to 6.4 NJ and UPGMA Modification for information on how the tree appearance can be modified.

6.3 UPGMA-Tree

The Unweighted Pair Group Method with Arithmetic Mean (UPGMA) computes tree plots by grouping the most similar sequences or taxa together at each step and then averaging their distances. It produces a tree with equal branch lengths and is often used for hierarchical clustering of data. Refer to 6.4 NJ and UPGMA Modification for information on how the tree appearance can be modified.

6.4 NJ and UPGMA Modification

The tree elements can be customized in great detail and supplemented with additional information such as variables (see 6.4.4 Variable Mapping). However the basic appearance, e.g. text and element sizes, are automatically adjusted to the qualities and quantities of the entries that were selected to be included for the tree. Due to the variable nature of different data sets, it is sometimes required to manually readjust some elements to receive a balanced look. While Minimum-Spanning trees have slightly different modification features and control inputs, NJ and UPGMA trees share the same control inputs. This is due to the different visualization technique used for the creation and display of MST plots. The controls to modify the tree are arranged in panels and divided in Layout, Label, Elements and Variables. In some panels you will find small menu buttons (highlighted in light blue). They allow to further modify the elements addressed by the respective panel in more detail (e.g. position or font-style).

6.4.1 Layout

The appearance of the general layout can be modified in detail. There is a range of different options, e.g. for controlling theme, colors, title & subtitle, size, legend and other elements. To switch to these controls navigate to the >> Visualization tab and click the Layout button from the menu left to the control panels.

Figure 19: Layout menu to control Theme, Color, Title, Sizing, Tree Scale and Legend.

6.4.1.1 Theme

Layout themes allow to change the geometrical appearance. You can choose from a selection of themes that are further categorized in linear and circular layouts. While the visual look changes when switching between linear and circular theme, the quality, i.e. the order and arrangement, of the hierarchical NJ and UPGMA trees, stays the same.

Linear: Rectangular, Roundrect, Slanted and Ellipse

Circular: Circular, Inward

Moreover, a Rootedge can be added by turning on the switch. The root of the tree can be considered as starting point, representing a theoretical “common ancestor” with an initial allelic profile, from which all other isolates developed. Next to aesthetics, displaying this element can help to distinguish “normal” branches, representing actual allelic distance between the isolates, from the root. The root menu lets you further modify it’s length and line type.

The Ladderize switch is turned on by default. It sorts the tree branches by their length.

Figure 20: Different tree layouts. Top left: NJ tree with rectangular layout and deactivated ladderize. Top right: NJ tree with roundrect layout and rootedge. Bottom left: UPGMA tree with circular layout and reduced space in the centrer. Bottom right: UPGMA tree with inward layout.

6.4.1.2 Color

The color of lines, text as well as background, can be modified in the Color panel. The colored buttons show the color currently displayed as well as the respective HEXA code. Clicking them opens the color menu. You can either select a color by choosing it directly from the gradient field or by providing a HEXA or RGBA code. Note, that the Lines/Text color applies to the tree branches, legend text and title, but not to the tip labels. Their color can be modified in the respective Label menu (see 6.4.2.1 Tips).

Figure 21: Inward UPGMA tree with customized background and line color. inward layout

6.4.1.3 Title

Add title and subtitle in the Title panel. Their color changes in accordance to the selected Lines/Text color, but can be separately modified. The title menu allows to customize the font size.

6.4.1.4 Sizing

The Sizing panel provides control of plot dimensions and position. For the aspect ratio, you can choose from 16:10, 16:9 and 4:3. The overall size can be scaled with the slider below. If some elements are cut off you can zoom out using the slider at the bottom. Especially trees having a circular layout can sometimes appear small with too much white space around. In this case zooming in might be beneficial. The Sizing menu allows to horizontally and vertically position the content.

6.4.1.5 Tree Scale

Legend and tree scale controls share the same panel. The tree scale helps to estimate the actual allelic distance, represented by the branch length. In case you prefer not to show this element you can hide it by toggling the switch. It’s length can be changed in the tree scale menu and proportionally scales with the branch length. If the scale superimposes other elements, adjust its position by dragging the sliders in the menu.

6.4.1.6 Legend

If variables are mapped to the plot, a legend will appear. For the orientation, the options are either horizontal or vertical (see Figure 22). The legend menu allows to also adjust position and size.

Figure 22: Two different legend orientation modes horizontal (top left) and vertical (bottom right).

6.4.2 Label

The Label menu allows to control whether and how certain labels are displayed. There are three different kinds of labels: Tips, Branches and Custom Labels. They can be modified in many different ways, e.g. in color, size or position.

Figure 23: Layout menu to add and control Tip, Branch and Custom labels.

6.4.2.1 Tips

The label at the tips represent the actual entries with their allelic profile that determined their position in the tree. By default, the Assembly Name is displayed as tip label. However it is possible to select other basic variables, e.g. Host, Country, City or Isolation Date, from the drop down menu, or even choose not to show tip labels at all by toggling the Show switch. Instead of the tip labels being positioned right next to the tips, they can be aligned to the right by activating the Align switch. UPGMA trees always have the tip labels aligned and NJ trees only have this activated by default for circular layouts. The menu on the right provides further customization options. The Opacity slider can be used to change the transparency. The Position parameter modifies the offset of the labels from the tip. Angle, size and font face can be changed as well. Customize the color of the label text with the color button and the color of the panel with the color button below. The panels envelope the tip label and are not shown by default. The controls in the panel menu allow to modify size of the panels (not the text itself) and to smooth the form.

Figure 24: NJ tree with aligned and colored tip labels.

6.4.2.2 Branches

Branch labels allow to supplement the tree with additional information by labelling the branch leading to the final tips with variables that are connected to the respective isolate. To show this element toggle the Show switch in the Branches panel. The drop down lets you choose which variable or meta data to annotate. The color of the panel surrounding the branch label can be changed with the color button below. The menu button includes further controls, e.g. opacity, size, horizontal and vertical position, font face as well as edge smoothing. Note, that having branch labels doesn’t work for trees with circular layout. Also more complex linear trees with many isolates included mostly have too confined space for adding branch labels. Instead, consider mapping a variable to other tree elements such as tip points (see 6.4.4 Variable Mapping).

Figure 25: NJ tree with branch labels showing the city of sample collection and panelled tip labels showing the sample host.

6.4.2.3 Custom Label

If there is a need for labels somewhere other than tips or branches, there is the option to create customized ones. The panel Custom Labels lets you define the label. Click the green + button too add it. The label will be positioned at plot center. Create more labels by giving them a name and adding them again. To change the size and position, select the respective label from the drop down and open the menu next to the + button. Do the desired changes and click the Apply button for them to come into effect. Figure 25 shows a tree with two highlighted clades (see 6.4.3.5 Clade Highlight). The custom label function was used to annotate them.

Figure 25: Tree with highlighted clades and customized labels to annotate them.

6.4.3 Elements

The Elements menu provides control over several special elements such as tip and node points or a heatmap. These are not essential but can amplify the explanatory power of the tree. Elements can be deactivated or activated and their appearance can be changed.

Figure 26: Elements menu to add and control Tip and Node Points, Tiles, Heatmap and Clades.

6.4.3.1 Tip Points

Tip points are located at the end of the tree branches and correspond to the isolates displayed. They can be modified in color or size to bring the ends of the tree into prominence. Alternatively this element can be used to map a variable (see 6.4.4 Variable Mapping).

Figure 27: Circular NJ tree with highlighted Tip (green) and Node points (red).

6.4.3.2 Node Points

Node points, in contrast to tip points, solely represent theoretical predecessors and relatives with respect to the isolates and their allelic profile. Despite the option to map a variable, their look can be customized in the same way like tip points. Mapping variables is not possible because they connect several isolates which may potentially have discrepant values for a chosen variable.

6.4.3.3 Tiles

Tiles are supplementary elements that can be used to map variables to the plot. They work with both circular and linear layouts. Up to five different tiles can be added by activating them in the Variables menu (see 6.4.4 Variable Mapping). To modify opacity, width or position, select the respective tiles that you wish to change with the selector at the top left corner of the panel. Any modifications will apply only for the selected tile. Opacity defines the transparency of the tile, enabling overlaying it e.g. over the tree. The width slider controls the width. Changing the position of the tiles for linear layouts, they are moved horizontally, while in circular layouts they are moved inwards or outwards in relation to the center of the circle.

Figure 28: Three different tiles that map continuous and categorical variables.

6.4.3.4 Heatmap

Heatmaps can be a powerful tool to visualize related variables of the same type (either categorical or continuous). For more details refer to 6.4.4 Variable Mapping. If the heatmap is activated in the Variables menu it can be modified using the respective control panel in the Elements menu. Width changes apply to the heatmap overall, not to single columns. Just as with tiles, the position control is moving the heatmap horizontally for linear layouts and inwards or outwards for circular layouts. In some situations, e.g. for long variable names or in circular layouts, it might make sense to modify the angle and/or position of the column headers. This can be done by using the controls in the heatmap menu.

Figure 29: Heatmap which assigns the gene expression of a selection of genes to the corresponding isolates.

6.4.3.5 Clade Highlight

Isolates are grouped in distinct hierarchical clades, which are defined by nodes that comprise several isolates or other daughter nodes and their respective isolates. In order to emphasize one or several clades toggle the Node View switch and inspect the respective node index of the clades you wish to highlight. Select the nodes in the drop-down menu below and deactivate the Node View again to see the highlighted clades. If only one clade is highlighted there is the option to customize its color with the color button below. If there is more than one clade selected, you choose from a color scale instead. Also use the menu in the Clade Highlight control panel to control the alignment of the clade highlights to each other. The borders of the colored squares can be modified to round or rectangular appearance.

Figure 30: Toggled Node View showing the numbered nodes for highlighting distinct clades.

Clades, which are located within another clade that is higher in the hierarchy can also be highlighted (see Figure 31).

Figure 31: Tree from Figure 29 with the clades of node 42 and its daughter node 44 highlighted.

6.4.4 Variable Mapping

Mapping variables, representing epidemiologic metadata or other properties of the isolates displayed, is a powerful way of enriching the plot with information. The Variables menu provides full control which variables are mapped, the elements they are mapped to and the color scale that represents the different values of the selected variable. The control panel is ordered into Element, Variable and Color Scale columns (see Figure 32). The switches in the Element column can be turned on or off to activate or deactivate the display of a variable with the respective element. Select the variable to be mapped from the drop-down menu right of the element switch. It contains the basic meta data (Isolation Date, Host, City, Country) as well as the manually added custom variables (see 4.1.2 Custom Variables). The currently selected variable is checked for its number of distinct values and variable type (categorical or continuous). As this information is relevant for selecting the color scale, it is displayed directly next to the color scale selection menus. For categorical variables, the selectable color scales automatically change depending on the number of distinct values. If the number of distinct values is 7 or less you can select from qualitative color scales. As there is a limited number of distinct colors available in the qualitative color scales, they are not selectable if the variable exceeds 7 distinct values. Instead, gradient color scales can be selected from. Continuous variables have continuous and divergent color scales available. Using the colorblind friendly gradients Viridis and Cividis is recommended. Divergent color scales are useful for visualizing data where there’s a clear central point of interest to highlight positive and negative deviations from a central value like 0. An example for a use case are gene expression variables. E.g. fold change values, with colors indicating whether the change is positive (upregulation) or negative (downregulation) relative to a baseline expression level of 0 (no change).

Figure 32: Variables menu to control variable mapping.

In Figure 33 the Isolation Date variable is mapped to the tip label color (see 6.4.2.1 Tips). Hence the tip labels indicate both the Assembly Name and the Isolation Date, with the Greys color scale highlighting more recently added isolates in darker shades. The tip point color is assigned to display the categorical City variable in which the sample was acquired. In this example with two values only, the cities Graz and Vienna. The qualitative scale Set2 is chosen to distinguish the variable as well as possible from other variables. The tip point shapes circle and triangle represent the host from which the bacterial sample was taken. As the variable values are represented by shapes instead of colors, there is no color scale for this option. Continuous values can’t be represented by shapes. There are six different shapes available, hence selecting the tip point shape to represent categorical variables is only possible if there are 6 or less distinct values. The custom variables Patient Age and ftsA, which stands for expression values of the ftsA gene, are mapped to Tile 1 and 2 respectively. Except color values which are assign by the variable mapping, the appearance of the elements, such as tip point sizes, can still be modified (e.g. 6.4.3.1 Tip Points).

Figure 34: Circular UPGMA plot with several variables mapped.

Figure 35 shows an example for gene expression fold changes mapped on a heatmap. While white/yellow colors indicates baseline expression levels around 0, green colors indicate upregulation and red colors downregulation. When a diverging scale is selected, you can choose the midpoint of the scale (Zero, Mean or Median) using the drop down menu that appears right to the color scale selector. Zero assigns the middle color of the diverging color scale to the value 0. The choices Mean and Median assign the middle color to the arithmetic mean and median of the respective value range. The appearance of the heatmap, such as width and position, can be modified using the respective control panel from the Elements menu (see 6.4.3.4 Heatmap).

Figure 35: Circular UPGMA tree with a heatmap representing gene expression fold changes for a selection of four genes.

7 Reporting Results

7.1 Download Plots

Neighbour-Joining and UPGMA trees can be downloaded in PNG, JPEG, BMP and SVG format. Minimum-Spanning trees can be downloaded in PNG, JPEG and BMP format. In addition they can be downloaded as HTML to preserve the interactivity of dragging, zooming and moving the MST graph. To initiate the download head to the > Visualization tab. In the sidebar, below the Create Tree button, you find the drop down to select the file type as well as the download button right next to it. Note: In order for the download to work, the plots have to be created first.

7.2 Generating Report Document

A report of HTML format can be created by clicking the button Print Report, located in the sidebar of the > Visualization tab. There are several options to control which information is included in the report. The elements are categorized in Entry Table, General , Analysis and Attach Plot (see 7.2.1 Report Elements). Note that the report requires prior creation of a tree plot. The entry table in the report, the attached plot as well as some analysis parameter, such as the tree algorithm, are all settled in the moment a tree is created. Therefore a proper report can only be generated after tree creation. For the entry table, instead of the entire local database for the respective scheme, only isolates of interest, i.e. the ones that have been used to generate the currently displayed tree are listed in the report. The download will be directed to the system location set in your browser download settings.

Figure 36: Setting general report parameters Date, Operator, Institute and Comment.

7.2.1 Report Elements

The sub-elements belonging to General are Date, Operator, Institute and Comment. If you wish to include only a selection of these elements, tick or untick them accordingly. Unticking the General element will deactivate the display of any sub-elements as well.

Figure 37: Upper part of the report showing general information (Date, Operator, Institute, Comment) and an attached NJ tree.

Ticking the Isolate Table prints the entry table, comprising the isolate names as well as the selected metadata columns on the report. Note, that only entries that are marked as Included in the database (>> Browse Entries) are printed. Hence only isolates that are shown in the current tree are included.

The sub-elements belonging to the Analysis parameters are Scheme, Tree, Distance, NA Handling and Version. These parameters are automatically derived from the session as well as the created tree and can only be selected to be shown or hidden. As with the General parameters, unticking Analysis Parameter will hide all sub-elements.

Figure 38: Lower part of the report showing the entry table (top) as well as analysis parameters (bottom left) and information about the cgMLST scheme (bottom right).

References

(1)

Clausen, P. T. L. C.; Aarestrup, F. M.; Lund, O. Rapid and Precise Alignment of Raw Reads Against Redundant Databases with KMA. BMC Bioinformatics 2018, 19 (1), 307. https://doi.org/10.1186/s12859-018-2336-6.

(2)

Kent, W. J. BLAT–the BLAST-Like Alignment Tool. Genome research 2002, 12 (4), 656–664. https://doi.org/10.1101/gr.229202.

(3)

Hamming, R. W. Error Detecting and Error Correcting Codes. The Bell System Technical Journal 1950, 29 (2), 147–160. https://doi.org/10.1002/j.1538-7305.1950.tb00463.x.