NX4 User Guide

What is NX4?

NX4 is a web-based visualization tool for the exploration of aligned viral sequences. The tool was born as an alternative to matrix-based MSA visualizations.

A note about the supported data format

NX4 currently supports a single FASTA file containing all the sequences pertaining to a calculated alignment. Each sequence block should follow the standard FASTA format and should be separated by one line break from the next sequence.

The application supports an ID extraction feature which allows the user to identify and export specific sequences. However, because there isn't a full consensus on how to format the sequence header in the FASTA format, some user might find that the sequence IDs show as "undefined" when trying to extract them in the tool. To avoid that, we recommend using either the UniProt standard or the NCBI standard to format your headers, as we have tried to accommodate for several different formatting scenarios. The table below shows how different header formats will show as IDs in the extraction module:

Example header

Example extracted ID

>db|UniqueIdentifier|EntryName ProteinName OS=OrganismName OX=OrganismIdentifier [GN=GeneName ]PE=ProteinExistence SV=SequenceVersion

db|UniqueIdentifier|EntryName

>P01013 GENE X PROTEIN (OVALBUMIN-RELATED)

P01013

>Ebola_virus_H.sapiens_wt/GBR/2015/Makona_UK3

|KR025228|United_Kingdom|2015_03_12

Ebola_virus_H.sapiens_wt/GBR/2015/

Makona_UK3|KR025228|2015_03_12

>ASingleStringHeader

>ASingleStringHeader

Loading data into the platform

To load data into the platform, drag and drop a valid file into the gray area on the top right of the home screen, or click on it to open a file browser.

Interfacing with the tool

Once data has been successfully loaded, the website will automatically display four separate views or modules:

  1. Overview module

  2. Detailed entropy module

  3. Frequencies module

  4. ID extraction module

Below we explain the how to interact with each module.

1. The overview module

The overview module provides a general display of the calculated Shannon entropy for the entire length of the sequence, and it includes a draggable element known as a "brush" that allows the user to hone in on a region of the sequence. To interact with the brush, simply click on the gray rectangular area and drag to the left or right to display a portion of the sequence below.

2. The detailed entropy module

This module located below the overview module displays a "zoomed-in" portion of the sequence, based on the selection of the brush (see the previous module for details). This line chart also displays the Shannon entropy, and the user can hover over the chart to display specific values and proportions. This module and the one below are coordinated, which means that when the user hovers over a specific position in the chain, they will see a marker in both modules at the same time.

3. The frequencies module

This module is a heat map visualization of the calculated frequencies for each amino acid. The color reference helps the user approximate the frequency of a given letter a a specific position, and by hovering with the mouse, the module will display the actual value next to the letter, on the left side of the heat map.

The user can also click on any square to obtain a list of for all the IDs for that particular letter and position in the module below. For instance, if 85% of sequences show a T in position 54, by clicking on that rectangle the user will obtain the IDs of all 0.85 * N sequences that have a T in that position.

4. The ID extraction module

This module allows to quickly and easily explore all the sequences selected by the user in the frequencies module. The user can copy this list of IDs to use in another application for further analysis.

Data sources

There are two datasets available in the homepage for you to test or use for replication. You may download them here:

101 Sequences – Ebola (EBOV)

From Gire et al., Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak, Science, 2014.

Click here for the raw data, or if you prefer to get it from the Science site, follow the link and navigate to "Figures and Data", then download the zip file "File S1", and after uncompressing, you'll find a FASTA file called "ebov.mafft.fasta" inside the "alignments" folder.

1,824 Sequences – Ebola (EBOV)

This file was kindly given to us for testing by researchers from the Sabeti Lab at the Broad Institute of MIT and Harvard. The sequences were aligned using MAFFT v7.221. Click here for the data (it's ~32 mb).

Last updated