MAVID user guide



Introduction

MAVID is a multiple alignment program that is suitable for alignments of large numbers of DNA sequences. The sequences can be small mitochondrial genomes or large genomic regions up to megabases long. The MAVID server integrates MAVID with various phylogenetic tree construction programs and visualization tools to allow biomedical researchers who have a collection of related genomic sequences to rapidly identify conserved regions for further analysis. For an overview of MAVID see the poster presented at the 2005 Biology of Genomes meeting at Cold Spring Harbor Laboratories.

The only requirement for use of the server is that the user have available a set of homologous genomic sequences. Preparation of the sequences for submission is discussed in the section on submitting sequences to the server. If annotations are available for one of the submitted sequences, these can be provided as well (see using annotations) and will be used in generating some of the subsequent plots. Furthermore, the annotations are transferred to the other sequences according to the multiple alignment.

After processing a request, the results are displayed on a specially created website that is stored for future reference. This website is accessible only to the user who submitted the sequences. The results page contains the MAVID generated multiple alignment in various formats, the phylogenetic tree constructed from the sequences, and VISTA pictures of the alignments.

Use of the server is summarized below:

Input: 
 - Sequences in multi-FASTA format
 - Optional annotations for one of the sequences
Output:
 - Phylogenetic tree
 - Pairwise or multiple alignment in various formats
 - Sets of VISTA pictures showing conservation in the pairwise alignments 
   generated from the MAVID multiple alignment

Submitting sequences to the server

The only required submission to the MAVID server is a file containing the sequences, to be input in the "DNA sequences file" box. The sequence file should contain the sequences in multi-FASTA format. The file itself should be a text file. Considerable care has been taken to ensure that sequence files can be parsed irrespective of extra spaces or other format deviations however it is possible that a submission could fail because the input file is not to specification. Indeed, this is one of the most common errors made by users of the MAVID server. Common errors include: It should also be noted that FASTA identifiers should be short and descriptive; these are used in subsequent programs and long, clumsy, headers are ineffective in some of the displays created (for example in viewing the phylogenetic tree).

Once sequences have been submitted to the server, a temporary page will be displayed showing progress in processing the job. The page will auto-refresh periodically until the job is complete at which point the results will be displayed. Typical jobs complete within seconds, however larger submissions (for example multiple alignments of megabases of sequence) may take longer. If a large job is submitted, it is advisable to bookmark the URL of the "job progress" page, and simply return to visit it at a later time (the URL of the results page will be the same).

Using annotations

The inclusion of annotations with the submitting sequences is optional. Annotations are currently used in generating the VISTA pictures. Users can submit annotations for only one of the sequences, and this is indicated by copying the FASTA header for that sequences in the box marked "FASTA line of sequence to which annotation corresponds". The raw annotations in GAF format need to be uploaded separately.

A useful feature of MAVID is the transferring of annotations to the other sequences. This is done automatically using the multiple alignment, and results can be viewed and downloaded on the VISTA page

Organization

The output is organized into two main categories: Download and View. Downloadable files include the multiple alignment in PHYLIP or multi-FASTA format, along with the phylogenetic tree. In the case where just two sequences are submitted for alignment, the pairwise alignment is available for download in AVID format. Please note that the alignment files are not compressed, and can be large.

The visualization options currently include the Vista plots (opens up another browser window) and the phylogenetic tree (opens up a Java applet). We are currently finishing up a new visualization method for multiple alignments called MATA, which will be available shortly.

The phylognetic tree

The phylogenetic tree generated from the MAVID alignment can be downloaded directly in Newick format, or viewed using the ATV applet (the tree is rooted using the midpoint method). The ATV applet is a useful tool for examining and manipulating the phylogenetic tree. Options include display of branch lengths, and the ability to reroot the tree or view it with unscaled branch lengths. It is important to note that the branch lengths correspond to the lengths of the horizontal segments in the tree. The branch lengths have been estimated from the entire alignment, without regard to the amount of functional sequence, and so they may or may not reflect the neutral rate of evolution for the submitted regions.

Viewing and using the VISTA pictures

Clicking on the "VISTA plots" link opens a new browser window which contains links to VISTA plots generated from the multiple alignment. A plot in Adobe PDF format is generated for each of the submitted sequences. For a given sequence, alignments of with the other sequences are extracted from the multiple alignment and these are displayed. If an annotation file has been provided the plot will show the locations of UTRs and coding exons. The plots have been designed to show conservation and not just similarity, and so cutoffs for shading conserved regions are set dynamically based on the evolutionary distance of the sequences. Similarly, the baseline of the VISTA curve is set to be roughly the background amount of similarity expected for the submitted sequences based on their evolutionary distance.

If annotations have been provided for one of the sequences, these will be mapped to the other sequences using the multiple alignment, and the VISTA plots for each of the sequences will be annotated correctly (the inferred annotations can also be downloaded on the VISTA page).

A worked example

The following example consists of a step-by-step illustration of how to use the server and also how to interpret the results:
Step 1: Download the file sequences.fasta (right click on the mouse button and select save). 
        This file contains three sequences (human, mouse and rat).
Step 2: Open the MAVID server window by clicking here.  
Step 3: Click on the "browse" button on the MAVID server page (to the right of the DNA sequences file box), 
        find the file sequences.fasta on your computer, and select it.
Step 4: Download the file annotation.
Step 5: Insert this file into the "Annotation file" box using the Browse button.
        Then type "human" in the "FASTA line" box, since the annotation file
        provides annotations for the human sequence.
Step 6: You are now ready to run the MAVID server. Just hit the "Submit" button.
Step 7: You will see an intermediate page showing the progress of the server in
        processing your results. When it is complete (after about 1 minute), 
        you should see a page that looks the same as this one (except with a different URL and possibly a different animal).
Step 8: The View section of the page shows different aspects of the results that can be viewed. 
        Clicking on the phylogenetic tree will open up the ATV applet, which is a tool produced
	at Washington University for visualizing phylogenetic trees (see ATV user-manual for more information). 
	Note that the tree is rooted, and that the branch lengths can be examined by pressing 
	"show branch lengths" in the applet. Your intuition about the evolutionary history of the human, mouse and rat
	should be confirmed. 
Step 9: Explore the VISTA page. Note that here are three different PDF files for download, one corresponding 
        to each of the sequences. Click on the "mouse_all.pdf" link. The VISTA shows the locations of the genes, UTRs and 
	coding exons. Conserved non-coding sequences are colored in red. Notice that the criteria for declaring a region
	conserved is that it must be at least 100bp long with 73% alignment (between human and mouse) with the percentage
	alignment upped to 86% between mouse and rat. This is because the mouse and rat are closer to each other and
	therefore the criteria for declaring a region conserved is higher. The numbers have been determined from the 
	overall branch lengths in the tree for this particular region.
	You can also download the annotations for the mouse and the rat sequences
	(these have been inferred from the multiple alignment).