PRABI-Doua

Pôle Rhône-Alpes de Bioinformatique Site Doua

Barre

File Formats Used at PRABI-Doua

FASTA format

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than ('>') symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length.

Example:

>MyGene                 
CCTCTCGGAGCTGGAAATGCAGCTATTGAGATCTTCGAATGCTGCGGAGCTGGAGGCGGA
GGCAGCTGGGGAGGTCCGAGCGATGTGACCAGGCCGCCATCGCTCGTCTCTTCCTCTCTC
CTGCCGCCTCCTGTGTCGAAAATAACTTTTTTAGTCTAAAGAAAGAAAGACAAAAGTAGT
CGTCCGCCCCTCACGCCCTCTCTTCCTCTCAGCCTTCCGCCCGGTGAGGAAGCCCGGGGT
GGCTGCTCCGCCGTCGGGGCCGCGCCGCCGAGCCCCAGCGCCCCGGGCCGCCCCCGCACG
CCGCCCCCATGCATCCCTTCTACACCCGGGCCGCCACCATGATAGGCGAGATCGCCGCCG
CCGTGTCCTTCATCTCCAAGTTTCTCCGCACCAAGGGGCTGACGAGCGAGCGACAGCTGC
AGACCTTCAGCCAGAGCCTGCAGGAGCTGCTGGCAGAACATTATAAACATCACTGGTTCC
CAGAAAAGCCATGCAAGGGATCGGGTTACCGTTGTATTCGCATCAACCATAAAATGGATC

or

>MyProtein
MAVTQTAQACDLVIFGAKGDLARRKLLPSLYQLEKAGQLNPDTRIIGVGRADWDKAAYTK
VVREALETFMKETIDEGLWDTLSARLDFCNLDVNDTAAFSRLGAMLDQKNRITINYFAMP
PSTFGAICKGLGEAKLNAKPARVVMEKPLGTSLATSQEINDQVGEYFEECQVYRIDHYLG
KETVLNLLALRFANSLFVNNWDNRTIDHVEITV

Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue).

The nucleic acid codes supported are:

	A --> adenosine           M --> A C (amino)
	C --> cytidine            S --> G C (strong)
	G --> guanine             W --> A T (weak)
	T --> thymidine           B --> G T C
	U --> uridine             D --> G A T
	R --> G A (purine)        H --> A C T
	Y --> T C (pyrimidine)    V --> G C A
	K --> G T (keto)          N --> A G C T (any)
	                          -  gap of indeterminate length

For those programs that use amino acid query sequences, the accepted amino acid codes are:

	A  alanine                         P  proline
	B  aspartate or asparagine         Q  glutamine
	C  cystine                         R  arginine
	D  aspartate                       S  serine
	E  glutamate                       T  threonine
	F  phenylalanine                   U  selenocysteine
	G  glycine                         V  valine
	H  histidine                       W  tryptophan
	I  isoleucine                      Y  tyrosine
	K  lysine                          Z  glutamate or glutamine
	L  leucine                         X  any
	M  methionine                      *  translation stop
	N  asparagine                      -  gap of indeterminate length

MASE format

This format is used to store nucleotide or protein multiple alignments. The beginning of the file must contain a header containing at least one line (but the content of this header may be empty). The header lines must begin by ';;'. The body of the file has the following structure: First, each entry must begin by one (or more) commentary line. Commentary lines begin by the character ';'. Again, this commentary line may be empty. After the commentaries, the name of the sequence is written on a separate line. At last, the sequence itself is written on the following lines.

Example:

;;Aligned by clustal on Fri Jul  7 10:54:01 1995
;no description
Sequence1
MAPG-SWFSPLLIAVVTLGLP-QEAAATFPAMPLSNLFANAVLRAQHLHLLAAETYKEFE
RTYIPEDQRYTN-KNSQAAFCYSETIPAPTGKDDAQQKSDMELLRFSLVLIQSWLTPVQY
...
H-LRNEDALLKNYGLLSCFKKDLHKVETYLKVMKCRRFGESNCTI
;no description
Sequence2
M-------GQVFLLMPVLLVSCFLSQG--AAMENQRLFNIAVNRVQHLHLMAQKMFNDFE
GTLLPDERR-QLNKIFLLDFCNSDSIVSPIDKLETQKSSVLKLLHISFRLIESWEYPSQT
...

Raw format

Allowed characters for nucleotide sequences are: A, a, C, c, G, G, T, t, U, u, Y, y, R, r. The length of a line is not fixed and may vary from one line to another.

Example:

TTTGATGAAAATCGCTTAGGCCTTGCTCTTCAAACAATCCAGCTTCTTTCACTC
TCAAGTTGCAAGAAGCAAGTGTAGCAATGTGCACGCGACAGCCGGGTGTGTGACGCTGG
CCAATCAGAGCGCAGAGCTCCGAAAGTTTACCTTTTATGGCTAGAGCCGGCATCTGC
CATATAAAAGAGCGCGCCCAGCGTCTCAGCCTCACTTTGAGCACACGCAGCTAG
TGCGGAATATCATCTGCCTGTAACCCATTCTCTAAAGTCGACAAACCCCCCCAAACCTAA
GGTGAGTTGATCT