File Formats Used at PRABI-Doua
FASTA format
A sequence in FASTA format begins with a single-line description, followed by
lines of sequence data. The description line is distinguished from the sequence
data by a greater-than ('>
') symbol in the first column. It is
recommended that all lines of text be shorter than 80 characters in length.
Example:
>MyGene CCTCTCGGAGCTGGAAATGCAGCTATTGAGATCTTCGAATGCTGCGGAGCTGGAGGCGGA GGCAGCTGGGGAGGTCCGAGCGATGTGACCAGGCCGCCATCGCTCGTCTCTTCCTCTCTC CTGCCGCCTCCTGTGTCGAAAATAACTTTTTTAGTCTAAAGAAAGAAAGACAAAAGTAGT CGTCCGCCCCTCACGCCCTCTCTTCCTCTCAGCCTTCCGCCCGGTGAGGAAGCCCGGGGT GGCTGCTCCGCCGTCGGGGCCGCGCCGCCGAGCCCCAGCGCCCCGGGCCGCCCCCGCACG CCGCCCCCATGCATCCCTTCTACACCCGGGCCGCCACCATGATAGGCGAGATCGCCGCCG CCGTGTCCTTCATCTCCAAGTTTCTCCGCACCAAGGGGCTGACGAGCGAGCGACAGCTGC AGACCTTCAGCCAGAGCCTGCAGGAGCTGCTGGCAGAACATTATAAACATCACTGGTTCC CAGAAAAGCCATGCAAGGGATCGGGTTACCGTTGTATTCGCATCAACCATAAAATGGATC or >MyProtein MAVTQTAQACDLVIFGAKGDLARRKLLPSLYQLEKAGQLNPDTRIIGVGRADWDKAAYTK VVREALETFMKETIDEGLWDTLSARLDFCNLDVNDTAAFSRLGAMLDQKNRITINYFAMP PSTFGAICKGLGEAKLNAKPARVVMEKPLGTSLATSQEINDQVGEYFEECQVYRIDHYLG KETVLNLLALRFANSLFVNNWDNRTIDHVEITV
Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue).
The nucleic acid codes supported are:
A --> adenosine M --> A C (amino) C --> cytidine S --> G C (strong) G --> guanine W --> A T (weak) T --> thymidine B --> G T C U --> uridine D --> G A T R --> G A (purine) H --> A C T Y --> T C (pyrimidine) V --> G C A K --> G T (keto) N --> A G C T (any) - gap of indeterminate length
For those programs that use amino acid query sequences, the accepted amino acid codes are:
A alanine P proline B aspartate or asparagine Q glutamine C cystine R arginine D aspartate S serine E glutamate T threonine F phenylalanine U selenocysteine G glycine V valine H histidine W tryptophan I isoleucine Y tyrosine K lysine Z glutamate or glutamine L leucine X any M methionine * translation stop N asparagine - gap of indeterminate length
MASE format
This format is used to store nucleotide or protein multiple alignments.
The beginning of the file must contain a header containing at least one
line (but the content of this header may be empty). The header lines must
begin by ';;
'. The body of the file has the following structure:
First, each entry must begin by one (or more) commentary line. Commentary
lines begin by the character ';
'. Again, this commentary line
may be empty. After the commentaries, the name of the sequence is written
on a separate line. At last, the sequence itself is written on the following lines.
Example:
;;Aligned by clustal on Fri Jul 7 10:54:01 1995 ;no description Sequence1 MAPG-SWFSPLLIAVVTLGLP-QEAAATFPAMPLSNLFANAVLRAQHLHLLAAETYKEFE RTYIPEDQRYTN-KNSQAAFCYSETIPAPTGKDDAQQKSDMELLRFSLVLIQSWLTPVQY ... H-LRNEDALLKNYGLLSCFKKDLHKVETYLKVMKCRRFGESNCTI ;no description Sequence2 M-------GQVFLLMPVLLVSCFLSQG--AAMENQRLFNIAVNRVQHLHLMAQKMFNDFE GTLLPDERR-QLNKIFLLDFCNSDSIVSPIDKLETQKSSVLKLLHISFRLIESWEYPSQT ...
Raw format
Allowed characters for nucleotide sequences are: A, a, C, c, G, G, T, t, U, u, Y, y, R, r. The length of a line is not fixed and may vary from one line to another.
Example:
TTTGATGAAAATCGCTTAGGCCTTGCTCTTCAAACAATCCAGCTTCTTTCACTC TCAAGTTGCAAGAAGCAAGTGTAGCAATGTGCACGCGACAGCCGGGTGTGTGACGCTGG CCAATCAGAGCGCAGAGCTCCGAAAGTTTACCTTTTATGGCTAGAGCCGGCATCTGC CATATAAAAGAGCGCGCCCAGCGTCTCAGCCTCACTTTGAGCACACGCAGCTAG TGCGGAATATCATCTGCCTGTAACCCATTCTCTAAAGTCGACAAACCCCCCCAAACCTAA GGTGAGTTGATCT