PRABI-Doua

Pôle Rhône-Alpes de Bioinformatique Site Doua

Barre

ACNUC glossary

Alphabetically sorted terms : acnuc - gcgacnuc, acnuctaxo, division, flat file, genetic code, help file, index file, label, tree

acnuc - gcgacnuc : two environment variables used by all ACNUC programs to define the name of the directories where index and flat files are located, respectively. Flat files can be located in the directory pointed to by $gcgacnuc, or in subdirectories of it.
In the LBBE setup, acnuc flat files can also be stored in the lab's iRODS server. In that case gcgacnuc is set to a value of the form irods://lbbeZone/home/.....

acnuctaxo : an environment variable that defines the directory where the NCBI taxonomy files (nodes.dmp and names.dmp) are located. Programs acnucgener and readncbitaxo read these two files. These files come from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz.

division : each flat file is called a database division. Divisions are generally called by the filename without extension (e.g. gbbct1) when running management programs. Division names are all stored in index file SMJYT. When the flat file is located in a subdirectory of $gcgacnuc, the name of this subdirectory is part of the division name.

flat file : acnuc databases index sequence + annotation files called flat files, that is, plain text files, as distributed by the database creators. GenBank flat files are named gbxxx.seq, EMBL and SwissProt flat files are named xxx.dat. Currently, flat files can exceed 4GB in size because two 4-byte integers are devoted to storing an address within that file. All of acnuc programs, except compressnewdiv, access flat files in readonly mode. Flat files are always located in a directory whose name is given by environment variable gcgacnuc or in a subdirectory thereof.
In the LBBE setup, flat files can optionally be stored in the laboratory's iRODS server.

genetic code : nucleotide sequence databases use a number of variant genetic codes to properly translate CDS in protein sequences. These genetic codes, defined by NCBI and distributed together with the species classification, are identified by two numerical ids, one given by NCBI, one defined by acnuc. What genetic code is used by what species is stored in acnuc in species labels.

List of defined genetic codes
ncbi
code ID
acnuc
code ID
differences from universal code
(* : stop codons)
target taxon and genome
1 0 Universal genetic code Standard
3 1 CUN=T AUA=M UGA=W Yeast Mitochondrial
2 2 AGR=* AUA=M UGA=W Vertebrate Mitochondrial
4 3 UGA=W Mold Mitochondrial; Protozoan Mitochondrial;
Coelenterate Mitochondrial; Mycoplasma; Spiroplasma
5 4 AUA=M UGA=W AGR=S Invertebrate Mitochondrial
12 5 CUG=S Alternative Yeast Nuclear
6 6 UAR=Q Ciliate Nuclear; Dasycladacean Nuclear; Hexamita Nuclear
10 7 UGA=C Euplotid Nuclear
9 8 UGA=W AGR=S AAA=N Echinoderm Mitochondrial; Flatworm Mitochondrial
13 9 UGA=W AGR=G AUA=M Ascidian Mitochondrial
14 10 UGA=W AGR=S UAA=Y AAA=N Alternative Flatworm Mitochondrial
15 11 UAG=Q Blepharisma Macronuclear
11 12 NUG=AUN=M when initiation codon Bacterial, Archaeal and Plant Plastid
16 13 UAG=L Chlorophycean Mitochondrial
21 14 AUA=M UGA=W AGR=S AAA=N Trematode Mitochondrial
22 15 UAG=L UCA=* Scenedesmus obliquus mitochondrial
23 16 UUA=* Thraustochytrium mitochondrial
24 17 UGA=W AGA=S AGG=K Pterobranchia Mitochondrial
25 18 UGA=G Candidate Division SR1 and Gracilibacteria
26 19 CUG=A Pachysolen tannophilus Nuclear
27 20 UAR=Q, UGA=W, CUG=A Karyorelict Nuclear
28 21 UAR=Q, UGA=W, CUG=A Condylostoma Nuclear
29 22 UAR=Y, CUG=A Mesodinium Nuclear
30 23 UAR=E, CUG=A Peritrich Nuclear
31 24 UAR=E, UGA=W Blastocrithidia Nuclear
32 25 UAG=W Balanophoraceae Plastid
33 26 UAA=Y, UGA=W, AGA=S, AGG=K Cephalodiscidae Mitochondrial

help file : Text files HELP and HELP_WIN contain on-line help information for the query and query_win programs. They also contain summary information: name of database, release number, total sequence, reference and residues contents. Both files are located in the $acnuc directory.

index file : acnuc databases are made up of a series of index files (see physical structure) that allow efficient accesses to sequence files according to various retrieval criteria. Index files are ACCESS, AUTHOR, BIBLIO, EXTRACT (not for protein databases), KEYWORDS, LOCUS, LONGL, MERES (optional, serves only to allow quicker launch), SHORTL, SHORTL2 (optional), SPECIES, SUBSEQ, TAXIDS (optional, to implement retrieval by taxon ID), TAXTREE (optional, contains all species tree information), TEXT. Acnuc index files are always located in a directory whose name is given by environment variable acnuc.

label : species, keywords, journal codes, type names all optionally have a label which is a descriptive character string stored in index file TEXT.
For species, the label can also store genetic code information: the label begins with mtgc:#| or with gc:#| to give the number of the mitochondrial and nuclear, respectively, genetic codes of this species. Species label can also store NCBI's taxon ID information: the label starts with id:#| .
For gene family databases such as Hovergen and Hobacgen, the label can also store taxonomic level information such as [species] or [suborder].

tree : species, or more generally, taxon names and keywords are organized in two trees in acnuc databases. The effect of this is that selecting from a tree node selects all sequences attached to all nodes placed below in the tree. The tree structure is extensive (nearly all nodes are properly placed in the tree) for taxon names. The tree follows NCBI's classification of species. The tree structure is very sparsed (most keywords are at the tree top, with nothing below) for keywords. The keyword tree structure proves useful to organize some logically related keywords.