ACNUC glossary
Alphabetically sorted terms : acnuc - gcgacnuc, acnuctaxo, division, flat file, genetic code, help file, index file, label, tree
acnuc - gcgacnuc : two environment
variables used by all ACNUC programs to define the name of the directories where
index and flat files are located, respectively.
Flat files can be located in the directory pointed to by $gcgacnuc, or in subdirectories of it.
In the LBBE setup, acnuc flat files can also be stored in the lab's iRODS server.
In that case gcgacnuc is set to a value of the form irods://lbbeZone/home/.....
acnuctaxo : an environment variable that defines the directory where the NCBI taxonomy files (nodes.dmp and names.dmp) are located. Programs acnucgener and readncbitaxo read these two files. These files come from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz.
division : each flat file is called a database division. Divisions are generally called by the filename without extension (e.g. gbbct1) when running management programs. Division names are all stored in index file SMJYT. When the flat file is located in a subdirectory of $gcgacnuc, the name of this subdirectory is part of the division name.
flat file : acnuc databases index sequence +
annotation files called flat files, that is, plain text files, as distributed by the database
creators. GenBank flat files are named gbxxx.seq, EMBL and SwissProt flat files are named xxx.dat.
Currently, flat files can exceed 4GB in size because two 4-byte
integers are devoted to storing an address within that file. All of acnuc programs, except
compressnewdiv, access flat files in readonly mode. Flat
files are always located in a directory whose name is given by environment variable gcgacnuc
or in a subdirectory thereof.
In the LBBE setup, flat files can optionally be stored in the laboratory's iRODS server.
genetic code : nucleotide sequence databases use a number of variant genetic codes to properly translate CDS in protein sequences. These genetic codes, defined by NCBI and distributed together with the species classification, are identified by two numerical ids, one given by NCBI, one defined by acnuc. What genetic code is used by what species is stored in acnuc in species labels.
ncbi code ID |
acnuc code ID |
differences from universal code (* : stop codons) |
target taxon and genome |
---|---|---|---|
1 | 0 | Universal genetic code | Standard |
3 | 1 | CUN=T AUA=M UGA=W | Yeast Mitochondrial |
2 | 2 | AGR=* AUA=M UGA=W | Vertebrate Mitochondrial |
4 | 3 | UGA=W | Mold Mitochondrial; Protozoan Mitochondrial; Coelenterate Mitochondrial; Mycoplasma; Spiroplasma |
5 | 4 | AUA=M UGA=W AGR=S | Invertebrate Mitochondrial |
12 | 5 | CUG=S | Alternative Yeast Nuclear |
6 | 6 | UAR=Q | Ciliate Nuclear; Dasycladacean Nuclear; Hexamita Nuclear |
10 | 7 | UGA=C | Euplotid Nuclear |
9 | 8 | UGA=W AGR=S AAA=N | Echinoderm Mitochondrial; Flatworm Mitochondrial |
13 | 9 | UGA=W AGR=G AUA=M | Ascidian Mitochondrial |
14 | 10 | UGA=W AGR=S UAA=Y AAA=N | Alternative Flatworm Mitochondrial |
15 | 11 | UAG=Q | Blepharisma Macronuclear |
11 | 12 | NUG=AUN=M when initiation codon | Bacterial, Archaeal and Plant Plastid |
16 | 13 | UAG=L | Chlorophycean Mitochondrial |
21 | 14 | AUA=M UGA=W AGR=S AAA=N | Trematode Mitochondrial |
22 | 15 | UAG=L UCA=* | Scenedesmus obliquus mitochondrial |
23 | 16 | UUA=* | Thraustochytrium mitochondrial |
24 | 17 | UGA=W AGA=S AGG=K | Pterobranchia Mitochondrial |
25 | 18 | UGA=G | Candidate Division SR1 and Gracilibacteria |
26 | 19 | CUG=A | Pachysolen tannophilus Nuclear |
27 | 20 | UAR=Q, UGA=W, CUG=A | Karyorelict Nuclear |
28 | 21 | UAR=Q, UGA=W, CUG=A | Condylostoma Nuclear |
29 | 22 | UAR=Y, CUG=A | Mesodinium Nuclear |
30 | 23 | UAR=E, CUG=A | Peritrich Nuclear |
31 | 24 | UAR=E, UGA=W | Blastocrithidia Nuclear |
32 | 25 | UAG=W | Balanophoraceae Plastid |
33 | 26 | UAA=Y, UGA=W, AGA=S, AGG=K | Cephalodiscidae Mitochondrial |
help file : Text files HELP and HELP_WIN contain on-line help information for the query and query_win programs. They also contain summary information: name of database, release number, total sequence, reference and residues contents. Both files are located in the $acnuc directory.
index file : acnuc databases are made up of a series of index files (see physical structure) that allow efficient accesses to sequence files according to various retrieval criteria. Index files are ACCESS, AUTHOR, BIBLIO, EXTRACT (not for protein databases), KEYWORDS, LOCUS, LONGL, MERES (optional, serves only to allow quicker launch), SHORTL, SHORTL2 (optional), SPECIES, SUBSEQ, TAXIDS (optional, to implement retrieval by taxon ID), TAXTREE (optional, contains all species tree information), TEXT. Acnuc index files are always located in a directory whose name is given by environment variable acnuc.
label : species, keywords, journal codes, type
names all optionally have a label which is a descriptive character string stored in
index file TEXT.
For species, the label can also store genetic code information: the label begins
with mtgc:#| or with gc:#| to give the number of the mitochondrial and nuclear,
respectively, genetic codes of this species. Species label can also store NCBI's taxon ID information:
the label starts with id:#| .
For gene family databases such as Hovergen and Hobacgen, the label can also store taxonomic level
information such as [species] or [suborder].
tree : species, or more generally, taxon names and keywords are organized in two trees in acnuc databases. The effect of this is that selecting from a tree node selects all sequences attached to all nodes placed below in the tree. The tree structure is extensive (nearly all nodes are properly placed in the tree) for taxon names. The tree follows NCBI's classification of species. The tree structure is very sparsed (most keywords are at the tree top, with nothing below) for keywords. The keyword tree structure proves useful to organize some logically related keywords.