PRABI-Doua: ACNUC glossary

ACNUC glossary

Alphabetically sorted terms : acnuc - gcgacnuc, acnuctaxo, division, flat file, genetic code, help file, index file, label, tree

acnuc - gcgacnuc : two environment variables used by all ACNUC programs to define the name of the directories where index and flat files are located, respectively. Flat files can be located in the directory pointed to by $gcgacnuc, or in subdirectories of it.
In the LBBE setup, acnuc flat files can also be stored in the lab's iRODS server. In that case gcgacnuc is set to a value of the form irods://lbbeZone/home/.....

acnuctaxo : an environment variable that defines the directory where the NCBI taxonomy files (nodes.dmp and names.dmp) are located. Programs acnucgener and readncbitaxo read these two files. These files come from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz.

division : each flat file is called a database division. Divisions are generally called by the filename without extension (e.g. gbbct1) when running management programs. Division names are all stored in index file SMJYT. When the flat file is located in a subdirectory of $gcgacnuc, the name of this subdirectory is part of the division name.

flat file : acnuc databases index sequence + annotation files called flat files, that is, plain text files, as distributed by the database creators. GenBank flat files are named gbxxx.seq, EMBL and SwissProt flat files are named xxx.dat. Currently, flat files can exceed 4GB in size because two 4-byte integers are devoted to storing an address within that file. All of acnuc programs, except compressnewdiv, access flat files in readonly mode. Flat files are always located in a directory whose name is given by environment variable gcgacnuc or in a subdirectory thereof.
In the LBBE setup, flat files can optionally be stored in the laboratory's iRODS server.

genetic code : nucleotide sequence databases use a number of variant genetic codes to properly translate CDS in protein sequences. These genetic codes, defined by NCBI and distributed together with the species classification, are identified by two numerical ids, one given by NCBI, one defined by acnuc. What genetic code is used by what species is stored in acnuc in species labels.

List of defined genetic codes
ncbi code ID	acnuc code ID	differences from universal code (* : stop codons)	target taxon and genome
1	0	Universal genetic code	Standard
3	1	CUN=T AUA=M UGA=W	Yeast Mitochondrial
2	2	AGR=* AUA=M UGA=W	Vertebrate Mitochondrial
4	3	UGA=W	Mold Mitochondrial; Protozoan Mitochondrial; Coelenterate Mitochondrial; Mycoplasma; Spiroplasma
5	4	AUA=M UGA=W AGR=S	Invertebrate Mitochondrial
12	5	CUG=S	Alternative Yeast Nuclear
6	6	UAR=Q	Ciliate Nuclear; Dasycladacean Nuclear; Hexamita Nuclear
10	7	UGA=C	Euplotid Nuclear
9	8	UGA=W AGR=S AAA=N	Echinoderm Mitochondrial; Flatworm Mitochondrial
13	9	UGA=W AGR=G AUA=M	Ascidian Mitochondrial
14	10	UGA=W AGR=S UAA=Y AAA=N	Alternative Flatworm Mitochondrial
15	11	UAG=Q	Blepharisma Macronuclear
11	12	NUG=AUN=M when initiation codon	Bacterial, Archaeal and Plant Plastid
16	13	UAG=L	Chlorophycean Mitochondrial
21	14	AUA=M UGA=W AGR=S AAA=N	Trematode Mitochondrial
22	15	UAG=L UCA=*	Scenedesmus obliquus mitochondrial
23	16	UUA=*	Thraustochytrium mitochondrial
24	17	UGA=W AGA=S AGG=K	Pterobranchia Mitochondrial
25	18	UGA=G	Candidate Division SR1 and Gracilibacteria
26	19	CUG=A	Pachysolen tannophilus Nuclear
27	20	UAR=Q, UGA=W, CUG=A	Karyorelict Nuclear
28	21	UAR=Q, UGA=W, CUG=A	Condylostoma Nuclear
29	22	UAR=Y, CUG=A	Mesodinium Nuclear
30	23	UAR=E, CUG=A	Peritrich Nuclear
31	24	UAR=E, UGA=W	Blastocrithidia Nuclear
32	25	UAG=W	Balanophoraceae Plastid
33	26	UAA=Y, UGA=W, AGA=S, AGG=K	Cephalodiscidae Mitochondrial

help file : Text files HELP and HELP_WIN contain on-line help information for the query and query_win programs. They also contain summary information: name of database, release number, total sequence, reference and residues contents. Both files are located in the $acnuc directory.

index file : acnuc databases are made up of a series of index files (see physical structure) that allow efficient accesses to sequence files according to various retrieval criteria. Index files are ACCESS, AUTHOR, BIBLIO, EXTRACT (not for protein databases), KEYWORDS, LOCUS, LONGL, MERES (optional, serves only to allow quicker launch), SHORTL, SHORTL2 (optional), SPECIES, SUBSEQ, TAXIDS (optional, to implement retrieval by taxon ID), TAXTREE (optional, contains all species tree information), TEXT. Acnuc index files are always located in a directory whose name is given by environment variable acnuc.

label : species, keywords, journal codes, type names all optionally have a label which is a descriptive character string stored in index file TEXT.
For species, the label can also store genetic code information: the label begins with mtgc:#| or with gc:#| to give the number of the mitochondrial and nuclear, respectively, genetic codes of this species. Species label can also store NCBI's taxon ID information: the label starts with id:#| .
For gene family databases such as Hovergen and Hobacgen, the label can also store taxonomic level information such as [species] or [suborder].

tree : species, or more generally, taxon names and keywords are organized in two trees in acnuc databases. The effect of this is that selecting from a tree node selects all sequences attached to all nodes placed below in the tree. The tree structure is extensive (nearly all nodes are properly placed in the tree) for taxon names. The tree follows NCBI's classification of species. The tree structure is very sparsed (most keywords are at the tree top, with nothing below) for keywords. The keyword tree structure proves useful to organize some logically related keywords.