PRABI-Doua: ACNUC physical structure

ACNUC physical structure

An ACNUC database is made of a series of index files ( ACCESS, AUTHOR, BIBLIO, EXTRACT, KEYWORDS , LOCUS, LONGL, SHORTL, SHORTL2 (optional), SHORTL_KEY (optional), SMJYT, SPECIES, SUBSEQ, TEXT, MERES (optional), TAXIDS (optional), TAXTREE (optional) ) that allow efficient access to sequences and annotations through a variety of selection criteria. Sequences and annotations are stored in flat files (e.g., fun.dat, gbbct1.seq) created by the database producers (e.g., EMBL, GenBank, SwissProt) that are accessed by ACNUC in a strictly readonly mode.

One-page summary of structure. Glossary.

Index files are made of a series of fixed-length records containing several fields that are 4-byte unsigned integer values except when indicated. Records are referred to by their number or rank, counting from 1. Binary integer values can be either all big-endian or all little-endian.

The first record of all index files (except MERES, TAXIDS and TAXTREE) follows this structure :
total |sorting state| end_sorted|

total: number of last written record in index file.
sorting state: a 6-char string that may be "SORTED" or "1/2SOR", and if so indicate that records are alphabetically sorted, or partially so, respectively; anything else means file may not be sorted.
end_sorted: only if 1/2SOR, gives the rank of the last alphabetically sorted record.

Parameters:
L_MNEMO = length of sequence names (variable in new format, fixed to 16 in old one)
WIDTH_SP = length of species names (variable in new format, fixed to 40 in old one)
WIDTH_KW = length of keywords (variable in new format, fixed to 40 in old one)
WIDTH_AUT = length of author names (variable in new format, fixed to 20 in old one)
WIDTH_BIB = length of BIBLIO names (variable in new format, fixed to 40 in old one)
WIDTH_SMJ = length of code in file SMJYT (variable in new format, fixed to 20 in old one)
SUBINLNG = number of SUBSEQ pointers in LONGL records (variable in new format, fixed to 63 in old one)
VALINSHRT2 = number of values in SHORTL2 records (0 means file SHORTL2 is not used; 1 and 2 are not permitted)
VALINSHRT_KEY = number of values in SHORTL_KEY records (0 means file SHORTL_KEY is not used; 1 and 2 are not permitted; VALINSHRT2 must be non zero for VALINSHRT_KEY to be non zero)
ACC_LENGTH = value ≥8 read at run-time when the database is opened
lrtxt = length of records of TEXT file (variable in new format, fixed to 60 in old one)
hsub, hkwsp : control hashings of seq names, species and keywords.

SUBSEQ one record for each parent or sub-sequence

name |length|  type  |pext  P:≤0 , S:>0 | plkey   | plinf        |    phase     |  h   |
     |      |to SMJYT| P: subseq list   |SHORTL(2)|P: LOCUS      |100*code+frame|SUBSEQ|
                     | S: to EXTRACT    |         |S: feat start |

name : padded by spaces to L_MNEMO uppercase characters; subsequences are named by adding a dot and an extension to their parent's name.
length : when 0, indicates a deleted record; deleted records appear in the list starting at record #3 of file LONGL.
type : to SMJYT, for seq type.
pext : its sign determines if parent (P, when ≤0) or sub-sequence (S, when >0);

> 0 : to EXTRACT for start of chain of corresponding exons
= 0 : this is a parent sequence without subsequence
< 0 : - pext is to LONGL for start of long list of subsequences.

plkey : to SHORTL or SHORTL2 or SHORTL_KEY, for list of attached keywords.
plinf : if Parent, to LOCUS for corresponding record; if Subsequence, to SHORTL for start of annotations.
phase : for protein-coding subseqs, combination of genetic code and reading frame (0,1,2) information according to 100*code+frame, or 0.
h : next element of chain of SUBSEQ records with same hashing value, 0 at end of chain.

SUBSEQ records can be sorted (by programs newordalphab and sortsubseq), and if so, they are alphabetically sorted at the parent sequence level and by order of appearance in annotations at the subsequence level.

LOCUS one record for each parent sequence

sub   |pnuc|pinf| pnuc2 |pinf2 |spec       |host   |plref    |molec|placc    |stat | org | div |date|
SUBSEQ|    |    |       |      |N:SPECIES  |SPECIES|SHORTL(2)|SMJYT|SHORTL(2)|SMJYT|SMJYT|     |    |
                               |P:SHORTL(2)|

sub : to SUBSEQ for corresponding record; 0 for a deleted record.
pnuc, pinf : least-significant half of the 64-bit address in flat file of rank div of start of sequence (pnuc) and annotations (pinf).
pnuc2, pinf2 : most-significant half of these 64-bit addresses.
spec : to SPECIES for corresponding species in nucleotide db (N); to SHORTL or SHORTL2 for list of species in protein db (P).
host : (implemented in EMBL only) to SPECIES for organism host to the sequence.
plref : to SHORTL or SHORTL2 for list of attached references.
molec : to SMJYT for molecule.
placc : to SHORTL or SHORTL2 for list of attached accession numbers.
stat : to SMJYT for status.
org : to SMJYT for organelle.
div : rank of flat file (see SMJYT ) where the sequence appears.
date : 11-char field following format dd-MMM-yyyy for date of seq entry in database.
(in the old ACNUC format, date is a 16-char field as MM/DD/YYMM/DD/YY).

KEYWORDS and SPECIES one record for each keyword or taxon

name|libel|plsub| desc | syno   |    h   |plhost|
    |TEXT |LONGL|SHORTL|KEYWORDS|KEYWORDS|
                       |SPECIES |SPECIES |LONGL |

The last field, plhost, exists in SPECIES and is absent from KEYWORDS.

name : uppercase only; padded by spaces to WIDTH_SP or WIDTH_KW characters; set to "xxx...xxx" when deleted.
libel : 0, or to TEXT for a lrtxt-char label; in SPECIES, this label may indicate the genetic codes adequate for the species, may contain the NCBI's taxon id, and may also contain the taxonomic level (e.g., genus, order, family) of the taxon.
plsub : to LONGL for list of attached sequences (only parent seqs for species, any seq type for keywords).
desc : to SHORTL for list for descendants in tree structure; the absolute value of the first elt of this list is the rank of corresponding record in KEYWORDS/SPECIES; the sign of this number is negative iff there are sequences associated to this record; other elements of list are "desc " values of records of descendants in tree; desc = 0 for synonyms.
syno : to KEYWORDS/SPECIES to implement keyword or species synonymy, or 0 if none; synonymous keyword/species are chained in a looped chain; one and only one member from this looped chain has a negative syno value and is the major keyword/species and the only one with non zero plsub and desc; other members of chain have a positive syno value; |syno| is the rank in KEYWORDS/SPECIES of the next synonym.
h : next element of chain of KEYWORDS/SPECIES records with same hashing value, 0 at end of chain.
plhost : to longl list of seqs that have this species as host (used in EMBL e.g., to relate viral or plasmid sequences to their host).

BIBLIO one record for each reference

name|plsub    |plaut    |  j  |  y  |
    |SHORTL(2)|SHORTL(2)|SMJYT|SMJYT|

name : uppercase only; padded by spaces to WIDTH_BIB characters.
journal citations appear as JournalCode/volume_number/first_page
book citations as BOOK/year/first_author
theses citations as THESIS/year/first_author
patent citations as PATENT/number
other citations as UNPUBL/year/first_author
plsub, plaut : to SHORTL or SHORTL2 for lists of attached sequences and authors, respectively.
j, y : to SMJYT records for corresponding journal and publication year, respectively.

AUTHOR one record for each author name

name|plref    |
    |SHORTL(2)|

name : uppercase only; padded by spaces to WIDTH_AUT characters; last name only, no initials.
plref : to SHORTL or SHORTL2 for list of references attached to this author.
the old format had an unused int field after plref, removed in the new format

ACCESS one record for each accession number

name|plsub    |
    |SHORTL(2)|

name : padded by spaces to ACC_LENGTH characters.
plsub : to SHORTL or SHORTL2 for list of parent seqs attached to this accession number.

SMJYT one record for each status, molecule, journal, year, type, organelle, division, and db structure information

name|plong|libel|
    |LONGL|TEXT |

name : padded by spaces to WIDTH_SMJ characters; first 2 characters identify the nature of the object : status("00"), molecule("01"), journal("02"), year("03"), type("04"), organelle("05"), division("06"), and db structure information("07"); uppercase only except for "06".
plong : 0 or to LONGL for list of sequences attached to this object.
libel : 0 or to lrtxt-char label
More information

Names starting with "06" can be "06FLTfname" or "06GCGfname" and indicate whether sequences and annotations are in flat or in GCG-structured files, and give the name of corresponding files (extension excluded; e.g., 06FLTgbbct1 for flat file gbbct1.seq). The file name can include a directory name (e.g., 06FLTlh/wgs_lhaa01_pro) when the corresponding file is in a subdirectory of $gcgacnuc.
The label of "06" records are of the form "rank:xx" and give the rank of the corresponding division, counting from 0.
Presence of one record named "07HASHING_ALGORITHM" and with label such as "Java algorithm" indicates that the java hashing algorithm is used for species, keywords and seq names; absence of such record means a previous algorithm is used.
Presence of one record named "07BIG_ANNOTS" indicates that annotations and sequences are adressed by a combination of 2 fields : field div of LOCUS gives the rank of the division the seq belongs to, fields pinf/pinf2 and pnuc/pnuc2 of LOCUS give the offsets within the division where annotations and the sequence begin, respectively; absence of such record is no longer supported.
Presence of one record named "07ALLOW_PUNCTUATION" indicates that sequence data is interspersed with punctuation data in flat files (special for databases of rRNA sequences).
Presence of one record named "07NOCHANGESUBSEQNAME" indicates that feature qualifiers /gene=, /standard_name= will not be used to construct the extension part of subsequence names.
Presence of one record named "07PROTEIN_IDSUBSEQNAME" indicates that feature qualifier /protein_id= will be used to construct the extension part of subsequence names.
Parameter values L_MNEMO, WIDTH_BIB, WIDTH_AUT, WIDTH_SP, WIDTH_KW, SUBINLNG, ACC_LENGTH are set by presence of a record named 07L_MNEMO, 07WIDTH_BIB, 07WIDTH_AUT, 07WIDTH_SP, 07WIDTH_KW, 07SUBINLNG, 07ACCESSION, respectively. Each record points to a TEXT label containing "parameter_name = parameter_value".
Presence of one record named "07ENDIANNESS" points to a TEXT label containing either "BIG_ENDIAN" or "LITTLE_ENDIAN". This indicates the endianness of the integer binary data stored in index files.
Presence of one record named "07PRIORITY_TO_TAXID" indicates that the feature qualifier /dbxref="taxon:###" has priority over /organism="xxx" to define the species associated to a database sequence.
Presence of one record named "07VALINSHRT2" indicates that index file SHORTL2 is used. This record points to a TEXT label containing "VALINSHRT2=##" where ## is the numerical value of the VALINSHRT2 parameter.
Presence of one record named "07VALINSHRT_KEY" indicates that index file SHORTL_KEY is used. This record points to a TEXT label containing "VALINSHRT_KEY=##" where ## is the numerical value of the VALINSHRT_KEY parameter.

EXTRACT (for nucleotide databases only) one record for each exon of each subsequence

mere  |deb|fin| next  |
SUBSEQ|   |   |EXTRACT|

mere : to SUBSEQ for rank of parent seq containing this exon.
deb, fin : endpoints in parent sequence of the exon.
next : to next exon of same sub-sequence, or 0 if no more.
(the old format had an int field pnuc between fields fin and next with the address in flat file of start of parent sequence containing this exon; this is absent in the new format)

TEXT one lrtxt-character record for each label of a species, keyword, or SMJYT

   label  |

In the case of species, labels may contain information about the correct genetic codes for this species, about NCBI's taxon ids, and about the name of the taxonomic level (e.g., order, family).

LONGL one record for each group of SUBINLNG elements of a long list

sub[0],sub[1],...,sub[SUBINLNG-1] |next |
     SUBSEQ,...                   |LONGL|

sub[i] : 0, or an element of the long list that is always a SUBSEQ record number.
next : 0, or rank of another LONGL record containing other elements of the list.
existing long lists [from: field holding start of list] :

parent seqs attached to a species (from: field plsub of SPECIES)
parent seqs arrached to a host (from: field plhost of SPECIES)
seqs attached to an SMJYT element (from: field plong of SMJYT)
seqs attached to a keyword (from: field plsub of KEYWORDS)
sub-seqs of a parent sequence (from: opposite of field pext of SUBSEQ)
all parent seqs in the database (must start at record # 2)
all records deleted from file SUBSEQ (must start at record # 3)

SHORTL either consecutive blocks of two integer values, or linked lists of values.

File SHORTL can optionally contain also all the data that can be stored by index files SHORTL2 and SHORTL_KEY. This occurs when the value of parameter VALINSHRT2 is 0.

val | next |
    |SHORTL|

val : an element of the short list; it may be a signed integer or an unsigned record number.
next : 0, or rank of another SHORTL record containing another element of the list.
descending nodes of a species or of a keyword (from: field desc of KEYWORDS/SPECIES; val: SHORTL record #)
File SHORTL also contains all subsequence annotation start addresses : ( from: field plinf of SUBSEQ; val,next: low,high order halves of annotation start address).
File SHORTL also contains data that determine 4 parameter values and other data used for hashing of seq names, species and keywords. Hashing is controlled by two positive, odd integer parameters, hsub (for seq names) and hkwsp (for species and keywords).
Record #2 of file SHORTL contains two values in fields val and next whose absolute values are equal to hsub and hkwsp, respectively. The new ACNUC format applies when val < 0 and next > 0. The old format does when val < 0 and next < 0.
In the new format, record #3 of file SHORTL contains the parameter values WIDTH_SMJ and lrtxt. In the old format, WIDTH_SMJ = 20 and lrtxt = 60.
Starting from record #4 (or #3 in the old format), there are (hsub+1)/2 records containing hsub values for seq name hashing, then (hkwsp+1)/2 records containing hkwsp values for keyword hashing, and then (hkwsp+1)/2 records containing hkwsp values for species hashing. True short lists begin after these records, that is, at number (hsub+1)/2 + (hkwsp+1) + 3 + 1. Each one of the hsub values stored in SHORTL starting at record #4 is the record number of the start of the chain of SUBSEQ records that share a common hashing value from the range [1,hsub]. Similarly, further data are starts of chains of keywords, and later of species, that share a common hashing value from the range [1,hkwsp]. Hashing values are computed by functions hashmn (seq names) and hasnum (keywords and species).
File SHORTL can optionally also contain all the data otherwise stored in files SHORTL2 and SHORTL_KEY. They are stored one value per SHORTL record.

SHORTL2 one record for each group of VALINSHRT2 values of a short list

The existence of index file SHORTL2 is determined by the value of parameter VALINSHRT2: when it is 0, file SHORTL2 is not used and all data described here are stored, one value per record, in index file SHORTL; when is it > 0, file SHORTL2 is used, and the parameter value is the number of values in each file record.

next   | val1 | val2 |...
SHORTL2|

next : 0, or rank of another SHORTL2 record containing another group of values.
val1, val2, … : a group of VALINSHRT2 values; they are unsigned record numbers.
existing short lists [from: field holding start of list; val: nature of list values] :

sequences attached to a reference (from: field plsub of BIBLIO; val: SUBSEQ record #)
references attached to a sequence (from: field plref of LOCUS; val: BIBLIO record #)
authors attached to a reference (from: field plaut of BIBLIO; val: AUTHOR record #)
references attached to an author (from: field plref of AUTHOR; val: BIBLIO record #)
sequences attached to an accession number (from: field plsub of ACCESS; val: SUBSEQ record #)
keywords attached to a sequence (from: field plkey of SUBSEQ; val: KEYWORDS record #)
These lists are in SHORTL_KEY if VALINSHRT_KEY is non zero.
accession numbers attached to a sequence (from: field placc of LOCUS; val: ACCESS record #)
species attached to a sequence (for protein databases only) ( from: field spec of LOCUS; val: SPECIES record #)

SHORTL_KEY one record for each group of VALINSHRT_KEY values of a short list

The existence of index file SHORTL_KEY is determined by the value of parameter VALINSHRT_KEY: when it is 0, file SHORTL_KEY is not used and all data described here are stored, one value per record, in index files SHORTL or SHORTL2; when is it > 0, file SHORTL_KEY is used, and the parameter value is the number of values in each file record. VALINSHRT_KEY is always zero if VALINSHRT2 == 0.

next      | val1 | val2 |...
SHORTL_KEY|

next : 0, or rank of another SHORTL_KEY record containing another group of values.
val1, val2, … : a group of VALINSHRT_KEY values; they are unsigned record numbers.
existing short lists [from: field holding start of list; val: nature of list values] :

keywords attached to a sequence (from: field plkey of SUBSEQ; val: KEYWORDS record #)

MERES an optional index file that allows faster opening of the database.
lenw binary integer values that encode the bitlist of all parent sequences of the database.

TAXIDS an optional index file that implements the TID= retrieval criterion
count | ... count integer values ... |

count: number of following values.
the i th value gives the rank in index file SPECIES of NCBI's taxid i or 0 if no corresponding taxon is in this file.

TAXTREE an optional index file containing all information of the species tree used to accelerate the loadtaxonomy function of the remote acnuc server.
This ASCII file contains one line for each taxon name of the form: rank&parent &count&" name"&"label "
where

rank: rank of the taxon in SPECIES file (the root has rank 2 and parent 0).
parent: if ≥ 0, rank of its parent taxon; if < 0, taxon is a synonym of taxon of rank |parent|.
count: number of sequences directly attached to this taxon.
name: double-quoted taxon name.
label: double-quoted taxon label.