PRABI-Doua

Pôle Rhône-Alpes de Bioinformatique Site Doua

Barre

ACNUC physical structure

An ACNUC database is made of a series of index files ( ACCESS, AUTHOR, BIBLIO, EXTRACT, KEYWORDS , LOCUS, LONGL, SHORTL, SHORTL2 (optional), SHORTL_KEY (optional), SMJYT, SPECIES, SUBSEQ, TEXT, MERES (optional), TAXIDS (optional), TAXTREE (optional) ) that allow efficient access to sequences and annotations through a variety of selection criteria. Sequences and annotations are stored in flat files (e.g., fun.dat, gbbct1.seq) created by the database producers (e.g., EMBL, GenBank, SwissProt) that are accessed by ACNUC in a strictly readonly mode.

One-page summary of structure. Glossary.

Index files are made of a series of fixed-length records containing several fields that are 4-byte unsigned integer values except when indicated. Records are referred to by their number or rank, counting from 1. Binary integer values can be either all big-endian or all little-endian.

The first record of all index files (except MERES, TAXIDS and TAXTREE) follows this structure :
total |sorting state| end_sorted|


Parameters:
L_MNEMO = length of sequence names (variable in new format, fixed to 16 in old one)
WIDTH_SP = length of species names (variable in new format, fixed to 40 in old one)
WIDTH_KW = length of keywords (variable in new format, fixed to 40 in old one)
WIDTH_AUT = length of author names (variable in new format, fixed to 20 in old one)
WIDTH_BIB = length of BIBLIO names (variable in new format, fixed to 40 in old one)
WIDTH_SMJ = length of code in file SMJYT (variable in new format, fixed to 20 in old one)
SUBINLNG = number of SUBSEQ pointers in LONGL records (variable in new format, fixed to 63 in old one)
VALINSHRT2 = number of values in SHORTL2 records (0 means file SHORTL2 is not used; 1 and 2 are not permitted)
VALINSHRT_KEY = number of values in SHORTL_KEY records (0 means file SHORTL_KEY is not used; 1 and 2 are not permitted; VALINSHRT2 must be non zero for VALINSHRT_KEY to be non zero)
ACC_LENGTH = value ≥8 read at run-time when the database is opened
lrtxt = length of records of TEXT file (variable in new format, fixed to 60 in old one)
hsub, hkwsp : control hashings of seq names, species and keywords.


SUBSEQ one record for each parent or sub-sequence

name |length|  type  |pext  P:≤0 , S:>0 | plkey   | plinf        |    phase     |  h   |
     |      |to SMJYT| P: subseq list   |SHORTL(2)|P: LOCUS      |100*code+frame|SUBSEQ|
                     | S: to EXTRACT    |         |S: feat start |       

SUBSEQ records can be sorted (by programs newordalphab and sortsubseq), and if so, they are alphabetically sorted at the parent sequence level and by order of appearance in annotations at the subsequence level.


LOCUS one record for each parent sequence

sub   |pnuc|pinf| pnuc2 |pinf2 |spec       |host   |plref    |molec|placc    |stat | org | div |date|
SUBSEQ|    |    |       |      |N:SPECIES  |SPECIES|SHORTL(2)|SMJYT|SHORTL(2)|SMJYT|SMJYT|     |    |
                               |P:SHORTL(2)|

KEYWORDS and SPECIES one record for each keyword or taxon

name|libel|plsub| desc | syno   |    h   |plhost|
    |TEXT |LONGL|SHORTL|KEYWORDS|KEYWORDS|
                       |SPECIES |SPECIES |LONGL |

The last field, plhost, exists in SPECIES and is absent from KEYWORDS.


BIBLIO one record for each reference

name|plsub    |plaut    |  j  |  y  |
    |SHORTL(2)|SHORTL(2)|SMJYT|SMJYT|

AUTHOR one record for each author name

name|plref    |
    |SHORTL(2)|

ACCESS one record for each accession number

name|plsub    |
    |SHORTL(2)|

SMJYT one record for each status, molecule, journal, year, type, organelle, division, and db structure information

name|plong|libel|
    |LONGL|TEXT |

EXTRACT (for nucleotide databases only) one record for each exon of each subsequence

mere  |deb|fin| next  |
SUBSEQ|   |   |EXTRACT|

TEXT one lrtxt-character record for each label of a species, keyword, or SMJYT

   label  |

In the case of species, labels may contain information about the correct genetic codes for this species, about NCBI's taxon ids, and about the name of the taxonomic level (e.g., order, family).


LONGL one record for each group of SUBINLNG elements of a long list

sub[0],sub[1],...,sub[SUBINLNG-1] |next |
     SUBSEQ,...                   |LONGL|

SHORTL either consecutive blocks of two integer values, or linked lists of values.

File SHORTL can optionally contain also all the data that can be stored by index files SHORTL2 and SHORTL_KEY. This occurs when the value of parameter VALINSHRT2 is 0.

val | next |
    |SHORTL|

SHORTL2 one record for each group of VALINSHRT2 values of a short list

The existence of index file SHORTL2 is determined by the value of parameter VALINSHRT2: when it is 0, file SHORTL2 is not used and all data described here are stored, one value per record, in index file SHORTL; when is it > 0, file SHORTL2 is used, and the parameter value is the number of values in each file record.

next   | val1 | val2 |...
SHORTL2|

SHORTL_KEY one record for each group of VALINSHRT_KEY values of a short list

The existence of index file SHORTL_KEY is determined by the value of parameter VALINSHRT_KEY: when it is 0, file SHORTL_KEY is not used and all data described here are stored, one value per record, in index files SHORTL or SHORTL2; when is it > 0, file SHORTL_KEY is used, and the parameter value is the number of values in each file record. VALINSHRT_KEY is always zero if VALINSHRT2 == 0.

next      | val1 | val2 |...
SHORTL_KEY|

MERES an optional index file that allows faster opening of the database.
lenw binary integer values that encode the bitlist of all parent sequences of the database.


TAXIDS an optional index file that implements the TID= retrieval criterion
count | ... count integer values ... |


TAXTREE an optional index file containing all information of the species tree used to accelerate the loadtaxonomy function of the remote acnuc server.
This ASCII file contains one line for each taxon name of the form: rank&parent &count&" name"&"label "
where