PRABI-Doua: ACNUC management

ACNUC management

Acnuc management programs in alphabetical order (acnuc & gcgacnuc environment variables identify the database): acnucgener, check, compressnewdiv, connectindex, convert_shortl_key, crenewdiv, crenewrelnum, cretaxtree, flattoaddress, initf, listlostfeat, listtoaddress, modkeylength, modhconst, modnet, namesindiv, nbrfgenerdiv, newordalphab, ordnet, processft, raadbstatus, readncbitaxo, setcode, smjytload, sortsubseq, supold, suppr_unused, swabacnuc, test_all_codes, updatehelp, voyage, wwwspecies.

Acnuc management programs in functional groups : (in LBBE all are in directory ~banques/debian-bin)

Add / remove sequences to/from an acnuc database.

initf creates an empty acnuc database.
acnucgener add sequences to an acnuc database (nucleotide or protein sequences).
nbrfgenerdiv add sequences to an acnuc database of protein sequences (PIR codata format).
supold remove sequences from an acnuc database.
processft scans the features table of yet-indexed sequences and creates missing subsequences.

Deal with biological classification of species, keywords tree, and genetic codes.

readncbitaxo reproduces in an acnuc database a given classification of species, typically NCBI's.
wwwspecies prepares a file containing the acnuc species tree formatted for use by the acnuc web species browser.
modnet interactive program to edit the tree structure of species or of keywords.
setcode assigns genetic code numbers to CDS subsequences.
test_all_codes lists all genetic codes defined in acnuc.
cretaxtree creates the optional TAXTREE index file.

Maintain clean, coherent, efficient acnuc index files.

connectindex maintains coherence between a set of flat files and a set of index files.
newordalphab optimizes access to acnuc index files by rewriting all index files.
suppr_unused removes all unused references, authors, acc nos, species, keywords, or records from file SMJYT.
updatehelp updates the summary information giving sequence and residue totals of a database.
sortsubseq alphabetically sorts sequence names and accession nos.
compressnewdiv removes unused bytes in a series of division files.
ordnet reorders species names or keywords compatibly with their tree structure.
modhconst changes the hashing constants of a db and/or db conversion to variable-length record key format.
modkeylength changes the maximum length of record keys in an acnuc database.
convert_shortl_key makes an acnuc database use optional index file SHORTL_KEY.

Miscellaneous.

listtoaddress computes division names & file offsets of a series of sequences.
flattoaddress computes division names & file offsets of a series of flat files.
smjytload add, delete, modify an element of index file SMJYT.
voyage interactive program to examine the content of any record of any acnuc index file.
namesindiv computes the list of names of sequences that belong to given acnuc division files.
listlostfeat scans the features table of all seqs of a database and detects missing subsequences.
crenewrelnum creates a new RELEASE # keyword in the tree of keywords.
crenewdiv creates a new division.
raadbstatus signals when a remotely accessible db is (un)available or sets db password.
swabacnuc changes the endianness of index files of an acnuc db.
series of check programs: detects inconsistencies within index files.

initf creates an empty acnuc database.
This is the only acnuc program that does not use the acnuc environment variable. It produces a series of empty index files in the current directory.

usage:

initf  db_type [gcg] [punctuation] [hsub=xx] [hkwsp=xx] [acc=xx] [halgo=java|old] [standardextonly]  
      [protein_idext] [sub=x] [key=x] [spec=x] [aut=x] [bib=x] [smj=x] [txt=x] [lng=x]
	or
initf -h           (to get a usage message)
where

db_type : genbank or embl or swissprot or nbrf, according to the need
gcg : use this option to create a database that indexes GCG files
punctuation : use this option so the created database allows punctuation to appear in sequence data
hsub=xx : use this to set the value of the seq name hashing constant hsub (default=1000003, use a prime number of the same magnitude as the total number of seqs in the to-be-built database)
hkwsp=xx : use this to set the value of the species and keyword name hashing constant hkwsp (default=500009, use a prime number of the same magnitude as the total number of keywords in the to-be-built database)
acc=xx : sets the max length of accession numbers (default 8)
halgo=java|old : sets what algorithm will be used for hashing names of sequences, species and keywords (default: java)
standardextonly : use this option so the created database don't use /gene= and /standard_name= feature qualifiers to construct subsequence name extensions.
protein_idext : use this option so the created database uses /protein_id= feature qualifiers to construct subsequence name extensions.
sub=xx : to set the max length of (sub)sequence names (default = 20)
key=xx : to set the max length of keywords (default = 40)
spec=xx : to set the max length of species names (default = 70)
aut=xx : to set the max length of author names (default = 20)
bib=xx : to set the max length of references (default = 40)
smj=xx : to set the max length of records keys in file SMJYT (default = 30)
txt=xx : to set the max length of labels in file TEXT (default = 70)
lng=xx : to set the number of SUBSEQ pointers in each LONGL record (default = 63)

acnucgener add sequences to an acnuc database (nucleotide or protein sequences).
Usage:

acnucgener a adress_file [-mmap index ... ]
	or
acnucgener d division_name1 [division_name2 ...] [-mmap index ... ]
where

address_file : name of a file typically created by connectindex or by listtoaddress containing the names, divisions and file offsets of sequences to enter

division_name : name(s) of one or more divisions (e.g., gbnew for file gbnew.seq, fname for fname.dat); the name can also contain a subdirectory of $gcgacnuc (e.g., lh/wgs_lhaa01_pro); all sequences in these divisions will be indexed, except those already indexed with the same date; existing seqs with anterior date are suppressed and then re-indexed.

index : indicates an index file to be processed entirely in virtual memory; one of ksub, kloc, kkey, kspec, kshrt, kshrt2, klng, ksmj, kaut, kacc, kbib; can be repeated as in: -mmap ksub -mmap kshrt -mmap kkey ; for large databases, it is recommended to use the -mmap option at least with each of ksub, kshrt, kkey, kspec, klng.

Program acnucgener creates subsequences for those items in sequence features that are known by the database as type. By default known types are CDS, tRNA, rRNA, scRNA, snRNA, misc_RNA. Use program smjytload to define additional types if desired. The /transl_table feature qualifier and the genetic code information from the NCBI classification (see below) is used to assign variant genetic codes to CDS subsequences. Qualifier values found in /EC_number=, /evidence=, /gene=, /product=, /protein_id=, /standard_name= are added as keywords of the subsequence. Qualifier values found in /gene= and /standard_name= are by default used to define the extension of the subsequence name, unless this behavior is turned off if entry 07NOCHANGESUBSEQNAME exists in file SMJYT. Alternatively, subsequence names can be made from the value of /protein_id= qualifiers if entry 07PROTEIN_IDSUBSEQNAME exists in file SMJYT.

Program acnucgener uses first species names found in the /organism= and next in the /dbxref="taxon:###" qualifiers of the source entry of the features table. This rule is reversed (taxon:### first and /organism next) if entry 07PRIORITY_TO_TAXID is present in index file SMJYT (this entry can be created/deleted with program smjytload). If this is absent, it uses the ORGANISM or OS records. Program acnucgener reads the full ncbi classification given in files names.dmp + nodes.dmp from directory $acnuctaxo to classify new species. If these files are not found or do not classify the species name, acnucgener uses the information of ORGANISM/OC lines to classify it. But acnucgener does not reflect in the acnuc species tree changed classification of a pre-existing species. For this reason, program readncbitaxo is useful to reflect changes of the NCBI classification of species in the acnuc species tree.

Customized processing of feature qualifiers is possible. Defined qualifiers can be detected and a keyword can be created from the qualifier or its value and attached to the subsequence corresponding to the feature entry. This is obtained by creating in the $acnuc directory a plain text file called custom_qualifier_policy that describes the desired custom feature qualifier processing. Follow this model (case is not significant) :

	Qualifier = GENE_FAMILY              
	Use_Value = True                     
	Parent_Keyword = GENE FAMILIES        

	qualifier = GENE_EXPRESSION
	use_value = TRUE
	parent_keyword = GENE EXPRESSIONS

Groups of lines deal with distinct qualifiers. The qualifier line begins a group and names the feature qualifier that requires custom processing (e.g., presence of /GENE_FAMILY in qualifiers). The use_value line says True if the value of the qualifier is used to define the keyword (e.g., keyword HBG00234 is used when /GENE_FAMILY="HBG00234" appears). By default the qualifier itself is used as a keyword. The parent_keyword line names a keyword under which to place the keyword in the tree of keywords (e.g., HBG00234 will be placed under GENE FAMILIES). By default the keyword is at the top of tree. The standard output of acnucgener describes what custom processing is used.

nbrfgenerdiv add sequences to an acnuc database of protein sequences (PIR codata format).
This program has become obsolete given the fusion of the PIR and SwissProt databases into UNIPROT.
Usage:

nbrfgenerdiv Name of address file? ? address_file_name Date de la release? (format 12/31/89) rel_date

where
address_file_name : name of file of seq names and file offsets typically created by connectindex.
rel_date : date used only for seqs lacking date info in their annotation.

processft scans the features table of yet-indexed sequences and creates missing subsequences.
Usage:

processft name_file

where
name_file: file of sequence names, one per line, typically created by connectindex or by listlostfeat.

Some situations arise where use of program acnucgener fails to correctly create all subsequences that should arise from sequence feature tables. One such case arises when a subsequence declared in the features table of seq. A is JOINed to a fragment of seq. B and when seq. B, but not seq. A, is updated, in the sense that its date is changed. Program connectindex detects the date change, so seq. B is removed (by supold) and re-indexed (by acnucgener), but acnucgener is not in a position to re-create the subsequence because it does not scan A's features table that defines the subseqs. File xxx.lost, created by connectindex, contains the name of seq. A, so running program processft with this file completes the database update by re-creating the subseq.

Another case is when a subsequence-associated feature table entry is added to a sequence without changing its date. Program connectindex does not detect this kind of change. The solution is to run listlostfeat that detects all missing subsequences from an acnuc database, and then processft on its output, to create these missing subseqs.

listlostfeat scans the features table of all seqs of a database and detects missing subsequences.
This program, run without arguments, reads the features table of all seqs of an acnuc database and detects missing sub-sequences. For each such case, it writes on its standard output the name of the parent sequence and the feature entry corresponding to the missing subsequence. If sent to a file, this output data is suitable to be used as argument for the processft program.
connectindex maintains coherence between a set of flat files and a set of index files.
This program can be used in 3 modes:
update mode : Connects an existing set of index files to an updated set of flat files and identify changed, new, and disappeared sequences. Typically used to prepare acnuc indexing of a new release of flat files. Flat files can be optionnally found gzip'ed and decompressed on the fly.
install mode : Connects a set of index files to a set of flat files and hides access to sequences present in index files but not in flat files. Typically used after copying index files from a distribution to ensure their coherence with local flat files.
scan mode : Does install mode on a given series of flat files rather than on all flat files.

Usage:

connectindex -h

gives a summary of program arguments

connectindex -update -basename base_name [-gz gzdirname] [-threads n] -divfile divlist

where:
base_name: base name of a series of output files to be created by the program

gzdirname: name of directory where gzip'ed flat files sit. Compressed files are read from this directory and decompressed to the $gcgacnuc directory.

n: optional number of parallel threads to use

divlist: name of file containing list of all division names, one per line. When flat files are in subdirectories of $gcgacnuc, include the subdirectory name in the division name.

In update mode, five output files are created. File disparu.mne lists names of sequences present in indices but absent from flat files. File base_name.1 lists new seqs (present in flat, absent in indices). File base_name .2 lists modified seqs (seq date or length or subsequences differ between indices and flat files). File base_name .lost lists names of seqs to be processed later by program processft because their features table changed. File base_name.address gives division names and file offsets of all new or changed sequences; it is to be used as an argument of program acnucgener.

connectindex -install [-gz gzdirname] [-threads n] -divfile divlist

connectindex [-threads n] -scan=number div1 div2 ...

where number is the number of following division names

The update/install modes can also be obtained by running the program without arguments and replying to a program dialog.

In update mode, the dialog replies are

u
f   or    g     (for flat of GCG formatted division files, respectively)
base_name       (base name of a series of output files to be created by the program)
number          (number of divisions in the acnuc database)
xxx             (names of these divisions on successive lines, without extension)

In install mode, the dialog replies are

i
f   or    g     (for flat of GCG formatted division files, respectively)
y   or    n     (if y an additional dialog item is needed)
	new_div_name (only if previous reply was y, a new division with this name 
	             is created in index files)
number          (number of divisions in the acnuc database)
xxx             (names of these divisions on successive lines, without extension)

newordalphab optimizes access to acnuc index files by rewriting all index files.
This program duplicates all index files in the directory pointed to by the acnuc environment variable under names xxx.NEW, and then deletes all old index files and renames the new files.

usage
newordalphab
             ...wait for termination with message "Normal end" on stdout.

listtoaddress computes division names & file offsets of a series of sequences.
usage:

listtoaddress names_file output_file
where

names_file : file of names of seqs to be processed.
output_file : file with division names and file offsets of these sequences.
Typical usage is to re-index a series of sequences by doing :

listtoaddress mylist.names mylist.address
supold mylist.names -mmap
acnucgener a mylist.address -mmap ksub -mmap kshrt

flattoaddress computes division names & file offsets of a series of flat files.
usage:

flattoaddress outfname flatfname...
where

outfname : name of output file with division names & file offsets of all entries present in flat files
flatfname : names(s) of input flat files containing sequence entries
Typical usage is to index a series of flat files by doing :

flattoaddress new.address flat1.dat flat2.dat
crenewdiv flat1
crenewdiv flat2
acnucgener a new.address

crenewdiv creates a new division
usage: crenewdiv division_name
supold remove sequences from an acnuc database.
usage:

supold names_file [-mmap ]
where

names_file : file of names of seqs to be removed, one per line
-mmap : this option lets the program work faster for large number of sequences

smjytload add, delete, modify an element of index file SMJYT.
The acnuc index file SMJYT contains one record for each name of molecule, journal, publication year, sequence type. It contains also the names of the division files of the database (not processed by this program), and records that specify optional database features.

smjytload is an interactive program that allows to create, rename, or delete such names. It also allows to modify the label of these names.

smjytload is useful to create new sequence types, so that corresponding subsequences be created by program acnucgener. Each type has a code and a label. Its code is the feature name, converted to uppercase (e.g., CDS, EXON, INTRON). Its label must begin with ".XX" where XX are the two letters used to construct subsequence names (e.g. .PE for CDS to get xxxx.PE1 as a subseq name); the rest of the label may describe the type.

smjytload is also useful to correct journal codes (remove duplicates for example).

compressnewdiv removes unused bytes in a series of divisions.
When an acnuc database is daily updated, new sequences are added at the end of division files dedicated to holding them (example, gbnew.seq). Such new sequences may be further modified, so new versions of them will appear further down the divisions of new seqs, and so previous versions will no longer be indexed.

compressnewdiv reads a series of division files (typically only those holding daily updates), compresses them in place by removing their unindexed portions, and updates pointers to all data that changed place in these files.
Usage:
```
compressnewdiv division_name... 
where
	
```
division_name : one or several names of division files to be compressed in place
modkeylength changes the maximum length of record keys in an acnuc database.
Usage: modkeylength param=length
where param is one of sub, acc, spec, key, aut, bib, smj, txt, lng, shrt2 corresponding to index files SUBSEQ, ACCESS, SPECIES, KEYWORDS, AUTHORS, BIBLIO, SMJYT, TEXT, LONGL, and SHORTL2 respectively
and length is the new desired max length of record keys for the corresponding index file.
Notes:

Max length can be safely reduced.
With parameter lng, length is the number of SUBSEQ pointers in each LONGL record.
With parameter shrt2, length is the number of values in each list of SHORTL2 which is created if it was not used before.
Length must be ≥ 10 for acc.
This program accepts only dbs in the variable-length record key format. Use modhconst to convert a db to this format.
Program voyage gives the current maximum length of all record keys.

modhconst changes the hashing constants of an acnuc database, and/or, if necessary, converts the database to the new format with variable-length record keys.
Usage;
modhconst [hsub=new_value] [hkwsp=new_value]

Access by sequence name in a large database will be faster if constant hsub is a prime number with the magnitude of the total number of seqs in the database. Similarly for constant hkswp and keywords.

convert_shortl_key makes an acnuc database use the optional index file SHORTL_KEY.
Usage: convert_shortl_key VALINSHRT_KEY=v
where v is the desired number of values in each list stored in SHORTL_KEY.
The acnuc database must already use optional index file SHORTL2. Program modkeylength can turn a database into using optional file SHORTL2.
readncbitaxo reproduces in an acnuc database a given classification of species, typically NCBI's.
The program reads files $acnuctaxo/names.dmp and $acnuctaxo/nodes.dmp that contain a classification of species and reproduces it entirely in the current acnuc database, except for species that exist in acnuc but not in the input file, that remain unchanged, and for species of the input file absent from acnuc, that are not created in acnuc (unless option -keepall is used, see below).
The program creates a log file (id.log in current directory) describing input classification, current acnuc classification, and all operations done to transform the second in the first.

There are four optional arguments to this program:
-partial : instructs the program not to delete synonyms existing in the curent acnuc classification but not in the input classification
-niveau : instructs the program to use taxonomic level information of the input classification as node label (used by databases such as Hovergen).
-setcode : instructs the program to create files ncbicodes.out that summarizes the genetic code information present in the input files and setcode.dialog formatted as input for the setcode program.
-keepall : instructs the program to create in the acnuc database all the species found in the input tree, even if no sequence is attached to them.

Option -h lists possible program options.

wwwspecies prepares a file containing the acnuc species tree formatted for use by the acnuc web species browser.
This file is written on the standard output of the program.
setcode assigns genetic code numbers to CDS subsequences.
Setcode is a dialog-based program that repeatedly asks for taxon name, genetic code id, and boolean mitochondrial information, and assigns this genetic code id to all subsequences of type CDS (and possibly of organelle MITOCHONDRION) from that taxon or taxa below in the tree.

The dialog is
taxon name or stop ( the program stops if stop)
y or n (for mitochondrial or genomic genetic code info, respect.)
acnuc-genetic-code-id (an acnuc-defined genetic code id)
[loop back to asking taxon name]

Procedure setcodegenbank.com runs the setcode program with the setcode.dialog information. It thus applies the genetic code information present in the NCBI classification to all of an acnuc database.

The flow until a CDS subsequence and its correct genetic code in acnuc is as follows. Program readncbitaxo writes in acnuc the genetic code information given in files names.dmp/nodes.dmp as part of the label of any leaf node or any sequence-bearing node. Program acnucgener uses this information to assign the adequate genetic code number to any CDS subsequence it creates. But this flow fails when acnucgener creates a new species and associated subsequences because the genetic code information is not available to the program then. Program setcode is thus useful to enforce a coherent genetic code information througout an acnuc database.

test_all_codes lists all genetic codes defined in acnuc.
This lists on stdout all genetic codes defined in acnuc in a format that allows comparison with NCBI's gencode.dmp file. The output also gives both NCBI's and acnuc's genetic code ids. One can then detect if new genetic codes appeared in NCBI and define them in acnuc.
cretaxtree creates the optional TAXTREE index file.
modnet interactive program to edit the tree structure of species or of keywords.
A series of operations can be done :

 
 0  Orientation towards Species or Keywords
 1  Creation of a node
 2  Modification of the name and/or the label of a node
 3  Creation of a branch
 4  Move of a branch
 5  State of a node 
 6  Delete a node or a synonym 
 7  Browse the tree 
 8  Create synonyms
 9  List isolated or unused nodes and detect tree loops
10  Modify the order of descendants of a node
11  Remove all unused nodes

modnet allows to correct a few branches or nodes in the classification of species. Program readncbitaxo is to be used for more extensive changes.
modnet is the main way to organize a series of keywords as a tree.

voyage interactive program to examine the content of any record of any acnuc index file.
Voyage is a utility program that helps debugging acnuc programs.
namesindiv computes the list of names of sequences that belong to given acnuc division files.
Usage: namesindiv outfname divname ...
where
outfname: name of an output file to be filled with seq names, one per line
divname : one or several names of acnuc division files (e.g., gbbct2 est_fun)
suppr_unused removes all unused references, authors, acc nos, species, keywords, or records from file SMJYT.
Usage:suppr_unused oper_id
where
oper_id: one of bib aut acc spec key smj to specify references, authors, acc nos, species, keywords, or records from file SMJYT, respectively.

Operation bib should be done before operation aut to be efficient.
When dealing with species or keywords, nodes whose descendents, in the tree, are all unused nodes are also deleted.

ordnet reorders species names or keywords compatibly with their tree structure.
Usage:ordnet oper_id
where
oper_id: s or k to specify species or keywords, respectively.

This program allows high-ranking taxa to appear before low-ranking ones in the index file of species names, which makes the output of browsing the species tree cleaner. The same applies to the keywords index file.

updatehelp updates the summary information giving sequence and residue totals of a database.
Usage:updatehelp [ -noupdate ]
The program computes the total number of sequences, subsequences, references, and nucleotides or amino acids in the current acnuc database, and writes this information at the top of on-line help files HELP and HELP_WIN.
The program also writes the date of the day the program is run, unless run with the -noupdate argument.
sortsubseq alphabetically sorts sequence names and accession nos.
This program, used without argument, is useful during the procedure of daily indexing after the acnucgener run to have again all sequences and accession numbers alphabetically sorted by name. Subsequences are sorted in the order of their appearance in the features table.
crenewrelnum creates a new RELEASE # keyword in the tree of keywords.
Looks for the 1st descendant of RELEASE NUMBERS that should be of the form RELEASE # and creates a new keyword with a number incremented by one. Useful for the GenBank format, after indexing a full database release and before starting daily updates, so new sequences be associated with the release number of the next full release. Useless with EMBL format because release numbers are read in annotations rather than guessed at by acnucgener.
swabacnuc converts the endianness of an acnuc db to the host computer's endianness.
Notes:

Does nothing if the db's and host's endiannesses are equal.
All programs can transparently read/write a db with opposite endianness to the host's, but this program allows to equalize db's and host's endiannesses.
Program voyage shows the current endianness of an acnuc db.
SPARC, PowerPC are big-endian computer architectures; intel, alpha are little-endian.

raadbstatus signals when a database becomes unavailable or available again; or sets a database password.
Usage: raadbstatus -f knowndbfile -p namedpipe -n dbname { on | off }
or
raadbstatus -f knowndbfile -a -n dbname (to set password of a protected database)

knowndbfile: name of file with list of remotely accessible acnuc databases (environment variable raalist gives this name)
namedpipe: name of pipe to communicate with the racnucd daemon (environment variable raadisable gives this name)
dbname: name of database, taken from first column of knowndbfile
on | off: use off to make db unavailable, on to make it available

Example to set the swissprot database offline:
raadbstatus -f $raalist -p $raadisable -n swissprot off

Example to password-protect the nbrf database:
raadbstatus -f $raalist -a -n nbrf Enter password: ******* Repeat password: *******

series of check programs: detects inconsistencies within index files.
A series of programs that help detect several sorts of inconsistencies within index files, for example, a link from a sequence to a keyword that is not matched by a corresponding link from keyword to sequence. These programs are :

checkacc : acc no <==> seq links
checkarbre : tree structure in species and keyword index files
checkaut : author <==> reference links
checkbc : coherence between SQ / SUMMARY annotation lines and sequence data
checkbib : sequence <==> reference links
checkhash : integrity of hashing of sequence, species and keyword names
checkinfnucpointers : integrity of pointers to annotations and sequences
checkkw : sequence <==> keyword links
checklng : integrity of all data in LONGL index file
checkmefi : parent-sequence <==> subsequence links
checksmj : sequence <==> SMJYT links
checkspec : sequence <==> species links
checksyno : integrity of synonymy data in species and keyword trees

Each program runs on all of the database and writes a description of any detected inconsistency on stdout.