ACNUC management
Acnuc management programs in alphabetical order (acnuc & gcgacnuc environment variables identify the database): acnucgener, check, compressnewdiv, connectindex, convert_shortl_key, crenewdiv, crenewrelnum, cretaxtree, flattoaddress, initf, listlostfeat, listtoaddress, modkeylength, modhconst, modnet, namesindiv, nbrfgenerdiv, newordalphab, ordnet, processft, raadbstatus, readncbitaxo, setcode, smjytload, sortsubseq, supold, suppr_unused, swabacnuc, test_all_codes, updatehelp, voyage, wwwspecies.
Acnuc management programs in functional groups : (in LBBE all are in directory ~banques/debian-bin)
- Add / remove sequences to/from an acnuc database.
- initf creates an empty acnuc database.
- acnucgener add sequences to an acnuc database (nucleotide or protein sequences).
- nbrfgenerdiv add sequences to an acnuc database of protein sequences (PIR codata format).
- supold remove sequences from an acnuc database.
- processft scans the features table of yet-indexed sequences and creates missing subsequences.
- Deal with biological classification of species, keywords tree, and genetic codes.
- readncbitaxo reproduces in an acnuc database a given classification of species, typically NCBI's.
- wwwspecies prepares a file containing the acnuc species tree formatted for use by the acnuc web species browser.
- modnet interactive program to edit the tree structure of species or of keywords.
- setcode assigns genetic code numbers to CDS subsequences.
- test_all_codes lists all genetic codes defined in acnuc.
- cretaxtree creates the optional TAXTREE index file.
- Maintain clean, coherent, efficient acnuc index files.
- connectindex maintains coherence between a set of flat files and a set of index files.
- newordalphab optimizes access to acnuc index files by rewriting all index files.
- suppr_unused removes all unused references, authors, acc nos, species, keywords, or records from file SMJYT.
- updatehelp updates the summary information giving sequence and residue totals of a database.
- sortsubseq alphabetically sorts sequence names and accession nos.
- compressnewdiv removes unused bytes in a series of division files.
- ordnet reorders species names or keywords compatibly with their tree structure.
- modhconst changes the hashing constants of a db and/or db conversion to variable-length record key format.
- modkeylength changes the maximum length of record keys in an acnuc database.
- convert_shortl_key makes an acnuc database use optional index file SHORTL_KEY.
- Miscellaneous.
- listtoaddress computes division names & file offsets of a series of sequences.
- flattoaddress computes division names & file offsets of a series of flat files.
- smjytload add, delete, modify an element of index file SMJYT.
- voyage interactive program to examine the content of any record of any acnuc index file.
- namesindiv computes the list of names of sequences that belong to given acnuc division files.
- listlostfeat scans the features table of all seqs of a database and detects missing subsequences.
- crenewrelnum creates a new RELEASE # keyword in the tree of keywords.
- crenewdiv creates a new division.
- raadbstatus signals when a remotely accessible db is (un)available or sets db password.
- swabacnuc changes the endianness of index files of an acnuc db.
- series of check programs: detects inconsistencies within index files.
- initf creates an empty acnuc database.
This is the only acnuc program that does not use the acnuc environment variable. It produces a series of empty index files in the current directory.
usage:
initf db_type [gcg] [punctuation] [hsub=xx] [hkwsp=xx] [acc=xx] [halgo=java|old] [standardextonly] [protein_idext] [sub=x] [key=x] [spec=x] [aut=x] [bib=x] [smj=x] [txt=x] [lng=x] or initf -h (to get a usage message) where
db_type : genbank or embl or swissprot or nbrf, according to the need
gcg : use this option to create a database that indexes GCG files
punctuation : use this option so the created database allows punctuation to appear in sequence data
hsub=xx : use this to set the value of the seq name hashing
constant hsub (default=1000003, use a prime number of the same magnitude as the total number of
seqs in the to-be-built database)
hkwsp=xx : use this to set the value of the species and keyword
name hashing constant hkwsp (default=500009, use a prime number of the same magnitude as the total
number of keywords in the to-be-built database)
acc=xx : sets the max length of accession numbers (default 8)
halgo=java|old : sets what algorithm will be used for hashing names of
sequences, species and keywords (default: java)
standardextonly : use this option so the created database don't use /gene= and /standard_name= feature
qualifiers to construct subsequence name extensions.
protein_idext : use this option so the created database uses /protein_id= feature qualifiers to
construct subsequence name extensions.
sub=xx : to set the max length of (sub)sequence names (default = 20)
key=xx : to set the max length of keywords (default = 40)
spec=xx : to set the max length of species names (default = 70)
aut=xx : to set the max length of author names (default = 20)
bib=xx : to set the max length of references (default = 40)
smj=xx : to set the max length of records keys in file SMJYT (default = 30)
txt=xx : to set the max length of labels in file TEXT (default = 70)
lng=xx : to set the number of SUBSEQ pointers in each LONGL record (default = 63)
Usage:
acnucgener a adress_file [-mmap index ... ] or acnucgener d division_name1 [division_name2 ...] [-mmap index ... ] where
address_file : name of a file typically created by
connectindex or by listtoaddress containing
the names, divisions and file offsets of sequences to enter
division_name : name(s) of one or more divisions
(e.g., gbnew for file gbnew.seq, fname for fname.dat); the name can also contain a subdirectory of $gcgacnuc
(e.g., lh/wgs_lhaa01_pro);
all sequences in these divisions will be
indexed, except those already indexed with the same date; existing seqs with anterior date are
suppressed and then re-indexed.
index : indicates an index file to be processed entirely in
virtual memory; one of ksub, kloc, kkey, kspec, kshrt, kshrt2, klng, ksmj, kaut, kacc, kbib;
can be repeated as in: -mmap ksub -mmap kshrt -mmap kkey ; for large databases, it is
recommended to use the -mmap option at least with each of ksub, kshrt, kkey, kspec, klng.
Program acnucgener creates subsequences for those items in sequence features that are known by the database as type. By default known types are CDS, tRNA, rRNA, scRNA, snRNA, misc_RNA. Use program smjytload to define additional types if desired. The /transl_table feature qualifier and the genetic code information from the NCBI classification (see below) is used to assign variant genetic codes to CDS subsequences. Qualifier values found in /EC_number=, /evidence=, /gene=, /product=, /protein_id=, /standard_name= are added as keywords of the subsequence. Qualifier values found in /gene= and /standard_name= are by default used to define the extension of the subsequence name, unless this behavior is turned off if entry 07NOCHANGESUBSEQNAME exists in file SMJYT. Alternatively, subsequence names can be made from the value of /protein_id= qualifiers if entry 07PROTEIN_IDSUBSEQNAME exists in file SMJYT.
Program acnucgener uses first species names found in the /organism= and next in the /dbxref="taxon:###" qualifiers of the source entry of the features table. This rule is reversed (taxon:### first and /organism next) if entry 07PRIORITY_TO_TAXID is present in index file SMJYT (this entry can be created/deleted with program smjytload). If this is absent, it uses the ORGANISM or OS records. Program acnucgener reads the full ncbi classification given in files names.dmp + nodes.dmp from directory $acnuctaxo to classify new species. If these files are not found or do not classify the species name, acnucgener uses the information of ORGANISM/OC lines to classify it. But acnucgener does not reflect in the acnuc species tree changed classification of a pre-existing species. For this reason, program readncbitaxo is useful to reflect changes of the NCBI classification of species in the acnuc species tree.
Customized processing of feature qualifiers is possible. Defined qualifiers can be detected and a keyword can be created from the qualifier or its value and attached to the subsequence corresponding to the feature entry. This is obtained by creating in the $acnuc directory a plain text file called custom_qualifier_policy that describes the desired custom feature qualifier processing. Follow this model (case is not significant) :
Qualifier = GENE_FAMILY Use_Value = True Parent_Keyword = GENE FAMILIES qualifier = GENE_EXPRESSION use_value = TRUE parent_keyword = GENE EXPRESSIONS
Groups of lines deal with distinct qualifiers. The qualifier line begins a group and names the feature qualifier that requires custom processing (e.g., presence of /GENE_FAMILY in qualifiers). The use_value line says True if the value of the qualifier is used to define the keyword (e.g., keyword HBG00234 is used when /GENE_FAMILY="HBG00234" appears). By default the qualifier itself is used as a keyword. The parent_keyword line names a keyword under which to place the keyword in the tree of keywords (e.g., HBG00234 will be placed under GENE FAMILIES). By default the keyword is at the top of tree. The standard output of acnucgener describes what custom processing is used.
This program has become obsolete given the fusion of the PIR and SwissProt databases into UNIPROT.
Usage:
Name of address file? ? address_file_name
Date de la release? (format 12/31/89) rel_date
where
address_file_name : name of file of seq names and file offsets
typically created by connectindex.
rel_date : date used only for seqs lacking date info in their annotation.
Usage:
where
name_file: file of sequence names, one per line, typically
created by connectindex or by listlostfeat.
Some situations arise where use of program acnucgener fails to correctly create all subsequences that should arise from sequence feature tables. One such case arises when a subsequence declared in the features table of seq. A is JOINed to a fragment of seq. B and when seq. B, but not seq. A, is updated, in the sense that its date is changed. Program connectindex detects the date change, so seq. B is removed (by supold) and re-indexed (by acnucgener), but acnucgener is not in a position to re-create the subsequence because it does not scan A's features table that defines the subseqs. File xxx.lost, created by connectindex, contains the name of seq. A, so running program processft with this file completes the database update by re-creating the subseq.
Another case is when a subsequence-associated feature table entry is added to a sequence without changing its date. Program connectindex does not detect this kind of change. The solution is to run listlostfeat that detects all missing subsequences from an acnuc database, and then processft on its output, to create these missing subseqs.
This program, run without arguments, reads the features table of all seqs of an acnuc database and detects missing sub-sequences. For each such case, it writes on its standard output the name of the parent sequence and the feature entry corresponding to the missing subsequence. If sent to a file, this output data is suitable to be used as argument for the processft program.
This program can be used in 3 modes:
update mode : Connects an existing set of index files to an updated set of flat files and identify changed, new, and disappeared sequences. Typically used to prepare acnuc indexing of a new release of flat files. Flat files can be optionnally found gzip'ed and decompressed on the fly.
install mode : Connects a set of index files to a set of flat files and hides access to sequences present in index files but not in flat files. Typically used after copying index files from a distribution to ensure their coherence with local flat files.
scan mode : Does install mode on a given series of flat files rather than on all flat files.
Usage:
connectindex -hgives a summary of program arguments
connectindex -update -basename base_name [-gz gzdirname] [-threads n] -divfile divlistwhere:
base_name: base name of a series of output files to be created
by the program
gzdirname: name of directory where gzip'ed flat files sit.
Compressed files are read from this directory and decompressed to the $gcgacnuc
directory.
n: optional number of parallel threads to use
divlist: name of file containing list of all
division names, one per line.
When flat files are in subdirectories of $gcgacnuc, include the subdirectory name in the division name.
In update mode, five output files are created. File disparu.mne
lists names of sequences present in indices but absent from flat files. File
base_name.1 lists new seqs
(present in flat, absent in indices). File base_name
.2 lists modified seqs (seq date or length or subsequences differ
between indices and flat files). File base_name
.lost lists names of seqs to be processed later by program
processft because their features table changed. File
base_name.address gives
division names and file offsets of all new or changed sequences;
it is to be used as an argument of program acnucgener.
connectindex [-threads n] -scan=number div1 div2 ...
where number is the number of following division names
The update/install modes can also be obtained by running the program without arguments and replying to a program dialog.
In update mode, the dialog replies are
u f or g (for flat of GCG formatted division files, respectively) base_name (base name of a series of output files to be created by the program) number (number of divisions in the acnuc database) xxx (names of these divisions on successive lines, without extension)
In install mode, the dialog replies are
i f or g (for flat of GCG formatted division files, respectively) y or n (if y an additional dialog item is needed) new_div_name (only if previous reply was y, a new division with this name is created in index files) number (number of divisions in the acnuc database) xxx (names of these divisions on successive lines, without extension)
This program duplicates all index files in the directory pointed to by the acnuc environment variable under names xxx.NEW, and then deletes all old index files and renames the new files.
usage newordalphab ...wait for termination with message "Normal end" on stdout.
usage:
listtoaddress names_file output_file where
names_file : file of names of seqs to be processed.
output_file : file with division
names and file offsets of these sequences.
Typical usage is to re-index a series of sequences by doing :
listtoaddress mylist.names mylist.address supold mylist.names -mmap acnucgener a mylist.address -mmap ksub -mmap kshrt
usage:
flattoaddress outfname flatfname... where
outfname : name of output file with division names & file
offsets of all entries present in flat files
flatfname : names(s) of input flat files containing sequence entries
Typical usage is to index a series of flat files by doing :
flattoaddress new.address flat1.dat flat2.dat crenewdiv flat1 crenewdiv flat2 acnucgener a new.address
usage: crenewdiv division_name
usage:
supold names_file [-mmap ]
where
names_file : file of names of seqs to be removed, one per line
-mmap : this option lets the program work faster for large number of sequences
The acnuc index file SMJYT contains one record for each name of molecule, journal, publication year, sequence type. It contains also the names of the division files of the database (not processed by this program), and records that specify optional database features.
smjytload is an interactive program that allows to create, rename, or delete such names. It also allows to modify the label of these names.
smjytload is useful to create new sequence types, so that corresponding subsequences be created by program acnucgener. Each type has a code and a label. Its code is the feature name, converted to uppercase (e.g., CDS, EXON, INTRON). Its label must begin with ".XX" where XX are the two letters used to construct subsequence names (e.g. .PE for CDS to get xxxx.PE1 as a subseq name); the rest of the label may describe the type.
smjytload is also useful to correct journal codes (remove duplicates for example).
When an acnuc database is daily updated, new sequences are added at the end of division files dedicated to holding them (example, gbnew.seq). Such new sequences may be further modified, so new versions of them will appear further down the divisions of new seqs, and so previous versions will no longer be indexed.
compressnewdiv reads a series of division files (typically only those holding daily updates),
compresses them in place by removing their unindexed portions, and updates pointers to all data
that changed place in these files.
Usage:
compressnewdiv division_name...
where
division_name : one or several names of division files to be compressed in place
Usage: modkeylength param=length
where param is one of sub, acc, spec, key, aut, bib, smj, txt, lng, shrt2 corresponding to index files SUBSEQ, ACCESS, SPECIES, KEYWORDS, AUTHORS, BIBLIO, SMJYT, TEXT, LONGL, and SHORTL2 respectively
and length is the new desired max length of record keys for the corresponding index file.
Notes:
- Max length can be safely reduced.
- With parameter lng, length is the number of SUBSEQ pointers in each LONGL record.
- With parameter shrt2, length is the number of values in each list of SHORTL2 which is created if it was not used before.
- Length must be ≥ 10 for acc.
- This program accepts only dbs in the variable-length record key format. Use modhconst to convert a db to this format.
- Program voyage gives the current maximum length of all record keys.
Usage;
modhconst [hsub=new_value] [hkwsp=new_value]
Access by sequence name in a large database will be faster if constant hsub is a prime number with the magnitude of the total number of seqs in the database. Similarly for constant hkswp and keywords.
Usage: convert_shortl_key VALINSHRT_KEY=v
where v is the desired number of values in each list stored in SHORTL_KEY.
The acnuc database must already use optional index file SHORTL2. Program modkeylength can turn a database into using optional file SHORTL2.
The program reads files $acnuctaxo/names.dmp and $acnuctaxo/nodes.dmp that contain a classification of species and reproduces it entirely in the current acnuc database, except for species that exist in acnuc but not in the input file, that remain unchanged, and for species of the input file absent from acnuc, that are not created in acnuc (unless option -keepall is used, see below).
The program creates a log file (id.log in current directory) describing input classification, current acnuc classification, and all operations done to transform the second in the first.
There are four optional arguments to this program:
-partial : instructs the program not to delete synonyms existing in the curent acnuc
classification but not in the input classification
-niveau : instructs the program to use taxonomic level information of the input
classification as node label (used by databases such as Hovergen).
-setcode : instructs the program to create files ncbicodes.out that summarizes
the genetic code information present in the input files and
setcode.dialog formatted as input for the setcode program.
-keepall : instructs the program to create in the acnuc database all the species found
in the input tree, even if no sequence is attached to them.
Option -h lists possible program options.
This file is written on the standard output of the program.
Setcode is a dialog-based program that repeatedly asks for taxon name, genetic code id, and boolean mitochondrial information, and assigns this genetic code id to all subsequences of type CDS (and possibly of organelle MITOCHONDRION) from that taxon or taxa below in the tree.
The dialog is
taxon name or stop ( the program stops if stop)
y or n (for mitochondrial or genomic genetic code info, respect.)
acnuc-genetic-code-id (an acnuc-defined genetic code id)
[loop back to asking taxon name]
Procedure setcodegenbank.com runs the setcode program with the setcode.dialog information. It thus applies the genetic code information present in the NCBI classification to all of an acnuc database.
The flow until a CDS subsequence and its correct genetic code in acnuc is as follows. Program readncbitaxo writes in acnuc the genetic code information given in files names.dmp/nodes.dmp as part of the label of any leaf node or any sequence-bearing node. Program acnucgener uses this information to assign the adequate genetic code number to any CDS subsequence it creates. But this flow fails when acnucgener creates a new species and associated subsequences because the genetic code information is not available to the program then. Program setcode is thus useful to enforce a coherent genetic code information througout an acnuc database.
This lists on stdout all genetic codes defined in acnuc in a format that allows comparison with NCBI's gencode.dmp file. The output also gives both NCBI's and acnuc's genetic code ids. One can then detect if new genetic codes appeared in NCBI and define them in acnuc.
A series of operations can be done :
0 Orientation towards Species or Keywords 1 Creation of a node 2 Modification of the name and/or the label of a node 3 Creation of a branch 4 Move of a branch 5 State of a node 6 Delete a node or a synonym 7 Browse the tree 8 Create synonyms 9 List isolated or unused nodes and detect tree loops 10 Modify the order of descendants of a node 11 Remove all unused nodes
modnet allows to correct a few branches or nodes in the classification of species. Program
readncbitaxo is to be used for more extensive changes.
modnet is the main way to organize a series of keywords as a tree.
Voyage is a utility program that helps debugging acnuc programs.
Usage: namesindiv outfname divname ...
where
outfname: name of an output file to be filled with seq names, one per line
divname : one or several names of acnuc division files (e.g., gbbct2 est_fun)
Usage:suppr_unused oper_id
where
oper_id: one of bib aut acc spec key smj to specify references, authors, acc nos, species, keywords, or records from file SMJYT, respectively.
Operation bib should be done before operation aut to be efficient.
When dealing with species or keywords, nodes whose descendents, in the tree, are all unused nodes are also deleted.
Usage:ordnet oper_id
where
oper_id: s or k to specify species or keywords, respectively.
This program allows high-ranking taxa to appear before low-ranking ones in the index file of species names, which makes the output of browsing the species tree cleaner. The same applies to the keywords index file.
Usage:updatehelp [ -noupdate ]
The program computes the total number of sequences, subsequences, references, and nucleotides or amino acids in the current acnuc database, and writes this information at the top of on-line help files HELP and HELP_WIN.
The program also writes the date of the day the program is run, unless run with the -noupdate argument.
This program, used without argument, is useful during the procedure of daily indexing after the acnucgener run to have again all sequences and accession numbers alphabetically sorted by name. Subsequences are sorted in the order of their appearance in the features table.
Looks for the 1st descendant of RELEASE NUMBERS that should be of the form RELEASE # and creates a new keyword with a number incremented by one. Useful for the GenBank format, after indexing a full database release and before starting daily updates, so new sequences be associated with the release number of the next full release. Useless with EMBL format because release numbers are read in annotations rather than guessed at by acnucgener.
Notes:
- Does nothing if the db's and host's endiannesses are equal.
- All programs can transparently read/write a db with opposite endianness to the host's, but this program allows to equalize db's and host's endiannesses.
- Program voyage shows the current endianness of an acnuc db.
- SPARC, PowerPC are big-endian computer architectures; intel, alpha are little-endian.
Usage: raadbstatus -f knowndbfile -p namedpipe -n dbname { on | off }
or
raadbstatus -f knowndbfile -a -n dbname (to set password of a protected database)
knowndbfile: name of file with list of remotely accessible acnuc databases (environment
variable raalist gives this name)
namedpipe: name of pipe to communicate with the racnucd daemon
(environment variable raadisable gives this name)
dbname: name of database, taken from first column of knowndbfile
on | off: use off to make db unavailable, on to make it available
Example to set the swissprot database offline:
raadbstatus -f $raalist -p $raadisable -n swissprot off
Example to password-protect the nbrf database:
raadbstatus -f $raalist -a -n nbrf
Enter password: *******
Repeat password: *******
A series of programs that help detect several sorts of inconsistencies within index files, for example, a link from a sequence to a keyword that is not matched by a corresponding link from keyword to sequence. These programs are :
- checkacc : acc no <==> seq links
- checkarbre : tree structure in species and keyword index files
- checkaut : author <==> reference links
- checkbc : coherence between SQ / SUMMARY annotation lines and sequence data
- checkbib : sequence <==> reference links
- checkhash : integrity of hashing of sequence, species and keyword names
- checkinfnucpointers : integrity of pointers to annotations and sequences
- checkkw : sequence <==> keyword links
- checklng : integrity of all data in LONGL index file
- checkmefi : parent-sequence <==> subsequence links
- checksmj : sequence <==> SMJYT links
- checkspec : sequence <==> species links
- checksyno : integrity of synonymy data in species and keyword trees
Each program runs on all of the database and writes a description of any detected inconsistency on stdout.