PRABI-Doua: ACNUC C programming interface

ACNUC C Application Programming Interface

header file: dir_acnuc.h ---- full source code: acnucsoft.tar

CONSTANTS / GLOBAL VARIABLES / TYPEDEFs
OPENING / CLOSING: acnucopen, simpleopen, dir_acnucopen, dir_acnucclose.
ACCESS BY SEQUENCE NAME: gsnuml, isenum.
ACCESS TO SEQUENCES: gfrag, prep_extract, extract_1_seq, fin_extract.
ACCESS TO SEQUENCE ANNOTATIONS: seq_to_annots64, read_annots64, next_annots64, short_descr, short_descr_p, read_loc_qualif.
TRANSLATION / GENETIC CODES: codaa, init_codon_to_aa, translate_cds, translate_init_codon, get_ncbi_gc_number, get_acnuc_gc_number, get_code_descr.
ACCESS BY SPECIES, KEYWORD, AUTHOR, REFERENCE, ACC. NO, etc: iknum, fcode, shkseq, descen, sel_seqs_1_node, taxidtosp, sptotaxid, descen, get_ancestor_taxon

C code to find seqs attached to an accession no.
C code to find seqs attached to a taxon, taxID or keyword
C code to find all keywords attached to a sequence
C code to find keywords placed below one keyword in the keyword tree
C code to find all species below one taxon in the taxon tree

ACCESS BY THE QUERY LANGUAGE

Other global variables: tlist, defbitlist, defoccup, deflnames, deflocus, defgenre, defllen.
Query language API: prep_requete, proc_requete, free_list.
Query API usage example
Query language

READING/WRITING ACNUC INDEX FILES: readacc, readsub,... , writeacc, writesub,... , read_first_rec, write_first_rec.
USING BIT LISTS: bit1, bit0, testbit, irbit, ou, et, non, bcount, lngbit.
UTILITY FUNCTIONS: compact, complementer_base, complementer_seq, endian_test, hashmn, hasnum, majuscules, padtosize, strcmptrail, trim_key.
SIMULTANEOUS ACCESS TO SEVERAL ACNUC DATABASES: chg_acnuc, store_acnuc_status, set_current_acnuc_db, sizeof_acnuc_status.
DATABASE MANAGEMENT FUNCTIONS: addshrt, addlng, supshrt, suplng, mdshrt, mdlng, cre_new_division, crespecies, crekeyword, cretaxids, addhsh, suphsh, delseq, dir_set_mmap, dir_acnucflush, write_quick_meres.
SUPPORT OF iRODS-BASED STORAGE: irods_fopen, irods_fgets, irods_ftello, irods_fseeko, irods_fclose.

INTRODUCTION

ACNUC environment
ACNUC databases are made of a series of flat text files (called divisions) containing annotations and sequences and a series of index files allowing efficient access to sequence data. Two environment variables acnuc and gcgacnuc are used by all ACNUC programs to define the name of the directories where index and flat files are located, respectively.

The C ACNUC API contains a series of functions that carry common basic operations and the full query language. Other operations require understanding the logical structure of ACNUC databases for direct navigation through index files. The header file dir_acnuc.h describes in detail the organization of all index files. The API can handle 'large' files via 64-bit file offsets, both for division and index files.

Usage of the C ACNUC API is as follows:
- Include file dir_acnuc.h as the first include statement of your program.
- Link with library libcacnuc.a (or libcacnucdeb.a in the LBBE).

Anonymous ftp access to C source files
All C sources are in publicly accessible file acnucsoft.tar . The makefile therein allows to build several ACNUC programs including query, the line-oriented ACNUC retrieval program. The makefile also builds the ACNUC library libcacnuc.a. Thus a program prog.c placed in the same directory as the ACNUC source files using the ACNUC API can be compiled by :
gcc -o prog prog.c -L. -lcacnuc

LBBE (i.e., Lyon) access to C source files and libraries
In the LBBE computer setup, all ACNUC software is in directory /panhome/banques/csrc/trunk . Thus a program prog.c using the ACNUC API can be compiled by :
gcc -o prog -I/panhome/banques/csrc/trunk prog.c -L/panhome/banques/csrc/trunk -lcacnucdeb
Simple C API example. See also the main acnuc header file dir_acnuc.h.

CONSTANTS / GLOBAL VARIABLES / TYPEDEFs

int L_MNEMO : the fixed length of (sub-)sequence names; may vary according to the database.
int WIDTH_SP : the fixed length of species; may vary according to the database.
int WIDTH_KW : the fixed length of keywords; may vary according to the database.
int ACC_LENGTH : the max length of accession numbers; may vary according to the database.
int WIDTH_AUT : the fixed length of author names; may vary according to the database.
int WIDTH_BIB : the fixed length of references; may vary according to the database.
int WIDTH_SMJ : the fixed length of SMJYT codes; may vary according to the database.
int lrtxt : the fixed length of TEXT labels; may vary according to the database.
constant WIDTH_MAX = 150 (is ≥ than any of L_MNEMO,WIDTH_SP,WIDTH_KW,WIDTH_AUT,WIDTH_BIB,WIDTH_SMJ,ACC_LENGTH,lrtxt)
int SUBINLNG : the number of SUBSEQ pointers in each LONGL record; may vary according to the database.
int VALINSHRT2 : the number of values in each record of index file SHORTL2; may vary according to the database; 0 means SHORTL2 is not used.
int VALINSHRT_KEY : the number of values in each record of index file SHORTL_KEY; may vary according to the database; 0 means SHORTL_KEY is not used; cannot be non-zero if VALINSHRT2 is zero.
DIR_FILE : A normally opaque struct type used for buffered and random access to ACNUC index files.
kacc, kaut, kbib, kext, kkey, kloc, klng, kshrt, kshrt2, kshrt_key, ksmj, kspec, ksub, ktxt : global variables of type pointer to DIR_FILE associated to each of the ACNUC index files named ACCESS, AUTHOR, BIBLIO, EXTRACT, KEYWORDS, LOCUS, LONGL, SHORTL, SHORTL2, SHORTL_KEY, SMJYT, SPECIES, SUBSEQ, TEXT, respectively.
enum shortl_kind { to_shortl = 0, sub_of_bib, spec_of_loc, bib_of_loc, aut_of_bib, bib_of_aut, sub_of_acc, key_of_sub, acc_of_loc}
Identifies a kind of short list stored in the SHORTL, SHORTL2 or SHORTL_KEY index files.
- to_shortl: other lists
- sub_of_bib: sequences of a reference
- spec_of_loc: species of a sequence (swissprot format only)
- bib_of_loc: references of a sequence
- aut_of_bib: authors of a reference
- bib_of_aut: references of an author
- sub_of_acc: sequences of an accession #
- key_of_sub: keywords of a sequence
- acc_of_loc: accession #s of a sequence
nseq : total number of records in file SUBSEQ (= maximum # of bits in a bit list of sequences)
maxa : the largest total record number among files SPECIES and KEYWORDS
lenbit : the largest among nseq and maxa (size in bits of a bit list that can contain either sequences, species, or keywords).
lenw : size in int of a bitlist holding lenbit bits (useful to allocate a bitlist of sequences, species, or keywords).
longa : size in int of a bitlist holding maxa bits (useful to allocate a bitlist of species or keywords).
flat_format : TRUE when using text flat files; FALSE when using GCG files.
genbank : TRUE iff annotations follow the GenBank syntax
embl : TRUE iff annotations follow the EMBL syntax
nbrf : TRUE iff annotations follow the PIR/NBRF syntax
swissprot : TRUE iff annotations follow the SwissProt syntax
divisions : rank of the last division file of the database (counting from 0, so there are divisions+1 divisions)
char **gcgname : array of division names, all without extension
int *annotopened : tells whether each division is currently opened
FILE **divannot : arrays of streams associated to currently opened divisions
int hsub, hkwsp : parameters that control hashing of sequence, keywords and species names
int hoffst : 3 means the new format that allows variable-length names; 2 means the old fixed-length format
int must_swap_bytes : TRUE iff endianness of index files and of host computer differ; the API transparently accepts reading and writing index files in this case also.
int irods_flat_files : TRUE iff the database's flat files are stored in the LBBE iRODS server.
flat_fopen, flat_fgets, flat_ftello, flat_fseeko, flat_fclose : pointers to functions that open, read, get and set position, close, respectively, an ACNUC flat file. If flat files are stored in the filesystem, these pointers point to fopen, fgets, .... In the LBBE setup, flat files can optionally be stored in an iRODS server (the value of the gcgacnuc environment variable determines whether this is true); in that case, these pointers point to other functions (see SUPPORT OF iRODS-BASED STORAGE) that open, read, get and set position, and close an iRODS-located file. It is also necessary to define the ALLOW_IRODS_FLAT_FILES preprocessor variable when compiling the ACNUC source code to support the storage of ACNUC flat files under iRODS.

OPENING / CLOSING

void acnucopen(void);
Opens the ACNUC database identified by the acnuc and gcgacnuc environment variables for full, read-only access.
void simpleopen(void);
Opens the ACNUC database identified by the acnuc and gcgacnuc environment variables for partial, read-only access : only access by sequence name and to annotations and sequences is possible.
void dir_acnucopen(char *db_access);
Opens the ACNUC database identified by the acnuc and gcgacnuc environment variables for full access. Access is read-only if db_access == "RO" or read/write if db_access == "WP".
void dir_acnucclose(void);
Closes access to the current ACNUC database.

ACCESS BY SEQUENCE NAME
ACNUC nucleotide databases contain parent sequences, that are regular database entries, and subsequences, that are one or several fragments of one or several parents as defined in a features table entry. Subsequences are named by adding to the parent name a dot and an extension (e.g., ECOTGP.TRPA).

int gsnuml(char *name, int *length, int *frame, int *gencode);

name : sequence name terminated with \0 (upper/lowercase accepted)
*length : upon return, contains the sequence length
*frame : ignored if NULL, or returned with reading frame (0, 1, 2)
*gencode : ignored if NULL, or returned with id of genetic code (0=usual code)
returned value : the rank in the database of the (sub)sequence named "name", or 0 if none exists

int isenum(char *name);

name : null-terminated sequence name (upper/lowercase accepted)
returned value : the rank in the database of the (sub)sequence named "name", or 0 if none exists

ACCESS TO SEQUENCES
int gfrag(int nsub, int first, int lfrag, char *dseq);
Gets any part of any (sub)sequence of the database.

nsub : rank of (sub)sequence
first : starting position (counting from 1) for sequence access
lfrag : number of positions asked for
dseq : upon return, null-terminated string filled with bases or aa read
dseq is allocated by the caller
less than lfrag positions can be read if sequence end is reached
returned value : actual number of residues read, or 0 if any error

int prep_extract(int usefilename, out_option option, char *fname, extract_option choice, char *feature_name, char *bounds, char *min_bounds, char **message);
Prepares later extraction operations (performed by calling extract_1_seq on one or several sequences) by specifying extraction format, type, and output destination.

usefilename : if TRUE, output goes to a file named in the filename argument; if FALSE, output goes through caller-provided function
option : one of these enum values: acnuc, gcg, fasta, analseq, flat, coordinates

acnuc, gcg, fasta, analseq, flat: name of format of extracted sequences
coordinates: the function outputs coordinates in parent sequences of target sequence fragments (rather than sequence data)
Output is formatted by series of rank=%d&start=%d&end=%d| giving the rank of parent sequence, start position and end position of each target fragment. Fragments related to a contiguous sequence occur on one line; line changes indicate distinct sequences. start > end indicates fragment is on complementary strand of parent sequence.

fname : a filename if usefilename argument was TRUE, or a pointer to a struct stream_output that allows output to go through caller-provided function.
choice : one of these enum values: simple, translate, fragment, feature, region

simple: sequences are fully extracted by later calls to extract_1_seq
translate: protein-coding sequences are translated (nothing gets extracted if applied to sequences of type != CDS; does not apply if option == coordinates)
fragment: Allows to extract any part of processed sequences. Such part is specified by the bounds and min_bounds arguments according to the syntax suggested by these examples:

132,1600        to extract from nucl. 132 to nucl 1600 of the sequence. 
                If applied to a subsequence, extraction is done in the parent seq 
                relatively to the subsequence start point.
-10,10          to extract from 10 nucl. BEFORE the 5' end of the sequence
                to nucl. 10 of it. Useful only for subsequences, and produces
                a fragment extracted from its parent sequence.
e-20,e+10       to extract from 20 nucl. BEFORE the 3' end of the sequence
                to 10 nucl. AFTER its 3' end. Useful only for subsequences, and 
                produces a fragment extracted from its parent sequence.
-20,e+5         to extract from 20 nucl. BEFORE the 5' end of the sequence
                to 5 nucl. AFTER its 3' end.

feature: feature tables of sequences are scanned, each instance of the feature key given in feature_name argument is extracted; meaningful only for parent sequences (subsequences have no feature table)
region: the fragment operation is applied to all entries of the specified kind in the feature table of processed sequences. The bounds and min_bounds arguments specify what part of feature data are extracted.

feature_name : (useful only if choice is feature or region) a feature key (CDS, mRNA,...)
bounds : (useful only if choice is fragment or region) see syntax above
min_bounds : (useful only if choice is fragment or region) NULL or same syntax as bounds. When the sequence data is too short for this quantity to be extracted, nothing is extracted. When the sequence data is between minbounds and bounds, extracted sequence data is extended by N's to the desired length. NULL is same as setting min_bounds = bounds.
message : upon return, pointer to an error-describing message
returned value : 0 if OK, != 0 if error

typedef int (*writefunction)(const char *, void *); struct stream_output { void *stream; writefunction outonelinef; };

All extraction output is sent to caller-provided function outonelinef that is called with one line of output data as first argument and with opaque data pointer stream as second.

int extract_1_seq(int seqnum, char *bounds);
Extracts sequence of rank seqnum according to extraction rules given by previous call to prep_extract.

seqnum : rank of (sub)sequence to process
bounds : (only for fragment or region operations) same as bounds argument of prep_extract call

void fin_extract(void);
To be called once after all calls to extract_1_seq to close the extraction output.

ACCESS TO SEQUENCE ANNOTATIONS
For a parent sequence, the only possible access is to its first annotation line and to following lines.
For a subsequence, the only possible access is to the first line of the corresponding FEATURE (e.g., CDS, tRNA, etc...) and to following lines.
Moreover, access to a previously accessed annotation line is possible provided the address of this line, returned by the next_annots64 function, is memorized.

void seq_to_annots64(int numseq, off_t *faddr, int *div);
This function gives the caller the information needed to access the first annotation line of a (sub)sequence.

seqnum : rank of parent or subsequence
*faddr, *div : upon return, couple of data used to access annotations via the read_annots64 function.

char *read_annots64(off_t faddr, int div);
Returns in static memory the annotation line addressed by the faddr and div arguments. Trailing \n and spaces are removed.
To access following annotation lines, use :
char *next_annots64(NULL);
Returns the annotation line following the last one read.
char *next_annots64(off_t *pfaddr);
This alternative call is useful to allow re-access to an annotation line, later in the program. First, read this line with next_annots64 and a non-NULL argument, memorize the off_t value obtained upon return, and use this value as the faddr argument of a call to read_annots64 any time later. The necessary div argument is the same for any annotation line of one sequence.
char *short_descr(int seqnum, char *text, int maxlen);

seqnum : (sub)sequence rank
text : upon return, char string filled with a short sequence description built with the sequence name and, for a parent sequence, from DE/DEFINITION lines, and for a subsequence, from corresponding "qualifiers".
maxlen : max memory size for text
returned value : pointer to text

char *short_descr_p(int seqnum, char *text, int maxlen);
same as short_descr for a parent sequence;
for a subsequence, applies short_descr to its main parent.
int read_loc_qualif(int isub, char *location, int maxlocat, char *type, char *qualifiers, int maxqualif);
To return location, qualifiers or feature key of a subsequence.

isub: rank of subsequence
location: NULL or memory to receive the subsequence's location
maxlocat: typically sizeof(location), useless if location is NULL
type: NULL or memory to receive the subsequence's feature key
qualifiers: NULL or memory to receive the subsequence's qualifiers
maxqualif: typicaly sizeof(qualifiers), useless if qualifiers is NULL
returned value: FALSE if OK; TRUE if isub is not a subsequence or if not enough memory to receive all required data.

TRANSLATION / GENETIC CODES

char codaa(char *codon, int code);

codon : pointer to trinucleotide (e.g. acu, GGT)
code : genetic code id (e.g., computed by gsnuml, or 0 for the usual code)
returned value : the corresponding amino acid on one character

char init_codon_to_aa(char *codon, int gc);

codon : pointer to initiation codon (e.g. aug, GTG)
gc : genetic code id
returned value : the corresponding amino acid on one character using the initiation codon rule of the genetic code.

char *get_code_descr(int code);

code : genetic code id (e.g., computed by gsnuml)
returned value : string <= 60 chars describing how this genetic code differs from the usual one (e.g. AGR=* AUA=M UGA=W )

char *translate_cds(int seqnum);
Complete translation, returned in malloc'ed memory, of sequence of rank seqnum (often a subsequence) using the sequence's genetic code and its rule concerning the initiation codon.
char translate_init_codon(int seqnum, int gc, int codon_start /* 1, 2, or 3 */);
returns in one char the translation of the initiation codon of sequence of rank seqnum using the genetic code of id gc and the offset codon_start for correct reading frame.
int get_ncbi_gc_number(int gc);
returns the NCBI id of the genetic code with ACNUC id gc
int get_acnuc_gc_number(int ncbi_gc);
returns the ACNUC id of the genetic code with NCBI id ncbi_gc
returns 0 (=usual code) if not found.

ACCESS BY SPECIES, KEYWORD, AUTHOR, REFERENCE, ACC. NUMBER, etc...

int iknum(char *name, DIR_FILE *fp);

name : taxon or keyword name (null-terminated string ignoring case)
fp : kkey for keyword or kspec for a taxon name
returned value : rank of name of 0 if does not exist

int fcode(DIR_FILE *fp, char *key, int lcompar);

fp : kacc, kaut, ksmj, kbib for accession-number, author-name, SMJYT, or reference, respectively
key : string to search (case is ignored)
lcompar : number of used characters in key during search
returned value : rank of found key in corresponding index file, or 0 if key does not exist.

int shkseq(char *name, int *bitlist, int oper);

name : taxon or keyword name (null-terminated string ignoring case); can contain @ characters to indicate wildcards.
bitlist : integer array of size at least lenw to be filled upon return with the bitlist of seqs attached to all taxa or keywords placed below name in the species or keywords trees.
oper : (input) 1 for species, 2 for host, 3 for keywords.
returned value : 1 when OK; 2 when nothing matches name in index file

void sel_seqs_1_node (DIR_FILE *kan, int recnum, int *bitlist, int host);

kan : kspec for species or kkey for keywords
recnum : rank in index file adressed by kan of a species or a keyword
bitlist : integer array of size at least lenw to be filled upon return with the bitlist of seqs attached to all taxa or keywords placed below name in the species or keywords trees. Normally transmitted empty (= all 0s) by caller.
host: TRUE iff kan==kspec and host sequences of taxon are expected

int taxidtosp(int tid);

return value : rank in file SPECIES of the taxon of taxID tid, or 0 if no such taxon.

int sptotaxid(char *taxname, int rank);

taxname: NULL or taxon name (case is not significant)
rank: (used only if taxname == NULL) the acnuc rank of a taxon
return value : ncbi taxon ID of given taxon or acnuc taxon rank, or 0 if no such taxon exists or if this taxon has no ncbi given taxon ID.

void descen (DIR_FILE *kan, int recnum, int *bitlist);

kan : kspec for species or kkey for keywords
recnum : starting record rank in file adressed by kan
bitlist : integer array of size at least longa to be filled upon return with the bitlist of taxa or of keywords placed below node of rank recnum in the species or keywords trees.

char *get_ancestor_taxon (char *name, int rank, int *pancestor);

name: NULL or a taxon name (case is not significant)
rank: (only if name == NULL) a taxon rank in index file SPECIES
pancestor: NULL, or returned filled with the rank in SPECIES of ancestor of name/rank
return value: the name of the ancestor in static memory ("Root" when name is at tree top)
or NULL if not enough memory or name does not exist.

C code to find seqs attached to an accession no.:

	
	char access[] = "M00001";
	int num, seq, rank;
	unsigned point;
	num = fcode(kacc, access, ACC_LENGTH);
	if(num == 0) return; /* this accession no does not exist */
	readacc(num);
	point = pacc->plsub; rank = 0;
	while (point != 0) {
		/* seq is the rank of a sequence attached to given acc no. */
		seq = follow_shortl(&point, sub_of_acc, &rank);
		}

C code to find seqs attached to a taxon, taxID or keyword

	
	char my_taxon[] = "Bovidae"; /* case ignored */
	char my_kw[] = "ribosomal protein"; /* case ignored */
	int tid = 284813 ; /* taxon id of Encephalitozoon cuniculi */
	int num, err, *list, numsp;

	list = (int *)calloc(lenw , sizeof(int) ).
	err = shkseq(my_taxon, list, 1);
	if(err == 2) return; /* taxon does not exist */
	num = 1;
	while( (num = irbit(list, num, nseq)) != 0) {
		/* here num is the rank of a seq attached to taxon my_taxon */
		}

	numsp = taxidtosp(tid);
	if(numsp != 0) sel_seqs_1_node(kspec, numsp, list, FALSE);
	num = 1;
	while( (num = irbit(list, num, nseq)) != 0) {
		/* here num is the rank of a seq attached to taxID tid */
		}

	err = shkseq(my_kw, list, 3);
	if(err == 2) return; /* keyword does not exist */
	num = 1;
	while( (num = irbit(list, num, nseq)) != 0) {
		/* here num is the rank of a seq attached to keyword my_kw */
		}

	free(list);

C code to find all keywords attached to a sequence

	
	int num, kw, rank;
	unsigned point;

	num = isenum("ECOTGP"); /* get rank of starting sequence name */
	readsub(num);
	point = psub->plkey;
	rank = 0;
	while (point != 0) {
		/* here kw is the rank of an attached keyword */
		kw = follow_shortl(&point, key_of_sub, &rank);
		}

C code to find keywords placed below one keyword in the keyword tree

	
	int kw, *liste_kw, num;

	liste_kw = (int *)malloc(longa * sizeof(int));
	kw = iknum("division names", kkey); /* get rank of starting keyword */
	if(kw == 0) return; /* keyword does not exist */
	descen(kkey, kw, liste_kw);
	/* list liste_kw contains all keywords placed below starting keyword in the tree 
	of keywords, including itself */
	bit0(liste_kw, kw); /* remove starting keyword from list */
	num = 1;
	while((num = irbit(liste_kw, num, maxa)) != 0) {
		readkey(num); /* here num is the rank of a descending 
			keyword in the tree of keywords */
		}

C code to find all species below one taxon in the taxon tree

	
	int sp, *liste_sp, num;

	liste_sp = (int *)malloc(longa * sizeof(int));
	sp = iknum("Mammalia", kspec); /* starting taxon */
	if(sp == 0) return; /* taxon does not exist */
	descen(kspec, sp, liste_sp);
	/* list liste_sp contains all taxa placed below starting taxon in the tree of taxa, 
	including itself */
	num = 1;
	while((num = irbit(liste_sp, num, maxa)) != 0) {
		readspec(num); /* here num is the rank of a descending 
			taxon in the tree of taxa */
		if(pspec->plsub == 0) bit0(liste_sp, num);
		/* if a taxon has no associated seq, remove it from list */
		}

ACCESS BY THE QUERY LANGUAGE

Other global variables :

int tlist = 50 : total number of usable bitlists
int defoccup[] : array giving the occupancy state of bitlists, TRUE when occupied.
int *defbitlist : array holding all (occupied and free) bitlists; this array is pre-allocated by the API; each bitlist is lenw int-long and k^th bitlist begins at defbitlist + k * lenw
char *deflnames[] : array of names of bitlists, converted to uppercase, malloc'ed when created, and free'ed when deleted.
int deflocus[] : array indicating whether bitlists contain parent sequences only (TRUE) or both parent and subsequences (FALSE).
char defgenre[] : array indicating the type of bitlists; 'S', sequences; 'E', species; 'K' keywords.
int defllen[] : array giving the number of elements in each bitlist

Query language API

#include "requete_acnuc.h"
necessary when following functions are used
void prep_acnuc_requete(void);
call this once before using the proc_requete function any number of times
int proc_requete(char *query, char message[100], char *listname, int *listrank);
computes the bitlist of sequences (sometimes species or keywords) that match a query;

query : the query string, for example sp=homo sapiens or sp=bos taurus
message : upon return, and in case of error, filled with an error describing message
listname : (input) name to be given, after conversion to uppercase, to the bitlist to be constructed; if a list with this (uppercase only) name already exists, the list will be replaced by the new one.
listrank : upon return, points to the rank of the created bitlist, so that defbitlist + (*listrank)*lenw points to the beginning of this list.
returned value : 0 if OK, ! = 0 indicates error.

void free_list(int num);
frees bitlist of rank num for use by future queries.

Query API usage example
Here is a commented example of usage. It boils down to :

	
#include "dir_acnuc.h"
#include "requete_acnuc.h"
acnucopen();
prep_acnuc_requete();
	
apply function proc_requete to the query string
scan the bitlist produced by this function

Query language
All ACNUC queries can be processed by the proc_requete function. The query language defines several selection criteria and operations between lists of elements matching criteria. It creates mainly lists of sequences, but also lists of species (or, more generally, taxa) and of keywords.

Selection criteria are : (no space before the = sign)

SP=taxon : seqs attached to taxon or any other below in tree; @ wildcard possible
TID=id : seqs attached to given numerical NCBI's taxon id
H=taxon : seqs whose host is taxon or any other below in tree; @ wildcard possible
K=keyword : seqs attached to keyword or any other below in tree; @ wildcard possible
T=type : seqs of specified type
J=journal_name : seqs published in journal specified using defined journal code
R=refcode : seqs from reference specified such as in jcode/volume/page (e.g., JMB/13/5432)
AU=name : seqs from references having specified author (only last name, no initial)
AC=accession_no : seqs attached to specified accession number
N=seq_name : seqs of given name (ID or LOCUS); @ wildcard possible
NS=taxon_name : taxon of given name; @ wildcard possible
NK=keyword_name : keyword of given name; @ wildcard possible
Y=year : seqs published in specified year; > and < can be used instead of =
O=organelle : seqs from specified organelle named following defined code (e.g., chloroplast)
M=molecule : seqs from specified molecule as named in ID or LOCUS annotation records
ST=status : seqs from specified data class (EMBL) or review level (UniProt)
F=file_name : seqs whose names are in given file, one name per line
FA=file_name : seqs attached to accession numbers in given file, one number per line
FK=file_name : produces the list of keywords named in given file, one keyword per line
FS=file_name : produces the list of species named in given file, one species per line
list_name : the named list that must have been previously constructed

Operators are : (always followed and preceded by spaces or parentheses)

AND or ET : intersection of the 2 list operands
OR or OU : union of the 2 list operands
NOT or NO : complementation of the single list operand
PAR or ME : compute the list of parent seqs of members of the single list operand
SUB or FI : add subsequences of members of the single list operand
PS : project to species: list of species attached to member sequences of the operand list
PK : project to keywords: list of keywords attached to member sequences of the operand list
UN : unproject: list of seqs attached to members of the species or keywords list operand
SD : compute the list of species placed in the tree below the members of the species list operand
KD : compute the list of keywords placed in the tree below the members of the keywords list operand

The query language is case insensitive except where filenames occur. Parentheses can be used to specify the range of operators. Three operators (AND, OR, NOT) can be ambiguous because they can also occur within valid criterion values. Such ambiguities can be solved by bracketting elementary selection criteria between double quotes. For example:

"sp=Beak and feather disease virus" and "au=ritchie"

READING/WRITING ACNUC INDEX FILES
Macros or functions are devoted to the reading of one record for each index file in C structures that are always accessible through global variables.

	
Function or macro          File    Pntr to record  DIR_FILE name
void readacc(int recnum);  ACCESS	pacc		kacc
void readsub(int recnum);  SUBSEQ	psub		ksub
void readloc(int recnum);  LOCUS	ploc		kloc
void readshrt(int recnum); SHORTL	pshrt		kshrt
void readlng(int recnum);  LONGL	plng		klng
void readext(int recnum);  EXTRACT	pext		kext
void readsmj(int recnum);  SMJYT	psmj		ksmj
void readaut(int recnum);  AUTHOR	paut		kaut
void readbib(int recnum);  BIBLIO	pbib		kbib
void readkey(int recnum);  KEYWORDS	pkey		kkey
void readspec(int recnum); SPECIES	pspec		kspec
void readtxt(int recnum);  TEXT         ptxt		ktxt

Writing is done similarly with macros writeacc, writesub, etc...
dir_acnuc.h details the structure associated to records of each ACNUC index files. For example, readsub(n) reads the nth record of file SUBSEQ into the following C structure pointed to by global variable psub :

	
	struct rsub {     /* SUBSEQ : one record for each (sub)sequence */
    int length, /* seq length; or 0 if record was deleted */
	type, /* to SMJYT, for seq type */
	pext, /* if > 0 this is a subsequence, pext points to EXTRACT for list of exons;
	   	if <= 0 this is a parent sequence, -pext points to LONGL for list of subseqs */
	plkey, /* to SHORTL for list of keywords */
	plinf, /* if parent sequence, plinf points to LOCUS for corresponding record;
	   	 if subsequence, points to SHORTL for list of address of start of annotations; 
	   	 this list contains only one element to be combined with the division rank
	   	 for access to annotations */
	phase, /* 100 * code_number + reading_frame_0_1_2 */
	h; /* to SUBSEQ for next record with same hashing value or 0  */
    char name[1]; sequence name padded by spaces to L_MNEMO chars
    } *psub;

Two functions allow reading and writing the first record of each index file which differs from all other records by holding the total record number in the index:

int read_first_rec(DIR_FILE *fp, int *endsort);

fp : variable associated to an index file
*endsort : returned with the rank of the last alphabetically sorted record; ignored if NULL
return value : total record number in index file (counted from 1)

void write_first_rec(DIR_FILE *fp, int total, int endsort);
Update the total record count in an index file

fp : variable associated to an index file
total : total record number in index file
endsort: rank of the last alphabetically sorted record or 0 if not sorted at all (applies to ksub, kaut, kbib, kacc, ksmj only).

Index files contain fixed-length-space-padded strings. These are therefore not C strings because they are not ended by a null byte. A true C string is obtained as follows:

char nom[L_MNEMO + 1]; memcpy(nom, psub->name, L_MNEMO); nom[L_MNEMO] = 0; trim_key(nom);

Conversely, to write a C string name to an ACNUC index file buffer, do :

padtosize(psub->name, name, L_MNEMO);

this may affect other fields of the structure that should therefore be filled after.

Reading example :

	
int num, type;
char seqname[] = "ecotgp.trpa";
#define LCODE sizeof(psmj->name)
char code[LCODE +  1];

num = isenum(seqname); /* get the seq rank from its name */
readsub(num); /* read SUBSEQ record of rank num into buffer pointed to by psub */
type = psub->type; /* this field indicates the seq type */
readsmj(type); /* read SMJYT record corresponding to type */
memcpy(code, psmj->name, LCODE );/*prepare a C string from the name field of the SMJYT record*/
code[LCODE] = 0;
trim_key(code);
printf("type of sequence %s is %s\n", seqname, code);

Functions to read/write short lists from index files SHORTL, SHORTL2 and SHORTL_KEY.
Always use follow_shortl() if possible because it provides an API covering all 3 index files.

int follow_shortl (unsigned *p_recnum, enum shortl_kind kind, int *p_rank);

p_recnum: pter to a record number in files SHORTL or SHORTL2 or SHORTL_KEY.
The function modifies the pointed value to deliver the next value of the list upon next call. When this value is 0 upon return, it means the end of the list was reached.
kind: indicates what sort of list is involved
p_rank: pter to integer that must be initialized to 0 and that is changed at each function call.
It is possible to set p_rank to NULL to access only the first value in the list
return value: one value stored in the list.

int read_shortl_record(DIR_FILE *k, unsigned recnum, struct rshrt *ps)
Reads record # recnum of file SHORTL open as k, and puts read data in ps.
Returns 0 iff OK.
int write_shortl_record(DIR_FILE *k, unsigned recnum, struct rshrt *ps)
Writes data in ps to record # recnum of file SHORTL open as k.
Returns 0 iff OK.
int read_shortl2_record(DIR_FILE *k, unsigned recnum, struct rshrt2 *ps)
Reads record # recnum of files SHORTL2 or SHORTL_KEY according to k, and puts read data in ps.
Returns 0 iff OK.
int write_shortl2_record(DIR_FILE *k, unsigned recnum, struct rshrt2 *ps)
Writes data in ps to record # recnum of files SHORTL2 or SHORTL_KEY according to k.
Returns 0 iff OK.
struct rshrt2* read_shortl2_record_new(enum shortl_kind slkind, unsigned recnum, DIR_FILE **pf)
Reads record # recnum of files SHORTL2 or SHORTL_KEY according to the kind of short list slkind, and puts in *pf the DIR_FILE variable of the corresponding index file.
Returns the content of the read record.

USING BIT LISTS
Bitlists allow to handle lists of sequences, species or keywords. List elements are represented by their rank. Ranks are the numbers in the ACNUC index files of corresponding records. Ranks are computed by gsnuml or isenum for sequences and iknum for species or keywords. Bitlists are arrays of integers. The range of rank values begins at 2 because index file records are numbered starting from 1 and record # 1 is reserved for holding the file's total record number.

Allocation of an empty list:

int *mylist; mylist = (int *)calloc(lenw, sizeof(int));

for a species or keyword list, longa can be used instead of lenw.

void bit1(int *mylist, int num) : adds element of rank num to list mylist.

bit1(mylist, num);

void bit0(int *mylist, int num) : removes element of rank num from list mylist.

bit0(mylist, num);

int testbit(int *mylist, int num) : tests for presence of element of rank num in list mylist.

if( testbit(mylist, num) ) { num is present } else { num is absent }

int irbit(int *mylist, int from, int last) : loop over all elements of a list.

int num = 1; while ( ( num = irbit(mylist, num, lenbit) ) != 0) { work with element of rank num }

for a species or keywords list, lenbit can be replaced by maxa.

Empty a bitlist

memset(mylist, 0, lenw * sizeof(int));

void ou(int *result, int *list1, int *list2, int nwords) : Add two lists.

ou(result, list1, list2, lenw); /* replace lenw by longa for species or keywords lists */

List result, to be allocated before, will contain elements of list1 and those of list2, and can be one of list1 or list2.

void et(int *result, int *list1, int *list2, int nwords) : Intersection of two lists.

et(result, list1, list2, lenw); /* replace lenw by longa for species or keywords lists */

List result, to be allocated before, will contain elements common to both list1 and list2, and can be one of list1 or list2.

void non(int *result, int *list1, int nwords): complementation of a list.
Combine "non" with "et" to remove from a list the elements of another list:

non(result, list2, lenw); et(result, list1, result, lenw);

List result, to be allocated before, will contain elements of list1 absent from list2.

int bcount(int *mylist, int maxbits): count the number of elements in a list.

int nbr = bcount(mylist, lenbit);

void lngbit(int recnum, int *blist): reads a long list from ACNUC indexes as a bitlist:
recnum: record number of the start of a long list
blist: a preallocated sequence bitlist

UTILITY FUNCTIONS

char complementer_base(char nucl);

nucl : a character, normally one of aAcCgGtTuUrRyYnN
returned value : the complementary base (lowercase, n if nucl is unknown char)

void complementer_seq(char *seq, int length);
In place complementation (and inversion) of a sequence.

void padtosize(char *pname, char *name, int length);
Completes a string to given length by adding spaces

pname : upon return, string made from name padded/truncated to length (must be large enough to hold final null and must not overlap string name)
name : unchanged input string
length : length that pname has upon return

int strcmptrail(char *s1, int l1, char *s2, int l2);
String comparison limited to lengths l1 and l2 and ignoring terminal spaces.
With s2==NULL and l2==0, s1 can be compared to a string of spaces only.
Returns as strcmp.

void majuscules(char *name);
applies toupper to all of name.
int trim_key(char *name);
removes trailing spaces from name, returns resulting length.

void compact(char *string);
removes all space characters from string.
int hashmn(char *seqname);
returns the hashing value in range [1..hsub] of the seqname that must have been padded by spaces to L_MNEMO characters.
int hasnum(char *spkwname, int len);
returns the hashing value in range [1..hkwsp] of the species or keyword name that must have been padded by spaces to len (typically WIDTH_SP/WIDTH_KW) characters.
enum endianness endian_test(void);
returns the host computer's endianness using enum endianness {big_endian, little_endian}.

SIMULTANEOUS ACCESS TO SEVERAL ACNUC DATABASES

int chg_acnuc(char *acnucvar, char *gcgacnucvar);
Allows to set the values of environment variables acnuc and gcgacnuc to direct the API to a desired database.
Returns TRUE iff not enough memory.
void *store_acnuc_status(void);
Memorizes data relative to access to an opened ACNUC database.
Returns NULL iff not enough memory.
void set_current_acnuc_db(void *db);
Directs the API to a database access to which had been previously memorized.
int sizeof_acnuc_status(void);
Returns the byte size of the memorized data structure.

Usage example:

	
#include "dir_acnuc.h"

/* declare prototypes */
int chg_acnuc(char *acnucvar, char *gcgacnucvar);
void *store_acnuc_status(void);
void set_current_acnuc_db(void *db);

/* declare a void * for each used database */
void *db1, *db2;

/* open + memorize access to 1st database */
chg_acnuc("/banques0/genbank/index", "/banques0/genbank/flat_files");
acnucopen();
db1 = store_acnuc_status();
if(db1 == NULL) {
	/* not enough memory */
	exit(ERREUR);
	}

/* open + memorize access to 2nd database */
chg_acnuc("/banques0/swissprot/index", "/banques0/swissprot/flat_files");
acnucopen();
db2 = store_acnuc_status();
if(db2 == NULL) {
	exit(ERREUR);
	}

/* directs the API to the 1st database */
set_current_acnuc_db(db1);
/* now access to the 1st database is possible */
gfrag(2, 1, 60, seq);
readsub(2);
printf("%.16s %s\n", psub->name, seq);

/* directs the API to the 2nd database */
set_current_acnuc_db(db2);
/* now access to the 2nd database is possible */
gfrag(2, 1, 60, seq);
readsub(2);
printf("%.16s %s\n", psub->name, seq);

DATABASE MANAGEMENT FUNCTIONS

int dir_set_mmap(DIR_FILE *kan);
(unix only) Attempts to place the whole of index file mentionned by kan in virtual memory, through the mmap system call, for faster access. The API for access to the mmap'ed index file is unchanged. Returns != 0 if mmap was impossible, which does not preclude I/O operations to be performed, but through simple read/write calls.

void delseq(int nsub);
complete suppression of (sub)sequence of rank nsub from database.

void addhsh(int recnum, DIR_FILE *kan);
adds record of rank recnum to hashing structure of index file kan (can be ksub, kspec, or kkey).
void suphsh(int recnum, DIR_FILE *kan);
suppress record of rank recnum from hashing structure of index file kan

void dir_acnucflush(void);
flushes to disk all changes to ACNUC index files

int mdshrt(DIR_FILE *kan, int nrec, int offset, int val, int *newplist);
Modification of a short list

kan : index file containing the starting address of the short list: kloc, ksub, kbib, kacc, kaut, kspec, kkey
nrec : rank in kan of the record containing the list starting address
offset : position within record of the starting address of short list;
>0 indicates addition to list, <0 indicates suppression from list
val : value to be added or suppressed
*newplist : if not NULL, upon return pointer to start of modified short list
return value : 1 if ok, 2 if error

int mdlng(DIR_FILE *kan, int nrec, int offset, int val, int *newplist);
Modification of a long list

kan : index file containing the starting address of the long list: ksub,ksmj,kspec,kkey
nrec : rank in kan of the record containing the list starting address
offset : position within record of the starting address of long list;
>0 indicates addition to list, <0 indicates suppression from list
val : value to be added or suppressed
*newplist : if not NULL, upon return pointer to start of modified long list
return value : 1 if ok, 2 if error

int crespecies(char *ascend, char *name);
Creation of a species or taxon name

ascend : name of taxon under which to place the newly created taxon in the tree (if NULL, new taxon is placed at root of tree)
name : name of taxon or species to create (no creation if name already exists)
return value : rank of newly created taxon

int crekeyword(char *ascend, char *name);
Creation of a keyword

ascend : name of keyword under which to place the newly created keyword in the tree (if NULL, new keyword is placed at root of tree)
name : name of keyword to create (no creation if name already exists)
return value : rank of newly created keyword

void cre_new_division(char *name);
Creation of a new flat or gcg file division.
name : name of the division file (without extension, example: gbnew)

int add_shortl(unsigned point, int value, enum shortl_kind slkind);
Adds a value to a short list.
- point : rank of the record where the list begins in index file SHORTL, SHORTL2 or SHORTL_KEY
- value : value to add to the list
- slkind : kind of short list
- return value : 1 when OK; 2 when the value was already present in the list.
int addshrt(int point, int value);
Adds a value to a short list. Use add_shortl() as much as possible which covers all variants of the SHORTL file with the same API.

point : rank of the record where the list begins in index file SHORTL
value : value to be added to the list
return value : 1 when OK; 2 when the value was already present in the list.

int addlng(int point, int value);
Adds a value to a long list.

point : rank of the record where the list begins in index file LONGL
value : value to be added to the list
return value : 1 when OK; 2 when the value was already present in the list.

int sup_shortl(unsigned point, int value, enum shortl_kind slkind);
Removes a value from a short list.
- point : rank of the record where the list begins in index file SHORTL, SHORTL2 or SHORTL_KEY
- value : value to remove from the list
- slkind : kind of short list
- return value : 1 when OK; 2 when the list becomes empty after suppression; 3 when the value was not present in the list.
int supshrt(int point, int value);
Removes a value from a short.

point : rank of the record where the list begins in index file SHORTL
value : value to be removed from the list
return value : 1 when OK; 2 when the list becomes empty after suppression; 3 when the value was not present in the list.

int suplng(int point, int value);
Removes a value from a long list.

point : rank of the record where the list begins in index file LONGL
value : value to be removed from the list
return value : 1 when OK; 2 when the list becomes empty after suppression; 3 when the value was not present in the list.

int cretaxids(void);
Fully computes and writes index file TAXIDS by reading all id:#| data from species labels.
Returns 0 iff no error.
void write_quick_meres(void);
Writes index file MERES. Must be called after having closed the modified acnuc db.

SUPPORT OF iRODS-BASED STORAGE (only available at LBBE in file /panhome/banques/csrc/trunk/irodsLBBEAPI.c|.h)

irodsFILE* irods_fopen(const char *fname, const char *mode);
Opens file fname stored in the LBBE iRODS server for reading (mode is "r") or writing (mode is "w"). fname is of the form "irods://lbbeZone/home/...". The returned value is NULL when the opening failed. Otherwise it is a pointer to an opaque structure. Opening for reading without an iRODS account is possible if the targeted file allows read access to the anonymous account. Other operations require an active iRODS account. Opening for writing creates any missing intermediate collection (or directory) present in fname.

irodsFILE* irods_fopen_ext(const char *fname, const char *mode, int parallel);
If parallel is non zero, several opening operations can be safely performed by various threads running in parallel.

int irods_fgetc(irodsFILE *f);
Reads one character from an iRODS-based file previously opened with irods_fopen(). The returned value is EOF when the end of file is reached.

char* irods_fgets(char *line, int l, irodsFILE *f);
Reads one line from an iRODS-based file previously opened with irods_fopen(). The returned value is NULL when the end of file is reached. Otherwise arguments are as in the standard fgets() function.

int irods_fputs(irodsFILE *f, char *line);
Writes one line to an iRODS-based file previously opened with irods_fopen(). The returned value is 0 when OK.

off_t irods_ftello(irodsFILE *f);
Returns the current reading position within an iRODS-based file previously opened with irods_fopen().

off_t irods_fseeko(irodsFILE *f, off_t offset, int whence);
Changes the current reading position within an iRODS-based file previously opened with irods_fopen(). Arguments are as in the standard fseeko() function.

int irods_fclose(irodsFILE *f);
Closes an iRODS-based file previously opened with irods_fopen(). Returned value is as in the standard fclose() function.

int irods_setbuffer(irodsFILE *f, size_t s);
Changes the buffer size of an iRODS-based file previously opened with irods_fopen(). s is the new size of the buffer. Return 0 if OK.

PRABI-Doua

Pôle Rhône-Alpes de Bioinformatique Site Doua

ACNUC C Application Programming Interface

Contents :