PRABI-Doua: WWW-QUERY & CROSS-TAXA HELP

WWW-QUERY & CROSS-TAXA HELP

Frequently Asked Questions
Buttons
Databases
WWW-Query Criteria
Cross-Taxa Search
Lists

Use of the buttons

In the following order, you should:

Select the type of search (for WWW-Query only)Use the button "Search for sequences" to retrieve a set of sequences, or "Search for families" to retrieve a set of families. If you select the "family" button, only the databases containing gene families will be available in the list.
Select the type of sequences Use the button "Protein" to select databases which contains proteins, or the button "Nucleotide" to select databases which contains nucleotide sequences.
Choose the database.

This field allows to select one of the proposed databases with a radio button. Use the button 'protein databank' or 'nucleotide databank' to select databases containing protein sequences (Swiss-Prot, etc.) or nucleotide sequences (EMBL, GenBank, etc.). Some databases ( HoGenom, etc.) present both a protein and a nucleotide database. The databases and their current contents are described here.

WWW-Query Selection Criteria

Many selection criteria are available. Mainly they correspond to the structured elements of the sequence documentation in the data banks. Thes selection criteria, Keyword, Keywords list, Name, Name list, Accession number, Acc. number list, Species, Family ID, Family ID list, Type, Year, Organelle, Molecule, Reference, Author, Journal, Status can be combined with logical operators.

Keyword

back to criteria
Enter the keyword name, using * to specify any series of characters to catch several keywords in one shot. Use of space is allowed. Examples: RNA POLYMERASE, *POLYMERASE, *TRANSFER RNA*SYNTHETASE*. Keywords are partially tree-structured. Any match catches also all keywords placed below in the tree.

Keywords list

back to criteria
Give the name of a keywords list you have transferred to the server, this using the send a list utility.

Name

back to criteria
Enter a sequence name, possibly using * to match any string of characters. Use of * is VERY slow when placed at the beginning of the query, otherwise the reply is fast. Examples: ECTRPA, ECTRP*.

Names list

back to criteria
Enter the name of a list containing sequence names. For that purpose you may either use a list previously created with WWW-Query or a list you have sent with the send a list utility.

Accession number

back to criteria
Enter an accession number. Example: L04470. All accession numbers listed in sequence annotations are indexed.

Accession number list

back to criteria
Enter the name of an list containing sequence accession numbers you have transferred to the server, this with send a list utility.

Species

back to criteria
Enter the species name, using * to specify any series of characters to catch several keywords in one shot. Use of space is allowed. Examples: ESCHERICHIA COLI, *COLI, E*COLI. Species names are tree-structured according to the biological classification of species.

Family ID

back to criteria
Enter a family name, possibly using * to match any string of characters. Use of * is VERY slow when placed at the beginning of the query, otherwise the reply is fast. Examples: HBG000020, HBG00002*.

Family ID list

back to criteria
Enter the name of a list containing family names. For that purpose you may either use a list previously created with WWW-Query or a list you have sent with the send a list utility.

Type

back to criteria
Sequence type identifies the nature of the encoded molecule (e.g., protein, tRNA, rRNA). Type should not be confused with molecule which denotes the chemical nature of the sequenced molecule (e.g., DNA, mRNA, tRNA). Type is defined only for the nucleotide sequence banks (GenBank, EMBL, Hovergen, NRSub and CGDB). Presently the existing types are:

    ID            Locus entry                (EMBL, SWISS-PROT, NRSub)
    LOCUS         Locus entry                (GenBank, Hovergen, EMGLib)
    CDS          .PE protein coding region   (all)
    RRNA         .RR mature ribosomal RNA    (all)
    TRNA         .TR mature transfer RNA     (all)
    MISC_RNA     .RN other structural RNA
                  coding region              (EMBL, GenBank, Hovergen, NRSub,
                                              EMGLib)
    SNRNA        .SN small nuclear RNA       (EMBL, GenBank, Hovergen, EMGLib)
    SCRNA        .SC small cytoplasmic RNA   (EMBL, GenBank, Hovergen, NRSub,
                                              EMGLib)
    3'INT        .3I 3' intron               (Hovergen)
    3'NCR        .3F 3' non-coding region    (Hovergen)
    5'INT        .5I 5' intron               (Hovergen)
    5'NCR        .5F 5' non-coding region    (Hovergen)
    CPG          .CG region > 200 bp with
                  CpGobs/CpGexp > 0.5        (Hovergen)
    INT_INT      .IN internal intron         (Hovergen)

Each entry of a FEATURE TABLE describing a coding region of a DNA fragment gives rise to a subsequence equal to the fragments described in the location of the feature. The type of the resulting subsequence equals the key of the corresponding feature table entry. The name of the resulting subsequence is built by adding to the parent sequence's name an extension uniquely identifying this particular feature.
Sequences of a given type are generally subsequences, i.e., fragments of parent sequences, except if the coding region covers totally the parent sequence, in which case ACNUC does not create a subsequence.

Year

back to criteria
Type the desired year of publication in the box, with one of these operations: > (after), = (this year), or < (before).

Organelle

back to criteria
Organelle (e.g., chloroplast, mitochondrion) denotes the nature of the genome that harbors a particular gene. By extension, WWW-Query also sees `nuclear' as an organelle. Also, a nuclear-encoded gene coding for a protein imported to an organelle is seen as a nuclear gene by WWW-Query. The existing organelles are:

    CHLOROPLAST     Chloroplast genome   (EMBL, GenBank, NBRF, Hovergen)
    MITOCHONDRION   Mitochondrial genome (EMBL, GenBank, NBRF, Hovergen)
    KINETOPLAST     Kinetoplast genome   (EMBL, GenBank, Hovergen)
    NUCLEAR         Nuclear genome       (all)

Molecule

back to criteria
In ACNUC, molecule denotes the chemical nature of the sequenced molecule (e.g., DNA, mRNA, tRNA). Molecule should not be confused with type which identifies the encoded molecule (e.g., protein, tRNA, rRNA). Thus the sequence of a tRNA gene has DNA for molecule because DNA rather than tRNA was sequenced. The subsequence covering the tRNA region has tRNA for type because this is the nature of the encoded product. Molecule is defined only for the nucleotide sequence banks (GenBank, EMBL, Hovergen, NRSub, and CGDB). Presently the existing molecules are:

    DNA          Sequenced molecule is DNA   (all)
    RNA          Sequenced molecule is RNA   (all)
    MRNA         Sequenced molecule is mRNA  (GenBank, Hovergen)
    RRNA         Sequenced molecule is rRNA  (GenBank, Hovergen)
    URNA         Sequenced molecule is tRNA  (GenBank, Hovergen)
    URNA         Sequenced molecule is snRNA (GenBank, Hovergen)

Reference

back to criteria
Enter the reference name. References are specified as follows depending on the type of document:

    Document          Format                          Example

    Journal article   journal_code/volume/1st_page    jme/34/17
    Book              book/year/1st_author            book/1980/broker
    Thesis            thesis/year/1st_author          thesis/1984/wildgruber
    Patent            patent/patent_coded_number      patent/ep0238993
    Unpublished, or
    submitted 	      unpubl/year/1st_author          unpubl/1993/cho

Author

back to criteria
Enter an author name, possibly using * to match any string of characters (slow). Examples: YANOFSKI, YANOF*. Only last names are indexed - initials are ignored. All authors of journal articles are indexed. Only the first author of books, theses, patents and other documents are indexed.

Journal

back to criteria
Enter a journal code.

Status

back to criteria
Status denotes the completion level of sequence annotations. This information exists only with the data banks in EMBL or SWISS-PROT formats. The existing status are:

    PRELIMINARY         Preliminary annotated sequence
    STANDARD            Fully annotated sequence
    UNANNOTATED         Only DE, AC and R[NPXATL]
    UNREVIEWED          Sequence with unreviewed annotation

Logical Operators

back to criteria
Elementary selection criteria (e.g., by species, by keyword) may be logically combined to create multi-criterion queries using operators. Only one operator is available for the first selection criteria: NOT. The default option, DEFAULT, has no effect on the query and is present only for aesthetic purpose. For the three other criteria, four operators are available: AND, OR, AND NOT, OR NOT.

Cross-Taxa Search

The left page ("Taxon selection")is used to build a query with Cross-Taxa which allows you to retrieve all gene families that are shared by a given set of taxa (the upper list) and that are not associated with another set of taxa (the lower list).

The right page ("Taxonomy helper") can be used to check the taxonomy of the species of interest.

Cross-Taxa gives access to a family retrieval system based on taxonomic criteria. Its web interface is composed of two text fields.

It allows to retrieve all gene families that are shared,strictly or not,by a first set of taxa defined in the first field and that are not associated with a second set of taxa defined in the second field. Any taxonomic level can be used and mixed to compose the query (e.g.,Homo sapiens ,Primate,Mammalia ). For example it is possible to retrieve the families of bacterial genes specific to a toxic strain of Escherichia coli, or to retrieve the gene families found in mammals but not in birds or as well to retrieve gene families which are found in mammals only.

The first set of taxa can be used for an inclusive or exclusive selection of families.
It is as well possible to pre-select the families by the number of sequences/species, as shown on this example.

Warning! Cross-Taxa queries can take a lot of time. For simple queries on families (for example, to retrieve all the families containing a sequence of mammalia), we recomand to use WWW-Query.

Two types of search are available:

Inclusive Search:
Any family containing at least one species from each taxon of the list will be selected.

Usage:

if you specify Primates in the list1 (with an empty list2) you will get all the families with at least one sequences from Primates.
if you specify Homo and Mus in the list1 (with an empty list2) you will get all the families with at least one sequence of Homo and one sequence of Mus (for example a 3 sequences-family, with one sequence from Homo, one sequence from Mus and one sequence from Bos.).
if you specify Mammalia in the list 1 and Primates in the list 2, you will get all the families with at least one sequence of Mammalia but no sequence from Primates (for example a 15 sequences-family, with 5 sequences from Bos, 5 sequences from Mus, 2 sequences from Rattus and 3 sequences from Xenopus).

back to Cross-Taxa
Exclusive Search:
Any family containing only species from all the taxa of the list (i.e. none from other taxa) will be selected.

Usage:

if you specify Primates in the list1 (with an empty list2) you will get all the families with sequences from Primates only.
if you specify Homo and Mus in the list1 (with an empty list2) you will get all the families with at least one sequence of Homo and one sequence of Mus and no sequence from any other species (for example a 3 sequences-family, with 2 sequences from Homo and one sequence from Mus).
if you specify Homo and Primates in the list1 (with an empty list2) you will get all the families with sequences from Primates only and at least one sequence from Homo (for example a 5 sequences-family, with 3 sequences from Homo and two sequences form Pan).
if you specify Mammalia in the list 1 and Primates in the list 2, you will get all the families with at least one sequence of Mammalia and Mammalia only and no sequence from Primates ( for example a 18 sequences-family, with 3 sequences from Bos, 7 sequences from Mus and 8 sequences from Rattus).

back to Cross-Taxa

Selection of families by number of sequences or species

You can select families by its number of sequences and/or by its number of species. For example it is useful to avoid families presenting only one sequence or one species.

Nota Bene:
The number of sequences and taxa displayed with the list of families are correct for protein sequences only. If you are using a nucleic database, the real number of sequences and taxa in the family (as given on the family associated page) can be different. Moreover, sligthly differences can appear here and now betwen the number of taxa and sequences given with the list (precalculated) and the real ones (given on the family page) even for protein databases.

Example

An example of use is given here
back to Cross-Taxa

List Name

Under WWW-Query, the result of each query is saved in a file stored locally on our server. By this way, it is not immediatly lost and the user has the possibility to re-use it for building other queries or for performing treatments.

The lists are stored in a sub-directory of /ftp/ftpdir/pub/ADE-User/data/ created via a cookie for the user (Your data are currently stored in the directory /ftp/ftpdir/pub/ADE-User/data/ 1398423241, you can chek your previous operations here ).

It is up to the user to give a name to a list. If no name is given, the system uses by default list. Be aware than some lists are created automatically by the system. These lists are always called list and erase the lists previously defined with this name. The sequences list of a family "FAMILY_NAME" is automatically called "FAMILY_NAME_lst" (or "P_FAMILY_NAME_lst" after a species selection).
Note that files older than 1 week in the directory created by the user are automatically cleaned.

Frequently Asked Questions

This page is under development, sorry. Last update = January 7, 2004.

How can I retrieve a protein or a gene?
I know the name of a sequence, what can I do with it?
There is a lot of databases available, which one should I use?
I do not find my sequence in your databases. Why?
The buttons do not work ...In construction...
I can not select the database ...In construction...
How can retrieve sequences associated to a keyword?In construction...
How can retrieve sequences associated to a taxon?In construction...
What are families?In construction...
What is the aim of the family databases?In construction...
How can I retrieve families associated to a keyword?In construction...
How can I retrieve families associated to a taxon?In construction...
Which family database should I use?In construction...
What is the meaning of the nucletoide and protein buttons?In construction...
What is the meaning of the sequence and family buttons?In construction...
How to use WWW-Query?

How to use Cross-Taxa?
Where my data are stored ?
How can I retrieve my data ?
What is the difference between Cross-Taxa and WWW-Query ?

How can I retrieve a protein or a gene?
You should go to the WWW-Query page (here). This is an "expert-user" page allowing complex queries.
- For a quick search, click on the button "Quick Search". This page retrieve all the sequences (or families) associated to a word, which can indiferently be a name, an accession number, a keyword, a species... The results are thus more exhaustive than with WWW-Query.
  To retrieve a sequence, use the left form of the page. Input a word, select a database then click on "submit". If you check the "exact match" box, only exact matches will be retrieved. Several lists of sequences (or families) are usualy generated. For example, search the word "BTG1" in SWISS-PROT: a list (called "name") of sequences presenting a name matching the word "BTG1" and a list (called "keyword") of sequences presenting a keyword matching the word "BTG1" are generated. Afterwards all theses sequences are regrouped in a global list (called "all") and displayed. If the "exact match" box is checked, only the sequences associated with the keyword "BTG1" are retrieved.
- For a simple query, click on the button "Go to Simple Search". This page allows you to retrieve sequences according to simple criteria as the sequence name, the accession number, a keyword.
- Complex queries are possible on the WWW-Query page (also accessible via the "Go to Expert Search" button).Firstly you should choose if you want to retrieve sequences or families of sequences. Afterwards you can fill the form as for a simple query except that you can combine several critera (this is optional, if you want to use only one criteria, let the 3 other fields empty) and that here is more criteria.
back to FAQ
I know the name of a sequence, what can I do with it?

You can

retrieve this sequence in one of the database to get its annotations, its sequence data, or apply several bioinformatics tools as BLAST,CLUSTALW,secondary structuire prediction, pattern search, and many NPSA tools, etc.

retrieve the family associated to this sequence, get all the sequences in the family and modify this list of sequences if needed, apply several bionformatics tools to these sequences, display the alignment and the phylogenetic tree, get the partial alignment of sequences associated to peculiar species, etc.

back to FAQ

There is a lot of databases available, which one should I use?

Several database are available on the server:

General databases, as EMBL, GenBank or SWISS-PROT can be queried with the different tools and utilities proposed by the PBIL. These database are regulary updated (daily for GenBank and EMBL, weekly for SWISS-PROT)

Other specific databases are dedicated to peculiar organisms, molecules, functions and/or phylogenetic analysis. For example , the Hobacgen database contains families of homologous genes from bacteria and archaea. These databases are described on the home page of the server.
Database contents are given here .

back to FAQ

I do not find my sequence in your databases. Why?

First of all, your sequence may be actually not present in the databases you are querying (For example, if you are looking for a protein sequence in EMBL , or for a animal sequence in Hobacprot/Hobacnucl, or for a cds in Hobacprot, etc). See this question for more informations abot different databases.

Maybe there was a confusion between the name and the accession number of the sequence when using WWW-Query. WWW-Query allows you to search a sequence by its name or its accession number; for example if an accession number is given instead the name, the sequence will not bet retrieved.
Alternatively Quick Search allows you to retrieve all the sequences associated to a word, which can indiferently be a name, an accession number, a keyword, a species... The results are thus more exhaustive than with WWW-Query.

Finally, in several databases, as Hoverprot and Hobacprot, the sequence names can be sligtly different from the SWISS-PROT ones, due to the duplication of the sequences. To avoid this problem, use the accession number instead the sequence name to retrieve you sequence.

back to FAQ

The buttons do not work ...

In construction...
back to FAQ

I can not select the database ...

In construction...
back to FAQ

How can retrieve sequences associated to a keyword?

In construction...
back to FAQ

How can retrieve sequences associated to a taxon?

In construction...
back to FAQ

What are families?

In construction...
back to FAQ

What is the aim of the family databases?

In construction...
back to FAQ

How can I retrieve families associated to a keyword?

In construction...
back to FAQ

How can I retrieve families associated to a taxon?

In construction...
back to FAQ

Which family database should I use?

In construction...
back to FAQ

What is the meaning of the nucletoide and protein buttons?

In construction...
back to FAQ

What is the meaning of the sequence and family buttons?

In construction...
back to FAQ

How to use WWW-Query?

In construction...
back to FAQ

How to use Cross-Taxa?

In construction...
back to FAQ

Where my data are stored ?

Under WWW-Query and Cross-Taxa, the result of each query is saved in a file stored locally on our server. By this way, it is not immediatly lost and the user has the possibility to re-use it for building other queries or for performing treatments. Thanks to the storage zone defined for the user, there is no confusion when many users are genererating lists with the same name at the same moment. The lists (of sequences or families) are stored in a sub-directory of ftp://pbil.univ-lyon1.fr/pub/ADE-User/data created via a cookie for the user (For example your data are currently stored in the directory ftp://pbil.univ-lyon1.fr/pub/ADE-User/data/ 1398423241 , and you can chek your previous operations here). It is up to the user to give a name to the list to be generated. If no name is given, the system uses by default list. Be aware than some lists are created automatically by the system. These lists are always called list and erase the lists previously defined with this name. The sequences list of a family named "FAMILY_NAME" is automatically called "FAMILY_NAME_lst" (or "P_FAMILY_NAME_lst" after a species selection).
Other data such as alignment files or philogenetic tree files are stored in the user directory as well. Partial alignments are stored in a sub-directory of the user directory called ALN.
Note that files older than 1 week in the directory created by the user are automatically cleaned.
back to FAQ

How can I retrieve my data ?

You can download all your data at URL:ftp://pbil.univ-lyon1.fr/pub/ADE-User/data/ 1398423241
It is recommended that you use a dedicated FTP client to retrieve them instead of a Web browser like Netscape or Internet Explorer. You can as well retrieve data sequences with the Retrieve button.
back to FAQ

What is the difference between Cross-Taxa and WWW-Query ?

WWW-Query allows you to retrieve sequences or families,Cross-Taxa is used to retrieve only families.
WWW-Query retrieves all the sequences wich fullfill several criteria of different sort, then generates the list of these sequences, or the list of families associated to these sequences.
Cross-Taxa retrieve families on a taxononomic basis, allowing more precise taxononic selection than WWW-Query.
It is possible to combine results from Cross-Taxa and WWW-Query (for example, to cross a family list generated with Cross-Taxa and a family generated with WWW-Query).
back to FAQ