Availability ============ The source code for this package is available from http://www.gene.com/share/gmap. License terms are provided in the COPYING file. Building and installing GMAP ============================ Prerequisites: a Unix system (including Cygwin on Windows), a C compiler, and Perl Step 1: Set your site-specific variables by editing the file config.site. In particular, you should set appropriate values for "prefix" and probably for "with_gmapdb", as explained in that file. Step 2: Build, test, and install the programs, by running the following GNU commands ./configure make make check make install Note: Instead of editing the config.site file in step 1, you may type everything on the command line for the ./configure script in step 2, like this ./configure --prefix=/your/usr/local/path --with-gmapdb=/path/to/gmapdb If you omit --with-gmapdb, it defaults to ${prefix}/share. If you omit --prefix, it defaults to /usr/local. Downloading a GMAP database =========================== You can use the program gmap_setup to build your own database (as described below), but you can started quickly by downloading a pre-built GMAP database from the same place you obtained the GMAP program (see above for URL). Place the database in the GMAPDB directory you specified in the config.site file when you built the gmap program. You should include a subdirectory for each GMAP database; for example, if you downloaded a database called NHGD_R35, your directory structure should look like this /path/to/gmapdb/NHGD_R35/ /path/to/gmapdb/NHGD_R35/NHGD_R35.chromosome /path/to/gmapdb/NHGD_R35/NHGD_R35.chromosome.iit ... /path/to/gmapdb/NHGD_R35/NHGD_R35.version Building a GMAP database (quick guide) ===================================== The program gmap_setup can build and install a GMAP database from a set of FASTA files containing either entire chromosomes or contigs that represent pieces of chromosomes. Run this command gmap_setup -d ... and then follow the directions. Note that the term ... above indicates that multiple files can be listed. The files can be in any order, and the contigs can be in any order within these files. The GMAP setup process will assemble the contigs and chromosomes into their appropriate alphanumeric order. If your FASTA headers contain information about the chromosomal coordinates, gmap_setup will try to parse that information from the headers. The program knows how to parse the following patterns: chr=1:217281..257582 [may insert spaces around '=', or omit '=' character] chr=1 [may insert spaces around '=', or omit '=' character] chromosome 1 [NCBI .mfa format] chromosome:NCBI35:22:1:49554710:1 [Ensembl format] /chromosome=2 [Celera format] /chromosome=2 /alignment=(88840247-88864134) /orientation=rev [Celera format] chr1:217281..257582 chr1 [may insert spaces after 'chr'] If only the chromosome is specified, without coordinates, the program will assign its own chromosomal coordinates by concatenating the contigs within each chromosome. Building a GMAP database (using an MD file) =========================================== Genomes from NCBI typically include an ".md" file (like seq_contig.md) that specifies the chromosomal coordinates for each contig. To use this information, provide the -M flag to gmap_setup, like this gmap_setup -d -M ... The program will then try to parse the mdfile (which often changes formats) and verify with you which columns contain the contig names and chromosomal coordinates. Building a GMAP database (details) ================================== The gmap_setup program will create a Makefile, called Makefile.. You will then be prompted to use this Makefile through the following commands: make -f Makefile. coords make -f Makefile. gmapdb make -f Makefile. fullascii [optional] (edit the .chrsubset file [optional]) make -f Makefile. install 1. The first step in using this Makefile is to create a file called coords.. You may manually edit this file, if you wish, before proceeding with the rest of the Makefile steps. The coords file contains one contig per line, in the following format: where the chromosomal_mapping is in the form :... Here are some examples: NT_077911.1 1:217281..257582 NT_091704.1 22U:1..166566 If you want the contig to be inserted as its reverse complement, then list the coordinates in the reverse direction (starting with the higher number), like this: NT_039199.1 1:61563373..61273712 You may delete lines or comment them out with a '#' character, which will effectively omit those contigs from your genome build. You may also change chromosomal assignments (in column 2) or assign contigs to alternative strains (in column 3). 2. The optional "make fullascii" command is intended to represent genomes where it is important to show lower-case or non-standard characters (anything other than A, C, G, T, N, or X) in the alignment. Normally, gmap_setup creates only a compressed version of the genome, in the file .genomecomp, which can hold only the standard, upper-case A, C, G, T, N, and X characters. It converts all lower-case characters to upper-case, and all non-ACGTNX characters to 'N'. If you wish to show lower-case or non-standard characters in the alignment, you may create a full ASCII genome file with the command make -f Makefile. fullascii after the "make gmapdb" command. This will create the file .genome, in addition to the .genomecomp file. The full ASCII genome can then be used by GMAP instead of the compressed genome by specifying the -G flag to GMAP. Note that a full ASCII genome file has a size equal to the total genome length, and some computers cannot handle files larger than 2 gigabytes. In such cases, only the standard compressed genome will work. Note, however, that the full ASCII genome does not affect the GMAP computation, only the printed alignment shown by GMAP with its -A flag. Also, you do not need to mask the genome; GMAP will work fine on the entire genome. But if you wish to indicate masked regions in the alignment with lower-case characters, then you will need to create a full ASCII genome file. 3. The full usage for gmap_setup is as follows: Usage: gmap_setup -d [-D ] [-o ] [-M <.md file>] [-S] [-W] [-E] [-q interval] -d genome name -D destination directory for installation (defaults to gmapdb directory specified at configure time) -o name of output Makefile (default is "Makefile.") -M use coordinates from an .md file (e.g., seq_contig.md file from NCBI) -S treat each sequence as a separate chromosome -W write some output directly to file, instead of using RAM (slow; use only if RAM is limited) -E interpret argument as a command, instead of a list of FASTA files -q GMAP indexing interval (default: 6 nt) These flags are explained below: * The -S flag: If your FASTA files contain separate sequences without any chromosomal information, you can treat each sequence as its own separate "chromosome" by adding the -S flag to the gmap_setup command, like this: gmap_setup -S -d ... GMAP can handle an unlimited number of "chromosomes", with arbitrarily long names. In this way, GMAP can act like a BLAST-type of program for near-identity matches. * The -E flag: If you need to pre-process the FASTA files before using these programs, perhaps because they are compressed or because you need to insert chromosomal information in the header lines, you can specify a command instead of multiple fasta_files, like these examples: gmap_setup -d -E 'gunzip -c chr*.gz' gmap_setup -d -E 'cat *.fa | ./add-chromosomal-info.pl' You can think of the command as a Unix pipe for processing each FASTA file before it is read by gmap_setup. * The -W flag: The gmap_setup process works best if you have a computer with enough RAM to hold the entire genome (e.g., 3 gigabytes for a human- or mouse-sized genome). Since the resulting genome files work across all machine architectures, you can find any machine with sufficient RAM to build the genome files and then transfer the files to another machine. (GMAP itself runs fine on machines with limited RAM.) If you cannot find any machine with sufficient RAM for gmap_setup, you can run the program with the -W flag to write the files directly, but this can be very slow. * The -q flag: If you specify a smaller interval (for example, 3 for the GMAP interval), you can create a higher-resolution database, which can be useful for mapping small oligomers (smaller than 18 nt). However, the corresponding genome index files will be larger (twice as big if you specify -q 3). These index files may exceed the 2 gigabyte file offset limit on some computers, and will therefore fail to work on those computers. Running GMAP ============ To see the full set of options, type "gmap --help". The following are some common examples of usage. For more examples, see the document available at http://www.gene.com/share/gmap/paper/demo-slides.pdf For each of the examples below, we assume that you have installed a genome database called NHGD_R35 in your GMAPDB directory. (If your database is located elsewhere, you can specify the -D flag to gmap or set the environment variable GMAPDB to point to that directory.) * Mapping only: To map one or more cDNAs in a FASTA file onto a genome, run GMAP as follows: gmap -d NHGD_R35 * Mapping and alignment: If you want to map and align the cDNAs, add the -A flag: gmap -d NHGD_R35 -A * Alignment only: To align one or more cDNAs in a FASTA file onto a given genomic segment (also in a FASTA file), use the -g flag instead of the -d flag: gmap -g -A * Batch mode: If you have a large number of cDNAs to run, and you have sufficient RAM (see below for guidelines) to run in batch mode, add the "-B 1" or "-B 2" option: gmap -d NHGD_R35 -B 1 -A The "-B 1" option pre-loads the genomic indices only into RAM. The "-B 2" option pre-loads both the indices and genome into RAM. For increased speed, the genomic indices are far more important than the genome for pre-loading into RAM. Guidelines: The "-B 1" option pre-loads the .idxpositions file. The "-B 2" option pre-loads that file, plus the .genomecomp file. Look at the sizes of these files to determine if you have enough RAM to hold them in memory continuously. Note that other programs running on your computer also need RAM. * Multithreaded mode: If your machine has several processors, you can make batch mode run even faster by specifying multiple threads with the -t flag: gmap -d NHGD_R35 -B 1 -A -t Note that with multiple threads, the output results will appear in random order, depending on which thread finishes its computation first. If you wish your output to be in the same order as the input cDNA file, add the '-O' (letter O, not the number 0) flag to get ordered output. Guidelines: The -t flag specifies the number of computational threads. In addition, if your machine supports threads, GMAP also uses one thread for reading the input query sequences, and one thread for printing the output results. Therefore, the total number of threads will be 2 plus the number you specify. The program will work optimally if it uses one thread per available processor. Note that other programs running on your computer also need processors. * Compressed output: If you want to store the alignment results in a compressed format, use the -Z flag. You can uncompress the results by using the gmap_uncompress.pl program: gmap -d NHGD_R35 -Z > x cat x | gmap_uncompress Defining chromosome subsets (for advanced use only) =================================================== GMAP has the ability to restrict its search of the genome to a subset of the available chromosomes. A user may specify a chromosomal subset with the "-c" flag to GMAP. The available chromosome subsets are listed in the chrsubset file in the GMAP database. In our running example, this would be the file /path/to/gmapdb/NHGD_R35/NHGD_R35.chrsubset The gmap_setup process automatically creates a file by this name, and pre-defines some basic subsets in that file, namely the subset "all" (which stands for all chromosomes) and a subset for each individual chromosome. However, you may edit this file manually to define your own chromosome subsets, using a FASTA-like syntax. Chromosome subsets can be defined either by listing the chromosomes to be included (i.e., starting the line with a plus '+' sign), or by listing those to be excluded (i.e., starting the line with a minus '-' sign). For example, if you wish to exclude chromosomes that contain unmapped contigs (such as "22U"), you can add the following lines to the chrsubset file: >vanilla -1U,2U,3U,4U,5U,6U,7U,8U,9U,10U,12U,13U,15U,16U,17U,18U,19U,22U,XU Equivalently, this subset could have been defined inclusively: >vanilla +1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,MT,X,Y The user may then restrict his search to these chromosomes by providing the "-c vanilla" flag to GMAP. If the user does not specify any chromosome subset, then the default subset is used. The default subset is the first one in the chrsubset file. The user also has the option, with the "-C" flag to GMAP, of specifying a chrsubset file other than the one in the GMAPDB directory. Building map files (for advanced use only) ========================================== This package includes an implementation of interval index trees (IITs), which permits efficient lookup of interval information. The gmap program also allows you (with its -m flag) to look up pre-mapped annotation information that overlaps your query cDNA sequence. These interval index trees (or map files) are built using the iit_store program included in this package. To build a map file, do the following: Step 1: Put your map information for a given genome (e.g. NHGD_R35) into a FASTA file with the following format: >label start end optional_tag optional_annotation (which may be zero, one, or multiple lines) For example, the label may be an EST accession, with the start and end numbers representing their position in the genome database. Tags are very general and can be used for a variety of purposes. For mapping purposes, GMAP understands two conventions for coordinates and tags: whole-genome and chromosomal. At this time, the chromosomal convention is preferred, but the earlier, whole-genome convention is still supported for backwards compatibility. The chromosomal option uses chromosomal coordinates and the tag indicates the chromosome and strand (using '+' or '-'): >NM_004448 35109780 35138441 17+ The whole-genome convention has just two tags to indicate the two strands of the genome. The tags should be either in ("FWD","REV") style or ("+","-") style (don't mix the two styles in the same file). In this convention, the coordinates are universal coordinates, which are the coordinates computed when chromosomes are concatenated. The GMAP program provides these coordinates in its alignments. In addition, the get-genome program can translate chromosomal coordinates to universal coordinates with its -C flag: get-genome -d NHGD_R35 -C 17:35109780..35138441 Since this returns "2690002715--2690031376", the resulting header would be >NM_004448 2690002715 2690031376 FWD Step 2: Run iit_store on this FASTA file, and put the file into the maps subdirectory of the corresponding genome directory. The file should end with ".iit"; if you don't specify this to the -o flag, the program will add ".iit" to the filename. iit_store -o myannot myannot.fa Now you can retrieve this information with iit_get iit_get myannot.iit start [optional_end [optional_tags...]] You can also retrieve this information with gmap, if you store the map file in the appropriate genome database, like this mv myannot.iit /path/to/genome/directory/NHGD_R35/NHGD_R35.maps/ Then you can retrieve map information for a given cDNA sequence by specifying the desired map file with the -m flag: gmap -d NHGD_R35 -m myannot The iit_get program has other capabilities, including the ability to retrieve information by label, like this: iit_get myannot.iit label More details can be found by doing "iit_get --help". Finally, GMAP and the IIT utilities support the GFF3 format. GMAP can generate its results in GFF3 format, and iit_store can parse GFF3 files using its -G and -l flags. More details about iit_store can be found by doing "iit_store --help".