samtools(1)                  Bioinformatics tools                  samtools(1)



NAME
       samtools - Utilities for the Sequence Alignment/Map (SAM) format

SYNOPSIS
       samtools view -bt ref_list.txt -o aln.bam aln.sam.gz

       samtools sort aln.bam aln.sorted

       samtools index aln.sorted.bam

       samtools view aln.sorted.bam chr2:20,100,000-20,200,000

       samtools merge out.bam in1.bam in2.bam in3.bam

       samtools faidx ref.fasta

       samtools pileup -f ref.fasta aln.sorted.bam

       samtools tview aln.sorted.bam ref.fasta


DESCRIPTION
       Samtools  is  a  set of utilities that manipulate alignments in the BAM
       format. It imports from and exports to the SAM (Sequence Alignment/Map)
       format,  does  sorting,  merging  and  indexing, and allows to retrieve
       reads in any regions swiftly.

       Samtools is designed to work on a stream. It regards an input file  `-'
       as  the  standard  input (stdin) and an output file `-' as the standard
       output (stdout). Several commands can thus be combined with Unix pipes.
       Samtools always output warning and error messages to the standard error
       output (stderr).

       Samtools is also able to open a BAM (not SAM) file on a remote  FTP  or
       HTTP  server  if  the  BAM file name starts with `ftp://' or `http://'.
       Samtools checks the current working directory for the  index  file  and
       will  download  the  index upon absence. Samtools does not retrieve the
       entire alignment file unless it is asked to do so.


COMMANDS AND OPTIONS
       import    samtools import <in.ref_list> <in.sam> <out.bam>

                 Since 0.1.4, this command is an alias of:

                 samtools view -bt <in.ref_list> -o <out.bam> <in.sam>


       sort      samtools sort [-n] [-m maxMem] <in.bam> <out.prefix>

                 Sort  alignments  by  leftmost  coordinates.  File  <out.pre-
                 fix>.bam will be created. This command may also create tempo-
                 rary files <out.prefix>.%d.bam when the whole alignment  can-
                 not be fitted into memory (controlled by option -m).

                 OPTIONS:

                 -n      Sort by read names rather than by chromosomal coordi-
                         nates

                 -m INT  Approximately   the    maximum    required    memory.
                         [500000000]


       merge     samtools   merge   [-h   inh.sam]  [-n]  <out.bam>  <in1.bam>
                 <in2.bam> [...]

                 Merge multiple sorted alignments.  The header reference lists
                 of  all  the input BAM files, and the @SQ headers of inh.sam,
                 if  any,  must  all  refer  to  the  same  set  of  reference
                 sequences.   The header reference list and (unless overridden
                 by -h) `@' headers of in1.bam will be copied to out.bam,  and
                 the headers of other files will be ignored.

                 OPTIONS:

                 -h FILE Use  the lines of FILE as `@' headers to be copied to
                         out.bam, replacing any header lines that would other-
                         wise  be  copied  from in1.bam.  (FILE is actually in
                         SAM format, though any alignment records it may  con-
                         tain are ignored.)

                 -n      The  input alignments are sorted by read names rather
                         than by chromosomal coordinates


       index     samtools index <aln.bam>

                 Index sorted alignment for fast  random  access.  Index  file
                 <aln.bam>.bai will be created.


       view      samtools  view  [-bhuHS]  [-t  in.refList]  [-o  output]  [-f
                 reqFlag] [-F skipFlag] [-q minMapQ] [-l  library]  [-r  read-
                 Group] <in.bam>|<in.sam> [region1 [...]]

                 Extract/print  all or sub alignments in SAM or BAM format. If
                 no region is specified, all the alignments will  be  printed;
                 otherwise  only  alignments overlapping the specified regions
                 will be output. An alignment may be given multiple  times  if
                 it is overlapping several regions. A region can be presented,
                 for example, in  the  following  format:  `chr2'  (the  whole
                 chr2),  `chr2:1000000'  (region starting from 1,000,000bp) or
                 `chr2:1,000,000-2,000,000'  (region  between  1,000,000   and
                 2,000,000bp  including  the  end  points).  The coordinate is
                 1-based.

                 OPTIONS:

                 -b      Output in the BAM format.

                 -u      Output uncompressed BAM. This option saves time spent
                         on  compression/decomprssion  and  is  thus preferred
                         when the output is piped to another samtools command.

                 -h      Include the header in the output.

                 -H      Output the header only.

                 -S      Input  is in SAM. If @SQ header lines are absent, the
                         `-t' option is required.

                 -t FILE This file is TAB-delimited. Each  line  must  contain
                         the  reference  name and the length of the reference,
                         one line  for  each  distinct  reference;  additional
                         fields  are ignored. This file also defines the order
                         of the reference sequences in  sorting.  If  you  run
                         `samtools  faidx  <ref.fa>', the resultant index file
                         <ref.fa>.fai can be used as this <in.ref_list>  file.

                 -o FILE Output file [stdout]

                 -f INT  Only  output  alignments with all bits in INT present
                         in the FLAG field. INT can be in hex in the format of
                         /^0x[0-9A-F]+/ [0]

                 -F INT  Skip alignments with bits present in INT [0]

                 -q INT  Skip alignments with MAPQ smaller than INT [0]

                 -l STR  Only output reads in library STR [null]

                 -r STR  Only output reads in read group STR [null]


       faidx     samtools faidx <ref.fasta> [region1 [...]]

                 Index  reference sequence in the FASTA format or extract sub-
                 sequence from indexed reference sequence.  If  no  region  is
                 specified,   faidx   will   index   the   file   and   create
                 <ref.fasta>.fai on the disk. If regions are speficified,  the
                 subsequences  will  be retrieved and printed to stdout in the
                 FASTA format. The input file can be compressed  in  the  RAZF
                 format.


       pileup    samtools   pileup  [-f  in.ref.fasta]  [-t  in.ref_list]  [-l
                 in.site_list]   [-iscgS2]   [-T   theta]   [-N   nHap]    [-r
                 pairDiffRate] <in.bam>|<in.sam>

                 Print  the alignment in the pileup format. In the pileup for-
                 mat, each line represents a genomic position,  consisting  of
                 chromosome name, coordinate, reference base, read bases, read
                 qualities and alignment  mapping  qualities.  Information  on
                 match, mismatch, indel, strand, mapping quality and start and
                 end of a read are all encoded at the  read  base  column.  At
                 this  column,  a dot stands for a match to the reference base
                 on the forward strand, a comma for a  match  on  the  reverse
                 strand,  `ACGTN'  for  a  mismatch  on the forward strand and
                 `acgtn' for a mismatch  on  the  reverse  strand.  A  pattern
                 `\+[0-9]+[ACGTNacgtn]+'   indicates  there  is  an  insertion
                 between this reference position and the next reference  posi-
                 tion.  The length of the insertion is given by the integer in
                 the pattern, followed by the inserted sequence. Similarly,  a
                 pattern `-[0-9]+[ACGTNacgtn]+' represents a deletion from the
                 reference. The deleted bases will be presented as `*' in  the
                 following  lines.  Also at the read base column, a symbol `^'
                 marks the start of a read segment which is a contiguous  sub-
                 sequence  on  the read separated by `N/S/H' CIGAR operations.
                 The ASCII of the character following `^' minus 33  gives  the
                 mapping  quality.  A  symbol `$' marks the end of a read seg-
                 ment.

                 If option -c is applied,  the  consensus  base,  Phred-scaled
                 consensus  quality, SNP quality (i.e. the Phred-scaled proba-
                 bility of the consensus being identical to the reference) and
                 root  mean square (RMS) mapping quality of the reads covering
                 the site will be inserted between the  `reference  base'  and
                 the  `read  bases'  columns.  An indel occupies an additional
                 line. Each indel line consists of  chromosome  name,  coordi-
                 nate,  a  star, the genotype, consensus quality, SNP quality,
                 RMS mapping quality, # covering reads, the first alllele, the
                 second  allele,  # reads supporting the first allele, # reads
                 supporting the second allele and #  reads  containing  indels
                 different from the top two alleles.

                 OPTIONS:


                 -s        Print  the mapping quality as the last column. This
                           option makes the output easier to  parse,  although
                           this format is not space efficient.


                 -S        The input file is in SAM.


                 -i        Only output pileup lines containing indels.


                 -f FILE   The  reference  sequence in the FASTA format. Index
                           file FILE.fai will be created if absent.


                 -M INT    Cap mapping quality at INT [60]


                 -t FILE   List of reference names ane  sequence  lengths,  in
                           the  format  described  for  the import command. If
                           this option is present, samtools assumes the  input
                           <in.alignment>  is  in  SAM  format;  otherwise  it
                           assumes in BAM format.


                 -l FILE   List of sites at which pileup is output. This  file
                           is  space  delimited.  The  first  two  columns are
                           required to be chromosome and  1-based  coordinate.
                           Additional  columns  are ignored. It is recommended
                           to use option -s together with -l as in the default
                           format we may not know the mapping quality.


                 -c        Call  the  consensus  sequence  using MAQ consensus
                           model. Options -T, -N, -I and -r are only effective
                           when -c or -g is in use.


                 -g        Generate  genotype  likelihood  in the binary GLFv3
                           format. This option suppresses -c, -i and -s.


                 -T FLOAT  The theta parameter (error dependency  coefficient)
                           in the maq consensus calling model [0.85]


                 -N INT    Number of haplotypes in the sample (>=2) [2]


                 -r FLOAT  Expected  fraction of differences between a pair of
                           haplotypes [0.001]


                 -I INT    Phred probability of an indel  in  sequencing/prep.
                           [40]



       tview     samtools tview <in.sorted.bam> [ref.fasta]

                 Text  alignment viewer (based on the ncurses library). In the
                 viewer, press `?' for help and press `g' to check the  align-
                 ment    start    from   a   region   in   the   format   like
                 `chr10:10,000,000'.


       fixmate   samtools fixmate <in.nameSrt.bam> <out.bam>

                 Fill in mate coordinates, ISIZE and mate related flags from a
                 name-sorted alignment.


       rmdup     samtools rmdup <input.srt.bam> <out.bam>

                 Remove  potential PCR duplicates: if multiple read pairs have
                 identical external coordinates, only  retain  the  pair  with
                 highest  mapping  quality.   This  command ONLY works with FR
                 orientation and requires ISIZE is correctly set.


       rmdupse   samtools rmdupse <input.srt.bam> <out.bam>

                 Remove potential duplicates for single-ended reads. This com-
                 mand  will  treat  all reads as single-ended even if they are
                 paired in fact.


       fillmd    samtools fillmd [-e] <aln.bam> <ref.fasta>

                 Generate the MD tag. If the MD tag is already  present,  this
                 command  will  give a warning if the MD tag generated is dif-
                 ferent from the existing tag.

                 OPTIONS:

                 -e      Convert a the read base to = if it  is  identical  to
                         the  aligned  reference  base.  Indel caller does not
                         support the = bases at the moment.



SAM FORMAT
       SAM is TAB-delimited. Apart from the header lines,  which  are  started
       with the `@' symbol, each alignment line consists of:


       +----+-------+----------------------------------------------------------+
       |Col | Field |                       Description                        |
       +----+-------+----------------------------------------------------------+
       | 1  | QNAME | Query (pair) NAME                                        |
       | 2  | FLAG  | bitwise FLAG                                             |
       | 3  | RNAME | Reference sequence NAME                                  |
       | 4  | POS   | 1-based leftmost POSition/coordinate of clipped sequence |
       | 5  | MAPQ  | MAPping Quality (Phred-scaled)                           |
       | 6  | CIAGR | extended CIGAR string                                    |
       | 7  | MRNM  | Mate Reference sequence NaMe (`=' if same as RNAME)      |
       | 8  | MPOS  | 1-based Mate POSistion                                   |
       | 9  | ISIZE | Inferred insert SIZE                                     |
       |10  | SEQ   | query SEQuence on the same strand as the reference       |
       |11  | QUAL  | query QUALity (ASCII-33 gives the Phred base quality)    |
       |12  | OPT   | variable OPTional fields in the format TAG:VTYPE:VALUE   |
       +----+-------+----------------------------------------------------------+

       Each bit in the FLAG field is defined as:


             +-------+--------------------------------------------------+
             | Flag  |                   Description                    |
             +-------+--------------------------------------------------+
             |0x0001 | the read is paired in sequencing                 |
             |0x0002 | the read is mapped in a proper pair              |
             |0x0004 | the query sequence itself is unmapped            |
             |0x0008 | the mate is unmapped                             |
             |0x0010 | strand of the query (1 for reverse)              |
             |0x0020 | strand of the mate                               |
             |0x0040 | the read is the first read in a pair             |
             |0x0080 | the read is the second read in a pair            |
             |0x0100 | the alignment is not primary                     |
             |0x0200 | the read fails platform/vendor quality checks    |
             |0x0400 | the read is either a PCR or an optical duplicate |
             +-------+--------------------------------------------------+

LIMITATIONS
       o Unaligned   words  used  in  bam_import.c,  bam_endian.h,  bam.c  and
         bam_aux.c.

       o CIGAR operation P is not properly handled at the moment.

       o In merging, the input files are required to have the same  number  of
         reference  sequences.  The  requirement  can be relaxed. In addition,
         merging does not reconstruct the header  dictionaries  automatically.
         Endusers  have  to  provide  the  correct header. Picard is better at
         merging.

       o Samtools' rmdup does not work for single-end data and does not remove
         duplicates across chromosomes. Picard is better.


AUTHOR
       Heng  Li from the Sanger Institute wrote the C version of samtools. Bob
       Handsaker from the Broad Institute implemented the BGZF library and Jue
       Ruan  from  Beijing  Genomics Institute wrote the RAZF library. Various
       people in the 1000Genomes Project contributed to the SAM format  speci-
       fication.


SEE ALSO
       Samtools website: <http://samtools.sourceforge.net>



samtools-0.1.7                 10 November 2009                    samtools(1)
