Mapping tags to a genome with LAST
==================================

LAST has many adjustable parameters, providing many ways of mapping
tags to a genome.  We cannot tell you which way is best, but we can
describe a few options.

Option 1: Find all matches with up to N mismatches
--------------------------------------------------

Suppose we have tags of length 25, and we wish to find all perfect
matches to the genome, allowing up to one mismatch.  One method is to
start by finding exact matches of length 12: this is because any
25-mer with one mismatch must have an exactly matching region of
length at least 12.  We can then check whether each 12-mer match can
be extended to a 25-mer match with at most one mismatch.  This method
may be very slow, because there will be many unproductive 12-mer
matches.

A better method is to use a spaced seed: 11111011111011111011.  Any
25-mer with one mismatch is guaranteed to have an exact match using
this seed.  Since the seed has 17 matched positions (17 "1"s), we will
get far fewer unproductive matches.

With LAST, we can do this as follows::

  lastdb -m111110 mydb genome.fa
  lastal -l17 -m999999999 -j1 -r2 -d47 -y1 mydb tags.fa

In the lastdb command, the seed pattern gets cyclically repeated, so
we only need to specify the part up to the "0".  In the lastal
command, we used -l17 to require 17 matched positions in initial
matches, and -m999999999 to accept hugely repeated initial matches.
We also used -j1 to request gapless alignments, -r2 to set the match
score to 2, and -d47 to request alignments with score >= 47.  This
will give us all 25-mer alignments with at most one mismatch.
Finally, we used -y1 to halt extensions if the score drops by more
than 1: this makes it faster without affecting the result.

The following table shows optimal spaced seed patterns for various tag
sizes and numbers of mismatches.  Each entry shows the number of
matched positions (e.g. 17) and the pattern (e.g. 11111011000).

======== ==========  ================  ==================  ====================
Tag size 1 mismatch  2 mismatches      3 mismatches        4 mismatches
======== ==========  ================  ==================  ====================
16       10 11110     7 1110100         4 11010000          3 1110
17       11 11110     7 1110100         5 11010000          4 1110
18       12 11110     8 1110100         5 11010000          4 1110
19       12 11110     8 1110100         6 11010000          4 1110
20       13 11110     8 1110100         6 11010000          4 1100010000
21       14 11110     9 1110100         6 11010000          5 1100010000
22       15 11110    10 1110100         7 1110100000        5 1100010000
23       16 11110    11 1110100         7 11101001000       5 1100010000
24       16 111110   11 1110100         8 11101001000       5 1100010000
25       17 111110   12 1110100         8 11101001000       6 1100010000
26       18 111110   12 1110100         9 11101001000       6 1100010000
27       19 111110   12 1110100         9 11101001000       6 1110100000000
28       20 111110   13 1110100         9 11101001000       7 1110100000000
29       20 111110   14 1110100        10 11101001000       7 1110100000000
30       21 111110   15 1110100        10 11101001000       8 1110100000000
31       22 111110   15 1110100        11 1110110100000     8 1110100000000
32       23 111110   16 1110100        11 111101011001000   8 111010010000000
33       24 111110   16 1110100        12 111101011001000   8 111010010000000
34       25 111110   17 1110111110100  12 111101011001000   9 111010010000000
35       25 1111110  17 11111011000    13 111101011001000   9 111010010000000
36       26 1111110  18 11111011000    13 111101011001000   9 11110010000001000
37       27 1111110  19 11111011000    14 111101011001000  10 11110010000001000
38       28 1111110  19 11111011000    15 111101011001000  10 11110010000001000
39       29 1111110  20 11111011000    15 111101011001000  10 11110010000001000
40       30 1111110  21 11111011000    15 111101011001000  11 11110010000001000
41       30 1111110  21 11111011000    16 111101011001000  11 11110010000001000
======== ==========  ================  ==================  ====================

This table was made using software kindly provided by Laurent Noé,
described in [G Kucherov, L Noé, M Roytberg IEEE/ACM Trans Comput Biol
Bioinform 2005 2(1):51-61].  For larger tags, the calculation time
becomes prohibitive.

Weaknesses of this approach
~~~~~~~~~~~~~~~~~~~~~~~~~~~

* It does not allow for insertions or deletions.

* It does not allow for higher error rates near the ends of tags.

* Usually, some tags match repetitively to millions of genome
  locations: this approach will find them all, which is slow and
  produces huge output.

Option 2: Fast, simple, pragmatic
---------------------------------

We can avoid all these problems by using LAST in a more
straightforward way, although we lose the guarantee of finding all
matches with up to N mismatches.  Suppose we have tags of length 36::

  lastdb mydb genome.fa
  lastal -e30 mydb tags.fa

Here, we used -e30 to get alignments with score >= 30.  In the default
scoring scheme, matches score +1, mismatches score -1, and gaps score
-(2 + gap size).  Therefore, this method allows a few mismatches
and/or a few small gaps.  You can tune this by changing the score
threshold and the scoring scheme.

Since we did not specify a spaced seed, lastal starts by finding exact
matches.  Specifically, it finds all exact matches, of whatever size,
that occur no more than ten times in the genome.  These matches
typically have a size of around 13, but can be much larger for
repetitive sequences.  This approach works well when there are more
errors near the ends of tags, because errors near the ends are less
likely to disrupt long exact matches.

Getting information on repetitive tags
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

One problem with this approach is that it gives no information on tags
that map repetitively to many locations.  We can get some information
on repetitive tags, without resorting to -m999999999, by counting
initial matches::

  lastal -j0 mydb tags.fa

A small risk with database volumes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

LAST is designed to work with 2 GB of memory, so it splits large
(e.g. mammalian) genomes into "volumes", and maps tags to each volume
in turn as if they were separate genomes.  There is a chance that a
tag might map uniquely to one volume, but repetitively to another
volume.  Unlikely, but possible.  If we don't get any information
about the repetitive mappings, we might mistakenly infer that it maps
uniquely to the genome.  You can check for this situation by using
lastal -j0 (or resorting to -m999999999).  If you have enough memory
(around 16 GB for a mammalian genome), you can put the whole genome
into one volume using the -s option of lastdb.  (Even if the genome is
in one volume, the two strands get searched separately.)

Other options
-------------

It's possible to use a hybrid of the two options above.  For example,
use a spaced seed but don't use -m999999999.
