last-map-probs.py
=================

This script reads alignments of DNA reads to a genome, and estimates
the probability that each alignment represents the genomic source of
the read.

It writes the alignments with "mismap" probabilities, i.e. the
probability that the alignment does not represent the genomic source
of the read.  By default, it discards alignments with mismap
probability > 0.01.

Typical usage
-------------

These commands map DNA reads to the human genome:

  lastdb -m1111110 hu human/chr*.fa
  lastal -Q1 -d108 -e120 hu reads.fastq | last-map-probs.py -s150 > myalns.maf

Explanation of typical usage
----------------------------

These commands find alignments with mismap probability <= 0.01 and
score >= 150 (-s150).  The score threshold should be high enough to
avoid random, spurious alignments: otherwise, the mismap probabilities
will not be reliable.  A threshold of 150 is often reasonable.  For
instance, if we compare 50 bp reads to the human genome, we expect a
random alignment with score >= 150 once every few thousand reads.

The lastal command finds alignments with score >= 120 (-e120).  This
is because the mismap probability of an alignment with score 150 may
depend on alignments with score < 150.  Setting e = s-30 seems to work
well.

Finally, the -d108 option makes lastal use a score threshold of 108 in
its gapless alignment phase.  This is needed because the default (d =
e*3/5) is too low and would make lastal too slow.

Options
-------

  -h, --help          Print a help message and exit.
  -m M, --mismap=M    Don't write alignments with mismap probability > M.
  -s S, --score=S     Don't write alignments with score < S.

Details
-------

* This script can read alignments in either of the formats produced by
  lastal (maf or tabular).

* The script reads one batch of alignments at a time (by looking for
  lines starting with "# batch").  If the batches are huge, it might
  need too much memory.  You can make the batches smaller using
  lastal's -i option.

Limitations
-----------

* It is possible that two or more alignments reflect the origin of one
  query sequence, for instance if the query arose by splicing.  This
  script makes no allowance for that possibility.

Method
------

Suppose one query sequence has three alignments, with scores: s1, s2,
s3.  The probability that the first alignment is the one that reflects
the origin of the query, is:
        exp(s1/t) / [exp(s1/t) + exp(s2/t) + exp(s3/t)]
Here, t is a parameter that depends on the scoring scheme: it is
written in the lastal header.

Reference
---------

For more information, please see this article:
  Incorporating sequence quality data into alignment improves DNA read mapping
  Frith MC, Wan R, Horton P
  Nucleic Acids Res. 2010 38:e100
