last-map-probs.py
=================

This script aims to choose between several alignments of a query
sequence.  It assumes that exactly one reported alignment reflects the
origin of each query sequence.  It calculates the probability that
each alignment is a "mismap" (is not the one that reflects the origin
of the query).

Example 1: If a query sequence only has one reported alignment, its
           mismap probability will be zero.

Example 2: If a query sequence has two reported alignments, with
           identical alignment scores, their mismap probabilities will
           be 0.5.

Options
-------

  -h, --help          Print a help message and exit.
  -m M, --mismap=M    Don't write alignments with mismap probability > M.
  -s S, --score=S     Don't write alignments with score < S.

Typical usage
-------------

This is a typical way of mapping short DNA reads to a genome:

  lastdb -m1111110 humandb human/chr*.fa
  lastal -Q1 -d108 -e120 humandb reads.fastq > myalns.maf
  last-map-probs.py -s150 myalns.maf > myalns2.maf

* The final alignments have score >= 150 (-s150).  This score is high
  enough that random, spurious alignments are expected only once every
  few thousand reads (for ~50 bp reads versus a mammalian genome).

* The initial alignments have score >= 120 (-e120).  This is because
  the mismap probability of an alignment with score 150 may depend on
  alignments with score < 150.  Setting e = s-30 seems to work well.

* The gapless score threshold when finding initial alignments is 108
  (-d108).  This should be high enough to avoid triggering too many
  time-consuming gapped alignments.

Details
-------

* This script can read alignments in either of the formats produced by
  lastal (maf or tabular).

* The input must be a real file (not a pipe), because the script makes
  two passes over it.

Limitations
-----------

* It is possible that two or more alignments reflect the origin of one
  query sequence, for instance if the query arose by splicing.  This
  script makes no allowance for that possibility.

Method
------

Suppose one query sequence has three alignments, with scores: s1, s2,
s3.  The probability that the first alignment is the one that reflects
the origin of the query, is:
        exp(s1/t) / [exp(s1/t) + exp(s2/t) + exp(s3/t)]
Here, t is a parameter that depends on the scoring scheme: it is
written in the lastal header.

Reference
---------

For more information, please see this article:
  Incorporating sequence quality data into alignment improves DNA read mapping
  Frith MC, Wan R, Horton P
  Nucleic Acids Res. 2010 38:e100
