Aligning bisulfite-converted DNA reads to a genome
==================================================

Bisulfite is used to detect methylated cytosines.  It converts
unmethylated Cs to Ts, but it leaves methylated Cs intact.  If we then
sequence the DNA and align it to a reference genome, we can infer
cytosine methylation.

To align the DNA accurately, we should take the C->T conversion into
account.  Here is how to do it with LAST.

Let's assume we have bisulfite-converted DNA reads in a file called
"reads.fastq" (in fastq-sanger format), and the genome is in
"mygenome.fa" (in fasta format).  We will also assume that all the
reads are from the converted strand, and not its reverse-complement
(i.e. they have C->T conversions and not G->A conversions).

First, we need to run lastdb twice, for forward-strand and
reverse-strand alignments::

  lastdb -u bisulfite_f.seed my_f mygenome.fa
  lastdb -u bisulfite_r.seed my_r mygenome.fa

Then find alignments, one strand at a time::

  lastal -p bisulfite_f.mat -s1 -Q1 -e120 my_f reads.fastq > temp_f
  lastal -p bisulfite_r.mat -s0 -Q1 -e120 my_r reads.fastq > temp_r

Take care not to mistype these commands, especially f/r and s1/s0.
Finally, merge the alignments and estimate which one represents the
genomic source of each read::

  last-merge-batches.py temp_f temp_r | last-map-probs.py -s150 > myalns.maf

These commands refer to files (bisulfite_f.seed etc), which are in the
examples directory.  You need to specify exactly where they are
(e.g. "-u examples/bisulfite_f.seed").

Avoiding biased methylation estimates
-------------------------------------

Imagine that one genomic cytosine is methylated in 50% of cells in
your sample, so that 50% of reads covering it have C and 50% have T.
It is possible that the reads with C are easier to align, so we align
more of them.  The methylation rate would then look higher than 50%.

We can avoid this bias by converting all Cs in the reads to Ts, before
aligning them::

  perl -pe 'y/C/t/ if $. % 4 == 2' reads.fastq | lastal ... my_f - > temp_f
  perl -pe 'y/C/t/ if $. % 4 == 2' reads.fastq | lastal ... my_r - > temp_r

This perl command assumes that the fastq file has uppercase sequences,
and no line-wrapping or blank lines.  It converts Cs to lowercase Ts:
lowercase has no effect on the alignment, but it lets you see where
the Cs were in the output.

Tips
----

* To go faster, try gapless alignment (add -j1 to the lastal options).
  Often, this is only minusculely less accurate than gapped alignment.

* The standard recipe discards alignments that have less than 99%
  confidence (including multi-mappers).  If you wish to keep these,
  increase the last-map-probs -m parameter::

    ... | last-map-probs.py -s150 -m0.9 > myalns.maf

* You can avoid temp files by using named pipes::

    mkfifo pipe1 pipe2
    lastal -p bisulfite_f.mat -s1 -Q1 -e120 my_f reads.fastq > pipe1 &
    lastal -p bisulfite_r.mat -s0 -Q1 -e120 my_r reads.fastq > pipe2 &
    last-merge-batches.py pipe1 pipe2 | last-map-probs.py -s150 > myalns.maf

  This probably works best if you have enough memory for both my_f and
  my_r (~30 GB for the human genome).

* To reduce the memory requirement, use lastdb's -w option::

    lastdb -w2 -u bisulfite_f.seed my_f mygenome.fa
    lastdb -w2 -u bisulfite_r.seed my_r mygenome.fa

  This makes the named pipe recipe use under 20 GB for the human
  genome, and the accuracy was not harmed in our tests.

* The .tis files created by lastdb (e.g. my_f.tis, my_r.tis) are
  identical.  So you can shave a few GB by linking them::

    ln -f my_f.tis my_r.tis

Paired-end DNA reads
--------------------

You can align paired-end reads by combining the recipe in this
document with the one in `<last-pair-probs.html>`_.  For large
datasets, you would want to go faster by using multiple CPUs.  This
gets a bit complicated, so we provide a lastal-bisulfite-paired script
in the examples directory.  Typical usage::

  lastdb -w2 -u bisulfite_f.seed my_f mygenome.fa
  lastdb -w2 -u bisulfite_r.seed my_r mygenome.fa

  lastal-bisulfite-paired.sh my_f my_r reads1.fastq reads2.fastq > results.maf

This assumes that reads1.fastq are all from the converted strand
(i.e. they have C->T conversions) and reads2.fastq are all from the
reverse-complement (i.e. they have G->A conversions).  It requires GNU
parallel to be installed.  You are encouraged to customize this
script.
