Converting position-sorted BAM files for use with streaming processors such as Salmon

Author: Shawn Baker, Software Engineer at Ocean Genomics This project has been made possible by the team at Ocean Genomics, and by a grant from the Chan Zuckerberg Initiative.


Problem

While many tools allow BAM files to be sorted by genomic position, tools that employ techniques such as online learning may require the records to be in a different order. In particular, for reads in a BAM file, Salmon assumes that:

  1. All the alignments for a particular read should come together before records for the next read appear.
  2. Alignments of reads should come in random order. (This is because observing all alignments of one gene before observing alignments from other genes will heavily bias the model during the learning phase.)

Position-sorted genomic BAM files violate both of these expectations. Alignments are sorted by the genomic position (violating expectation 2), and alignments for multimapping reads are scattered across the genome (violating expectation 1).

Converting coordinates with mudskipper

If you need to translate genomic coordinates to transcriptomtic coordinates, you can use the muskipper command. If you want to avoid producing genomic-aligned coordinates, you can have mudskipper sort BAM files by read and then deterministically shuffle every possible alignment either specifying the -u or --shuffle option to the bam2bam subcommand. The general format of this type invocation is:

mudskipper bam2bam --shuffle --bam <bam-file> --gtf <gtf-file> --out <output-file>

Alternatively, if you have a BAM file that doesn’t require coordinate translation but is position-sorted, you can invoke mudkskipper with the shuffle subcommand. An example would be:

mudskipper shuffle --bam <bam-file> --out <output-file>

Notes on CPU and memory usage

Specifying --shuffle to the bam2bam subcommand or directly using mudskipper shuffle requires auxillary storage space equal to the BAM file to implement the two-pass algorithm required to reorder the BAM records. If the BAM file does not fit into system memory, it will be bucketed into (bam_file_bytes / allowed_memory) file-based buckets. By default, this phase of mudskipper will only use 75% of maximum physical system memory. You can override this by specifying --max_mem_mb, which will then use whatever integer amount of memory you specify, in megabytes.

By default, mudskipper spawns threads equal to the number of logical cores available to it. If, for whatever reason, you want a different number of threads, you can pass --threads to either subcommand.