ksw2 aligner

Documentation

namespace hugenomic

Copyright (c) 2021, Huxelerate S.r.l. All rights reserved.

namespace gpu
class ksw2
#include <ksw2.hpp>

Accelerated GPU implementation of a sequence aligner based on the ksw2_extd method from: https://github.com/lh3/ksw2

The aligner computes the optimal scores and the CIGAR for global and extension alignment of two sequences using dual gap affine cost function and a custom scoring matrix. The aligner also supports banded alignment and the z-drop heuristics: https://doi.org/10.1093/bioinformatics/bty191

Public Types

typedef uint32_t cigar_op

The data type used for storing a CIGAR operation within a CIGAR (e.g. 10M, 2I, 4D). A CIGAR operation encodes both the operation type and its length.

Note

We use the same encoding of the ksw2 library.

Public Functions

ksw2(uint32_t gpu_id, size_t gpu_memory)

Instantiate a ksw2 object to align sequences on a given GPU.

Parameters
  • gpu_id: The identifier of the target GPU to use

  • gpu_memory: The amount of GPU memory in bytes to use, if 0 all the free GPU memory will be used

Exceptions
  • std::invalid_argument: If an invalid GPU identifier is specified or if the specified GPU memory is not enough or exceeds the free GPU memory

~ksw2()

Release all resources and the allocated GPU memory.

std::vector<response> align_batch(const std::vector<std::pair<std::vector<uint8_t>, std::vector<uint8_t>>> &pairs_of_sequences, int8_t start_gap_cost, int8_t extend_gap_cost, int8_t start_gap_cost_2, int8_t extend_gap_cost_2, const std::vector<int8_t> scoring_matrix, int32_t bandwidth, int32_t zdrop, int32_t end_bonus, bool extension_only_cigar)

Performs a sequence alignment of one or more pairs of query and target sequences and returns the alignment scores and the CIGAR for each pair of sequences.

The function computes at once the score of 4 different types of alignments. These alignments are: global alignment, global query alignment, global target alignment, extension alignment. For each alignment type, each pair of query and target sequences are aligned from their beginning up to an end position (query, target) that depends on the alignment type as follows:

  • global alignment (query_length - 1, target_length - 1)

  • global query alignment (query_length - 1, FREE)

  • global target alignment (FREE, target_length - 1)

  • extension alignment (FREE, FREE)

Where FREE means that the algorithm selects the position that maximizes the alignment score.

The function returns only the CIGAR of one alignment among: extension, global and global query alignment. To specify which CIGAR to return the user can use the extension_only_cigar and the end_bonus parameters. The following rules determine which CIGAR is returned:

  • global alignment CIGAR

    when extension_only_cigar = false

  • global query alignment CIGAR

    when extension_only_cigar = true AND global_query_max_score + end_bonus > extension_max_score

  • extension alignment CIGAR

    when extension_only_cigar = true AND global_query_max_score + end_bonus <= extension_max_score

However, if the alignment is z-dropped, the extension alignment CIGAR is returned by default.

The cost function for gaps is computed as the minimum of two affine gap functions using the following formula, where l is the length of the gap:

min{start_gap_cost + l * extend_gap_cost, start_gap_cost_2 + l * extend_gap_cost_2}

The cost function associated to a match/mismatch is provided by a scoring matrix. The scoring matrix is an M*M matrix encoded as a linear array that determines the score to assign when comparing an element q of the query to an element t of the target sequence. In details, the score is assigned as:

score = scoring_matrix[t * M + q]

The value M must be in the range [2, MAX_SCORING_MATRIX_SIZE].

Notice that the values stored in the query and target sequences must be between 0 and M-1 otherwise an std::invalid_argument will be thrown.

PERFORMANCE TIP: you should always try to maximize the number of pairs of sequences in order to obtain maximum performance.

Return

Vector of ksw2::response containing the alignment result for all pairs of sequences

Parameters
  • pairs_of_sequences: The pairs of query and target sequences to align where the first element of each pair is a query sequence and the second element of each pair is a target sequence

  • start_gap_cost: The cost for starting a gap

  • extend_gap_cost: The cost for extending a gap

  • start_gap_cost_2: The cost for starting a gap (second gap function)

  • extend_gap_cost_2: The cost for extending a gap (second gap function)

  • scoring_matrix: Scoring matrix of size M*M encoded as a linear vector

  • bandwidth: The width of the band for banded alignment, use a negative number for full alignment

  • zdrop: The z value for the z-drop heuristic, use a negative number to avoid z-drop

  • end_bonus: The bonus score to determine if the CIGAR of global query alignment should be returned instead of the CIGAR for extension alignment when extension_only_cigar is true

  • extension_only_cigar: Whether to return the CIGAR for extension alignment/global query alignment (true) or the CIGAR for global alignment (false). If the alignment is z-dropped, the flag is not considered and the CIGAR for extension alignment is returned by default.

Exceptions
  • std::invalid_argument: In case of invalid sequences or scoring matrix

std::vector<response> align_batch_indirect_storage(const std::vector<const uint8_t*> &query_pointers, const std::vector<uint32_t> &query_sizes, const std::vector<const uint8_t*> &target_pointers, const std::vector<uint32_t> &target_sizes, int8_t start_gap_cost, int8_t extend_gap_cost, int8_t start_gap_cost_2, int8_t extend_gap_cost_2, const std::vector<int8_t> scoring_matrix, int32_t bandwidth, int32_t zdrop, int32_t end_bonus, bool extension_only_cigar)

This method provides the same functionality of ksw2::align_batch but supports a more flexible data storage for the input sequences.

Each sequence is provided as input using a memory pointer to the sequence and its size.

For details on the functionality of this method refer to ksw2::align_batch.

Return

Vector of ksw2::response containing the alignment result for all pairs of sequences

Parameters
  • query_pointers: Vector of memory pointers to each query sequence

  • query_sizes: Vector of query sizes

  • target_pointers: Vector of memory pointers to each target sequence

  • target_sizes: Vector of target sizes

  • start_gap_cost: The cost for starting a gap

  • extend_gap_cost: The cost for extending a gap

  • start_gap_cost_2: The cost for starting a gap (second gap function)

  • extend_gap_cost_2: The cost for extending a gap (second gap function)

  • scoring_matrix: Scoring matrix of size M*M encoded as a linear vector

  • bandwidth: The width of the band for banded alignment, use a negative number for full alignment

  • zdrop: The z value for the z-drop heuristic, use a negative number to avoid z-drop

  • end_bonus: The bonus score to determine if the CIGAR of global query alignment should be returned instead of the CIGAR for extension alignment when extension_only_cigar is true

  • extension_only_cigar: Whether to return the CIGAR for extension alignment/global query alignment (true) or the CIGAR for global alignment (false). If the alignment is z-dropped, the flag is not considered and the CIGAR for extension alignment is returned by default.

Exceptions
  • std::invalid_argument: In case of invalid sequences, invalid scoring matrix or mismatching sizes for query and target vectors

Public Static Functions

static inline char get_cigar_op_type(cigar_op op)

Retrieves the operation type of a CIGAR operation.

Return

The operation type, can be one of:

  • ’M’: match/mismatch

  • ’I’: insertion

  • ’D’: deletion

  • ’U’: undefined (only if the cigar operation is incorrect)

Parameters
  • op: The cigar operation

static inline uint32_t get_cigar_op_length(cigar_op cigar_op)

Retrieves the length of a CIGAR operation.

Return

The length of the CIGAR operation

Parameters
  • cigar_op: The CIGAR operation

Public Static Attributes

static const uint64_t MAX_SCORING_MATRIX_SIZE = 64

The maximum supported size for the custom scoring matrix

static const int32_t MAX_NOT_FOUND = -0x40000000

A special value assigned to a max score returned by the aligner when the max score is not found (such as when the alignment is z-dropped)

Private Members

std::experimental::propagate_const<std::unique_ptr<impl>> pimpl
struct response
#include <ksw2.hpp>

The result of an alignment on two pair of sequences

Public Members

int32_t extension_max_score

Max score of extension alignment:

  • query from 0 to [X]

  • target from 0 to [Y]

can be equal to MAX_NOT_FOUND if not found

int32_t extension_query_end_position

Query end position ([X]) for extension alignment

int32_t extension_target_end_position

Target end position ([Y]) for extension alignment

int32_t global_query_max_score

Max score of global query alignment:

  • query from 0 to query_len - 1

  • target from 0 to [Y]

can be equal to MAX_NOT_FOUND if not found

int32_t global_query_target_end_position

Target end position ([Y]) for global query alignment

int32_t global_target_max_score

Max score of global target alignment:

  • query from 0 to [X]

  • target from 0 to target_len - 1

can be equal to MAX_NOT_FOUND if not found

int32_t global_target_query_end_position

Query end position ([X]) for global target alignment

int32_t global_max_score

Max score of global alignment:

  • query from 0 to query_len - 1

  • target from 0 to target_len - 1

can be equal to MAX_NOT_FOUND if not found

std::vector<cigar_op> cigar

The CIGAR of one of the following alignments:

  • global alignment: if extension_only_cigar=false

  • global query alignment: if extension_only_cigar=true and global_query_max_score + end_bonus > extension_max_score

  • extension alignment: if extension_only_cigar=true and global_query_max_score + end_bonus <= extension_max_score

Note

If the alignment has been z-dropped, the CIGAR of extension alignment is returned by default

bool global_query_cigar

Whether the CIGAR for global query alignment is returned

bool zdropped

Whether the alignment has been z-dropped

Usage examples

examples/ksw2/1_cigar_operations_counting

/** 
 * Copyright (c) 2021, Huxelerate S.r.l.
 * All rights reserved.
 *
 * Sample code for computing the average number of matches/mismatches, insertions and deletions
 * for global alignment between random DNA reads.
 */

#include <iostream>
#include <iomanip>
#include <string>
#include <sys/time.h>
#include <vector>
#include <utility>
#include "ksw2.hpp"    // HUGenomic library

using namespace hugenomic::gpu;

// number of pairs of sequences to generate and their lengths
const unsigned int NUM_PAIRS = 200000;
const size_t QUERY_SEQUENCES_LEN = 256;
const size_t TARGET_SEQUENCES_LEN = 256;

// scoring system used by the sequence alignment algorithm
const int START_GAP_COST = 4;
const int EXTEND_GAP_COST = 2;
const int START_GAP_COST_2 = 13;
const int EXTEND_GAP_COST_2 = 1;

// in this example we use 5x5 scoring matrix, where we assume the following 
// encoding for positions (rows/columns) within the matrix:
// 0 -> A,   1 -> C,   2 -> G,   3 -> T,   4 -> * 
std::vector<int8_t> SCORING_MATRIX = {
     3, -4, -4, -4,  0,
    -4,  3, -4, -4,  0,
    -4, -4,  3, -4,  0,
    -4, -4, -4,  3,  0,
     0,  0,  0,  0,  0
};

const int ZDROP = -1;       // we do not use the z-drop heuristics
const int BANDWIDTH = -1;   // we do full alignment (not banded)

const int END_BONUS = 0;
const int EXTENSION_ONLY_CIGAR = false;

// utility function to generate a random dna sequence with a given length
std::vector<uint8_t> random_dna_sequence(size_t length);

// utility function to retrieve the current time in microseconds
unsigned long get_time();

int main()
{
    // initialize the ksw2 aligner
    const int gpu_id = 0;
    ksw2 aligner(gpu_id, 0);

    // generate a random set of pairs of DNA reads 
    std::cout << "Generating " << NUM_PAIRS 
              << " random pairs of DNA reads, each pair with read lengths ("
              << QUERY_SEQUENCES_LEN << ", " << TARGET_SEQUENCES_LEN << ")" 
              << std::endl;

    srand(0);
    std::vector< std::pair< std::vector<uint8_t>, std::vector<uint8_t> > > sequences_pairs;

    for(unsigned int s = 0; s < NUM_PAIRS; s++) {
        std::vector<uint8_t> query_sequence = random_dna_sequence(QUERY_SEQUENCES_LEN);
        std::vector<uint8_t> target_sequence = random_dna_sequence(TARGET_SEQUENCES_LEN);
        sequences_pairs.push_back({query_sequence, target_sequence});
    }
    std::cout << "Generation complete!" << std::endl << std::endl;

    // run the alignment on each pair of random reads
    std::cout << "Running alignment..." << std::endl;

    int total_match_mismatch = 0;
    int total_deletion = 0;
    int total_insertion = 0;

    unsigned long start_compute_time = get_time();

    std::vector<ksw2::response> alignment_results = aligner.align_batch(sequences_pairs,
        START_GAP_COST, EXTEND_GAP_COST, START_GAP_COST_2, EXTEND_GAP_COST_2,
        SCORING_MATRIX, BANDWIDTH, ZDROP, END_BONUS, EXTENSION_ONLY_CIGAR);
    
    unsigned long end_compute_time = get_time();

    for(const auto &alignment: alignment_results) {

        for(ksw2::cigar_op cigar_op: alignment.cigar) {

            char cigar_op_type = ksw2::get_cigar_op_type(cigar_op);
            uint32_t cigar_op_length = ksw2::get_cigar_op_length(cigar_op);

            if(cigar_op_type == 'M')
                total_match_mismatch += cigar_op_length;
            else if(cigar_op_type == 'I')
                total_insertion += cigar_op_length;
            else if(cigar_op_type == 'D')
                total_deletion += cigar_op_length;
        }
    }
    

    // print the average number of matches/mismatches, insertions and deletions
    std::cout << std::fixed << std::setprecision(2);
    std::cout << "Alignment complete!" << std::endl << std::endl;
    std::cout << "Average number of matches/mismatches = " 
              << total_match_mismatch / (double) NUM_PAIRS << std::endl << std::endl;
    std::cout << "Average number of insertions = " 
              << total_insertion / (double) NUM_PAIRS << std::endl << std::endl;
    std::cout << "Average number of deletions = " 
              << total_deletion / (double) NUM_PAIRS << std::endl << std::endl;

    // measure execution time and performance
    double time_ms = (end_compute_time - start_compute_time) / 1000.0;
    size_t cell_updates = QUERY_SEQUENCES_LEN * TARGET_SEQUENCES_LEN * NUM_PAIRS;
    std::cout << "Alignment time: " << time_ms << " ms" << std::endl
              << "GCUPS (Giga Cell Update Per Second) =  " 
              << (cell_updates / time_ms / 1000000.0) << std::endl;

    return 0;
}


std::vector<uint8_t> random_dna_sequence(size_t length)
{
    std::vector<uint8_t> sequence;
    for(size_t i = 0; i < length; i++) {
        sequence.push_back( rand() % 5 );
    }

    return sequence;
}


unsigned long get_time()
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec * 1000 * 1000 + tv.tv_usec;
}