ksw2c aligner (ksw2_cigar)

Documentation

Copyright (c) 2021, Huxelerate S.r.l. All rights reserved.

Provides a set of APIs to interact with the HUGenomic ksw2_cigar core that performs accelerated sequence alignment on FPGA.

The core functionality is based on the ksw2_extd method from the ksw2 library: https://github.com/lh3/ksw2

The core computes the optimal scores and the CIGAR for global and extension alignment of two sequences using dual gap affine cost function and a custom scoring matrix. The core also supports banded alignment and the z-drop heuristics: https://doi.org/10.1093/bioinformatics/bty191

Before running the mothods of this library, you must ensure that the target FPGA is properly configured with the ksw2_cigar core through the fpga_configure function.

Note

If you only need the optimal scores but not the CIGAR it is recommended to use the ksw2_score core since it provides better performance.

namespace hugenomic

Typedefs

typedef uint32_t ksw2c_cigar_op

The data type used for storing a CIGAR operation within a CIGAR (e.g. 10M, 2I, 4D). A CIGAR operation encodes both the operation type and its length.

Note

We use the same encoding of the ksw2 library.

Functions

char ksw2c_get_cigar_op_type(ksw2c_cigar_op cigar_op)

Retrieves the operation type of a CIGAR operation.

Return

The operation type, can be one of:

  • ’M’: match/mismatch

  • ’I’: insertion

  • ’D’: deletion

  • ’U’: undefined (only if the cigar operation is incorrect)

Parameters
  • cigar_op: The cigar operation

uint32_t ksw2c_get_cigar_op_length(ksw2c_cigar_op cigar_op)

Retrieves the length of a CIGAR operation.

Return

The length of the CIGAR operation

Parameters
  • cigar_op: The CIGAR operation

std::vector<ksw2c_response> ksw2c_batch(uint32_t fpga_id, const std::vector<std::pair<std::vector<uint8_t>, std::vector<uint8_t>>> &pairs_of_sequences, int8_t start_gap_cost, int8_t extend_gap_cost, int8_t start_gap_cost_2, int8_t extend_gap_cost_2, const std::vector<int8_t> scoring_matrix, int32_t bandwidth, int32_t zdrop, int32_t end_bonus, bool extension_only_cigar)

Performs a sequence alignment of one or more pairs of query and target sequences on a target FPGA and returns the alignment scores and the CIGAR for each pair of sequences.

The ksw2_cigar core computes at once the score of 4 different types of alignments. These alignments are: global alignment, global query alignment, global target alignment, extension alignment. For each alignment type, each pair of query and target sequences are aligned from their beginning up to an end position (query, target) that depends on the alignment type as follows:

  • global alignment (query_length - 1, target_length - 1)

  • global query alignment (query_length - 1, FREE)

  • global target alignment (FREE, target_length - 1)

  • extension alignment (FREE, FREE)

Where FREE means that the algorithm selects the position that maximizes the alignment score.

The core returns only the CIGAR of one alignment among: extension, global and global query alignment. To specify which CIGAR to return the user can use the extension_only_cigar and the end_bonus parameters. The following rules determine which CIGAR is returned:

  • global alignment CIGAR

    when extension_only_cigar = false

  • global query alignment CIGAR

    when extension_only_cigar = true AND global_query_max_score + end_bonus > extension_max_score

  • extension alignment CIGAR

    when extension_only_cigar = true AND global_query_max_score + end_bonus <= extension_max_score

However, if the alignment is z-dropped, the extension alignment CIGAR is returned by default.

The cost function for gaps is computed as the minimum of two affine gap functions using the following formula, where l is the length of the gap:

min{start_gap_cost + l * extend_gap_cost, start_gap_cost_2 + l * extend_gap_cost_2}

The cost function associated to a match/mismatch is provided by a scoring matrix. The scoring matrix is an M*M matrix encoded as a linear array that determines the score to assign when comparing an element q of the query to an element t of the target sequence. In details, the score is assigned as:

score = scoring_matrix[t * M + q]

The value M must be in the range [2, KSW2C_MAX_SCORING_MATRIX_SIZE].

Notice that the values stored in the query and target sequences must be between 0 and M-1 otherwise an std::invalid_argument will be thrown.

Furthermore each sequence cannot be longer than KSW2C_MAX_SEQUENCE_SIZE.

PERFORMANCE TIP: you should always try to maximize the number of pairs of sequences in order to obtain maximum performance.

Return

Vector of ksw2c_response containing the alignment result for all pairs of sequences

Parameters
  • fpga_id: The identifier of the target FPGA to use

  • pairs_of_sequences: The pairs of query and target sequences to align where the first element of each pair is a query sequence and the second element of each pair is a target sequence

  • start_gap_cost: The cost for starting a gap

  • extend_gap_cost: The cost for extending a gap

  • start_gap_cost_2: The cost for starting a gap (second gap function)

  • extend_gap_cost_2: The cost for extending a gap (second gap function)

  • scoring_matrix: Scoring matrix of size M*M encoded as a linear vector

  • bandwidth: The width of the band for banded alignment, use a negative number for full alignment

  • zdrop: The z value for the z-drop heuristic, use a negative number to avoid z-drop

  • end_bonus: The bonus score to determine if the CIGAR of global query alignment should be returned instead of the CIGAR for extension alignment when extension_only_cigar is true

  • extension_only_cigar: Whether to return the CIGAR for extension alignment/global query alignment (true) or the CIGAR for global alignment (false). If the alignment is z-dropped, the flag is not considered and the CIGAR for extension alignment is returned by default.

Exceptions

Variables

const uint64_t KSW2C_MAX_SEQUENCE_SIZE = 49152

The maximum supported sequence size

const uint64_t KSW2C_MAX_SCORING_MATRIX_SIZE = 64

The maximum supported size for the custom scoring matrix

const int32_t KSW2C_MAX_NOT_FOUND = -0x40000000

A special value assigned to a max score returned by the aligner when the max score is not found (such as when the alignment is z-dropped)

struct ksw2c_response
#include <ksw2_cigar.hpp>

A structure that defines the result of the execution of a ksw2_cigar alignment on two pair of sequences

Public Members

int32_t extension_max_score

Max score of extension alignment:

  • query from 0 to [X]

  • target from 0 to [Y]

can be equal to KSW2C_MAX_NOT_FOUND if not found

int32_t extension_query_end_position

Query end position ([X]) for extension alignment

int32_t extension_target_end_position

Target end position ([Y]) for extension alignment

int32_t global_query_max_score

Max score of global query alignment:

  • query from 0 to query_len - 1

  • target from 0 to [Y]

can be equal to KSW2C_MAX_NOT_FOUND if not found

int32_t global_query_target_end_position

Target end position ([Y]) for global query alignment

int32_t global_target_max_score

Max score of global target alignment:

  • query from 0 to [X]

  • target from 0 to target_len - 1

can be equal to KSW2C_MAX_NOT_FOUND if not found

int32_t global_target_query_end_position

Query end position ([X]) for global target alignment

int32_t global_max_score

Max score of global alignment:

  • query from 0 to query_len - 1

  • target from 0 to target_len - 1

can be equal to KSW2C_MAX_NOT_FOUND if not found

std::vector<ksw2c_cigar_op> cigar

The CIGAR of one of the following alignments:

  • global alignment: if extension_only_cigar=false

  • global query alignment: if extension_only_cigar=true and global_query_max_score + end_bonus > extension_max_score

  • extension alignment: if extension_only_cigar=true and global_query_max_score + end_bonus <= extension_max_score

Note

If the alignment has been z-dropped, the CIGAR of extension alignment is returned by default

bool global_query_cigar

Whether the CIGAR for global query alignment is returned

bool zdropped

Whether the alignment has been z-dropped

Usage examples

examples/ksw2_cigar/1_cigar_operations_counting

/** 
 * Copyright (c) 2021, Huxelerate S.r.l.
 * All rights reserved.
 *
 * Sample code for computing the average number of matches/mismatches, insertions and deletions
 * for global alignment between random DNA reads.
 */

#include <iostream>
#include <iomanip>
#include <string>
#include <sys/time.h>
#include <vector>
#include <utility>
#include "hugenomic.hpp"    // HUGenomic library

using namespace hugenomic;

// number of pairs of sequences to generate and their lengths
const unsigned int NUM_PAIRS = 48;
const size_t QUERY_SEQUENCES_LEN = 4000;
const size_t TARGET_SEQUENCES_LEN = 4000;

// scoring system used by the sequence alignment algorithm
const int START_GAP_COST = 4;
const int EXTEND_GAP_COST = 2;
const int START_GAP_COST_2 = 13;
const int EXTEND_GAP_COST_2 = 1;

// in this example we use 5x5 scoring matrix, where we assume the following 
// encoding for positions (rows/columns) within the matrix:
// 0 -> A,   1 -> C,   2 -> G,   3 -> T,   4 -> * 
std::vector<int8_t> SCORING_MATRIX = {
     3, -4, -4, -4,  0,
    -4,  3, -4, -4,  0,
    -4, -4,  3, -4,  0,
    -4, -4, -4,  3,  0,
     0,  0,  0,  0,  0
};

const int ZDROP = -1;       // we do not use the z-drop heuristics
const int BANDWIDTH = -1;   // we do full alignment (not banded)

const int END_BONUS = 0;
const int EXTENSION_ONLY_CIGAR = false;

// utility function to generate a random dna sequence with a given length
std::vector<uint8_t> random_dna_sequence(size_t length);

// utility function to retrieve the current time in microseconds
unsigned long get_time();

int main()
{
    // initialize one FPGA with the accelerated alignment core
    const int fpga_id = 0;
    fpga_configure(ksw2_cigar, fpga_id);

    // generate a random set of pairs of DNA reads 
    std::cout << "Generating " << NUM_PAIRS 
              << " random pairs of DNA reads, each pair with read lengths ("
              << QUERY_SEQUENCES_LEN << ", " << TARGET_SEQUENCES_LEN << ")" 
              << std::endl;

    srand(0);
    std::vector< std::pair< std::vector<uint8_t>, std::vector<uint8_t> > > sequences_pairs;

    for(int s = 0; s < NUM_PAIRS; s++) {
        std::vector<uint8_t> query_sequence = random_dna_sequence(QUERY_SEQUENCES_LEN);
        std::vector<uint8_t> target_sequence = random_dna_sequence(TARGET_SEQUENCES_LEN);
        sequences_pairs.push_back({query_sequence, target_sequence});
    }
    std::cout << "Generation complete!" << std::endl << std::endl;

    // run the alignment on each pair of random reads
    std::cout << "Running alignment..." << std::endl;

    int total_score = 0;
    int total_length = 0;

    int total_match_mismatch = 0;
    int total_deletion = 0;
    int total_insertion = 0;

    unsigned long start_compute_time = get_time();

    std::vector<ksw2c_response> alignment_results = ksw2c_batch(fpga_id, sequences_pairs,
        START_GAP_COST, EXTEND_GAP_COST, START_GAP_COST_2, EXTEND_GAP_COST_2,
        SCORING_MATRIX, BANDWIDTH, ZDROP, END_BONUS, EXTENSION_ONLY_CIGAR);
    
    unsigned long end_compute_time = get_time();

    for(const auto &alignment: alignment_results) {

        for(ksw2c_cigar_op cigar_op: alignment.cigar) {

            char cigar_op_type = ksw2c_get_cigar_op_type(cigar_op);
            uint32_t cigar_op_length = ksw2c_get_cigar_op_length(cigar_op);

            if(cigar_op_type == 'M')
                total_match_mismatch += cigar_op_length;
            else if(cigar_op_type == 'I')
                total_insertion += cigar_op_length;
            else if(cigar_op_type == 'D')
                total_deletion += cigar_op_length;
        }
    }
    

    // print the average number of matches/mismatches, insertions and deletions
    std::cout << std::fixed << std::setprecision(2);
    std::cout << "Alignment complete!" << std::endl << std::endl;
    std::cout << "Average number of matches/mismatches = " 
              << total_match_mismatch / (double) NUM_PAIRS << std::endl << std::endl;
    std::cout << "Average number of insertions = " 
              << total_insertion / (double) NUM_PAIRS << std::endl << std::endl;
    std::cout << "Average number of deletions = " 
              << total_deletion / (double) NUM_PAIRS << std::endl << std::endl;

    // measure execution time and performance
    double time_ms = (end_compute_time - start_compute_time) / 1000.0;
    size_t cell_updates = QUERY_SEQUENCES_LEN * TARGET_SEQUENCES_LEN * NUM_PAIRS;
    std::cout << "Alignment time: " << time_ms << " ms" << std::endl
              << "GCUPS (Giga Cell Update Per Second) =  " 
              << (cell_updates / time_ms / 1000000.0) << std::endl;

    // release the FPGA 
    fpga_release(fpga_id);
    return 0;
}


std::vector<uint8_t> random_dna_sequence(size_t length)
{
    std::vector<uint8_t> sequence;
    for(size_t i = 0; i < length; i++) {
        sequence.push_back( rand() % 5 );
    }

    return sequence;
}


unsigned long get_time()
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec * 1000 * 1000 + tv.tv_usec;
}