Sequence aligner (smith_waterman)

Documentation

Copyright (c) 2019, Huxelerate S.r.l. All rights reserved.

Provides a set of APIs to interact with the HUGenomic smith_waterman core that performs accelerated sequence alignment on FPGA.

The core supports local, global and semi-global (a.k.a. overlap) pairwise alignments with custom affine scoring system. For more information about pairwise alignments refer to the documentation of the Seqan library documentation: https://seqan.readthedocs.io/en/master/Tutorial/Algorithms/Alignment/PairwiseSequenceAlignment.html

Before running the mothods of this library, you must ensure that the target FPGA is properly configured with the smith_waterman core through the fpga_configure function.

namespace hugenomic

Functions

std::vector<sw_response> sw_batch(uint32_t fpga_id, const std::vector<std::pair<std::string, std::string>> &pairs_of_sequences, int32_t match_score, int32_t mismatch_score, int32_t extend_gap_score, int32_t start_gap_score, bool global_alignment, bool free_top, bool free_left, bool free_right, bool free_down)

Performs an optimal sequence alignment of one or more pairs of sequences on a target FPGA.

Sequences cannot be longer than SW_MAX_SEQUENCE_SIZE and the smaller sequence of a pair cannot exceed SW_MAX_SIZE_OF_SMALLER_SEQUENCE.

PERFORMANCE TIP: you should always try to maximize the number of pairs of sequences in order to obtain maximum performance.

Return

Vector of sw_response containing the alignment result for the pairs of sequences

Parameters
  • fpga_id: The identifier of the target FPGA to use

  • pairs_of_sequences: The pairs of horizontal and vertical sequences to align where the first element of each pair is a horizontal sequence and the second element of each pair is a vertical sequence

  • match_score: The score associated to a match

  • mismatch_score: The score associated to a mismatch

  • extend_gap_score: The score associated to a gap extension

  • start_gap_score: The score associated to starting a gap

  • global_alignment: Whether to perform global alignment (true) or local alignment (false)

  • free_top: Whether there is no penalty for skipping characters at the beginning of the vertical sequence (only for global alignment)

  • free_left: Whether there is no penalty for skipping characters at the beginning of the horizontal sequence (only for global alignment)

  • free_right: Whether there is no penalty for skipping characters at the end of the horizontal sequence (only for global alignment)

  • free_down: Whether there is no penalty for skipping characters at the end of the vertical sequence (only for global alignment)

Exceptions
  • std::invalid_argument: if a sequence is empty or a pair of sequences exceeds the size constraints

  • invalid_fpga_configuration: if the target FPGA is not configured with the smith_waterman core

  • invalid_fpga_identifier: if an invalid FPGA identifier is specified

std::vector<sw_response> sw_all_to_all(uint32_t fpga_id, const std::vector<std::string> &horizontal_sequences, const std::vector<std::string> &vertical_sequences, int32_t match_score, int32_t mismatch_score, int32_t extend_gap_score, int32_t start_gap_score, bool global_alignment, bool free_top, bool free_left, bool free_right, bool free_down)

Runs an all-to-all optimal sequence alignment on a given FPGA.

The method receives as input two vector of std::string containing the pairs of horizontal and vertical sequences to align against each other and return a vector of sw_response of size num_horizontal_sequences X num_vertical_sequences containing the alignment results for all the pairs of sequences.

Sequences cannot be longer than SW_MAX_SEQUENCE_SIZE and the smaller sequence of a pair of horizontal and vertical sequences cannot exceed SW_MAX_SIZE_OF_SMALLER_SEQUENCE.

PERFORMANCE TIP: Whenever possible try to maximize the number of alignments to perform in a single function call.

Return

Vector of sw_response containing all the alignment results stored in row-wise format. row 0: from result[0] to result[1 * horizontal_sequences.size() - 1] row 1: from result[1 * horizontal_sequences.size()] to result[2 * horizontal_sequences.size() - 1] row 2: from result[2 * horizontal_sequences.size()] to result[3 * horizontal_sequences.size() - 1] …

Parameters
  • fpga_id: The identifier of the target FPGA

  • horizontal_sequences: Vector containing all the horizontal sequences

  • vertical_sequences: Vector containing all the vertical sequences

  • match_score: The score associated to a match

  • mismatch_score: The score associated to a mismatch

  • extend_gap_score: The score associated to a gap extension

  • start_gap_score: The score associated to starting a gap

  • global_alignment: Whether to perform global alignment (true) or local alignment (false)

  • free_top: Whether there is no penalty for skipping characters at the beginning of the vertical sequence (only for global alignment)

  • free_left: Whether there is no penalty for skipping characters at the beginning of the horizontal sequence (only for global alignment)

  • free_right: Whether there is no penalty for skipping characters at the end of the horizontal sequence (only for global alignment)

  • free_down: Whether there is no penalty for skipping characters at the end of the vertical sequence (only for global alignment)

Exceptions
  • std::invalid_argument: If there is an empty sequence or a pair of vertical and horizontal sequences exceed size constraints (see above)

  • invalid_fpga_configuration: If the target FPGA is not configured with the smith_waterman core

  • invalid_fpga_identifier: If an invalid FPGA identifier is specified

Variables

const uint64_t SW_MAX_SIZE_OF_SMALLER_SEQUENCE = 65535

The maximum supported size of the smaller among a pair of horizontal and vertical sequences

const uint64_t SW_MAX_SEQUENCE_SIZE = 715762091

The maximum supported size of a sequence

struct sw_response
#include <smith_waterman.hpp>

A structure that defines the result of the execution of the smith_waterman core on two pair of sequences

Public Members

int32_t alignment_score

the optimal alignment score between the two sequences

uint32_t vertical_sequence_end_position

The end position of the vertical sequence in the optimal alignment

uint32_t horizontal_sequence_end_position

The end position of the horizontal sequence in the optimal alignment

uint32_t vertical_sequence_start_position

The start position of the vertical sequence in the optimal alignment

uint32_t horizontal_sequence_start_position

The start position of the horizontal sequence in the optimal alignment

Usage examples

examples/smith_waterman/1_random_global_alignment

/** 
 * Copyright (c) 2019, Huxelerate S.r.l.
 * All rights reserved.
 *
 * Sample code for computing the average score of an overlap alignment between random DNA reads.
 */

#include <iostream>
#include <iomanip>
#include <string>
#include <sys/time.h>
#include <vector>
#include <utility>
#include "hugenomic.hpp"    // HUGenomic library

using namespace hugenomic;

// number of pairs of sequences to generate and their lengths
const unsigned int NUM_PAIRS = 128;
const size_t HORIZONTAL_SEQUENCES_LEN = 2000;
const size_t VERTICAL_SEQUENCES_LEN = 2000;

// scoring system used by the sequence alignment algorithm
const int MATCH_SCORE = 1;
const int MISMATCH_PENALTY = -1;
const int EXTEND_GAP_PENALTY = -1;
const int OPEN_GAP_PENALTY = -1;

// specifies an overlap alignment: a global alignment with no penalties for start/end gaps
const bool IS_GLOBAL_ALIGNMENT = true;
const bool FREE_TOP = true;
const bool FREE_LEFT = true;
const bool FREE_RIGHT = true;
const bool FREE_DOWN = true;

// utility function to generate a random dna sequence with a given length
std::string random_dna_sequence(size_t length);

// utility function to retrieve the current time in microseconds
unsigned long get_time();

int main()
{
    // initialize one FPGA with the accelerated alignment core
    const int fpga_id = 0;
    fpga_configure(smith_waterman, fpga_id);

    // generate a random set of pairs of DNA reads 
    std::cout << "Generating " << NUM_PAIRS 
              << " random pairs of DNA reads, each pair with read lengths ("
              << HORIZONTAL_SEQUENCES_LEN << ", " << VERTICAL_SEQUENCES_LEN << ")" 
              << std::endl;

    srand(0);
    std::vector< std::pair< std::string, std::string > > sequences_pairs;

    for(int s = 0; s < NUM_PAIRS; s++) {
        std::string horizontal_sequence = random_dna_sequence(HORIZONTAL_SEQUENCES_LEN);
        std::string vertical_sequence = random_dna_sequence(VERTICAL_SEQUENCES_LEN);
        sequences_pairs.push_back({horizontal_sequence, vertical_sequence});
    }
    std::cout << "Generation complete!" << std::endl << std::endl;

    // run the overlap alignment on each pair of random reads
    std::cout << "Running overlap alignment..." << std::endl;

    int total_score = 0;

    unsigned long start_compute_time = get_time();

    std::vector<sw_response> alignment_results = sw_batch(fpga_id, sequences_pairs,
        MATCH_SCORE, MISMATCH_PENALTY, EXTEND_GAP_PENALTY, OPEN_GAP_PENALTY, 
        FREE_TOP, FREE_LEFT, FREE_RIGHT, FREE_DOWN, IS_GLOBAL_ALIGNMENT);

    for(int s = 0; s < NUM_PAIRS; s++) {
        total_score += alignment_results[s].alignment_score;
    }

    unsigned long end_compute_time = get_time();

    // print the average score for a random overlap alignment
    std::cout << std::fixed << std::setprecision(2);
    std::cout << "Alignment complete!" << std::endl << std::endl;
    std::cout << "Average overlap alignment score = " 
              << total_score / (double) NUM_PAIRS << std::endl << std::endl;

    // measure execution time and performance
    double time_ms = (end_compute_time - start_compute_time) / 1000.0;
    size_t cell_updates = HORIZONTAL_SEQUENCES_LEN * VERTICAL_SEQUENCES_LEN * NUM_PAIRS;
    std::cout << "Alignment time: " << time_ms << " ms" << std::endl
              << "GCUPS (Giga Cell Update Per Second) =  " 
              << (cell_updates / time_ms / 1000000.0) << std::endl;

    // release the FPGA 
    fpga_release(fpga_id);
    return 0;
}


std::string random_dna_sequence(size_t length)
{
    char possible_characters[4] = {'A', 'C', 'T', 'G'};

    std::string sequence;
    for(size_t i = 0; i < length; i++) {
        sequence += possible_characters[rand() % 4];
    }

    return sequence;
}


unsigned long get_time()
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec * 1000 * 1000 + tv.tv_usec;
}

examples/smith_waterman/2_all_to_all_alignment

/** 
 * Copyright (c) 2019, Huxelerate S.r.l.
 * All rights reserved.
 *
 * Sample code for computing the scores of the all-to-all alignment of random DNA sequences.
 */

#include <iostream>
#include <iomanip>
#include <sstream>
#include <sys/time.h>
#include "hugenomic.hpp"    // HUGenomic library

using namespace hugenomic;

// number of horizontal and vertical sequences and their lengths
const unsigned int NUM_HORIZONTAL_SEQUENCES = 11;
const unsigned int NUM_VERTICAL_SEQUENCES = 32;
const size_t HORIZONTAL_SEQUENCES_LEN = 2000;
const size_t VERTICAL_SEQUENCES_LEN = 2000;

// scoring system used by the sequence alignment algorithm
const int MATCH_SCORE = 1;
const int MISMATCH_PENALTY = -1;
const int EXTEND_GAP_PENALTY = -1;
const int OPEN_GAP_PENALTY = -1;

// specifies an overlap alignment: a global alignment with no penalties for start/end gaps
const bool IS_GLOBAL_ALIGNMENT = true;
const bool FREE_TOP = true;
const bool FREE_LEFT = true;
const bool FREE_RIGHT = true;
const bool FREE_DOWN = true;

// utility function to generate a random dna sequence with a given length
std::string random_dna_sequence(size_t length);

// utility function to retrieve the current time in microseconds
unsigned long get_time();

int main()
{
    // initializes one FPGA with the accelerated alignment core
    const int fpga_id = 0;
    fpga_configure(smith_waterman, fpga_id);

    // generates random DNA sequences
    std::vector<std::string> horizontal_sequences;
    std::vector<std::string> vertical_sequences;

    srand(0);
    std::cout << "Generating " << NUM_HORIZONTAL_SEQUENCES << " horizontal sequences each "
              << "with length " << HORIZONTAL_SEQUENCES_LEN << std::endl;
    
    for(int s = 0; s < NUM_HORIZONTAL_SEQUENCES; s++) {
        horizontal_sequences.push_back(random_dna_sequence(HORIZONTAL_SEQUENCES_LEN));
    }

    std::cout << "Generating " << NUM_VERTICAL_SEQUENCES << " vertical sequences each "
              << "with length " << VERTICAL_SEQUENCES_LEN << std::endl;

    for(int s = 0; s < NUM_VERTICAL_SEQUENCES; s++) {
        vertical_sequences.push_back(random_dna_sequence(VERTICAL_SEQUENCES_LEN));
    }

    std::cout << "Generation complete!" << std::endl << std::endl;


    // run the all-to-all alignment
    std::cout << "Running all-to-all overlap alignment..." << std::endl;

    unsigned long start_compute_time = get_time();

    std::vector<sw_response> alignment_results = sw_all_to_all(fpga_id,
        horizontal_sequences, vertical_sequences, 
        MATCH_SCORE, MISMATCH_PENALTY, EXTEND_GAP_PENALTY, OPEN_GAP_PENALTY, 
        FREE_TOP, FREE_LEFT, FREE_RIGHT, FREE_DOWN, IS_GLOBAL_ALIGNMENT);

    unsigned long end_compute_time = get_time();


    // print the matrix of scores for the all-to-all alignments
    std::cout << std::fixed << std::setprecision(2);
    std::cout << "All-to-all alignment complete!" << std::endl << std::endl;

    for(int i = 0; i < NUM_VERTICAL_SEQUENCES; i++){
        for(int j = 0; j < NUM_HORIZONTAL_SEQUENCES; j++){
            std::cout << alignment_results[i * NUM_HORIZONTAL_SEQUENCES + j].alignment_score << "\t";
        }
        std::cout << std::endl;
    }

    std::cout << std::endl;

    // measures execution time and performance
    double time = end_compute_time - start_compute_time;
    std::cout << "Alignment time: " << time / 1000.0 << " ms" << std::endl
              << "GCUPS (Giga Cell Update Per Second) =  " 
              << ((NUM_VERTICAL_SEQUENCES * VERTICAL_SEQUENCES_LEN * 
                    NUM_HORIZONTAL_SEQUENCES * HORIZONTAL_SEQUENCES_LEN) / time / 1000.0)
              << std::endl;

    // release the FPGA 
    fpga_release(fpga_id);

    return 0;
}


std::string random_dna_sequence(size_t length)
{
    char possible_characters[4] = {'A', 'C', 'T', 'G'};

    std::string sequence;
    for(size_t i = 0; i < length; i++) {
        sequence += possible_characters[rand() % 4];
    }

    return sequence;
}


unsigned long get_time()
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec * 1000 * 1000 + tv.tv_usec;
}