Sequence aligner (smith_waterman)¶
Documentation¶
Copyright (c) 2019, Huxelerate S.r.l. All rights reserved.
Provides a set of APIs to interact with the HUGenomic smith_waterman core that performs accelerated sequence alignment on FPGA.
The core supports local, global and semi-global (a.k.a. overlap) pairwise alignments with custom affine scoring system. For more information about pairwise alignments refer to the documentation of the Seqan library documentation: https://seqan.readthedocs.io/en/master/Tutorial/Algorithms/Alignment/PairwiseSequenceAlignment.html
Before running the mothods of this library, you must ensure that the target FPGA is properly configured with the smith_waterman core through the fpga_configure function.
-
namespace
hugenomic
Functions
-
std::vector<sw_response>
sw_batch
(uint32_t fpga_id, const std::vector<std::pair<std::string, std::string>> &pairs_of_sequences, int32_t match_score, int32_t mismatch_score, int32_t extend_gap_score, int32_t start_gap_score, bool global_alignment, bool free_top, bool free_left, bool free_right, bool free_down)¶ Performs an optimal sequence alignment of one or more pairs of sequences on a target FPGA.
Sequences cannot be longer than SW_MAX_SEQUENCE_SIZE and the smaller sequence of a pair cannot exceed SW_MAX_SIZE_OF_SMALLER_SEQUENCE.
PERFORMANCE TIP: you should always try to maximize the number of pairs of sequences in order to obtain maximum performance.
- Return
Vector of sw_response containing the alignment result for the pairs of sequences
- Parameters
fpga_id
: The identifier of the target FPGA to usepairs_of_sequences
: The pairs of horizontal and vertical sequences to align where the first element of each pair is a horizontal sequence and the second element of each pair is a vertical sequencematch_score
: The score associated to a matchmismatch_score
: The score associated to a mismatchextend_gap_score
: The score associated to a gap extensionstart_gap_score
: The score associated to starting a gapglobal_alignment
: Whether to perform global alignment (true) or local alignment (false)free_top
: Whether there is no penalty for skipping characters at the beginning of the vertical sequence (only for global alignment)free_left
: Whether there is no penalty for skipping characters at the beginning of the horizontal sequence (only for global alignment)free_right
: Whether there is no penalty for skipping characters at the end of the horizontal sequence (only for global alignment)free_down
: Whether there is no penalty for skipping characters at the end of the vertical sequence (only for global alignment)
- Exceptions
std::invalid_argument
: if a sequence is empty or a pair of sequences exceeds the size constraintsinvalid_fpga_configuration
: if the target FPGA is not configured with the smith_waterman coreinvalid_fpga_identifier
: if an invalid FPGA identifier is specified
-
std::vector<sw_response>
sw_all_to_all
(uint32_t fpga_id, const std::vector<std::string> &horizontal_sequences, const std::vector<std::string> &vertical_sequences, int32_t match_score, int32_t mismatch_score, int32_t extend_gap_score, int32_t start_gap_score, bool global_alignment, bool free_top, bool free_left, bool free_right, bool free_down)¶ Runs an all-to-all optimal sequence alignment on a given FPGA.
The method receives as input two vector of std::string containing the pairs of horizontal and vertical sequences to align against each other and return a vector of sw_response of size num_horizontal_sequences X num_vertical_sequences containing the alignment results for all the pairs of sequences.
Sequences cannot be longer than SW_MAX_SEQUENCE_SIZE and the smaller sequence of a pair of horizontal and vertical sequences cannot exceed SW_MAX_SIZE_OF_SMALLER_SEQUENCE.
PERFORMANCE TIP: Whenever possible try to maximize the number of alignments to perform in a single function call.
- Return
Vector of sw_response containing all the alignment results stored in row-wise format. row 0: from result[0] to result[1 * horizontal_sequences.size() - 1] row 1: from result[1 * horizontal_sequences.size()] to result[2 * horizontal_sequences.size() - 1] row 2: from result[2 * horizontal_sequences.size()] to result[3 * horizontal_sequences.size() - 1] …
- Parameters
fpga_id
: The identifier of the target FPGAhorizontal_sequences
: Vector containing all the horizontal sequencesvertical_sequences
: Vector containing all the vertical sequencesmatch_score
: The score associated to a matchmismatch_score
: The score associated to a mismatchextend_gap_score
: The score associated to a gap extensionstart_gap_score
: The score associated to starting a gapglobal_alignment
: Whether to perform global alignment (true) or local alignment (false)free_top
: Whether there is no penalty for skipping characters at the beginning of the vertical sequence (only for global alignment)free_left
: Whether there is no penalty for skipping characters at the beginning of the horizontal sequence (only for global alignment)free_right
: Whether there is no penalty for skipping characters at the end of the horizontal sequence (only for global alignment)free_down
: Whether there is no penalty for skipping characters at the end of the vertical sequence (only for global alignment)
- Exceptions
std::invalid_argument
: If there is an empty sequence or a pair of vertical and horizontal sequences exceed size constraints (see above)invalid_fpga_configuration
: If the target FPGA is not configured with the smith_waterman coreinvalid_fpga_identifier
: If an invalid FPGA identifier is specified
Variables
-
const uint64_t
SW_MAX_SIZE_OF_SMALLER_SEQUENCE
= 65535¶ The maximum supported size of the smaller among a pair of horizontal and vertical sequences
-
const uint64_t
SW_MAX_SEQUENCE_SIZE
= 715762091¶ The maximum supported size of a sequence
-
struct
sw_response
¶ - #include <smith_waterman.hpp>
A structure that defines the result of the execution of the smith_waterman core on two pair of sequences
Public Members
-
int32_t
alignment_score
¶ the optimal alignment score between the two sequences
-
uint32_t
vertical_sequence_end_position
¶ The end position of the vertical sequence in the optimal alignment
-
uint32_t
horizontal_sequence_end_position
¶ The end position of the horizontal sequence in the optimal alignment
-
uint32_t
vertical_sequence_start_position
¶ The start position of the vertical sequence in the optimal alignment
-
uint32_t
horizontal_sequence_start_position
¶ The start position of the horizontal sequence in the optimal alignment
-
int32_t
-
std::vector<sw_response>
Usage examples¶
examples/smith_waterman/1_random_global_alignment¶
/**
* Copyright (c) 2019, Huxelerate S.r.l.
* All rights reserved.
*
* Sample code for computing the average score of an overlap alignment between random DNA reads.
*/
#include <iostream>
#include <iomanip>
#include <string>
#include <sys/time.h>
#include <vector>
#include <utility>
#include "hugenomic.hpp" // HUGenomic library
using namespace hugenomic;
// number of pairs of sequences to generate and their lengths
const unsigned int NUM_PAIRS = 128;
const size_t HORIZONTAL_SEQUENCES_LEN = 2000;
const size_t VERTICAL_SEQUENCES_LEN = 2000;
// scoring system used by the sequence alignment algorithm
const int MATCH_SCORE = 1;
const int MISMATCH_PENALTY = -1;
const int EXTEND_GAP_PENALTY = -1;
const int OPEN_GAP_PENALTY = -1;
// specifies an overlap alignment: a global alignment with no penalties for start/end gaps
const bool IS_GLOBAL_ALIGNMENT = true;
const bool FREE_TOP = true;
const bool FREE_LEFT = true;
const bool FREE_RIGHT = true;
const bool FREE_DOWN = true;
// utility function to generate a random dna sequence with a given length
std::string random_dna_sequence(size_t length);
// utility function to retrieve the current time in microseconds
unsigned long get_time();
int main()
{
// initialize one FPGA with the accelerated alignment core
const int fpga_id = 0;
fpga_configure(smith_waterman, fpga_id);
// generate a random set of pairs of DNA reads
std::cout << "Generating " << NUM_PAIRS
<< " random pairs of DNA reads, each pair with read lengths ("
<< HORIZONTAL_SEQUENCES_LEN << ", " << VERTICAL_SEQUENCES_LEN << ")"
<< std::endl;
srand(0);
std::vector< std::pair< std::string, std::string > > sequences_pairs;
for(int s = 0; s < NUM_PAIRS; s++) {
std::string horizontal_sequence = random_dna_sequence(HORIZONTAL_SEQUENCES_LEN);
std::string vertical_sequence = random_dna_sequence(VERTICAL_SEQUENCES_LEN);
sequences_pairs.push_back({horizontal_sequence, vertical_sequence});
}
std::cout << "Generation complete!" << std::endl << std::endl;
// run the overlap alignment on each pair of random reads
std::cout << "Running overlap alignment..." << std::endl;
int total_score = 0;
unsigned long start_compute_time = get_time();
std::vector<sw_response> alignment_results = sw_batch(fpga_id, sequences_pairs,
MATCH_SCORE, MISMATCH_PENALTY, EXTEND_GAP_PENALTY, OPEN_GAP_PENALTY,
FREE_TOP, FREE_LEFT, FREE_RIGHT, FREE_DOWN, IS_GLOBAL_ALIGNMENT);
for(int s = 0; s < NUM_PAIRS; s++) {
total_score += alignment_results[s].alignment_score;
}
unsigned long end_compute_time = get_time();
// print the average score for a random overlap alignment
std::cout << std::fixed << std::setprecision(2);
std::cout << "Alignment complete!" << std::endl << std::endl;
std::cout << "Average overlap alignment score = "
<< total_score / (double) NUM_PAIRS << std::endl << std::endl;
// measure execution time and performance
double time_ms = (end_compute_time - start_compute_time) / 1000.0;
size_t cell_updates = HORIZONTAL_SEQUENCES_LEN * VERTICAL_SEQUENCES_LEN * NUM_PAIRS;
std::cout << "Alignment time: " << time_ms << " ms" << std::endl
<< "GCUPS (Giga Cell Update Per Second) = "
<< (cell_updates / time_ms / 1000000.0) << std::endl;
// release the FPGA
fpga_release(fpga_id);
return 0;
}
std::string random_dna_sequence(size_t length)
{
char possible_characters[4] = {'A', 'C', 'T', 'G'};
std::string sequence;
for(size_t i = 0; i < length; i++) {
sequence += possible_characters[rand() % 4];
}
return sequence;
}
unsigned long get_time()
{
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec * 1000 * 1000 + tv.tv_usec;
}
examples/smith_waterman/2_all_to_all_alignment¶
/**
* Copyright (c) 2019, Huxelerate S.r.l.
* All rights reserved.
*
* Sample code for computing the scores of the all-to-all alignment of random DNA sequences.
*/
#include <iostream>
#include <iomanip>
#include <sstream>
#include <sys/time.h>
#include "hugenomic.hpp" // HUGenomic library
using namespace hugenomic;
// number of horizontal and vertical sequences and their lengths
const unsigned int NUM_HORIZONTAL_SEQUENCES = 11;
const unsigned int NUM_VERTICAL_SEQUENCES = 32;
const size_t HORIZONTAL_SEQUENCES_LEN = 2000;
const size_t VERTICAL_SEQUENCES_LEN = 2000;
// scoring system used by the sequence alignment algorithm
const int MATCH_SCORE = 1;
const int MISMATCH_PENALTY = -1;
const int EXTEND_GAP_PENALTY = -1;
const int OPEN_GAP_PENALTY = -1;
// specifies an overlap alignment: a global alignment with no penalties for start/end gaps
const bool IS_GLOBAL_ALIGNMENT = true;
const bool FREE_TOP = true;
const bool FREE_LEFT = true;
const bool FREE_RIGHT = true;
const bool FREE_DOWN = true;
// utility function to generate a random dna sequence with a given length
std::string random_dna_sequence(size_t length);
// utility function to retrieve the current time in microseconds
unsigned long get_time();
int main()
{
// initializes one FPGA with the accelerated alignment core
const int fpga_id = 0;
fpga_configure(smith_waterman, fpga_id);
// generates random DNA sequences
std::vector<std::string> horizontal_sequences;
std::vector<std::string> vertical_sequences;
srand(0);
std::cout << "Generating " << NUM_HORIZONTAL_SEQUENCES << " horizontal sequences each "
<< "with length " << HORIZONTAL_SEQUENCES_LEN << std::endl;
for(int s = 0; s < NUM_HORIZONTAL_SEQUENCES; s++) {
horizontal_sequences.push_back(random_dna_sequence(HORIZONTAL_SEQUENCES_LEN));
}
std::cout << "Generating " << NUM_VERTICAL_SEQUENCES << " vertical sequences each "
<< "with length " << VERTICAL_SEQUENCES_LEN << std::endl;
for(int s = 0; s < NUM_VERTICAL_SEQUENCES; s++) {
vertical_sequences.push_back(random_dna_sequence(VERTICAL_SEQUENCES_LEN));
}
std::cout << "Generation complete!" << std::endl << std::endl;
// run the all-to-all alignment
std::cout << "Running all-to-all overlap alignment..." << std::endl;
unsigned long start_compute_time = get_time();
std::vector<sw_response> alignment_results = sw_all_to_all(fpga_id,
horizontal_sequences, vertical_sequences,
MATCH_SCORE, MISMATCH_PENALTY, EXTEND_GAP_PENALTY, OPEN_GAP_PENALTY,
FREE_TOP, FREE_LEFT, FREE_RIGHT, FREE_DOWN, IS_GLOBAL_ALIGNMENT);
unsigned long end_compute_time = get_time();
// print the matrix of scores for the all-to-all alignments
std::cout << std::fixed << std::setprecision(2);
std::cout << "All-to-all alignment complete!" << std::endl << std::endl;
for(int i = 0; i < NUM_VERTICAL_SEQUENCES; i++){
for(int j = 0; j < NUM_HORIZONTAL_SEQUENCES; j++){
std::cout << alignment_results[i * NUM_HORIZONTAL_SEQUENCES + j].alignment_score << "\t";
}
std::cout << std::endl;
}
std::cout << std::endl;
// measures execution time and performance
double time = end_compute_time - start_compute_time;
std::cout << "Alignment time: " << time / 1000.0 << " ms" << std::endl
<< "GCUPS (Giga Cell Update Per Second) = "
<< ((NUM_VERTICAL_SEQUENCES * VERTICAL_SEQUENCES_LEN *
NUM_HORIZONTAL_SEQUENCES * HORIZONTAL_SEQUENCES_LEN) / time / 1000.0)
<< std::endl;
// release the FPGA
fpga_release(fpga_id);
return 0;
}
std::string random_dna_sequence(size_t length)
{
char possible_characters[4] = {'A', 'C', 'T', 'G'};
std::string sequence;
for(size_t i = 0; i < length; i++) {
sequence += possible_characters[rand() % 4];
}
return sequence;
}
unsigned long get_time()
{
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec * 1000 * 1000 + tv.tv_usec;
}