[−][src]Function debruijn::filter::filter_kmers
pub fn filter_kmers<K: Kmer, V: Vmer, D1: Clone, DS, S: KmerSummarizer<D1, DS>>(
seqs: &[(V, Exts, D1)],
summarizer: &dyn Deref<Target = S>,
stranded: bool,
report_all_kmers: bool,
memory_size: usize
) -> (BoomHashMap2<K, Exts, DS>, Vec<K>) where
DS: Debug,
Process DNA sequences into kmers and determine the set of valid kmers,
their extensions, and summarize associated label/'color' data. The input
sequences are converted to kmers of type K
, and like kmers are grouped together.
All instances of each kmer, along with their label data are passed to
summarizer
, an implementation of the KmerSummarizer
which decides if
the kmer is 'valid' by an arbitrary predicate of the kmer data, and
summarizes the the individual label into a single label data structure
for the kmer. Care is taken to keep the memory consumption small.
Less than 4G of temporary memory should be allocated to hold intermediate kmers.
Arguments
seqs
a slice of (sequence, extensions, data) tuples. Each tuple represents an input sequence. The input sequence must implementVmer<K
> The data slot is an arbitrary data structure labeling the input sequence. If complete sequences are passed in, the extensions entry should be set toExts::empty()
. In sharded DBG construction (for example when minimizer-based partitioning of the input strings), the input sequence is a sub-string of the original input string. In this case the extensions of the sub-string in the original string should be passed in the extensions.summarizer
is an implementation ofKmerSummarizer<D1,DS>
that decides whether a kmer is valid (e.g. based on the number of observation of the kmer), and summarizes the data about the individual kmer observations. SeeCountFilter
andCountFilterSet
for examples.stranded
: if true, preserve the strandedness of the input sequences, effectively assuming they are all in the positive strand. If false, the kmers will be canonicalized to the lexicographic minimum of the kmer and it's reverse complement.report_all_kmers
: if true returns the vector of all the observed kmers and performs the kmer based filteringmemory_size
: gives the size bound on the memory in GB to use and automatically determines the number of passes needed.
Returns
BoomHashMap2 Object, check rust-boomphf for details