pomoxis package¶

Submodules¶

pomoxis.assess_homopolymers module¶

class pomoxis.assess_homopolymers.AverageScore(cumulative_score=0, count=0)[source]¶

Bases: object

Keep track of a simple average of inputs.

property average_score¶: Return the average of the accumulated samples.

pomoxis.assess_homopolymers.analyse_counts(counts, prefix)[source]¶

pomoxis.assess_homopolymers.analyse_homopolymers(args)[source]¶

pomoxis.assess_homopolymers.combine_complementary_relative_lengths(data)[source]¶

pomoxis.assess_homopolymers.count_homopolymers(args)[source]¶

pomoxis.assess_homopolymers.counts_factory()[source]¶

pomoxis.assess_homopolymers.find_homopolymers(seq, min_length, alphabet='ACGT')[source]¶

Parameters

seq – the sequence to search
min_length – minimum hp length to find
alphabet – alphabet of bases to look for

Returns

pomoxis.assess_homopolymers.get_fraction_correct(counts)[source]¶: Calculate fraction correct vs length from counts.

pomoxis.assess_homopolymers.get_longest_homopolymer(seq, base)[source]¶

pomoxis.assess_homopolymers.get_next_aligned_base(align, seq_start)[source]¶

pomoxis.assess_homopolymers.get_query_coords(align_pairs, seq_start, hp_len, base)[source]¶

pomoxis.assess_homopolymers.get_query_region(query_sequence, query_start, query_end, base)[source]¶

pomoxis.assess_homopolymers.get_relative_lengths(counts)[source]¶

pomoxis.assess_homopolymers.load_counts(pkl)[source]¶

pomoxis.assess_homopolymers.main()[source]¶: Entry point for homopolymer accuracy counting.

pomoxis.assess_homopolymers.merge_counts(pkl_files)[source]¶

pomoxis.assess_homopolymers.plot_fraction_correct(df, fname, cols=('A', 'T', 'G', 'C'))[source]¶: Plot fraction correct vs length from counts.

pomoxis.assess_homopolymers.plot_relative_lengths(data, fname)[source]¶

pomoxis.assess_homopolymers.process_bam(input_file, out_fh, homo_len, score)[source]¶

pomoxis.assess_homopolymers.save_counts(counts, fp)[source]¶

pomoxis.bio module¶

pomoxis.bio.reverse_complement(seq)[source]¶

Reverse complement sequence.

Param: input sequence string.
Returns: reverse-complemented string.

pomoxis.bio.shotgun_library(fasta_file, mu, sigma, direction=(1, - 1))[source]¶

Generate random fragment sequences of a given input sequence

Parameters

seq – input sequence.
mu – mean fragment length.
sigma – stdv of fragment length.
direction – tuple represention direction of output sequences with respect to the input sequence.

Yields

sequence fragments.

Note

Could be made more efficient using buffers for random samples and handling cases separately.

pomoxis.catalogue_errors module¶

class pomoxis.catalogue_errors.AlignSeg(rname, qname, pairs, rlen)¶

Bases: tuple

property pairs¶: Alias for field number 2

property qname¶: Alias for field number 1

property rlen¶: Alias for field number 3

property rname¶: Alias for field number 0

class pomoxis.catalogue_errors.ClassifyErrorTest(methodName='runTest')[source]¶

Bases: unittest.case.TestCase

setUp()[source]¶: Hook method for setting up the test fixture before exercising it.

test_HP_complete_sub()[source]¶

test_HP_complex_sub()[source]¶

test_HP_del_join()[source]¶

test_HP_del_join2()[source]¶

test_HP_del_join3()[source]¶

test_HP_del_join4()[source]¶

test_HP_flk_sub_ext()[source]¶

test_HP_flk_sub_trc()[source]¶

test_HP_ins_split()[source]¶

test_HP_ins_split2()[source]¶

test_HP_ins_split3()[source]¶

test_HP_ins_split4()[source]¶

test_HP_ins_split5()[source]¶

test_HP_ins_split6()[source]¶

test_HP_messy_sub()[source]¶

test_HP_messy_sub_ext()[source]¶

test_HP_not_complete_sub()[source]¶

test_HP_sub_join()[source]¶

test_HP_sub_split()[source]¶

test_HP_sub_swap_ext()[source]¶

test_HP_sub_swap_trc()[source]¶

test_HP_sub_swap_trc2()[source]¶

test_HP_swap_sub_trc()[source]¶

test_complex_HP_del()[source]¶

test_fwd_repeat_ins()[source]¶

test_hp_del_2_0()[source]¶

test_hp_del_6_5()[source]¶

test_hp_ins()[source]¶

test_long_multi_del()[source]¶

test_long_multi_ins()[source]¶

test_multi_del()[source]¶

test_multi_ins()[source]¶

class pomoxis.catalogue_errors.Context(p_i, qb, rb)¶

Bases: tuple

property p_i¶: Alias for field number 0

property qb¶: Alias for field number 1

property rb¶: Alias for field number 2

class pomoxis.catalogue_errors.Error(rp, rname, qp, qname, ref, match, read, counts, klass, aggr_klass)¶

Bases: tuple

property aggr_klass¶: Alias for field number 9

property counts¶: Alias for field number 7

property klass¶: Alias for field number 8

property match¶: Alias for field number 5

property qname¶: Alias for field number 3

property qp¶: Alias for field number 2

property read¶: Alias for field number 6

property ref¶: Alias for field number 4

property rname¶: Alias for field number 1

property rp¶: Alias for field number 0

pomoxis.catalogue_errors.analyze_counts(counts, total_ref_length)[source]¶

pomoxis.catalogue_errors.are_adjacent(inds)[source]¶

“Check if all int indices in the interable are consecutive.

Parameters: inds – iterable of ints
Returns: bool

pomoxis.catalogue_errors.classify_error(context, indel_sizes=None)[source]¶

Classify error within an alignment.

Parameters: context – Context object
Indel_sizes: iterable of int, for binning indel sizes. indels >= to indel_sizes[0] will not be considered as HP splitting/joining indels
Returns: (str reference_context, str match_line, str query_context, dict counts of sub/ins/del within context)

pomoxis.catalogue_errors.classify_hp_indel(p_i, key, errors, runs1, seq2)[source]¶

Look for a specific kind of HP indel that splits or joins two HPs

Parameters

p_i – int, index of error
key – key of error type within errors (should be ‘ins’ or ‘del’)
errors – dict of error positions

Runs1

np.ndarray, rle encoding of sequence1 (query if deletions join two HPs, or ref if insertions split a HP

Seq2

iterable of str of sequence2 (ref if deletions join two HPs, or query if insertions split a HP.

Returns

str classification or None

pomoxis.catalogue_errors.classify_hp_sub(p_i, adjacent, errors, match_line, rb_runs, qb_runs, qp_is_hp, rp_is_hp)[source]¶

pomoxis.catalogue_errors.get_aggr_counts(total_counts)[source]¶

pomoxis.catalogue_errors.get_aggr_klass(klass)[source]¶

pomoxis.catalogue_errors.get_errors(aln, tree=None)[source]¶

Find positions of errors in an aligment.

Parameters

aln – iterable of AlignPos objects.
bed_file – path to .bed file of regions to include in analysis.
tree – intervaltree.IntervalTree object of regions to analyse.

Returns

( [(ri, qi, ‘error_type’, last_ri, last_qi)], aligned_ref_len) ri, qi: ref and query positions error_type: ‘D’, ‘I’ or ‘S’ last_ri, last_qi: ref and query positions of the last match aligned_ref_len: total aligned reference length (taking account of masking tree)

pomoxis.catalogue_errors.get_match_line_and_err_index(context)[source]¶

pomoxis.catalogue_errors.get_run(i, runs)[source]¶

Find run to which the i’th element belongs.

Parameters: i – int, element index wihin input to rle.
Returns: int, element index within runs to which i belongs.

pomoxis.catalogue_errors.is_in_hp(seq, p_i)[source]¶

pomoxis.catalogue_errors.main()[source]¶

pomoxis.catalogue_errors.plot_summary(df, outdir, prefix, ref_len)[source]¶: Create a plot showing Q-scores as largest remaining error klass is removed

pomoxis.catalogue_errors.preprocess_error(p, aln, search_by_q, offset=10)[source]¶

Parameters

p – int, position (ref position, or query position)
aln – iterable of AlignPos objects.
search_by_q – bool, whether to search by query position (typically done if qi is None).

Returns

Context object

pomoxis.catalogue_errors.qscore(d)[source]¶: Calculate a qscore

pomoxis.catalogue_errors.rle(it)[source]¶

Calculate a run length encoding (rle), of an input vector.

Parameters: it – iterable.
Returns: structured array with fields start, length, and value.

pomoxis.catalogue_errors.simple_klass(adjacent, n, err_type, indel_sizes)[source]¶

pomoxis.common_errors_from_bam module¶

pomoxis.common_errors_from_bam.count_errors(errors)[source]¶

pomoxis.common_errors_from_bam.get_errors(aln, ref_seq=None)[source]¶

pomoxis.common_errors_from_bam.get_qscores(counts, ref_len)[source]¶

pomoxis.common_errors_from_bam.main()[source]¶

pomoxis.common_errors_from_bam.scuff_ref(ref_seq, errors)[source]¶

pomoxis.coverage_from_bam module¶

pomoxis.coverage_from_bam.coverage_of_region(region, bam_fp, stride)[source]¶: Get coverage data for a region

pomoxis.coverage_from_bam.coverage_summary_of_region(*args)[source]¶

pomoxis.coverage_from_bam.main()[source]¶

pomoxis.find_indels module¶

pomoxis.find_indels.get_trimmed_pairs(aln)[source]¶

Trim aligned pairs to the alignment.

Parameters: aln – pysam.AlignedSegment object
Yields pairs

pomoxis.find_indels.main(arguments=None)[source]¶

pomoxis.qscores_from_summary module¶

pomoxis.qscores_from_summary.extract_vals(vals, name_map, col_ind)[source]¶

pomoxis.qscores_from_summary.get_ref_cover(lines, ref)[source]¶

pomoxis.qscores_from_summary.main()[source]¶

pomoxis.qscores_from_summary.qscore_from_summary(summaries, median, ref)[source]¶

pomoxis.ref_seqs_from_bam module¶

pomoxis.ref_seqs_from_bam.main()[source]¶

pomoxis.stats_from_bam module¶

Tabulate some simple alignment stats from sam.

pomoxis.stats_from_bam.main(arguments=None)[source]¶

pomoxis.stats_from_bam.masked_stats_from_aligned_read(read, references, lengths, tree)[source]¶

Create summary information for an aligned read over regions in bed file.

Parameters: read – pysam.AlignedSegment object

pomoxis.stats_from_bam.stats_from_aligned_read(read, references, lengths)[source]¶

Create summary information for an aligned read

Parameters: read – pysam.AlignedSegment object

pomoxis.subsample_bam module¶

pomoxis.subsample_bam.filter_read(r, bam, args, logger)[source]¶: Decide whether a read should be filtered out, returning a bool

pomoxis.subsample_bam.main()[source]¶

pomoxis.subsample_bam.subsample_region_proportionally(region, args)[source]¶

pomoxis.subsample_bam.subsample_region_uniformly(region, args)[source]¶

pomoxis.summary_from_stats module¶

Tabulate some simple alignment stats from sam.

pomoxis.summary_from_stats.main(arguments=None)[source]¶

pomoxis.summary_from_stats.qscore(d)[source]¶

pomoxis.summary_from_stats.summarise_stats(d, percentiles=(10, 50, 90))[source]¶

pomoxis.trim_alignments module¶

pomoxis.trim_alignments.main()[source]¶

pomoxis.util module¶

class pomoxis.util.AlignPos(qpos, qbase, rpos, rbase)¶

Bases: tuple

property qbase¶: Alias for field number 1

property qpos¶: Alias for field number 0

property rbase¶: Alias for field number 3

property rpos¶: Alias for field number 2

class pomoxis.util.FastxWrite(fname, mode='w', width=80, force_q=False, mock_q=10)[source]¶

Bases: object

write(name, seq, qual=None, comment='')[source]¶

class pomoxis.util.Region(ref_name, start, end)¶

Bases: tuple

property end¶: Alias for field number 2

property ref_name¶: Alias for field number 0

property start¶: Alias for field number 1

class pomoxis.util.SeqLen(option_strings, dest, nargs=None, const=None, default=None, type=None, choices=None, required=False, help=None, metavar=None)[source]¶

Bases: argparse.Action

Parse a sequence length from str such as 4.8mb or from fastx.

pomoxis.util.cat(files, output, chunks=10485760)[source]¶

Concatenate a set of files.

Parameters

files – input filenames.
output – output filenames.
chunks – buffersize for filecopy.

pomoxis.util.chunks(iterable, n)[source]¶

Generate fixed length chunks of an interable.

Parameters

iterable – input sequence.
n – chunk size.

pomoxis.util.coverage_from_fastx()[source]¶

pomoxis.util.extract_long_reads()[source]¶: Filter fastq to longest reads.

pomoxis.util.fast_convert()[source]¶: Convert between fasta<->fastq.

pomoxis.util.get_pairs(aln)[source]¶

Return generator of pairs.

Parameters: aln – pysam.AlignedSegment object.
Returns: generator of AlignPos objects.

pomoxis.util.get_seq_lens(fastx)[source]¶: Get sequence lengths from fastx file

pomoxis.util.get_trimmed_pairs(aln)[source]¶

Trim aligned pairs to the alignment.

Parameters: aln – pysam.AlignedSegment object
Yields pairs

pomoxis.util.intervaltrees_from_bed(path_to_bed)[source]¶

Created dict of intervaltrees from a .bed file, indexed by chrom.

Parameters: path_to_bed – str, path to .bed file.
Returns: { str chrom: intervaltree.IntervalTree obj }.

pomoxis.util.parse_regions(regions, ref_lengths=None)[source]¶

Parse region strings into Region objects.

Parameters

regions – iterable of str
ref_lengths – {str ref_names: int ref_lengths}, if provided Region.end will default to the reference length instead of None.

>>> parse_regions(['Ecoli'])[0]
Region(ref_name='Ecoli', start=0, end=None)
>>> parse_regions(['Ecoli:1000-2000'])[0]
Region(ref_name='Ecoli', start=1000, end=2000)
>>> parse_regions(['Ecoli:-1000'])[0]
Region(ref_name='Ecoli', start=0, end=1000)
>>> parse_regions(['Ecoli:500-'])[0]
Region(ref_name='Ecoli', start=500, end=None)
>>> parse_regions(['Ecoli'], ref_lengths={'Ecoli':4800000})[0]
Region(ref_name='Ecoli', start=0, end=4800000)
>>> parse_regions(['NC_000921.1:10000-20000'])[0]
Region(ref_name='NC_000921.1', start=10000, end=20000)

pomoxis.util.reverse_bed()[source]¶: Convert bed-file coordinates to coordinates on the reverse strand.

pomoxis.util.split_fastx(fname, output, chunksize=10000)[source]¶

Split records in a fasta/q into fixed lengths.

Parameters

fname – input filename.
output – output filename.
chunksize – (maximum) length of output records.

pomoxis.util.split_fastx_cmdline()[source]¶: Split records in a fasta/q file into chunks of a maximum size.

pomoxis.util.tag_bam()[source]¶: Command line tool to add tags to a bam.

pomoxis.util.yield_from_bed(bedfile)[source]¶

Module contents¶

pomoxis.get_prog_path(prog)[source]¶

Get the absolute path of bundled executables.

Parameters: prog – programme (file) name.
Returns: absolute path to executable.

pomoxis.run_prog(prog, args, stdout=None)[source]¶

Run one of the bundled executables.

Parameters

prog – programme name.
args – commandline arguments for programme.
stdout – filehandle/path for stdout of programme.

Returns

result of subprocess.call.

pomoxis.show_prog_path()[source]¶: Print the path of bundled executables.