pomoxis package

Submodules

pomoxis.assess_homopolymers module

class pomoxis.assess_homopolymers.AverageScore(cumulative_score=0, count=0)[source]

Bases: object

Keep track of a simple average of inputs.

property average_score

Return the average of the accumulated samples.

pomoxis.assess_homopolymers.analyse_counts(counts, prefix)[source]
pomoxis.assess_homopolymers.analyse_homopolymers(args)[source]
pomoxis.assess_homopolymers.combine_complementary_relative_lengths(data)[source]
pomoxis.assess_homopolymers.count_homopolymers(args)[source]
pomoxis.assess_homopolymers.counts_factory()[source]
pomoxis.assess_homopolymers.find_homopolymers(seq, min_length, alphabet='ACGT')[source]
Parameters
  • seq – the sequence to search

  • min_length – minimum hp length to find

  • alphabet – alphabet of bases to look for

Returns

pomoxis.assess_homopolymers.get_fraction_correct(counts)[source]

Calculate fraction correct vs length from counts.

pomoxis.assess_homopolymers.get_longest_homopolymer(seq, base)[source]
pomoxis.assess_homopolymers.get_next_aligned_base(align, seq_start)[source]
pomoxis.assess_homopolymers.get_query_coords(align_pairs, seq_start, hp_len, base)[source]
pomoxis.assess_homopolymers.get_query_region(query_sequence, query_start, query_end, base)[source]
pomoxis.assess_homopolymers.get_relative_lengths(counts)[source]
pomoxis.assess_homopolymers.load_counts(pkl)[source]
pomoxis.assess_homopolymers.main()[source]

Entry point for homopolymer accuracy counting.

pomoxis.assess_homopolymers.merge_counts(pkl_files)[source]
pomoxis.assess_homopolymers.plot_fraction_correct(df, fname, cols=('A', 'T', 'G', 'C'))[source]

Plot fraction correct vs length from counts.

pomoxis.assess_homopolymers.plot_relative_lengths(data, fname)[source]
pomoxis.assess_homopolymers.process_bam(input_file, out_fh, homo_len, score)[source]
pomoxis.assess_homopolymers.save_counts(counts, fp)[source]

pomoxis.bio module

pomoxis.bio.reverse_complement(seq)[source]

Reverse complement sequence.

Param

input sequence string.

Returns

reverse-complemented string.

pomoxis.bio.shotgun_library(fasta_file, mu, sigma, direction=(1, - 1))[source]

Generate random fragment sequences of a given input sequence

Parameters
  • seq – input sequence.

  • mu – mean fragment length.

  • sigma – stdv of fragment length.

  • direction – tuple represention direction of output sequences with respect to the input sequence.

Yields

sequence fragments.

Note

Could be made more efficient using buffers for random samples and handling cases separately.

pomoxis.catalogue_errors module

class pomoxis.catalogue_errors.AlignSeg(rname, qname, pairs, rlen)

Bases: tuple

property pairs

Alias for field number 2

property qname

Alias for field number 1

property rlen

Alias for field number 3

property rname

Alias for field number 0

class pomoxis.catalogue_errors.ClassifyErrorTest(methodName='runTest')[source]

Bases: unittest.case.TestCase

setUp()[source]

Hook method for setting up the test fixture before exercising it.

test_HP_complete_sub()[source]
test_HP_complex_sub()[source]
test_HP_del_join()[source]
test_HP_del_join2()[source]
test_HP_del_join3()[source]
test_HP_del_join4()[source]
test_HP_flk_sub_ext()[source]
test_HP_flk_sub_trc()[source]
test_HP_ins_split()[source]
test_HP_ins_split2()[source]
test_HP_ins_split3()[source]
test_HP_ins_split4()[source]
test_HP_ins_split5()[source]
test_HP_ins_split6()[source]
test_HP_messy_sub()[source]
test_HP_messy_sub_ext()[source]
test_HP_not_complete_sub()[source]
test_HP_sub_join()[source]
test_HP_sub_split()[source]
test_HP_sub_swap_ext()[source]
test_HP_sub_swap_trc()[source]
test_HP_sub_swap_trc2()[source]
test_HP_swap_sub_trc()[source]
test_complex_HP_del()[source]
test_fwd_repeat_ins()[source]
test_hp_del_2_0()[source]
test_hp_del_6_5()[source]
test_hp_ins()[source]
test_long_multi_del()[source]
test_long_multi_ins()[source]
test_multi_del()[source]
test_multi_ins()[source]
class pomoxis.catalogue_errors.Context(p_i, qb, rb)

Bases: tuple

property p_i

Alias for field number 0

property qb

Alias for field number 1

property rb

Alias for field number 2

class pomoxis.catalogue_errors.Error(rp, rname, qp, qname, ref, match, read, counts, klass, aggr_klass)

Bases: tuple

property aggr_klass

Alias for field number 9

property counts

Alias for field number 7

property klass

Alias for field number 8

property match

Alias for field number 5

property qname

Alias for field number 3

property qp

Alias for field number 2

property read

Alias for field number 6

property ref

Alias for field number 4

property rname

Alias for field number 1

property rp

Alias for field number 0

pomoxis.catalogue_errors.analyze_counts(counts, total_ref_length)[source]
pomoxis.catalogue_errors.are_adjacent(inds)[source]

“Check if all int indices in the interable are consecutive.

Parameters

inds – iterable of ints

Returns

bool

pomoxis.catalogue_errors.classify_error(context, indel_sizes=None)[source]

Classify error within an alignment.

Parameters

contextContext object

Indel_sizes

iterable of int, for binning indel sizes. indels >= to indel_sizes[0] will not be considered as HP splitting/joining indels

Returns

(str reference_context, str match_line, str query_context, dict counts of sub/ins/del within context)

pomoxis.catalogue_errors.classify_hp_indel(p_i, key, errors, runs1, seq2)[source]

Look for a specific kind of HP indel that splits or joins two HPs

Parameters
  • p_i – int, index of error

  • key – key of error type within errors (should be ‘ins’ or ‘del’)

  • errors – dict of error positions

Runs1

np.ndarray, rle encoding of sequence1 (query if deletions join two HPs, or ref if insertions split a HP

Seq2

iterable of str of sequence2 (ref if deletions join two HPs, or query if insertions split a HP.

Returns

str classification or None

pomoxis.catalogue_errors.classify_hp_sub(p_i, adjacent, errors, match_line, rb_runs, qb_runs, qp_is_hp, rp_is_hp)[source]
pomoxis.catalogue_errors.get_aggr_counts(total_counts)[source]
pomoxis.catalogue_errors.get_aggr_klass(klass)[source]
pomoxis.catalogue_errors.get_errors(aln, tree=None)[source]

Find positions of errors in an aligment.

Parameters
  • aln – iterable of AlignPos objects.

  • bed_file – path to .bed file of regions to include in analysis.

  • treeintervaltree.IntervalTree object of regions to analyse.

Returns

( [(ri, qi, ‘error_type’, last_ri, last_qi)], aligned_ref_len) ri, qi: ref and query positions error_type: ‘D’, ‘I’ or ‘S’ last_ri, last_qi: ref and query positions of the last match aligned_ref_len: total aligned reference length (taking account of masking tree)

pomoxis.catalogue_errors.get_match_line_and_err_index(context)[source]
pomoxis.catalogue_errors.get_run(i, runs)[source]

Find run to which the i’th element belongs.

Parameters

i – int, element index wihin input to rle.

Returns

int, element index within runs to which i belongs.

pomoxis.catalogue_errors.is_in_hp(seq, p_i)[source]
pomoxis.catalogue_errors.main()[source]
pomoxis.catalogue_errors.plot_summary(df, outdir, prefix, ref_len)[source]

Create a plot showing Q-scores as largest remaining error klass is removed

pomoxis.catalogue_errors.preprocess_error(p, aln, search_by_q, offset=10)[source]
Parameters
  • p – int, position (ref position, or query position)

  • aln – iterable of AlignPos objects.

  • search_by_q – bool, whether to search by query position (typically done if qi is None).

Returns

Context object

pomoxis.catalogue_errors.qscore(d)[source]

Calculate a qscore

pomoxis.catalogue_errors.rle(it)[source]

Calculate a run length encoding (rle), of an input vector.

Parameters

it – iterable.

Returns

structured array with fields start, length, and value.

pomoxis.catalogue_errors.simple_klass(adjacent, n, err_type, indel_sizes)[source]

pomoxis.common_errors_from_bam module

pomoxis.common_errors_from_bam.count_errors(errors)[source]
pomoxis.common_errors_from_bam.get_errors(aln, ref_seq=None)[source]
pomoxis.common_errors_from_bam.get_qscores(counts, ref_len)[source]
pomoxis.common_errors_from_bam.main()[source]
pomoxis.common_errors_from_bam.scuff_ref(ref_seq, errors)[source]

pomoxis.coverage_from_bam module

pomoxis.coverage_from_bam.coverage_of_region(region, bam_fp, stride)[source]

Get coverage data for a region

pomoxis.coverage_from_bam.coverage_summary_of_region(*args)[source]
pomoxis.coverage_from_bam.main()[source]

pomoxis.find_indels module

pomoxis.find_indels.get_trimmed_pairs(aln)[source]

Trim aligned pairs to the alignment.

Parameters

alnpysam.AlignedSegment object

Yields pairs

pomoxis.find_indels.main(arguments=None)[source]

pomoxis.qscores_from_summary module

pomoxis.qscores_from_summary.extract_vals(vals, name_map, col_ind)[source]
pomoxis.qscores_from_summary.get_ref_cover(lines, ref)[source]
pomoxis.qscores_from_summary.main()[source]
pomoxis.qscores_from_summary.qscore_from_summary(summaries, median, ref)[source]

pomoxis.ref_seqs_from_bam module

pomoxis.ref_seqs_from_bam.main()[source]

pomoxis.stats_from_bam module

Tabulate some simple alignment stats from sam.

pomoxis.stats_from_bam.main(arguments=None)[source]
pomoxis.stats_from_bam.masked_stats_from_aligned_read(read, references, lengths, tree)[source]

Create summary information for an aligned read over regions in bed file.

Parameters

readpysam.AlignedSegment object

pomoxis.stats_from_bam.stats_from_aligned_read(read, references, lengths)[source]

Create summary information for an aligned read

Parameters

readpysam.AlignedSegment object

pomoxis.subsample_bam module

pomoxis.subsample_bam.filter_read(r, bam, args, logger)[source]

Decide whether a read should be filtered out, returning a bool

pomoxis.subsample_bam.main()[source]
pomoxis.subsample_bam.subsample_region_proportionally(region, args)[source]
pomoxis.subsample_bam.subsample_region_uniformly(region, args)[source]

pomoxis.summary_from_stats module

Tabulate some simple alignment stats from sam.

pomoxis.summary_from_stats.main(arguments=None)[source]
pomoxis.summary_from_stats.qscore(d)[source]
pomoxis.summary_from_stats.summarise_stats(d, percentiles=(10, 50, 90))[source]

pomoxis.trim_alignments module

pomoxis.trim_alignments.main()[source]

pomoxis.util module

class pomoxis.util.AlignPos(qpos, qbase, rpos, rbase)

Bases: tuple

property qbase

Alias for field number 1

property qpos

Alias for field number 0

property rbase

Alias for field number 3

property rpos

Alias for field number 2

class pomoxis.util.FastxWrite(fname, mode='w', width=80, force_q=False, mock_q=10)[source]

Bases: object

write(name, seq, qual=None, comment='')[source]
class pomoxis.util.Region(ref_name, start, end)

Bases: tuple

property end

Alias for field number 2

property ref_name

Alias for field number 0

property start

Alias for field number 1

class pomoxis.util.SeqLen(option_strings, dest, nargs=None, const=None, default=None, type=None, choices=None, required=False, help=None, metavar=None)[source]

Bases: argparse.Action

Parse a sequence length from str such as 4.8mb or from fastx.

pomoxis.util.cat(files, output, chunks=10485760)[source]

Concatenate a set of files.

Parameters
  • files – input filenames.

  • output – output filenames.

  • chunks – buffersize for filecopy.

pomoxis.util.chunks(iterable, n)[source]

Generate fixed length chunks of an interable.

Parameters
  • iterable – input sequence.

  • n – chunk size.

pomoxis.util.coverage_from_fastx()[source]
pomoxis.util.extract_long_reads()[source]

Filter fastq to longest reads.

pomoxis.util.fast_convert()[source]

Convert between fasta<->fastq.

pomoxis.util.get_pairs(aln)[source]

Return generator of pairs.

Parameters

alnpysam.AlignedSegment object.

Returns

generator of AlignPos objects.

pomoxis.util.get_seq_lens(fastx)[source]

Get sequence lengths from fastx file

pomoxis.util.get_trimmed_pairs(aln)[source]

Trim aligned pairs to the alignment.

Parameters

alnpysam.AlignedSegment object

Yields pairs

pomoxis.util.intervaltrees_from_bed(path_to_bed)[source]

Created dict of intervaltrees from a .bed file, indexed by chrom.

Parameters

path_to_bed – str, path to .bed file.

Returns

{ str chrom: intervaltree.IntervalTree obj }.

pomoxis.util.parse_regions(regions, ref_lengths=None)[source]

Parse region strings into Region objects.

Parameters
  • regions – iterable of str

  • ref_lengths – {str ref_names: int ref_lengths}, if provided Region.end will default to the reference length instead of None.

>>> parse_regions(['Ecoli'])[0]
Region(ref_name='Ecoli', start=0, end=None)
>>> parse_regions(['Ecoli:1000-2000'])[0]
Region(ref_name='Ecoli', start=1000, end=2000)
>>> parse_regions(['Ecoli:-1000'])[0]
Region(ref_name='Ecoli', start=0, end=1000)
>>> parse_regions(['Ecoli:500-'])[0]
Region(ref_name='Ecoli', start=500, end=None)
>>> parse_regions(['Ecoli'], ref_lengths={'Ecoli':4800000})[0]
Region(ref_name='Ecoli', start=0, end=4800000)
>>> parse_regions(['NC_000921.1:10000-20000'])[0]
Region(ref_name='NC_000921.1', start=10000, end=20000)
pomoxis.util.reverse_bed()[source]

Convert bed-file coordinates to coordinates on the reverse strand.

pomoxis.util.split_fastx(fname, output, chunksize=10000)[source]

Split records in a fasta/q into fixed lengths.

Parameters
  • fname – input filename.

  • output – output filename.

  • chunksize – (maximum) length of output records.

pomoxis.util.split_fastx_cmdline()[source]

Split records in a fasta/q file into chunks of a maximum size.

pomoxis.util.tag_bam()[source]

Command line tool to add tags to a bam.

pomoxis.util.yield_from_bed(bedfile)[source]

Module contents

pomoxis.get_prog_path(prog)[source]

Get the absolute path of bundled executables.

Parameters

prog – programme (file) name.

Returns

absolute path to executable.

pomoxis.run_prog(prog, args, stdout=None)[source]

Run one of the bundled executables.

Parameters
  • prog – programme name.

  • args – commandline arguments for programme.

  • stdout – filehandle/path for stdout of programme.

Returns

result of subprocess.call.

pomoxis.show_prog_path()[source]

Print the path of bundled executables.