hapsburg.PackagesSupport.h5_python.h5_functions

Contains various Functions to operate with h5 files: Loading h5, as well as converting to VCFs @ Author: Harald Ringbauer, 2019, All rights reserved

Functions

load_h5(path[, output])

Load HDF5 from path and return hdf5 object

save_data_h5(gt, ad, ref, alt, pos, rec, samples, path)

Create a new HDF5 File with Input Data.

to_vcf(chrom, pos, ref, alt, gt, iids, vcf_path[, ...])

Saves VCF. If Genotype Likelihoods given (pl), save them too.

add_gt_data(df, gt[, pl, iids, m_sym])

Add Genotype and Allele Depth Fields [l,n,2] for iids to pandas dataframe df.

hdf5_to_vcf(path_h5, path_vcf[, iids, markers, chrom, ...])

Load HDF5 from path_h5, extract iids and

ad_to_gentoypeL(ad[, error])

Convert Allele Depth Fields to Genotype Likelihoods.

gl_to_pl(gl)

Convert Genotype Probabilities to normalized PHRED scores

merge_in_ld_map(path_h5, path_snp1240k[, chs, write_mode])

Merge in MAP from eigenstrat .snp file into

bring_over_samples(h5_original, h5_target[, field, dt])

Bring over field from one h5 to another. Assume field does not exist in target

merge_chr_hdf5(fs[, path_combined_h5, chs])

Combine Genotype hdf5s from different chromosomes into one

concat_fields(f, f2, field1, field2[, axis])

Concatenate two hdf5 fields and return data

combine_hdf5s(f, g, path_new)

Combine Genotype hdf5s and save at path_new

mpileup2hdf5(path2mpileup, refHDF5[, iid, s, e, ...])

Function to convert Pileup to HDF5 format.

mpileups2hdf5([iid, chs, mpileup_path, out_path, ...])

Function to transfrom Pileups from several chromosomes to hdf5s.

pull_down_pileup([path_bam, iid, chs, processes, ...])

Produces a Pull Down File from a .bam file, output a pileup file in the

bam_to_hdf5([iid, chs, processes, path_bam, path_bed, ...])

Converts a bam file to a HDF5 file.

Module Contents

hapsburg.PackagesSupport.h5_python.h5_functions.load_h5(path, output=True)

Load HDF5 from path and return hdf5 object

hapsburg.PackagesSupport.h5_python.h5_functions.save_data_h5(gt, ad, ref, alt, pos, rec, samples, path, gp=[], af=[], ch=[], compression='gzip', ad_group=True, gt_type='int8')

Create a new HDF5 File with Input Data. gt: Genotype data [l,k,2] ad: Allele depth [l,k,2] ref: Reference Allele [l] alt: Alternate Allele [l] pos: Position [l] ch: Chromosome [l] only numerical values (int8) allowed m: Map position [l] af: Allele Frequencies [l] samples: Sample IDs [k] Save genotype data as int8, readcount data as int16. ad_group: whether to save allele depth gt_type: What genotype data type save

hapsburg.PackagesSupport.h5_python.h5_functions.to_vcf(chrom, pos, ref, alt, gt, iids, vcf_path, header=[], pl=[])

Saves VCF. If Genotype Likelihoods given (pl), save them too.

hapsburg.PackagesSupport.h5_python.h5_functions.add_gt_data(df, gt, pl=[], iids=[], m_sym='.')

Add Genotype and Allele Depth Fields [l,n,2] for iids to pandas dataframe df. Return modified Data Frame”. If pl (Genotype Likelihoods) given, add them too.

hapsburg.PackagesSupport.h5_python.h5_functions.hdf5_to_vcf(path_h5, path_vcf, iids=[], markers=[], chrom=0, pl_field=False)

Load HDF5 from path_h5, extract iids and (if given) markers by position and save vcf to path_vcf. pl: If True, also save Genotype Likelihoods! chrom: Value for chromosome (otherwise load from h5) iids: Which Individuals to match and save. If none given: Save all!

hapsburg.PackagesSupport.h5_python.h5_functions.ad_to_gentoypeL(ad, error=0.001)

Convert Allele Depth Fields to Genotype Likelihoods. ad: [l,n,2] contains allele contains readcounts (integers) error: Flip Error for Read return: Genotype Probabilities (Pr(G|RC)) [l,n,3] for 00/01/11

hapsburg.PackagesSupport.h5_python.h5_functions.gl_to_pl(gl)

Convert Genotype Probabilities to normalized PHRED scores gl: [l,n,3] Probabilities Pr(G|RC) (not logscale) return: [l,n,3] vector

hapsburg.PackagesSupport.h5_python.h5_functions.merge_in_ld_map(path_h5, path_snp1240k, chs=range(1, 23), write_mode='a')

Merge in MAP from eigenstrat .snp file into hdf5 file. Save modified h5 in place path_h5: Path to hdf5 file to modify. path_snp1240k: Path to Eigenstrat .snp file whose map to use chs: Which Chromosomes to merge in HDF5 [list]. write_mode: Which mode to use on hdf5. a: New field. r+: Change Field

hapsburg.PackagesSupport.h5_python.h5_functions.bring_over_samples(h5_original, h5_target, field='samples', dt='S32')

Bring over field from one h5 to another. Assume field does not exist in target h5_original: The original hdf5 path h5_target: The target hdf5 path field: Which field to copy over

hapsburg.PackagesSupport.h5_python.h5_functions.merge_chr_hdf5(fs, path_combined_h5='', chs=[])

Combine Genotype hdf5s from different chromosomes into one hdf5 and and save at path_new fs: List of hdf5s. path_combined_h5: Where to save the new masive hdf5 For now only save Allele Depths an GT - but not GP -IMPLEMENT UPDATE chs: list of chromosomes

hapsburg.PackagesSupport.h5_python.h5_functions.concat_fields(f, f2, field1, field2, axis=0)

Concatenate two hdf5 fields and return data

hapsburg.PackagesSupport.h5_python.h5_functions.combine_hdf5s(f, g, path_new)

Combine Genotype hdf5s and save at path_new f,g: Genotzpe hdf5s. g will be appended to f path_new: Where to save the new masive hdf5

hapsburg.PackagesSupport.h5_python.h5_functions.mpileup2hdf5(path2mpileup, refHDF5, iid='', s=-np.inf, e=np.inf, outPath='', output=True)

Function to convert Pileup to HDF5 format. Outputs HDF5 file at outPath in format

hapsburg.PackagesSupport.h5_python.h5_functions.mpileups2hdf5(iid='', chs=range(1, 23), mpileup_path='', out_path='', refh5_path='', s=-np.inf, e=np.inf, output=True, processes=1)

Function to transfrom Pileups from several chromosomes to hdf5s. Effectively a wrapper of mpileup2hdf5 Assumes input pileups are in format: IID.chrX.mpileup. Produce output in standard file name chrX.hdf5 iid: Name of Indivdiual to run. Used in encoding of input and as name in hdf5 chs: List of Chromosomes to run. Used in input and output file names. mpileup_path: Where to find the actual out_path: Where to find the output files. Folder in form /PATH/ refh5_path: Reference HDF5s, in format /PATH/chr processes: How many Processes to run in parallel

hapsburg.PackagesSupport.h5_python.h5_functions.pull_down_pileup(path_bam='', iid='', chs=range(1, 23), processes=4, path_bed='', out_path='', q=30, Q=30, output=True)

Produces a Pull Down File from a .bam file, output a pileup file in the standard format IID.chrX.mpileup path_bam: From which bam to pulldown. chs: Which chromosomes to run [LIST] path_bed: Path to BED file of SNP set to pulldown. format PATH.chr out_path: Where to pulldown to. processes: How many Processes to run

hapsburg.PackagesSupport.h5_python.h5_functions.bam_to_hdf5(iid='', chs=range(1, 23), processes=4, path_bam='', path_bed='', pileup_path='', outh5_path='', refh5_path='', q=30, Q=30, output=False)

Converts a bam file to a HDF5 file. Goes via samtools pulldown file as intermediate. Produce output in standard file name chrX.hdf5. Runs multiple chromosomes, and can be parallelized (into multiple Processes) Parameters: iid: What IID to save to [STRING] chs: Which chromosomes to run [LIST] path_bam: Complete path to bam to pulldown. path_bed: Complete path to BED file of SNP set to pulldown. Format PATH.chr but not X.bed pileup_path: Where to pulldown to. Folder in form /PATH/ outh5_path: Where to put the hdf5 output files. Folder in form /PATH/ refh5_path: Reference HDF5s, in format /PATH/chr processes: How many Processes to run in parallel