hapsburg.PackagesSupport.hapsburg_run

Function to run HAPSBURG on a full individual and full reference Dataset, with all relevant keywords. Function for running on single chromsome, with all relevant keywords. @Author: Harald Ringbauer, 2019

Module Contents

Functions

prepare_path_general(basepath, iid, prefix, suffix, ...)

preload(iid, ch, start, end, path_targets, ...[, ...])

hapsb_chunk_negloglik_preload(hmm, c)

hapsb_multiChunk_preload(c, hmms[, processes])

hapsb_femaleROHcontam_preload(iid, roh_list, ...[, ...])

Estimating autosomal contamination rate from a list of ROH blocks. Need at least one ROH for inference.

hapsb_chrom(iid[, ch, save, save_fp, n_ref, ...])

Run Hapsburg analysis for one chromosome on eigenstrat data

hapsb_ind(iid[, chs, path_targets_prefix, ...])

Analyze a full single individual in a parallelized fashion. Run multiple chromosome analyses in parallel.

hapsb_chromXs([iids, processes, file_result, ch, ...])

Run multiple X chromosome pairs. iids is list of two individuals.

hapCon_chrom_BFGS_legacy([iid, hdf5, n_ref, ...])

hapCon_chrom_BFGS([iid, mpileup, bam, bamTable, q, Q, ...])

Run HapCon to estimate male X chromosome contamination.

hapsb_chrom_lowmem(iid[, ch, save, save_fp, n_ref, ...])

Run Hapsburg analysis for one chromosome on eigenstrat data

hapsburg.PackagesSupport.hapsburg_run.prepare_path_general(basepath, iid, prefix, suffix, logfile)
hapsburg.PackagesSupport.hapsburg_run.preload(iid, ch, start, end, path_targets, h5_path1000g, meta_path_ref, folder_out, conPop=['CEU'], roh_jump=300, e_rate=0.01, e_rate_ref=0.001, n_ref=2504, exclude_pops=[])
hapsburg.PackagesSupport.hapsburg_run.hapsb_chunk_negloglik_preload(hmm, c)
hapsburg.PackagesSupport.hapsburg_run.hapsb_multiChunk_preload(c, hmms, processes=1)
hapsburg.PackagesSupport.hapsburg_run.hapsb_femaleROHcontam_preload(iid, roh_list, h5_path1000g, meta_path_ref, hdf5_path=None, folder_out=None, init_c=0.025, trim=0.005, minLen=0.05, conPop=['CEU'], roh_jump=300, e_rate=0.01, e_rate_ref=0.001, processes=1, n_ref=2504, exclude_pops=['AFR'], prefix=None, logfile=False, cleanup=False)

Estimating autosomal contamination rate from a list of ROH blocks. Need at least one ROH for inference.

Parameters:
  • iid (str) – IID of the sample. We assume that the mpileup file has the format $iid.chr[1-22].mpileup.

  • roh_list (str) – Path to a file containing a list of ROH blocks. This file should have the same format as the output of hapROH.

  • h5_path1000g (str) – Path to the reference panel.

  • meta_path_ref (str) – Path to the metadata of reference panel.

  • hdf5_path (str) – Directory of hdf5 files. One file for each autosome.

  • folder_out (str) – Directory in which you want the output to reside. If not given, all output files will be in the parent directory of mpileup_path.

  • init_c (float) – Initial value for the BFGS search.

  • trim (float) – Trim both ends of inferred ROH blocks (in Morgan).

  • minLen (float) – Minimum length of ROH blocks to use in estimating contamination (in Morgan).

  • conPop (list of str) – Contaminant Ancestry. Must correspond to names in the super_pop or pop column in the 1000G metadata file.

  • roh_jump (float) – Copying jump rate.

  • e_rate (float) – If mpileup_path is provided, the sequencing error rate will be estimated from flanking sites, so this parameter has no effect. If hdf5_path is provided, then this parameter is the error rate used for inference. So one should obtain sequencing error estimate from external source if you use the hdf5_path option.

  • e_rate_ref (float) – Haplotype copying error rate.

  • processes (int) – Number of processes to use.

  • n_ref (int) – Number of samples in the reference panel.

  • exclude_pops (list of str) – A list of populations to exclude from the reference panel.

  • prefix (str) – Prefix of the output and log file. The output will follow $iid.$prefix.hapCON_ROH.txt. And the log file will follow $iid.$prefix.hapCON_ROH.log.

  • logfile (bool) – Whether to produce a log file.

  • cleanup (bool) – Whether to delete intermediary HDF5 files generated during this function run.

Returns:

  • conMLE (float) – MLE estimate for contamination.

  • se (float) – Standard error of the estimated contamination rate.

hapsburg.PackagesSupport.hapsburg_run.hapsb_chrom(iid, ch=3, save=True, save_fp=False, n_ref=2504, diploid_ref=True, exclude_pops=[], e_model='EigenstratPacked', p_model='MosaicHDF5', readcounts=True, random_allele=True, downsample=False, post_model='Standard', path_targets=None, h5_path1000g=None, meta_path_ref=None, folder_out=None, prefix_out='', c=0.0, conPop=['CEU'], roh_in=1, roh_out=20, roh_jump=300, e_rate=0.01, e_rate_ref=0.0, max_gap=0.005, roh_min_l_initial=0.02, roh_min_l_final=0.05, min_len1=0.02, min_len2=0.04, cutoff_post=0.999, logfile=True)

Run Hapsburg analysis for one chromosome on eigenstrat data Wrapper for HMM Class.

Parameters:
  • iid (str) – IID of the Target Individual, as found in Eigenstrat.

  • ch (int) – Which chromosomes to call ROH.

  • path_targets (str) – Path of the target files. You need only specify one of path_targets_prefix or path_targets.

  • h5_path1000g (str) – Path of the reference genotypes

  • meta_path_ref (str) – Path of the meta file for the references

  • folder_out (str) – Path of the basis folder for output

  • prefix_out (str) – Path to insert in output string, e.g. test/ [str]

  • e_model (str) – Emission model to use, should be one of haploid/diploid_gt/readcount

  • p_model (str) – Preprocessing model to use, should be one of EigenstratPacked/EigenstratUnpacked/MosaicHDF5

  • post_model (str) – Model to post-process the data, should be one of Standard/MMR (experimental)

  • save (bool) – Whether to save the inferred ROH

  • save_fp (bool) – Whether to save the full posterior matrix

  • n_ref (int) – Number of (diploid) reference Individuals to use

  • diploid (bool) – Whether the reference panel is diploid or not (e.g., for autosome, True and for male X chromosome, False).

  • exclude_pops (list of str) – Which populations to exclude from reference

  • readcounts (bool) – Whether to load readcount data

  • random_allele (bool) – Whether to pick a random of the two target alleles per locus

  • downsample – If not false (i.e. float), downsample readcounts to this target average coverage

  • c (float) – Contamination rate. This is only applicable if the emission model is readcount_contam.

  • conPop (list of str) – Ancestry of contamination source. Only applicable if the emission model is readcount_contam.

  • roh_in (float) – Parater to jump into ROH state (per Morgan)

  • roh_out (float) – Parameter to jump out of ROH state (per Morgan)

  • roh_jump (float) – Parameter to jump (per Morgan)

  • e_rate (float) – Sequencing error rate.

  • e_rate_ref (float) – Haplotype miscopying rate.

  • cutoff_post (float) – Posterior cutoff for ROH calling

  • max_gap (float) – Maximum gap to merge two adjacent short ROH blocks (in Morgan)

  • roh_min_l_initial (float) – Minimum length of ROH blocks to use before merging adjacent ones (in Morgan)

  • roh_min_l_final (float) – Minimum length of ROH blcoks to output after merging (in Morgan)

  • min_len1 (float) – Minimum length of the shorter candidate block in two adjacent blocks that can be merged (in Morgan)

  • min_len2 (float) – Minimum length of the longer candidate block in two adjacent blocks that can be merged (in Morgan)

  • logfile (bool) – Whether to use logfile

hapsburg.PackagesSupport.hapsburg_run.hapsb_ind(iid, chs=range(1, 23), path_targets_prefix='', path_targets='', h5_path1000g=None, meta_path_ref=None, folder_out=None, prefix_out='', e_model='haploid', p_model='Eigenstrat', post_model='Standard', processes=1, delete=False, output=True, save=True, save_fp=False, n_ref=2504, diploid_ref=True, exclude_pops=[], readcounts=True, random_allele=True, downsample=False, c=0.0, conPop=['CEU'], roh_in=1, roh_out=20, roh_jump=300, e_rate=0.01, e_rate_ref=0.0, cutoff_post=0.999, max_gap=0.005, roh_min_l_initial=0.02, roh_min_l_final=0.04, min_len1=0.02, min_len2=0.04, logfile=True, combine=True, file_result='_roh_full.csv')

Analyze a full single individual in a parallelized fashion. Run multiple chromosome analyses in parallel. Then brings together the result ROH tables from each chromosome into one genome-wide summary ROH table. This function wraps hapsb_chrom. The default Parameters are finetuned for pseudo-haploid 1240k aDNA data.

Parameters:
  • iid (str) – IID of the Target Individual, as found in Eigenstrat.

  • chs (list) – Which set of chromosomes to call ROH.

  • path_targets_prefix – A directory containing a hdf5 file for each chromosome. The file name should follow $iid.chr$ch.hdf5.

  • path_targets (str) – Path of the target files. You need only specify one of path_targets_prefix or path_targets.

  • h5_path1000g (str) – Path of the reference genotypes

  • meta_path_ref (str) – Path of the meta file for the references

  • folder_out (str) – Path of the basis folder for output

  • prefix_out (str) – Path to insert in output string, e.g. test/ [str]

  • e_model (str) – Emission model to use, should be one of haploid/diploid_gt/readcount

  • p_model (str) – Preprocessing model to use, should be one of EigenstratPacked/EigenstratUnpacked/MosaicHDF5

  • post_model (str) – Model to post-process the data, should be one of Standard/MMR (experimental)

  • processes (int) – How many Processes to use

  • delete (bool) – Whether to delete raw posterior per locus

  • output (bool) – Whether to print extensive output

  • save (bool) – Whether to save the inferred ROH

  • save_fp (bool) – Whether to save the full posterior matrix

  • n_ref (int) – Number of (diploid) reference Individuals to use

  • diploid (bool) – Whether the reference panel is diploid or not (e.g., for autosome, True and for male X chromosome, False).

  • exclude_pops (list of str) – Which populations to exclude from reference

  • readcounts (bool) – Whether to load readcount data

  • random_allele (bool) – Whether to pick a random of the two target alleles per locus

  • downsample – If not false (i.e. float), downsample readcounts to this target average coverage

  • c (float) – Contamination rate. This is only applicable if the emission model is readcount_contam.

  • conPop (list of str) – Ancestry of contamination source. Only applicable if the emission model is readcount_contam.

  • roh_in (float) – Parater to jump into ROH state (per Morgan)

  • roh_out (float) – Parameter to jump out of ROH state (per Morgan)

  • roh_jump (float) – Parameter to jump (per Morgan)

  • e_rate (float) – Sequencing error rate.

  • e_rate_ref (float) – Haplotype miscopying rate.

  • cutoff_post (float) – Posterior cutoff for ROH calling

  • max_gap (float) – Maximum gap to merge two adjacent short ROH blocks (in Morgan)

  • roh_min_l_initial (float) – Minimum length of ROH blocks to use before merging adjacent ones (in Morgan)

  • roh_min_l_final (float) – Minimum length of ROH blcoks to output after merging (in Morgan)

  • min_len1 (float) – Minimum length of the shorter candidate block in two adjacent blocks that can be merged (in Morgan)

  • min_len2 (float) – Minimum length of the longer candidate block in two adjacent blocks that can be merged (in Morgan)

  • logfile (bool) – Whether to use logfile

  • combine (bool) – Wether to combine output of all chromosomes

  • file_result (str) – Appendix to individual results

  • Return (If combine is true, return a pandas dataframe that contains information of all detected ROH blocks. Otherwise nothing is returned.) –

hapsburg.PackagesSupport.hapsburg_run.hapsb_chromXs(iids=[['I15595', 'I15970']], processes=1, file_result='', ch=23, save=True, save_fp=False, n_ref=2504, diploid_ref=True, exclude_pops=[], e_model='EigenstratPacked', p_model='MosaicHDF5', readcounts=True, random_allele=True, downsample=False, post_model='Standard', path_targets=None, h5_path1000g=None, meta_path_ref=None, folder_out=None, prefix_out='', c=0.0, conPop=['CEU'], roh_in=1, roh_out=20, roh_jump=300, e_rate=0.01, e_rate_ref=0.0, max_gap=0.005, roh_min_l_initial=0.02, roh_min_l_final=0.05, min_len1=0.02, min_len2=0.04, cutoff_post=0.999, logfile=True)

Run multiple X chromosome pairs. iids is list of two individuals. For parameters see hapsb_chrom. Additional: iids: list of pairs of iids to run [list of lists of len 2] processes: How many processes to use [int]. Think about sufficient memory.

hapsburg.PackagesSupport.hapsburg_run.hapCon_chrom_BFGS_legacy(iid='', hdf5=None, n_ref=2504, diploid_ref=False, exclude_pops=['AFR'], conPop=['CEU'], h5_path1000g=None, meta_path_ref=None, folder_out='', c=0.025, roh_jump=300, e_rate=0.01, e_rate_ref=0.001, output=False)
hapsburg.PackagesSupport.hapsburg_run.hapCon_chrom_BFGS(iid='', mpileup=None, bam=None, bamTable=None, q=30, Q=30, n_ref=2504, diploid_ref=False, exclude_pops=['AFR'], conPop=['CEU'], h5_path1000g=None, meta_path_ref=None, folder_out='', c=0.025, roh_jump=300, e_rate_ref=0.001, damage=False, logfile=False, output=False, cleanup=False, prefix='hapCon')

Run HapCon to estimate male X chromosome contamination.

Parameters:
  • iid (str) – IID of the Target Individual, if not provided, will be deduced from the prefix of BAM or mpileup file.

  • mpileup (str) – path to mpileup file of chrX.

  • bam (str) – path to BAM file.

  • bamTable (str) – path to output of BamTable. You need to specify one of mpileup/BAM/bamTable. We recommend using BamTable as your first choice.

  • q (int) – minimum mappling quality for reads in BAM file to be considered. Only applicable if used in conjunction with the bam option.

  • Q (int) – minimum base quality for reads in BAM file to be considered. Only applicable if used in conjunction with the bam option.

  • h5_path1000g (str) – Path of the reference genotypes.

  • meta_path_ref (str) – Path of the meta file for the references.

  • folder_out (str) – Path of the basis folder for output, if not provided, output will reside in the same directory as BAM or mpileup file.

  • output (bool) – Whether to print extensive output. Default False.

  • n_ref (int) – Maximum Number of (diploid) reference Individuals to use. Default 2504.

  • exclude_pops (list of str) – Which populations to exclude from reference. Default is to exclude African haplotypes. Note that the population label supplied in the list must correspond to entries in the “pop” or “super_pop” column in the metadata file. A list of population labels (column “pop”) can be seen https://www.coriell.org/1/NHGRI/Collections/1000-Genomes-Collections/1000-Genomes-Project. The “super_pop” column has 5 possible values: AFR, EUR, EAS, SAS, AMR. The column “pop” is a refinement of “super_pop”. For example, to exclude AFR and EUR haplotypes, one can use exclude_pops=[AFR, EUR]. You can also mix “pop” and “super_pop” labels, for example, exluce_pops=[AFR, IBS]. The same principle applies for the conPop argument as well.

  • conPop (list of str) – use which population in the ref panel as the contaminating pop. If empty list, then use all samples in the ref panel to cauclate allele freq. Default is to CEU allele frequency.

  • c (float) – initial contamination rate to start the BFGS optimization procedure. Default to 0.025.

  • roh_jump (float) – Parameter to jump (per Morgan). Default to 300.

  • e_rate_ref (float) – Haplotype coping error rate. Default to 0.001.

  • logfile (bool) – Whether to produce a logfile. Default False.

  • cleanup (bool) – Whether to delete the intermediary .hdf5 file. Default False.

  • prefix (str) – The output file will have file name $iid.$prefix.txt

Returns:

  • conMLE (float) – MLE point estimate for contamination.

  • lower95 (float) – lower bound for the 95% CI of estimated contamination.

  • upper95 (float) – upper bound for the 95% CI of estimated contamination.

hapsburg.PackagesSupport.hapsburg_run.hapsb_chrom_lowmem(iid, ch=3, save=True, save_fp=False, n_ref=2504, diploid_ref=True, exclude_pops=[], e_model='EigenstratPacked', p_model='MosaicHDF5', readcounts=True, random_allele=True, downsample=False, post_model='Standard', path_targets=None, h5_path1000g=None, meta_path_ref=None, folder_out=None, prefix_out='', c=0.0, conPop=['CEU'], roh_in=1, roh_out=20, roh_jump=300, e_rate=0.01, e_rate_ref=0.0, max_gap=0.005, roh_min_l_initial=0.02, roh_min_l_final=0.05, min_len1=0.02, min_len2=0.04, cutoff_post=0.999, verbose=True, logfile=True)

Run Hapsburg analysis for one chromosome on eigenstrat data Wrapper for HMM Class.

Parameters:
  • iid (str) – IID of the Target Individual, as found in Eigenstrat.

  • ch (int) – Which chromosomes to call ROH.

  • path_targets (str) – Path of the target files. You need only specify one of path_targets_prefix or path_targets.

  • h5_path1000g (str) – Path of the reference genotypes

  • meta_path_ref (str) – Path of the meta file for the references

  • folder_out (str) – Path of the basis folder for output

  • prefix_out (str) – Path to insert in output string, e.g. test/ [str]

  • e_model (str) – Emission model to use, should be one of haploid/diploid_gt/readcount

  • p_model (str) – Preprocessing model to use, should be one of EigenstratPacked/EigenstratUnpacked/MosaicHDF5

  • post_model (str) – Model to post-process the data, should be one of Standard/MMR (experimental)

  • save (bool) – Whether to save the inferred ROH

  • save_fp (bool) – Whether to save the full posterior matrix

  • n_ref (int) – Number of (diploid) reference Individuals to use

  • diploid (bool) – Whether the reference panel is diploid or not (e.g., for autosome, True and for male X chromosome, False).

  • exclude_pops (list of str) – Which populations to exclude from reference

  • readcounts (bool) – Whether to load readcount data

  • random_allele (bool) – Whether to pick a random of the two target alleles per locus

  • downsample – If not false (i.e. float), downsample readcounts to this target average coverage

  • c (float) – Contamination rate. This is only applicable if the emission model is readcount_contam.

  • conPop (list of str) – Ancestry of contamination source. Only applicable if the emission model is readcount_contam.

  • roh_in (float) – Parater to jump into ROH state (per Morgan)

  • roh_out (float) – Parameter to jump out of ROH state (per Morgan)

  • roh_jump (float) – Parameter to jump (per Morgan)

  • e_rate (float) – Sequencing error rate.

  • e_rate_ref (float) – Haplotype miscopying rate.

  • cutoff_post (float) – Posterior cutoff for ROH calling

  • max_gap (float) – Maximum gap to merge two adjacent short ROH blocks (in Morgan)

  • roh_min_l_initial (float) – Minimum length of ROH blocks to use before merging adjacent ones (in Morgan)

  • roh_min_l_final (float) – Minimum length of ROH blcoks to output after merging (in Morgan)

  • min_len1 (float) – Minimum length of the shorter candidate block in two adjacent blocks that can be merged (in Morgan)

  • min_len2 (float) – Minimum length of the longer candidate block in two adjacent blocks that can be merged (in Morgan)

  • logfile (bool) – Whether to use logfile