hapsburg.PackagesSupport.hapsburg_run ===================================== .. py:module:: hapsburg.PackagesSupport.hapsburg_run .. autoapi-nested-parse:: Function to run HAPSBURG on a full individual and full reference Dataset, with all relevant keywords. Function for running on a single chromosome, with all relevant keywords. @Author: Harald Ringbauer, 2019 Functions --------- .. autoapisummary:: hapsburg.PackagesSupport.hapsburg_run.prepare_path_general hapsburg.PackagesSupport.hapsburg_run.preload hapsburg.PackagesSupport.hapsburg_run.hapsb_chunk_negloglik_preload hapsburg.PackagesSupport.hapsburg_run.hapsb_multiChunk_preload hapsburg.PackagesSupport.hapsburg_run.hapsb_femaleROHcontam_preload hapsburg.PackagesSupport.hapsburg_run.hapsb_chrom hapsburg.PackagesSupport.hapsburg_run.hapsb_ind hapsburg.PackagesSupport.hapsburg_run.hapsb_chromXs hapsburg.PackagesSupport.hapsburg_run.hapCon_chrom_BFGS_legacy hapsburg.PackagesSupport.hapsburg_run.hapCon_chrom_BFGS hapsburg.PackagesSupport.hapsburg_run.hapsb_chrom_lowmem hapsburg.PackagesSupport.hapsburg_run.hapsb_ind_lowmem Module Contents --------------- .. py:function:: prepare_path_general(basepath, iid, prefix, suffix, logfile) .. py:function:: preload(iid, ch, start, end, path_targets, h5_path1000g, meta_path_ref, folder_out, conPop=['CEU'], roh_jump=300, e_rate=0.01, e_rate_ref=0.001, n_ref=2504, exclude_pops=[], lowmem=False) .. py:function:: hapsb_chunk_negloglik_preload(hmm, c) .. py:function:: hapsb_multiChunk_preload(c, hmms, processes=1) .. py:function:: hapsb_femaleROHcontam_preload(iid, roh_list, h5_path1000g, meta_path_ref, hdf5_path=None, folder_out=None, init_c=0.025, trim=0.005, minLen=0.05, conPop=['CEU'], roh_jump=300, e_rate=0.01, e_rate_ref=0.001, processes=1, n_ref=2504, exclude_pops=['AFR'], prefix=None, logfile=False, cleanup=False, lowmem=False) Estimating autosomal contamination rate from a list of ROH blocks. Need at least one ROH for inference. :param iid: IID of the sample. We assume that the mpileup file has the format $iid.chr[1-22].mpileup. :type iid: str :param roh_list: Path to a file containing a list of ROH blocks. This file should have the same format as the output of hapROH. :type roh_list: str :param h5_path1000g: Path to the reference panel. :type h5_path1000g: str :param meta_path_ref: Path to the metadata of reference panel. :type meta_path_ref: str :param hdf5_path: Directory of hdf5 files. One file for each autosome. :type hdf5_path: str :param folder_out: Directory in which you want the output to reside. If not given, all output files will be in the parent directory of mpileup_path. :type folder_out: str :param init_c: Initial value for the BFGS search. :type init_c: float :param trim: Trim both ends of inferred ROH blocks (in Morgan). :type trim: float :param minLen: Minimum length of ROH blocks to use in estimating contamination (in Morgan). :type minLen: float :param conPop: Contaminant Ancestry. Must correspond to names in the super_pop or pop column in the 1000G metadata file. :type conPop: list of str :param roh_jump: Copying jump rate. :type roh_jump: float :param e_rate: If mpileup_path is provided, the sequencing error rate will be estimated from flanking sites, so this parameter has no effect. If hdf5_path is provided, then this parameter is the error rate used for inference. So one should obtain sequencing error estimate from external source if you use the hdf5_path option. :type e_rate: float :param e_rate_ref: Haplotype copying error rate. :type e_rate_ref: float :param processes: Number of processes to use. :type processes: int :param n_ref: Number of samples in the reference panel. :type n_ref: int :param exclude_pops: A list of populations to exclude from the reference panel. :type exclude_pops: list of str :param prefix: Prefix of the output and log file. The output will follow $iid.$prefix.hapCON_ROH.txt. And the log file will follow $iid.$prefix.hapCON_ROH.log. :type prefix: str :param logfile: Whether to produce a log file. :type logfile: bool :param cleanup: Whether to delete intermediary HDF5 files generated during this function run. :type cleanup: bool :returns: * **conMLE** (*float*) -- MLE estimate for contamination. * **se** (*float*) -- Standard error of the estimated contamination rate. .. py:function:: hapsb_chrom(iid, ch=3, save=True, save_fp=False, n_ref=2504, diploid_ref=True, exclude_pops=[], e_model='EigenstratPacked', p_model='MosaicHDF5', readcounts=True, random_allele=True, downsample=False, post_model='Standard', path_targets=None, h5_path1000g=None, meta_path_ref=None, folder_out=None, prefix_out='', c=0.0, conPop=['CEU'], roh_in=1, roh_out=20, roh_jump=300, e_rate=0.01, e_rate_ref=0.0, max_gap=0.005, roh_min_l_initial=0.02, roh_min_l_final=0.05, min_len1=0.02, min_len2=0.04, cutoff_post=0.999, logfile=True) Run Hapsburg analysis for one chromosome on eigenstrat data Wrapper for HMM Class. :param iid: IID of the Target Individual, as found in Eigenstrat. :type iid: str :param ch: Which chromosomes to call ROH. :type ch: int :param path_targets: Path of the target files. You need only specify one of path_targets_prefix or path_targets. :type path_targets: str :param h5_path1000g: Path of the reference genotypes :type h5_path1000g: str :param meta_path_ref: Path of the meta file for the references :type meta_path_ref: str :param folder_out: Path of the basis folder for output :type folder_out: str :param prefix_out: Path to insert in output string, e.g. test/ [str] :type prefix_out: str :param e_model: Emission model to use, should be one of haploid/diploid_gt/readcount :type e_model: str :param p_model: Preprocessing model to use, should be one of EigenstratPacked/EigenstratUnpacked/MosaicHDF5 :type p_model: str :param post_model: Model to post-process the data, should be one of Standard/MMR (experimental) :type post_model: str :param save: Whether to save the inferred ROH :type save: bool :param save_fp: Whether to save the full posterior matrix :type save_fp: bool :param n_ref: Number of (diploid) reference Individuals to use :type n_ref: int :param diploid: Whether the reference panel is diploid or not (e.g., for autosome, True and for male X chromosome, False). :type diploid: bool :param exclude_pops: Which populations to exclude from reference :type exclude_pops: list of str :param readcounts: Whether to load readcount data :type readcounts: bool :param random_allele: Whether to pick a random of the two target alleles per locus :type random_allele: bool :param downsample: If not false (i.e. float), downsample readcounts to this target average coverage :param c: Contamination rate. This is only applicable if the emission model is readcount_contam. :type c: float :param conPop: Ancestry of contamination source. Only applicable if the emission model is readcount_contam. :type conPop: list of str :param roh_in: Parater to jump into ROH state (per Morgan) :type roh_in: float :param roh_out: Parameter to jump out of ROH state (per Morgan) :type roh_out: float :param roh_jump: Parameter to jump (per Morgan) :type roh_jump: float :param e_rate: Sequencing error rate. :type e_rate: float :param e_rate_ref: Haplotype miscopying rate. :type e_rate_ref: float :param cutoff_post: Posterior cutoff for ROH calling :type cutoff_post: float :param max_gap: Maximum gap to merge two adjacent short ROH blocks (in Morgan) :type max_gap: float :param roh_min_l_initial: Minimum length of ROH blocks to use before merging adjacent ones (in Morgan) :type roh_min_l_initial: float :param roh_min_l_final: Minimum length of ROH blcoks to output after merging (in Morgan) :type roh_min_l_final: float :param min_len1: Minimum length of the shorter candidate block in two adjacent blocks that can be merged (in Morgan) :type min_len1: float :param min_len2: Minimum length of the longer candidate block in two adjacent blocks that can be merged (in Morgan) :type min_len2: float :param logfile: Whether to use logfile :type logfile: bool .. py:function:: hapsb_ind(iid, chs=range(1, 23), path_targets_prefix='', path_targets='', h5_path1000g=None, meta_path_ref=None, folder_out=None, prefix_out='', e_model='haploid', p_model='Eigenstrat', post_model='Standard', processes=1, delete=False, output=True, save=True, save_fp=False, n_ref=2504, diploid_ref=True, exclude_pops=[], readcounts=True, random_allele=True, downsample=False, c=0.0, conPop=['CEU'], roh_in=1, roh_out=20, roh_jump=300, e_rate=0.01, e_rate_ref=0.0, cutoff_post=0.999, max_gap=0.005, roh_min_l_initial=0.02, roh_min_l_final=0.04, min_len1=0.02, min_len2=0.04, logfile=True, combine=True, file_result='_roh_full.csv') Analyze a full single individual in a parallelized fashion. Run multiple chromosome analyses in parallel. Then brings together the result ROH tables from each chromosome into one genome-wide summary ROH table. This function wraps hapsb_chrom. The default Parameters are finetuned for pseudo-haploid 1240k aDNA data. :param iid: IID of the Target Individual, as found in Eigenstrat. :type iid: str :param chs: Set of chromosomes to call ROH on :type chs: list :param path_targets_prefix: A directory containing an HDF5 file for each chromosome. The file name should follow $iid.chr$ch.hdf5. :param path_targets: Path of the target files. You need only specify one of path_targets_prefix or path_targets. :type path_targets: str :param h5_path1000g: Path of the reference genotypes :type h5_path1000g: str :param meta_path_ref: Path of the meta file for the references :type meta_path_ref: str :param folder_out: Path of the basis folder for output :type folder_out: str :param prefix_out: Path to insert in output string, e.g. test/ [str] :type prefix_out: str :param e_model: Emission model to use, should be one of haploid/diploid_gt/readcount :type e_model: str :param p_model: Preprocessing model to use, should be one of EigenstratPacked/EigenstratUnpacked/MosaicHDF5 :type p_model: str :param post_model: Model to post-process the data, should be one of Standard/MMR (experimental) :type post_model: str :param processes: How many Processes to use :type processes: int :param delete: Whether to delete raw posterior per locus :type delete: bool :param output: Whether to print extensive output :type output: bool :param save: Whether to save the inferred ROH :type save: bool :param save_fp: Whether to save the full posterior matrix :type save_fp: bool :param n_ref: Number of (diploid) reference Individuals to use :type n_ref: int :param diploid: Whether the reference panel is diploid or not (e.g., for autosome, True and for male X chromosome, False). :type diploid: bool :param exclude_pops: Which populations to exclude from reference :type exclude_pops: list of str :param readcounts: Whether to load readcount data :type readcounts: bool :param random_allele: Whether to pick a random of the two target alleles per locus :type random_allele: bool :param downsample: If not false (i.e. float), downsample readcounts to this target average coverage :param c: Contamination rate. This is only applicable if the emission model is readcount_contam. :type c: float :param conPop: Ancestry of contamination source. Only applicable if the emission model is readcount_contam. :type conPop: list of str :param roh_in: Parater to jump into ROH state (per Morgan) :type roh_in: float :param roh_out: Parameter to jump out of ROH state (per Morgan) :type roh_out: float :param roh_jump: Parameter to jump (per Morgan) :type roh_jump: float :param e_rate: Sequencing error rate. :type e_rate: float :param e_rate_ref: Haplotype miscopying rate. :type e_rate_ref: float :param cutoff_post: Posterior cutoff for ROH calling :type cutoff_post: float :param max_gap: Maximum gap to merge two adjacent short ROH blocks (in Morgan) :type max_gap: float :param roh_min_l_initial: Minimum length of ROH blocks to use before merging adjacent ones (in Morgan) :type roh_min_l_initial: float :param roh_min_l_final: Minimum length of ROH blcoks to output after merging (in Morgan) :type roh_min_l_final: float :param min_len1: Minimum length of the shorter candidate block in two adjacent blocks that can be merged (in Morgan) :type min_len1: float :param min_len2: Minimum length of the longer candidate block in two adjacent blocks that can be merged (in Morgan) :type min_len2: float :param logfile: Whether to use logfile :type logfile: bool :param combine: Wether to combine output of all chromosomes :type combine: bool :param file_result: Appendix to individual results :type file_result: str :param Return: :type Return: If combine is true, return a pandas dataframe that contains information of all detected ROH blocks. Otherwise nothing is returned. .. py:function:: hapsb_chromXs(iids=[['I15595', 'I15970']], processes=1, file_result='', ch=23, save=True, save_fp=False, n_ref=2504, diploid_ref=True, exclude_pops=[], e_model='EigenstratPacked', p_model='MosaicHDF5', readcounts=True, random_allele=True, downsample=False, post_model='Standard', path_targets=None, h5_path1000g=None, meta_path_ref=None, folder_out=None, prefix_out='', c=0.0, conPop=['CEU'], roh_in=1, roh_out=20, roh_jump=300, e_rate=0.01, e_rate_ref=0.0, max_gap=0.005, roh_min_l_initial=0.02, roh_min_l_final=0.05, min_len1=0.02, min_len2=0.04, cutoff_post=0.999, logfile=True) Run multiple X chromosome pairs. iids is list of two individuals. For parameters see hapsb_chrom. Additional: iids: list of pairs of iids to run [list of lists of len 2] processes: How many processes to use [int]. Think about sufficient memory. .. py:function:: hapCon_chrom_BFGS_legacy(iid='', hdf5=None, n_ref=2504, diploid_ref=False, exclude_pops=['AFR'], conPop=['CEU'], h5_path1000g=None, meta_path_ref=None, folder_out='', c=0.025, roh_jump=300, e_rate=0.01, e_rate_ref=0.001, output=False) .. py:function:: hapCon_chrom_BFGS(iid='', mpileup=None, bam=None, bamTable=None, q=30, Q=30, n_ref=2504, diploid_ref=True, exclude_pops=['AFR'], conPop=['CEU'], h5_path1000g=None, meta_path_ref=None, folder_out='', c=0.025, roh_jump=300, e_rate_ref=0.001, lowmem=False, logfile=False, output=False, cleanup=False, prefix='hapCon') Run HapCon to estimate male X chromosome contamination. :param iid: IID of the Target Individual, if not provided, will be deduced from the prefix of BAM or mpileup file. :type iid: str :param mpileup: path to mpileup file of chrX. :type mpileup: str :param bam: path to BAM file. :type bam: str :param bamTable: path to output of BamTable. You need to specify one of mpileup/BAM/bamTable. We recommend using BamTable as your first choice. :type bamTable: str :param q: minimum mappling quality for reads in BAM file to be considered. Only applicable if used in conjunction with the bam option. :type q: int :param Q: minimum base quality for reads in BAM file to be considered. Only applicable if used in conjunction with the bam option. :type Q: int :param h5_path1000g: Path of the reference genotypes. :type h5_path1000g: str :param meta_path_ref: Path of the meta file for the references. :type meta_path_ref: str :param folder_out: Path of the basis folder for output, if not provided, output will reside in the same directory as BAM or mpileup file. :type folder_out: str :param output: Whether to print extensive output. Default False. :type output: bool :param n_ref: Maximum Number of (diploid) reference Individuals to use. Default 2504. :type n_ref: int :param exclude_pops: Which populations to exclude from reference. Default is to exclude African haplotypes. Note that the population label supplied in the list must correspond to entries in the "pop" or "super_pop" column in the metadata file. A list of population labels (column "pop") can be seen https://www.coriell.org/1/NHGRI/Collections/1000-Genomes-Collections/1000-Genomes-Project. The "super_pop" column has 5 possible values: AFR, EUR, EAS, SAS, AMR. The column "pop" is a refinement of "super_pop". For example, to exclude AFR and EUR haplotypes, one can use exclude_pops=[AFR, EUR]. You can also mix "pop" and "super_pop" labels, for example, exluce_pops=[AFR, IBS]. The same principle applies for the conPop argument as well. :type exclude_pops: list of str :param conPop: use which population in the ref panel as the contaminating pop. If empty list, then use all samples in the ref panel to cauclate allele freq. Default is to CEU allele frequency. :type conPop: list of str :param c: initial contamination rate to start the BFGS optimization procedure. Default to 0.025. :type c: float :param roh_jump: Parameter to jump (per Morgan). Default to 300. :type roh_jump: float :param e_rate_ref: Haplotype coping error rate. Default to 0.001. :type e_rate_ref: float :param logfile: Whether to produce a logfile. Default False. :type logfile: bool :param cleanup: Whether to delete the intermediary .hdf5 file. Default False. :type cleanup: bool :param prefix: The output file will have file name $iid.$prefix.txt :type prefix: str :returns: * **conMLE** (*float*) -- MLE point estimate for contamination. * **lower95** (*float*) -- lower bound for the 95% CI of estimated contamination. * **upper95** (*float*) -- upper bound for the 95% CI of estimated contamination. .. py:function:: hapsb_chrom_lowmem(iid, ch=3, save=True, save_fp=False, n_ref=2504, diploid_ref=True, exclude_pops=[], e_model='readcount', p_model='HDF5_lowmem', readcounts=True, random_allele=True, downsample=False, post_model='Standard', path_targets=None, h5_path1000g=None, meta_path_ref=None, folder_out=None, prefix_out='', c=0.0, conPop=['CEU'], roh_in=1, roh_out=20, roh_jump=300, e_rate=0.01, e_rate_ref=0.0, max_gap=0.005, roh_min_l_initial=0.02, roh_min_l_final=0.05, min_len1=0.02, min_len2=0.04, cutoff_post=0.999, verbose=True, logfile=True) Run Hapsburg analysis for one chromosome on eigenstrat data Wrapper for HMM Class. :param iid: IID of the Target Individual, as found in Eigenstrat. :type iid: str :param ch: Which chromosomes to call ROH. :type ch: int :param path_targets: Path of the target files. You need only specify one of path_targets_prefix or path_targets. :type path_targets: str :param h5_path1000g: Path of the reference genotypes :type h5_path1000g: str :param meta_path_ref: Path of the meta file for the references :type meta_path_ref: str :param folder_out: Path of the basis folder for output :type folder_out: str :param prefix_out: Path to insert in output string, e.g. test/ [str] :type prefix_out: str :param e_model: Emission model to use, should be one of haploid/diploid_gt/readcount :type e_model: str :param p_model: Preprocessing model to use, should be one of EigenstratPacked/EigenstratUnpacked/MosaicHDF5 :type p_model: str :param post_model: The model to post-process the data, should be one of Standard/MMR (experimental) :type post_model: str :param save: Whether to save the inferred ROH :type save: bool :param save_fp: Whether to save the full posterior matrix :type save_fp: bool :param n_ref: Number of (diploid) reference Individuals to use :type n_ref: int :param diploid: Whether the reference panel is diploid or not (e.g., for autosome, True and for male X chromosome, False). :type diploid: bool :param exclude_pops: Which populations to exclude from the reference :type exclude_pops: list of str :param readcounts: Whether to load readcount data :type readcounts: bool :param random_allele: Whether to pick a random of the two target alleles per locus :type random_allele: bool :param downsample: If not false (i.e. float), downsample readcounts to this target average coverage :param c: Contamination rate. This is only applicable if the emission model is readcount_contam. :type c: float :param conPop: Ancestry of contamination source. Only applicable if the emission model is readcount_contam. :type conPop: list of str :param roh_in: Parater to jump into ROH state (per Morgan) :type roh_in: float :param roh_out: Parameter to jump out of ROH state (per Morgan) :type roh_out: float :param roh_jump: Parameter to jump (per Morgan) :type roh_jump: float :param e_rate: Sequencing error rate. :type e_rate: float :param e_rate_ref: Haplotype miscopying rate. :type e_rate_ref: float :param cutoff_post: Posterior cutoff for ROH calling :type cutoff_post: float :param max_gap: Maximum gap to merge two adjacent short ROH blocks (in Morgan) :type max_gap: float :param roh_min_l_initial: Minimum length of ROH blocks to use before merging adjacent ones (in Morgan) :type roh_min_l_initial: float :param roh_min_l_final: Minimum length of ROH blocks to output after merging (in Morgan) :type roh_min_l_final: float :param min_len1: Minimum length of the shorter candidate block in two adjacent blocks that can be merged (in Morgan) :type min_len1: float :param min_len2: Minimum length of the longer candidate block in two adjacent blocks that can be merged (in Morgan) :type min_len2: float :param logfile: Whether to use logfile :type logfile: bool .. py:function:: hapsb_ind_lowmem(iid, chs=range(1, 23), path_targets_prefix='', path_targets='', h5_path1000g=None, meta_path_ref=None, folder_out=None, prefix_out='', e_model='haploid', p_model='Eigenstrat', post_model='Standard', processes=1, delete=False, output=True, save=True, save_fp=False, n_ref=2504, diploid_ref=True, exclude_pops=[], readcounts=True, random_allele=True, downsample=False, c=0.0, conPop=['CEU'], roh_in=1, roh_out=20, roh_jump=300, e_rate=0.01, e_rate_ref=0.0, cutoff_post=0.999, max_gap=0.005, roh_min_l_initial=0.02, roh_min_l_final=0.04, min_len1=0.02, min_len2=0.04, verbose=True, logfile=True, combine=True, file_result='_roh_full.csv') Analyze a full single individual in a parallelized fashion. Run multiple chromosome analyses in parallel. It then brings together the result ROH tables from each chromosome into a genome-wide summary ROH table. This function wraps hapsb_chrom. The default Parameters are fine-tuned for pseudo-haploid 1240k aDNA data. :param iid: IID of the Target Individual, as found in Eigenstrat. :type iid: str :param chs: Which set of chromosomes to call ROH. :type chs: list :param path_targets_prefix: A directory containing an HDF5 file for each chromosome. The file name should follow $iid.chr$ch.hdf5. :param path_targets: Path of the target files. You need only specify one of path_targets_prefix or path_targets. :type path_targets: str :param h5_path1000g: Path of the reference genotypes :type h5_path1000g: str :param meta_path_ref: Path of the meta file for the references :type meta_path_ref: str :param folder_out: Path of the basis folder for output :type folder_out: str :param prefix_out: Path to insert in output string, e.g. test/ [str] :type prefix_out: str :param e_model: Emission model to use, should be one of haploid/diploid_gt/readcount :type e_model: str :param p_model: Preprocessing model to use, should be one of EigenstratPacked/EigenstratUnpacked/MosaicHDF5 :type p_model: str :param post_model: Model to post-process the data, should be one of Standard/MMR (experimental) :type post_model: str :param processes: How many Processes to use :type processes: int :param delete: Whether to delete raw posterior per locus :type delete: bool :param output: Whether to print extensive output :type output: bool :param save: Whether to save the inferred ROH :type save: bool :param save_fp: Whether to save the full posterior matrix :type save_fp: bool :param n_ref: Number of (diploid) reference Individuals to use :type n_ref: int :param diploid: Whether the reference panel is diploid or not (e.g., for autosome, True and for male X chromosome, False). :type diploid: bool :param exclude_pops: Which populations to exclude from reference :type exclude_pops: list of str :param readcounts: Whether to load readcount data :type readcounts: bool :param random_allele: Whether to pick a random of the two target alleles per locus :type random_allele: bool :param downsample: If not false (i.e. float), downsample readcounts to this target average coverage :param c: Contamination rate. This is only applicable if the emission model is readcount_contam. :type c: float :param conPop: Ancestry of contamination source. Only applicable if the emission model is readcount_contam. :type conPop: list of str :param roh_in: Parater to jump into ROH state (per Morgan) :type roh_in: float :param roh_out: Parameter to jump out of ROH state (per Morgan) :type roh_out: float :param roh_jump: Parameter to jump (per Morgan) :type roh_jump: float :param e_rate: Sequencing error rate. :type e_rate: float :param e_rate_ref: Haplotype miscopying rate. :type e_rate_ref: float :param cutoff_post: Posterior cutoff for ROH calling :type cutoff_post: float :param max_gap: Maximum gap to merge two adjacent short ROH blocks (in Morgan) :type max_gap: float :param roh_min_l_initial: Minimum length of ROH blocks to use before merging adjacent ones (in Morgan) :type roh_min_l_initial: float :param roh_min_l_final: Minimum length of ROH blcoks to output after merging (in Morgan) :type roh_min_l_final: float :param min_len1: Minimum length of the shorter candidate block in two adjacent blocks that can be merged (in Morgan) :type min_len1: float :param min_len2: Minimum length of the longer candidate block in two adjacent blocks that can be merged (in Morgan) :type min_len2: float :param logfile: Whether to use logfile :type logfile: bool :param combine: Whether to combine the output of all chromosomes :type combine: bool :param file_result: Appendix to individual results :type file_result: str :param Return: :type Return: If combine is true, return a pandas dataframe that contains information of all detected ROH blocks. Otherwise nothing is returned.