hapsburg.preprocessing
Class for preprocessing and loading the Data. Is the interface to Data Folders Has Sub-Classes for different Types of Data, as well as factory Method that returns the right subclass based on a keyword. Use this factory method to load classes. @ Author: Harald Ringbauer, 2019, All rights reserved
Attributes
Classes
Class for PreProcessing the Data. |
|
Class for PreProcessing the Data. |
|
Class for PreProcessing Eigenstrat Files |
|
Class for PreProcessing Eigenstrat Files |
|
Class for PreProcessing simulated 1000 Genome Data (Mosaic Simulations). |
|
Preprocessing if data has been saved into a folder |
Functions
|
Factory method to load the Transition Model. |
Module Contents
- class hapsburg.preprocessing.PreProcessing
Bases:
objectClass for PreProcessing the Data. Standard: Intersect Reference Data with Individual Data Return the Intersection Dataset
- save = True
- output = True
- iid = ''
- ch = 0
- segM = []
- n_ref = 0
- abstractmethod load_data(iid='MA89', ch=6)
Return Refererence Matrix [k,l], Genotype/Readcount Matrix [2,l] as well as linkage map [l]
- set_params(**kwargs)
Set the Parameters. Takes keyworded arguments
- set_output_folder(iid, ch)
Set the output folder after folder_out. General Structure for HAPSBURG: folder_out/iid/chrX/
- class hapsburg.preprocessing.PreProcessingHDF5(conPop=[], save=True, output=True)
Bases:
PreProcessingClass for PreProcessing the Data. Standard: Intersect Reference Data with Individual Data Return the Intersection Dataset
- path_targets = './../ancient-sardinia/output/h5_rev/mod_reich_sardinia_ancients_rev_mrg_dedup_3trm_anno.h5'
- h5_path1000g = './Data/1000Genomes/HDF5/1240kHDF5/Eur1240chr'
- meta_path_ref = './Data/1000Genomes/Individuals/meta_df.csv'
- excluded = ['TSI']
- conPop = []
- folder_out = './Empirical/1240k/'
- prefix_out_data = ''
- samples_field = 'samples'
- readcounts = False
- diploid_ref = True
- random_allele = True
- only_calls = True
- flipstrand = True
- max_mm_rate = 0.9
- downsample = False
- save = True
- output = True
- get_index_iid_legacy(iid, fs=0)
Get the Index of IID in fs iid to extract. fs reference HDF5
- get_index_iid(iid, f=0, samples_field='samples')
Get the Index of IID in fs iid to extract. fs reference HDF5
- get_ref_ids(f, samples_field='samples')
OVERWRITE: Get the Indices of the individuals in the HDF5 to extract. Here: Allow to subset for Individuals from different 1000G Populations samples_field: Field of all sample iids in hdf5
- load_data(iid='MA89', ch=6, start=-np.inf, end=np.inf)
Return Matrix of reference [k,l], Matrix of Individual Data [2,l], as well as linkage Map [l]
- optional_postprocessing(gts_ind, gts, r_map, pos, out_folder, pCon, read_counts=[])
Postprocessing steps of gts_ind, gts, r_map, and the folder, based on boolean fields of the class.
- destroy_phase_func(gts_ind, dtype='int8')
Randomly shuffles phase for gts [2,n_loci]
- markers_called(gts_ind, read_counts)
Return boolean array of markers which are called. If read_counts exist, use that for downsampling. Otherwise Use Genotype Field
- load_h5(path)
Load and return the HDF5 File from Path
- merge_2hdf(f, g, start=-np.inf, end=np.inf)
Merge two HDF 5 f and g. Return Indices of Overlap Individuals. f is Sardinian HDF5, g the Reference HDF5
- save_info(folder, cm_map, pos, gt_individual=[], read_counts=[])
Save Linkage Map, Readcount and Genotype Data per Individual. (Needed for latter Plotting) Genotypes Individual: If given, save as well
- extract_snps_contaminationPop(h5, ids_con, markers)
- extract_snps_hdf5(h5, ids_ref, markers, diploid=False, dtype='int8', removeIncompleteHap=True, start=0, end=-1)
Extract genotypes from h5 on ids and markers. If diploid, concatenate haplotypes along 0 axis. Extract indivuals first, and then subset to SNPs in Memory. Return 2D array [# haplotypes, # markers]
- extract_rc_hdf5(h5, ids, markers, dtype=np.int8)
Extract Readcount data from from h5 on single (!) id and markers int8: Watch out - limited to max RC 127!
- extract_rmap_hdf5(h5, markers, col='variants/MAP')
Extract a column like rec map from h5. Can also be used for Positions
- get_segment(r_map, markers_obs, markers_ref)
Extract only markers in self.segM and downsamples the key array. Return downsampled versions.
- class hapsburg.preprocessing.PreProcessingEigenstrat(save=True, output=True, packed=1, sep='\\s+')
Bases:
PreProcessingHDF5Class for PreProcessing Eigenstrat Files Same as PreProcessingHDF5 for reference, but with Eigenstrat coe for target
- path_targets = ''
- h5_path1000g = './Data/1000Genomes/HDF5/1240kHDF5/Eur1240chr'
- packed = -1
- sep = '\\s+'
- flipstrand = True
- save = True
- output = True
- es_get_index_iid(es, iid)
Get IID of Indices
- get_1000G_path(h5_path1000g, ch)
Construct and return the path to the 1000 Genome reference panel
- get_ref_ids(f, samples_field='samples')
OVERWRITE: Get the Indices of the individuals in the HDF5 to extract. Here: Allow to subset for Individuals from different 1000G Populations samples_field: Field of all sample iids in hdf5
- optional_postprocessing(gts_ind, gts, r_map, pos, out_folder, read_counts=[])
Postprocessing steps of gts_ind, gts, r_map, and the folder, based on boolean fields of the class.
- load_data(iid='MA89', ch=6)
Return Matrix of reference [k,l], Matrix of Individual Data [2,l], as well as linkage Map [l] and the output folder. Save the loaded data if self.save==True Various modifiers in class fields (check also PreProcessingHDF5)
- extract_snps_es(es, id, markers)
Use Eigenstrat object. Extract genotypes for individual index i (integer) for list of markers. Do conversion from Eigenstrat GT to format used here
- merge_es_hdf5(es, f_ref)
Merge Eigenstrat and HDF5 Loci, return intersection indices es: LoadEigenstrat Object f_ref: Reference HDF5 if self.flipstrand return indices [its] [its] and flip boolean vector [its]
- to_read_counts(gts_ind)
Transforms vector of genotypes [2,l] to vector of read counts [2,l]. Return this vector of read counts
- class hapsburg.preprocessing.PreProcessingEigenstratX(save=True, output=True, packed=1, sep='\\s+')
Bases:
PreProcessingEigenstratClass for PreProcessing Eigenstrat Files Same as Eigenstrat, but will load and combine two X Chromosomes (have to be male ones!!)
- path_targets = ''
- h5_path1000g = './Data/1000Genomes/HDF5/1240kHDF5/Eur1240chr'
- packed = -1
- sep = '\\s+'
- flipstrand = True
- set_output_folder(iid, ch='X')
Set the output folder after folder_out. General Structure for HAPSBURG: folder_out/iid1_iid2/chrX/ Return this folder
- get_1000G_path(h5_path1000g, ch='X')
Construct and eturn the path to the 1000 Genome reference panel
- es_get_index_iid(es, iid)
Get index of IIDs in eigenstrat. Here for X: iid is a list
- extract_snps_es(es, id, markers)
Use Eigenstrat object. Extract genotypes for individual index i (integer) for list of markers. Do conversion from Eigenstrat GT to format used here
- class hapsburg.preprocessing.PreProcessingHDF5Sim(conPop=[], save=True, output=True)
Bases:
PreProcessingHDF5Class for PreProcessing simulated 1000 Genome Data (Mosaic Simulations). Same as PreProcessingHDF5 but with the right Folder Structure MODIFY
- out_folder = ''
- prefix_out_data = ''
- path_targets = ''
- h5_path1000g = './Data/1000Genomes/HDF5/1240kHDF5/Eur1240chr'
- class hapsburg.preprocessing.PreProcessingFolder(save=True, output=True)
Bases:
PreProcessingPreprocessing if data has been saved into a folder (such as in a Simulation)
- save = True
- output = True
- readcounts = False
- load_data(iid='MA89', ch=6, n_ref=503, folder='')
Return Matrix of reference [k,l], Matrix of Individual Data [2,l], as well as linkage Map [l]
- load_linkage_map(folder, nr_snps)
Load and Return the Linkage Map
- hapsburg.preprocessing.load_preprocessing(p_model='SardHDF5', conPop=[], save=True, output=True)
Factory method to load the Transition Model. Return
- hapsburg.preprocessing.pp