hapsburg.preprocessing ====================== .. py:module:: hapsburg.preprocessing .. autoapi-nested-parse:: Class for preprocessing and loading the Data. Is the interface to Data Folders Has Sub-Classes for different Types of Data, as well as factory Method that returns the right subclass based on a keyword. Use this factory method to load classes. @ Author: Harald Ringbauer, 2019, All rights reserved Attributes ---------- .. autoapisummary:: hapsburg.preprocessing.pp Classes ------- .. autoapisummary:: hapsburg.preprocessing.PreProcessing hapsburg.preprocessing.PreProcessingHDF5 hapsburg.preprocessing.PreProcessingEigenstrat hapsburg.preprocessing.PreProcessingEigenstratX hapsburg.preprocessing.PreProcessingHDF5Sim hapsburg.preprocessing.PreProcessingFolder Functions --------- .. autoapisummary:: hapsburg.preprocessing.load_preprocessing Module Contents --------------- .. py:class:: PreProcessing Bases: :py:obj:`object` Class for PreProcessing the Data. Standard: Intersect Reference Data with Individual Data Return the Intersection Dataset .. py:attribute:: save :value: True .. py:attribute:: output :value: True .. py:attribute:: iid :value: '' .. py:attribute:: ch :value: 0 .. py:attribute:: segM :value: [] .. py:attribute:: n_ref :value: 0 .. py:method:: load_data(iid='MA89', ch=6) :abstractmethod: Return Refererence Matrix [k,l], Genotype/Readcount Matrix [2,l] as well as linkage map [l] .. py:method:: set_params(**kwargs) Set the Parameters. Takes keyworded arguments .. py:method:: set_output_folder(iid, ch) Set the output folder after folder_out. General Structure for HAPSBURG: folder_out/iid/chrX/ .. py:class:: PreProcessingHDF5(conPop=[], save=True, output=True) Bases: :py:obj:`PreProcessing` Class for PreProcessing the Data. Standard: Intersect Reference Data with Individual Data Return the Intersection Dataset .. py:attribute:: path_targets :value: './../ancient-sardinia/output/h5_rev/mod_reich_sardinia_ancients_rev_mrg_dedup_3trm_anno.h5' .. py:attribute:: h5_path1000g :value: './Data/1000Genomes/HDF5/1240kHDF5/Eur1240chr' .. py:attribute:: meta_path_ref :value: './Data/1000Genomes/Individuals/meta_df.csv' .. py:attribute:: excluded :value: ['TSI'] .. py:attribute:: conPop :value: [] .. py:attribute:: folder_out :value: './Empirical/1240k/' .. py:attribute:: prefix_out_data :value: '' .. py:attribute:: samples_field :value: 'samples' .. py:attribute:: readcounts :value: False .. py:attribute:: diploid_ref :value: True .. py:attribute:: random_allele :value: True .. py:attribute:: only_calls :value: True .. py:attribute:: flipstrand :value: True .. py:attribute:: max_mm_rate :value: 0.9 .. py:attribute:: downsample :value: False .. py:attribute:: save :value: True .. py:attribute:: output :value: True .. py:method:: get_index_iid_legacy(iid, fs=0) Get the Index of IID in fs iid to extract. fs reference HDF5 .. py:method:: get_index_iid(iid, f=0, samples_field='samples') Get the Index of IID in fs iid to extract. fs reference HDF5 .. py:method:: get_ref_ids(f, samples_field='samples') OVERWRITE: Get the Indices of the individuals in the HDF5 to extract. Here: Allow to subset for Individuals from different 1000G Populations samples_field: Field of all sample iids in hdf5 .. py:method:: load_data(iid='MA89', ch=6, start=-np.inf, end=np.inf) Return Matrix of reference [k,l], Matrix of Individual Data [2,l], as well as linkage Map [l] .. py:method:: optional_postprocessing(gts_ind, gts, r_map, pos, out_folder, pCon, read_counts=[]) Postprocessing steps of gts_ind, gts, r_map, and the folder, based on boolean fields of the class. .. py:method:: destroy_phase_func(gts_ind, dtype='int8') Randomly shuffles phase for gts [2,n_loci] .. py:method:: markers_called(gts_ind, read_counts) Return boolean array of markers which are called. If read_counts exist, use that for downsampling. Otherwise Use Genotype Field .. py:method:: load_h5(path) Load and return the HDF5 File from Path .. py:method:: merge_2hdf(f, g, start=-np.inf, end=np.inf) Merge two HDF 5 f and g. Return Indices of Overlap Individuals. f is Sardinian HDF5, g the Reference HDF5 .. py:method:: save_info(folder, cm_map, pos, gt_individual=[], read_counts=[]) Save Linkage Map, Readcount and Genotype Data per Individual. (Needed for latter Plotting) Genotypes Individual: If given, save as well .. py:method:: extract_snps_contaminationPop(h5, ids_con, markers) .. py:method:: extract_snps_hdf5(h5, ids_ref, markers, diploid=False, dtype='int8', removeIncompleteHap=True, start=0, end=-1) Extract genotypes from h5 on ids and markers. If diploid, concatenate haplotypes along 0 axis. Extract indivuals first, and then subset to SNPs in Memory. Return 2D array [# haplotypes, # markers] .. py:method:: extract_rc_hdf5(h5, ids, markers, dtype=np.int8) Extract Readcount data from from h5 on single (!) id and markers int8: Watch out - limited to max RC 127! .. py:method:: extract_rmap_hdf5(h5, markers, col='variants/MAP') Extract a column like rec map from h5. Can also be used for Positions .. py:method:: get_segment(r_map, markers_obs, markers_ref) Extract only markers in self.segM and downsamples the key array. Return downsampled versions. .. py:class:: PreProcessingEigenstrat(save=True, output=True, packed=1, sep='\\s+') Bases: :py:obj:`PreProcessingHDF5` Class for PreProcessing Eigenstrat Files Same as PreProcessingHDF5 for reference, but with Eigenstrat coe for target .. py:attribute:: path_targets :value: '' .. py:attribute:: h5_path1000g :value: './Data/1000Genomes/HDF5/1240kHDF5/Eur1240chr' .. py:attribute:: packed :value: -1 .. py:attribute:: sep :value: '\\s+' .. py:attribute:: flipstrand :value: True .. py:attribute:: save :value: True .. py:attribute:: output :value: True .. py:method:: es_get_index_iid(es, iid) Get IID of Indices .. py:method:: get_1000G_path(h5_path1000g, ch) Construct and return the path to the 1000 Genome reference panel .. py:method:: get_ref_ids(f, samples_field='samples') OVERWRITE: Get the Indices of the individuals in the HDF5 to extract. Here: Allow to subset for Individuals from different 1000G Populations samples_field: Field of all sample iids in hdf5 .. py:method:: optional_postprocessing(gts_ind, gts, r_map, pos, out_folder, read_counts=[]) Postprocessing steps of gts_ind, gts, r_map, and the folder, based on boolean fields of the class. .. py:method:: load_data(iid='MA89', ch=6) Return Matrix of reference [k,l], Matrix of Individual Data [2,l], as well as linkage Map [l] and the output folder. Save the loaded data if self.save==True Various modifiers in class fields (check also PreProcessingHDF5) .. py:method:: extract_snps_es(es, id, markers) Use Eigenstrat object. Extract genotypes for individual index i (integer) for list of markers. Do conversion from Eigenstrat GT to format used here .. py:method:: merge_es_hdf5(es, f_ref) Merge Eigenstrat and HDF5 Loci, return intersection indices es: LoadEigenstrat Object f_ref: Reference HDF5 if self.flipstrand return indices [its] [its] and flip boolean vector [its] .. py:method:: to_read_counts(gts_ind) Transforms vector of genotypes [2,l] to vector of read counts [2,l]. Return this vector of read counts .. py:class:: PreProcessingEigenstratX(save=True, output=True, packed=1, sep='\\s+') Bases: :py:obj:`PreProcessingEigenstrat` Class for PreProcessing Eigenstrat Files Same as Eigenstrat, but will load and combine two X Chromosomes (have to be male ones!!) .. py:attribute:: path_targets :value: '' .. py:attribute:: h5_path1000g :value: './Data/1000Genomes/HDF5/1240kHDF5/Eur1240chr' .. py:attribute:: packed :value: -1 .. py:attribute:: sep :value: '\\s+' .. py:attribute:: flipstrand :value: True .. py:method:: set_output_folder(iid, ch='X') Set the output folder after folder_out. General Structure for HAPSBURG: folder_out/iid1_iid2/chrX/ Return this folder .. py:method:: get_1000G_path(h5_path1000g, ch='X') Construct and eturn the path to the 1000 Genome reference panel .. py:method:: es_get_index_iid(es, iid) Get index of IIDs in eigenstrat. Here for X: iid is a list .. py:method:: extract_snps_es(es, id, markers) Use Eigenstrat object. Extract genotypes for individual index i (integer) for list of markers. Do conversion from Eigenstrat GT to format used here .. py:class:: PreProcessingHDF5Sim(conPop=[], save=True, output=True) Bases: :py:obj:`PreProcessingHDF5` Class for PreProcessing simulated 1000 Genome Data (Mosaic Simulations). Same as PreProcessingHDF5 but with the right Folder Structure MODIFY .. py:attribute:: out_folder :value: '' .. py:attribute:: prefix_out_data :value: '' .. py:attribute:: path_targets :value: '' .. py:attribute:: h5_path1000g :value: './Data/1000Genomes/HDF5/1240kHDF5/Eur1240chr' .. py:class:: PreProcessingFolder(save=True, output=True) Bases: :py:obj:`PreProcessing` Preprocessing if data has been saved into a folder (such as in a Simulation) .. py:attribute:: save :value: True .. py:attribute:: output :value: True .. py:attribute:: readcounts :value: False .. py:method:: load_data(iid='MA89', ch=6, n_ref=503, folder='') Return Matrix of reference [k,l], Matrix of Individual Data [2,l], as well as linkage Map [l] .. py:method:: load_linkage_map(folder, nr_snps) Load and Return the Linkage Map .. py:function:: load_preprocessing(p_model='SardHDF5', conPop=[], save=True, output=True) Factory method to load the Transition Model. Return .. py:data:: pp