hapsburg.preprocessing

Class for preprocessing and loading the Data. Is the interface to Data Folders Has Sub-Classes for different Types of Data, as well as factory Method that returns the right subclass based on a keyword. Use this factory method to load classes. @ Author: Harald Ringbauer, 2019, All rights reserved

Attributes

pp

Classes

`PreProcessing`	Class for PreProcessing the Data.
`PreProcessingHDF5`	Class for PreProcessing the Data.
`PreProcessingEigenstrat`	Class for PreProcessing Eigenstrat Files
`PreProcessingEigenstratX`	Class for PreProcessing Eigenstrat Files
`PreProcessingHDF5Sim`	Class for PreProcessing simulated 1000 Genome Data (Mosaic Simulations).
`PreProcessingFolder`	Preprocessing if data has been saved into a folder

Functions

load_preprocessing([p_model, conPop, save, output])

Factory method to load the Transition Model.

Module Contents

class hapsburg.preprocessing.PreProcessing

Bases: object

Class for PreProcessing the Data. Standard: Intersect Reference Data with Individual Data Return the Intersection Dataset

save = True

output = True

iid = ''

ch = 0

segM = []

n_ref = 0

abstractmethod load_data(iid='MA89', ch=6): Return Refererence Matrix [k,l], Genotype/Readcount Matrix [2,l] as well as linkage map [l]

set_params(**kwargs): Set the Parameters. Takes keyworded arguments

set_output_folder(iid, ch): Set the output folder after folder_out. General Structure for HAPSBURG: folder_out/iid/chrX/

class hapsburg.preprocessing.PreProcessingHDF5(conPop=[], save=True, output=True)

Bases: PreProcessing

Class for PreProcessing the Data. Standard: Intersect Reference Data with Individual Data Return the Intersection Dataset

path_targets = './../ancient-sardinia/output/h5_rev/mod_reich_sardinia_ancients_rev_mrg_dedup_3trm_anno.h5'

h5_path1000g = './Data/1000Genomes/HDF5/1240kHDF5/Eur1240chr'

meta_path_ref = './Data/1000Genomes/Individuals/meta_df.csv'

excluded = ['TSI']

conPop = []

folder_out = './Empirical/1240k/'

prefix_out_data = ''

samples_field = 'samples'

readcounts = False

diploid_ref = True

random_allele = True

only_calls = True

flipstrand = True

max_mm_rate = 0.9

downsample = False

save = True

output = True

get_index_iid_legacy(iid, fs=0): Get the Index of IID in fs iid to extract. fs reference HDF5

get_index_iid(iid, f=0, samples_field='samples'): Get the Index of IID in fs iid to extract. fs reference HDF5

get_ref_ids(f, samples_field='samples'): OVERWRITE: Get the Indices of the individuals in the HDF5 to extract. Here: Allow to subset for Individuals from different 1000G Populations samples_field: Field of all sample iids in hdf5

load_data(iid='MA89', ch=6, start=-np.inf, end=np.inf): Return Matrix of reference [k,l], Matrix of Individual Data [2,l], as well as linkage Map [l]

optional_postprocessing(gts_ind, gts, r_map, pos, out_folder, pCon, read_counts=[]): Postprocessing steps of gts_ind, gts, r_map, and the folder, based on boolean fields of the class.

destroy_phase_func(gts_ind, dtype='int8'): Randomly shuffles phase for gts [2,n_loci]

markers_called(gts_ind, read_counts): Return boolean array of markers which are called. If read_counts exist, use that for downsampling. Otherwise Use Genotype Field

load_h5(path): Load and return the HDF5 File from Path

merge_2hdf(f, g, start=-np.inf, end=np.inf): Merge two HDF 5 f and g. Return Indices of Overlap Individuals. f is Sardinian HDF5, g the Reference HDF5

save_info(folder, cm_map, pos, gt_individual=[], read_counts=[]): Save Linkage Map, Readcount and Genotype Data per Individual. (Needed for latter Plotting) Genotypes Individual: If given, save as well

extract_snps_contaminationPop(h5, ids_con, markers)

extract_snps_hdf5(h5, ids_ref, markers, diploid=False, dtype='int8', removeIncompleteHap=True, start=0, end=-1): Extract genotypes from h5 on ids and markers. If diploid, concatenate haplotypes along 0 axis. Extract indivuals first, and then subset to SNPs in Memory. Return 2D array [# haplotypes, # markers]

extract_rc_hdf5(h5, ids, markers, dtype=np.int8): Extract Readcount data from from h5 on single (!) id and markers int8: Watch out - limited to max RC 127!

extract_rmap_hdf5(h5, markers, col='variants/MAP'): Extract a column like rec map from h5. Can also be used for Positions

get_segment(r_map, markers_obs, markers_ref): Extract only markers in self.segM and downsamples the key array. Return downsampled versions.

class hapsburg.preprocessing.PreProcessingEigenstrat(save=True, output=True, packed=1, sep='\\s+')

Bases: PreProcessingHDF5

Class for PreProcessing Eigenstrat Files Same as PreProcessingHDF5 for reference, but with Eigenstrat coe for target

path_targets = ''

h5_path1000g = './Data/1000Genomes/HDF5/1240kHDF5/Eur1240chr'

packed = -1

sep = '\\s+'

flipstrand = True

save = True

output = True

es_get_index_iid(es, iid): Get IID of Indices

get_1000G_path(h5_path1000g, ch): Construct and return the path to the 1000 Genome reference panel

get_ref_ids(f, samples_field='samples'): OVERWRITE: Get the Indices of the individuals in the HDF5 to extract. Here: Allow to subset for Individuals from different 1000G Populations samples_field: Field of all sample iids in hdf5

optional_postprocessing(gts_ind, gts, r_map, pos, out_folder, read_counts=[]): Postprocessing steps of gts_ind, gts, r_map, and the folder, based on boolean fields of the class.

load_data(iid='MA89', ch=6): Return Matrix of reference [k,l], Matrix of Individual Data [2,l], as well as linkage Map [l] and the output folder. Save the loaded data if self.save==True Various modifiers in class fields (check also PreProcessingHDF5)

extract_snps_es(es, id, markers): Use Eigenstrat object. Extract genotypes for individual index i (integer) for list of markers. Do conversion from Eigenstrat GT to format used here

merge_es_hdf5(es, f_ref): Merge Eigenstrat and HDF5 Loci, return intersection indices es: LoadEigenstrat Object f_ref: Reference HDF5 if self.flipstrand return indices [its] [its] and flip boolean vector [its]

to_read_counts(gts_ind): Transforms vector of genotypes [2,l] to vector of read counts [2,l]. Return this vector of read counts

class hapsburg.preprocessing.PreProcessingEigenstratX(save=True, output=True, packed=1, sep='\\s+')

Bases: PreProcessingEigenstrat

Class for PreProcessing Eigenstrat Files Same as Eigenstrat, but will load and combine two X Chromosomes (have to be male ones!!)

path_targets = ''

h5_path1000g = './Data/1000Genomes/HDF5/1240kHDF5/Eur1240chr'

packed = -1

sep = '\\s+'

flipstrand = True

set_output_folder(iid, ch='X'): Set the output folder after folder_out. General Structure for HAPSBURG: folder_out/iid1_iid2/chrX/ Return this folder

get_1000G_path(h5_path1000g, ch='X'): Construct and eturn the path to the 1000 Genome reference panel

es_get_index_iid(es, iid): Get index of IIDs in eigenstrat. Here for X: iid is a list

extract_snps_es(es, id, markers): Use Eigenstrat object. Extract genotypes for individual index i (integer) for list of markers. Do conversion from Eigenstrat GT to format used here

class hapsburg.preprocessing.PreProcessingHDF5Sim(conPop=[], save=True, output=True)

Bases: PreProcessingHDF5

Class for PreProcessing simulated 1000 Genome Data (Mosaic Simulations). Same as PreProcessingHDF5 but with the right Folder Structure MODIFY

out_folder = ''

prefix_out_data = ''

path_targets = ''

h5_path1000g = './Data/1000Genomes/HDF5/1240kHDF5/Eur1240chr'

class hapsburg.preprocessing.PreProcessingFolder(save=True, output=True)

Bases: PreProcessing

Preprocessing if data has been saved into a folder (such as in a Simulation)

save = True

output = True

readcounts = False

load_data(iid='MA89', ch=6, n_ref=503, folder=''): Return Matrix of reference [k,l], Matrix of Individual Data [2,l], as well as linkage Map [l]

load_linkage_map(folder, nr_snps): Load and Return the Linkage Map

hapsburg.preprocessing.load_preprocessing(p_model='SardHDF5', conPop=[], save=True, output=True): Factory method to load the Transition Model. Return

hapsburg.preprocessing.pp