hapsburg.preprocessing

Class for preprocessing and loading the Data. Is the interface to Data Folders Has Sub-Classes for different Types of Data, as well as factory Method that returns the right subclass based on a keyword. Use this factory method to load classes. @ Author: Harald Ringbauer, 2019, All rights reserved

Attributes

pp

Classes

PreProcessing

Class for PreProcessing the Data.

PreProcessingHDF5

Class for PreProcessing the Data.

PreProcessingEigenstrat

Class for PreProcessing Eigenstrat Files

PreProcessingEigenstratX

Class for PreProcessing Eigenstrat Files

PreProcessingHDF5Sim

Class for PreProcessing simulated 1000 Genome Data (Mosaic Simulations).

PreProcessingFolder

Preprocessing if data has been saved into a folder

Functions

load_preprocessing([p_model, conPop, save, output])

Factory method to load the Transition Model.

Module Contents

class hapsburg.preprocessing.PreProcessing

Bases: object

Class for PreProcessing the Data. Standard: Intersect Reference Data with Individual Data Return the Intersection Dataset

save = True
output = True
iid = ''
ch = 0
segM = []
n_ref = 0
abstractmethod load_data(iid='MA89', ch=6)

Return Refererence Matrix [k,l], Genotype/Readcount Matrix [2,l] as well as linkage map [l]

set_params(**kwargs)

Set the Parameters. Takes keyworded arguments

set_output_folder(iid, ch)

Set the output folder after folder_out. General Structure for HAPSBURG: folder_out/iid/chrX/

class hapsburg.preprocessing.PreProcessingHDF5(conPop=[], save=True, output=True)

Bases: PreProcessing

Class for PreProcessing the Data. Standard: Intersect Reference Data with Individual Data Return the Intersection Dataset

path_targets = './../ancient-sardinia/output/h5_rev/mod_reich_sardinia_ancients_rev_mrg_dedup_3trm_anno.h5'
h5_path1000g = './Data/1000Genomes/HDF5/1240kHDF5/Eur1240chr'
meta_path_ref = './Data/1000Genomes/Individuals/meta_df.csv'
excluded = ['TSI']
conPop = []
folder_out = './Empirical/1240k/'
prefix_out_data = ''
samples_field = 'samples'
readcounts = False
diploid_ref = True
random_allele = True
only_calls = True
flipstrand = True
max_mm_rate = 0.9
downsample = False
save = True
output = True
get_index_iid_legacy(iid, fs=0)

Get the Index of IID in fs iid to extract. fs reference HDF5

get_index_iid(iid, f=0, samples_field='samples')

Get the Index of IID in fs iid to extract. fs reference HDF5

get_ref_ids(f, samples_field='samples')

OVERWRITE: Get the Indices of the individuals in the HDF5 to extract. Here: Allow to subset for Individuals from different 1000G Populations samples_field: Field of all sample iids in hdf5

load_data(iid='MA89', ch=6, start=-np.inf, end=np.inf)

Return Matrix of reference [k,l], Matrix of Individual Data [2,l], as well as linkage Map [l]

optional_postprocessing(gts_ind, gts, r_map, pos, out_folder, pCon, read_counts=[])

Postprocessing steps of gts_ind, gts, r_map, and the folder, based on boolean fields of the class.

destroy_phase_func(gts_ind, dtype='int8')

Randomly shuffles phase for gts [2,n_loci]

markers_called(gts_ind, read_counts)

Return boolean array of markers which are called. If read_counts exist, use that for downsampling. Otherwise Use Genotype Field

load_h5(path)

Load and return the HDF5 File from Path

merge_2hdf(f, g, start=-np.inf, end=np.inf)

Merge two HDF 5 f and g. Return Indices of Overlap Individuals. f is Sardinian HDF5, g the Reference HDF5

save_info(folder, cm_map, pos, gt_individual=[], read_counts=[])

Save Linkage Map, Readcount and Genotype Data per Individual. (Needed for latter Plotting) Genotypes Individual: If given, save as well

extract_snps_contaminationPop(h5, ids_con, markers)
extract_snps_hdf5(h5, ids_ref, markers, diploid=False, dtype='int8', removeIncompleteHap=True, start=0, end=-1)

Extract genotypes from h5 on ids and markers. If diploid, concatenate haplotypes along 0 axis. Extract indivuals first, and then subset to SNPs in Memory. Return 2D array [# haplotypes, # markers]

extract_rc_hdf5(h5, ids, markers, dtype=np.int8)

Extract Readcount data from from h5 on single (!) id and markers int8: Watch out - limited to max RC 127!

extract_rmap_hdf5(h5, markers, col='variants/MAP')

Extract a column like rec map from h5. Can also be used for Positions

get_segment(r_map, markers_obs, markers_ref)

Extract only markers in self.segM and downsamples the key array. Return downsampled versions.

class hapsburg.preprocessing.PreProcessingEigenstrat(save=True, output=True, packed=1, sep='\\s+')

Bases: PreProcessingHDF5

Class for PreProcessing Eigenstrat Files Same as PreProcessingHDF5 for reference, but with Eigenstrat coe for target

path_targets = ''
h5_path1000g = './Data/1000Genomes/HDF5/1240kHDF5/Eur1240chr'
packed = -1
sep = '\\s+'
flipstrand = True
save = True
output = True
es_get_index_iid(es, iid)

Get IID of Indices

get_1000G_path(h5_path1000g, ch)

Construct and return the path to the 1000 Genome reference panel

get_ref_ids(f, samples_field='samples')

OVERWRITE: Get the Indices of the individuals in the HDF5 to extract. Here: Allow to subset for Individuals from different 1000G Populations samples_field: Field of all sample iids in hdf5

optional_postprocessing(gts_ind, gts, r_map, pos, out_folder, read_counts=[])

Postprocessing steps of gts_ind, gts, r_map, and the folder, based on boolean fields of the class.

load_data(iid='MA89', ch=6)

Return Matrix of reference [k,l], Matrix of Individual Data [2,l], as well as linkage Map [l] and the output folder. Save the loaded data if self.save==True Various modifiers in class fields (check also PreProcessingHDF5)

extract_snps_es(es, id, markers)

Use Eigenstrat object. Extract genotypes for individual index i (integer) for list of markers. Do conversion from Eigenstrat GT to format used here

merge_es_hdf5(es, f_ref)

Merge Eigenstrat and HDF5 Loci, return intersection indices es: LoadEigenstrat Object f_ref: Reference HDF5 if self.flipstrand return indices [its] [its] and flip boolean vector [its]

to_read_counts(gts_ind)

Transforms vector of genotypes [2,l] to vector of read counts [2,l]. Return this vector of read counts

class hapsburg.preprocessing.PreProcessingEigenstratX(save=True, output=True, packed=1, sep='\\s+')

Bases: PreProcessingEigenstrat

Class for PreProcessing Eigenstrat Files Same as Eigenstrat, but will load and combine two X Chromosomes (have to be male ones!!)

path_targets = ''
h5_path1000g = './Data/1000Genomes/HDF5/1240kHDF5/Eur1240chr'
packed = -1
sep = '\\s+'
flipstrand = True
set_output_folder(iid, ch='X')

Set the output folder after folder_out. General Structure for HAPSBURG: folder_out/iid1_iid2/chrX/ Return this folder

get_1000G_path(h5_path1000g, ch='X')

Construct and eturn the path to the 1000 Genome reference panel

es_get_index_iid(es, iid)

Get index of IIDs in eigenstrat. Here for X: iid is a list

extract_snps_es(es, id, markers)

Use Eigenstrat object. Extract genotypes for individual index i (integer) for list of markers. Do conversion from Eigenstrat GT to format used here

class hapsburg.preprocessing.PreProcessingHDF5Sim(conPop=[], save=True, output=True)

Bases: PreProcessingHDF5

Class for PreProcessing simulated 1000 Genome Data (Mosaic Simulations). Same as PreProcessingHDF5 but with the right Folder Structure MODIFY

out_folder = ''
prefix_out_data = ''
path_targets = ''
h5_path1000g = './Data/1000Genomes/HDF5/1240kHDF5/Eur1240chr'
class hapsburg.preprocessing.PreProcessingFolder(save=True, output=True)

Bases: PreProcessing

Preprocessing if data has been saved into a folder (such as in a Simulation)

save = True
output = True
readcounts = False
load_data(iid='MA89', ch=6, n_ref=503, folder='')

Return Matrix of reference [k,l], Matrix of Individual Data [2,l], as well as linkage Map [l]

load_linkage_map(folder, nr_snps)

Load and Return the Linkage Map

hapsburg.preprocessing.load_preprocessing(p_model='SardHDF5', conPop=[], save=True, output=True)

Factory method to load the Transition Model. Return

hapsburg.preprocessing.pp