hapsburg.preprocessing
======================

.. py:module:: hapsburg.preprocessing

.. autoapi-nested-parse::

   Class for preprocessing and loading the Data.
   Is the interface to Data Folders
   Has Sub-Classes for different Types of Data, as well as factory Method
   that returns the right subclass based on a keyword. Use this
   factory method to load classes.
   @ Author: Harald Ringbauer, 2019, All rights reserved


Attributes
----------

.. autoapisummary::

   hapsburg.preprocessing.pp


Classes
-------

.. autoapisummary::

   hapsburg.preprocessing.PreProcessing
   hapsburg.preprocessing.PreProcessingHDF5
   hapsburg.preprocessing.PreProcessingEigenstrat
   hapsburg.preprocessing.PreProcessingEigenstratX
   hapsburg.preprocessing.PreProcessingHDF5Sim
   hapsburg.preprocessing.PreProcessingFolder


Functions
---------

.. autoapisummary::

   hapsburg.preprocessing.load_preprocessing


Module Contents
---------------

.. py:class:: PreProcessing

   Bases: :py:obj:`object`


   Class for PreProcessing the Data.
   Standard: Intersect Reference Data with Individual Data
   Return the Intersection Dataset


   .. py:attribute:: save
      :value: True


   .. py:attribute:: output
      :value: True


   .. py:attribute:: iid
      :value: ''


   .. py:attribute:: ch
      :value: 0


   .. py:attribute:: segM
      :value: []


   .. py:attribute:: n_ref
      :value: 0


   .. py:method:: load_data(iid='MA89', ch=6)
      :abstractmethod:


      Return Refererence Matrix [k,l], Genotype/Readcount Matrix [2,l]
      as well as linkage map [l]


   .. py:method:: set_params(**kwargs)

      Set the Parameters.
      Takes keyworded arguments


   .. py:method:: set_output_folder(iid, ch)

      Set the output folder after folder_out.
      General Structure for HAPSBURG: folder_out/iid/chrX/


.. py:class:: PreProcessingHDF5(conPop=[], save=True, output=True)

   Bases: :py:obj:`PreProcessing`


   Class for PreProcessing the Data.
   Standard: Intersect Reference Data with Individual Data
   Return the Intersection Dataset


   .. py:attribute:: path_targets
      :value: './../ancient-sardinia/output/h5_rev/mod_reich_sardinia_ancients_rev_mrg_dedup_3trm_anno.h5'


   .. py:attribute:: h5_path1000g
      :value: './Data/1000Genomes/HDF5/1240kHDF5/Eur1240chr'


   .. py:attribute:: meta_path_ref
      :value: './Data/1000Genomes/Individuals/meta_df.csv'


   .. py:attribute:: excluded
      :value: ['TSI']


   .. py:attribute:: conPop
      :value: []


   .. py:attribute:: folder_out
      :value: './Empirical/1240k/'


   .. py:attribute:: prefix_out_data
      :value: ''


   .. py:attribute:: samples_field
      :value: 'samples'


   .. py:attribute:: readcounts
      :value: False


   .. py:attribute:: diploid_ref
      :value: True


   .. py:attribute:: random_allele
      :value: True


   .. py:attribute:: only_calls
      :value: True


   .. py:attribute:: flipstrand
      :value: True


   .. py:attribute:: max_mm_rate
      :value: 0.9


   .. py:attribute:: downsample
      :value: False


   .. py:attribute:: save
      :value: True


   .. py:attribute:: output
      :value: True


   .. py:method:: get_index_iid_legacy(iid, fs=0)

      Get the Index of IID in fs
      iid to extract. fs reference HDF5


   .. py:method:: get_index_iid(iid, f=0, samples_field='samples')

      Get the Index of IID in fs
      iid to extract. fs reference HDF5


   .. py:method:: get_ref_ids(f, samples_field='samples')

      OVERWRITE: Get the Indices of the individuals
      in the HDF5 to extract. Here: Allow to subset for Individuals from
      different 1000G Populations
      samples_field: Field of all sample iids in hdf5


   .. py:method:: load_data(iid='MA89', ch=6, start=-np.inf, end=np.inf)

      Return Matrix of reference [k,l], Matrix of Individual Data [2,l],
      as well as linkage Map [l]


   .. py:method:: optional_postprocessing(gts_ind, gts, r_map, pos, out_folder, pCon, read_counts=[])

      Postprocessing steps of gts_ind, gts, r_map, and the folder,
      based on boolean fields of the class.


   .. py:method:: destroy_phase_func(gts_ind, dtype='int8')

      Randomly shuffles phase for gts [2,n_loci]


   .. py:method:: markers_called(gts_ind, read_counts)

      Return boolean array of markers which are called.
      If read_counts exist, use that for downsampling.
      Otherwise Use Genotype Field


   .. py:method:: load_h5(path)

      Load and return the HDF5 File from Path


   .. py:method:: merge_2hdf(f, g, start=-np.inf, end=np.inf)

      Merge two HDF 5 f and g. Return Indices of Overlap Individuals.
      f is Sardinian HDF5,
      g the Reference HDF5


   .. py:method:: save_info(folder, cm_map, pos, gt_individual=[], read_counts=[])

      Save Linkage Map, Readcount and Genotype Data per Individual.
      (Needed for latter Plotting)
      Genotypes Individual: If given, save as well


   .. py:method:: extract_snps_contaminationPop(h5, ids_con, markers)


   .. py:method:: extract_snps_hdf5(h5, ids_ref, markers, diploid=False, dtype='int8', removeIncompleteHap=True, start=0, end=-1)

      Extract genotypes from h5 on ids and markers.
      If diploid, concatenate haplotypes along 0 axis.
      Extract indivuals first, and then subset to SNPs
      in Memory.
      Return 2D array [# haplotypes, # markers]


   .. py:method:: extract_rc_hdf5(h5, ids, markers, dtype=np.int8)

      Extract Readcount data from from h5 on single (!) id and markers
      int8: Watch out - limited to max RC 127!


   .. py:method:: extract_rmap_hdf5(h5, markers, col='variants/MAP')

      Extract a column like rec map from h5.
      Can also be used for Positions


   .. py:method:: get_segment(r_map, markers_obs, markers_ref)

      Extract only markers in self.segM and downsamples
      the key array. Return downsampled versions.


.. py:class:: PreProcessingEigenstrat(save=True, output=True, packed=1, sep='\\s+')

   Bases: :py:obj:`PreProcessingHDF5`


   Class for PreProcessing Eigenstrat Files
   Same as PreProcessingHDF5 for reference, but with Eigenstrat coe
   for target


   .. py:attribute:: path_targets
      :value: ''


   .. py:attribute:: h5_path1000g
      :value: './Data/1000Genomes/HDF5/1240kHDF5/Eur1240chr'


   .. py:attribute:: packed
      :value: -1


   .. py:attribute:: sep
      :value: '\\s+'


   .. py:attribute:: flipstrand
      :value: True


   .. py:attribute:: save
      :value: True


   .. py:attribute:: output
      :value: True


   .. py:method:: es_get_index_iid(es, iid)

      Get IID of Indices


   .. py:method:: get_1000G_path(h5_path1000g, ch)

      Construct and return the path
      to the 1000 Genome reference panel


   .. py:method:: get_ref_ids(f, samples_field='samples')

      OVERWRITE: Get the Indices of the individuals
      in the HDF5 to extract. Here: Allow to subset for Individuals from
      different 1000G Populations
      samples_field: Field of all sample iids in hdf5


   .. py:method:: optional_postprocessing(gts_ind, gts, r_map, pos, out_folder, read_counts=[])

      Postprocessing steps of gts_ind, gts, r_map, and the folder,
      based on boolean fields of the class.


   .. py:method:: load_data(iid='MA89', ch=6)

      Return Matrix of reference [k,l], Matrix of Individual Data [2,l],
      as well as linkage Map [l] and the output folder.
      Save the loaded data if self.save==True
      Various modifiers in class fields (check also PreProcessingHDF5)


   .. py:method:: extract_snps_es(es, id, markers)

      Use Eigenstrat object. Extract genotypes for individual index i
      (integer) for
      list of markers. Do conversion from Eigenstrat GT to format
      used here


   .. py:method:: merge_es_hdf5(es, f_ref)

      Merge Eigenstrat and HDF5 Loci, return intersection indices
      es: LoadEigenstrat Object
      f_ref: Reference HDF5
      if self.flipstrand return indices [its] [its]
      and flip boolean vector [its]


   .. py:method:: to_read_counts(gts_ind)

      Transforms vector of genotypes [2,l] to
      vector of read counts [2,l]. Return this
      vector of read counts


.. py:class:: PreProcessingEigenstratX(save=True, output=True, packed=1, sep='\\s+')

   Bases: :py:obj:`PreProcessingEigenstrat`


   Class for PreProcessing Eigenstrat Files
   Same as Eigenstrat, but will load and combine two
   X Chromosomes (have to be male ones!!)


   .. py:attribute:: path_targets
      :value: ''


   .. py:attribute:: h5_path1000g
      :value: './Data/1000Genomes/HDF5/1240kHDF5/Eur1240chr'


   .. py:attribute:: packed
      :value: -1


   .. py:attribute:: sep
      :value: '\\s+'


   .. py:attribute:: flipstrand
      :value: True


   .. py:method:: set_output_folder(iid, ch='X')

      Set the output folder after folder_out.
      General Structure for HAPSBURG: folder_out/iid1_iid2/chrX/
      Return this folder


   .. py:method:: get_1000G_path(h5_path1000g, ch='X')

      Construct and eturn the path
      to the 1000 Genome reference panel


   .. py:method:: es_get_index_iid(es, iid)

      Get index of IIDs in eigenstrat.
      Here for X: iid is a list


   .. py:method:: extract_snps_es(es, id, markers)

      Use Eigenstrat object. Extract genotypes for individual index i
      (integer) for
      list of markers. Do conversion from Eigenstrat GT to format
      used here


.. py:class:: PreProcessingHDF5Sim(conPop=[], save=True, output=True)

   Bases: :py:obj:`PreProcessingHDF5`


   Class for PreProcessing simulated 1000 Genome Data (Mosaic Simulations).
   Same as PreProcessingHDF5 but with the right Folder Structure
   MODIFY


   .. py:attribute:: out_folder
      :value: ''


   .. py:attribute:: prefix_out_data
      :value: ''


   .. py:attribute:: path_targets
      :value: ''


   .. py:attribute:: h5_path1000g
      :value: './Data/1000Genomes/HDF5/1240kHDF5/Eur1240chr'


.. py:class:: PreProcessingFolder(save=True, output=True)

   Bases: :py:obj:`PreProcessing`


   Preprocessing if data has been saved into a folder
   (such as in a Simulation)


   .. py:attribute:: save
      :value: True


   .. py:attribute:: output
      :value: True


   .. py:attribute:: readcounts
      :value: False


   .. py:method:: load_data(iid='MA89', ch=6, n_ref=503, folder='')

      Return Matrix of reference [k,l], Matrix of Individual Data [2,l],
      as well as linkage Map [l]


   .. py:method:: load_linkage_map(folder, nr_snps)

      Load and Return the Linkage Map


.. py:function:: load_preprocessing(p_model='SardHDF5', conPop=[], save=True, output=True)

   Factory method to load the Transition Model.
   Return


.. py:data:: pp