fip package
Submodules
fip.chem module
- fip.chem.rdmol2brics_blocs_smiles(mol, min_fragment_size=1)
Decomposes the provided rdmol instance into BRICS fragment SMILES, as designed by Degen et al. in On the Art of Compiling and Using ‘Drug-Like’ Chemical Fragment Spaces, https://doi.org/10.1002/cmdc.200800178
- Parameters
mol – the source RDKit Mol instance
min_fragment_size – passed to the minFragmentSize of the RDKit implementation
- Returns
SMILES of the BRICS fragments
- fip.chem.rdmol2morgan_feature_smiles(mol, radius=3, min_radius=1, *, all_bonds_explicit=False, canonical_smiles=True, isomeric_smiles=False, all_H_explicit=True)
Breaks a molecule, given as an RDKit Mol instance, into a set of ECFP-like fragments with selected radius, and returns them in SMILES notation.
- Parameters
mol – the molecule for fragmenting, as RDKit Mol
radius – EC fragment radius
min_radius – minimal fragment radius to consider, default 0. Can be set to ignore lower-scope fragments.
all_bonds_explicit – boolean indicating whether all bond orders will be explicitly stated in the output. Default False.
canonical_smiles – boolean indicating whether the fragment should be attempted to make canonical. Default True.
isomeric_smiles – boolean indicating whether to include stereo information in the fragments. Default False.
all_H_explicit – boolean indicating whether to explicitly include all hydrogen atoms. Default True.
- Returns
a set of SMILES strings
- fip.chem.rdmol2smiles(mol)
Simple conversion of RDKit Mol instance into a SMILES string. Wrapped in case some standardization/postprocessing needed.
- Parameters
mol – RDKit Mol instance
- Returns
SMILES string
- fip.chem.rdmol_bonds2fragment_smiles(mol, bonds, *, all_bonds_explicit=False, canonical_smiles=True, isomeric_smiles=False, all_H_explicit=True)
Generates a fragment in SMILES notation from a molecule given as an RDKit Mol instance, based on the provided bond IDs within the molecule.
- Parameters
mol – the molecule containing the fragment, as RDKit Mol
bonds – a set of bonds that are part of the fragment
all_bonds_explicit – boolean indicating whether all bond orders will be explicitly stated in the output. Default False.
canonical_smiles – boolean indicating whether the fragment should be attempted to make canonical. Default True.
isomeric_smiles – boolean indicating whether to include stereo information in the fragments. Default False.
all_H_explicit – boolean indicating whether to explicitly include all hydrogen atoms. Default False.
- Returns
a SMILES string of the fragment, None if there are no atoms matched
- fip.chem.rdmol_has_substructure_pattern(rdmol, pattern)
Direct wrap of RDKit HasSubstructMatch. Wrapped in case some standardization/postprocessing needed.
- Parameters
rdmol – RDKit Mol instance to search the pattern in
pattern – the pattern to search, also in RDKit Mol form
- Returns
bool or None (in case of RDKit Mol instance error)
- fip.chem.rdmol_locations2fragments_smiles(mol, fragment_locations, min_radius=0, *, all_bonds_explicit=False, canonical_smiles=True, isomeric_smiles=False, all_H_explicit=True)
Generates a set of fragments in SMILES notation from a molecule given as an RDKit Mol instance, based on provided atom indices and radii.
- Parameters
mol – the molecule for fragmenting, as RDKit Mol
fragment_locations – an iterable of fragment locations, in (atom, radius) tuples
min_radius – minimal feature radius to consider, default 0. Can be set to ignore lower-scope features.
all_bonds_explicit – boolean indicating whether all bond orders will be explicitly stated in the output. Default False.
canonical_smiles – boolean indicating whether the fragment should be attempted to make canonical. Default True.
isomeric_smiles – boolean indicating whether to include stereo information in the fragments. Default False.
all_H_explicit – boolean indicating whether to explicitly include all hydrogen atoms. Default True.
- Returns
a set of SMILES strings
- fip.chem.sdf2rdmols(sdf_path)
Converts an SDF file to RDKit Mol instances. Wrap of RDKit SDMolSupplier, modified to omit RDKit Mol instances that can’t be parsed.
- Parameters
sdf_path – Path to the SDF file
- Returns
a generator yielding RDKit Mol instances
- fip.chem.smarts2rdmol(smarts)
Simple conversion of SMARTS string into a RDKit Mol instance. Wrapped in case some standardization/postprocessing needed.
- Parameters
smarts – SMARTS string
- Returns
RDKit Mol instance
- fip.chem.smiles2rdmol(smiles)
Simple conversion of SMILES string into a RDKit Mol instance. Wrapped in case some standardization/postprocessing needed.
- Parameters
smiles – SMILES string
- Returns
RDKit Mol instance
- fip.chem.standardize_mol(mol, *, remove_hydrogens=True, remove_stereo=True)
Simple structure standardization.
- Parameters
mol – RDKit Mol instance to standardize
remove_hydrogens – whether to remove all hydrogens using RDKit RemoveAllHs
remove_stereo – whether to remove stereo information using RDKIt RemoveStereochemistry
- Returns
standardized RDKit Mol instance
fip.profiles module
- class fip.profiles.CooccurrenceProbabilityProfile(df, *, zscore=False, min_cutoff_value=None, **kwargs)
Bases:
fip.profiles.InterrelationProfileA co-occurrence probability profile, holding information about the observed probabilities for feature pair co-occurrences within the characterized set.
- classmethod from_cooccurrence_profile(cooccurrence_profile, *args, vector_count=None, **kwargs)
Creates a Cooccurrence Probability Profile from the provided Cooccurrence Profile. To calculate probability, overall count of feature vector is needed. The count is either explicitly provided through the vector_count kwarg, or inferred from the cooccurrence_profile.attrs[‘vector_count’] or, failing that, as a max of its ‘count’ column
- Parameters
cooccurrence_profile – the source CooccurrenceProfile instance
args – any further arguments to be passed to the InterrelationProfile init
vector_count – explicit count of feature vectors, i.e. samples, to manually adjust the probabilities
kwargs – any further keyword arguments to be passed to the InterrelationProfile init
- Returns
a CooccurrenceProbabilityProfile instance
- get_imputation_value(*args)
Interrelation probability imputation based on “most-optimistic” scenario that the co-occurrence would happen in the n+1 sample.
- Returns
the imputation value as a float
- imputable_standalone_probabilities(*args, vector_count=None)
Calculates standalone probabilities for individual features, to be used in the “most-optimistic” imputation scheme, i.e. presuming that the n+1 feature vector will contain all observed features co-occurring.
- Parameters
vector_count – explicit count of feature vectors, i.e. samples, to optionally manually adjust the probabilities
- Returns
a dictionary of feature:probability
- mean_interrelation_value()
Provides mean value of all interrelation values within the profile, including imputed values. Ignores self-relations.
- Returns
the mean interrelation value as a float
- standard_interrelation_deviation()
Provides standard deviation for all interrelation values within the profile, including imputed values. Ignores self-relations.
- Returns
the standard deviation value of interrelations as a float
- class fip.profiles.CooccurrenceProfile(df, *, zscore=False, min_cutoff_value=None, **kwargs)
Bases:
fip.profiles.InterrelationProfileA feature Co-occurrence profile, which is usually a core profile for further interrelation analysis. Holds raw counts on how many times did the features occur and co-occur within the characterized set.
- add_another_cooccurrence_profile(other)
Adds the contents of another co-occurrence profile to this one, in full outer join fashion. The addition is done inplace, i.e. on this instance.
- Parameters
other – another CooccurrenceProfile instance
- Returns
self
- classmethod from_feature_lists(feature_lists, *args, **kwargs)
Generate a Co-occurrence profile from an iterable of feature lists.
- Parameters
feature_lists – the iterable of feature lists to derive the CooccurrenceProfile from
args – any further arguments to be passed to the CooccurrenceProfile init
kwargs – any further keyword arguments to be passed to the CooccurrenceProfile init
- Returns
the corresponding CooccurrenceProfile instance
- classmethod from_feature_lists_split_on_feature(iterable, feature)
Generates and returns two CooccurrenceProbabilityProfile instances from the provided iterable, and the provided feature. Returns main profile from feature vectors containing the given feature, and a reference profile from all other feature vectors.
- Parameters
iterable – the iterable of feature lists to derive the CooccurrenceProfile from
feature – the feature to split the set on
- Returns
a tuple of the CooccurrenceProfile instances (with_feature, without_feature)
- get_imputation_value(*args)
Returns interrelation imputation value. For co-occurrences, this is flat 0, unless z-scored.
- Returns
imputation value
- mean_interrelation_value()
Provides mean value of all interrelation values within the profile, including imputed values. Ignores self-relations.
- Returns
the mean interrelation value as a float
- standard_interrelation_deviation()
Provides standard deviation for all interrelation values within the profile, including imputed values. Ignores self-relations.
- Returns
the standard deviation value of interrelations as a float
- class fip.profiles.InterrelationProfile(df, *, zscore=False, min_cutoff_value=None, **kwargs)
Bases:
objectA generic parent class representing an interrelation profile, and implementing their common functionality. Not meant for instantiation.
- convert_to_zscore()
Converts the values within the InterrelationProfile into Z-scores, i.e. subtracts mean, divides by standard deviation.
- Returns
None, the InterrelationProfile is changed itself
- distinct_features(selection=None)
Provides a set of all distinct features present within the interrelation profile. Optionally, can be provided a selection containing interrelation profile subset, to return distinct features within that subset.
- Parameters
selection – a subset DataFrame for the interrelation profile, to narrow the scope. Optional, default None.
- Returns
feature names as a set of strings
- static features2cooccurrences(features, *, omit_self_relations=False)
Processes an iterable of features into a string set of feature co-occurrences.
- Parameters
features – an iterable of features to process, e.g. (feature1, feature2, feature4, …)
omit_self_relations – Whether to ignore feature self-relations, i.e. (f1, f1). Default False.
- Returns
a co-occurrence generator
- features_interrelation_values(features, *, omit_self_relations=False)
Yields interrelation values within the profile for all features within a given feature list. Includes imputed values.
- Parameters
features – features to look up within the profile
omit_self_relations – whether to omit self-relations in the lookup, default False.
- Returns
a generator yielding the interrelation values, usually floats or ints
- classmethod from_dataframe(dataframe, *args, **kwargs)
Loads an interrelation profile from a dataframe containing ‘feature1’, ‘feature2’ and ‘value’ columns.
- Parameters
dataframe – the dataframe to load
zscore – whether to produce a z-scored profile instead of raw values. Keyword argument, default False.
min_cutoff_value – if defined, drop all interrelations with values below the given limit. Default None.
args – any further arguments to be passed to the InterrelationProfile init
kwargs – any further keyword arguments to be passed to the InterrelationProfile init
- Returns
the corresponding InterrelationProfile instance
- classmethod from_dict(value_dict, *args, **kwargs)
Loads an interrelation profile from a dictionary of {(feature1, feature2): value}. feature1, feature2 are handled as strings and serve as a multiindex for the value.
- Parameters
value_dict – the dictionary to load
zscore – whether to produce a z-scored profile instead of raw values. Keyword argument, default False.
min_cutoff_value – if defined, drop all interrelations with values below the given limit. Default None.
args – any further arguments to be passed to the InterrelationProfile init
kwargs – any further keyword arguments to be passed to the InterrelationProfile init
- Returns
the corresponding InterrelationProfile instance
- abstract get_imputation_value(f1, f2)
Provides the imputation value for feature pair that does not occur within the interrelation profile. Implemented individually within different FeatureInterrelation types, due to imputation differences.
- interrelation_value(f1, f2=None)
Returns the interrelation value for the feature pair provided in the arguments. If second argument f2 is not filled in, returns self-relation (f1, f1).
- Parameters
f1 – the first feature
f2 – the second feature, default None
- Returns
The interrelation value between the two features within the profile. Usually int or float.
- iterate_feature_interrelations()
Yields all interrelations values between features in the profile, in a tuple. Omits self-relations of features. Includes imputed values.
- Returns
yields tuples of (feature1, feature2, interrelation_value)
- mean_feature_interrelation_value(features, *, omit_self_relations=True)
Returns the mean interrelation value within the profile for all features within a given feature list. Corresponds to feature “tightness” measure for profiles such as (Z)PMI and (Z)PKLD. Includes imputed values.
- Parameters
features – features to look up within the profile
omit_self_relations – whether to omit self-relations in the lookup, default True.
- Returns
mean interrelation value, as a float
- abstract mean_interrelation_value()
Provides mean value of all interrelation values within the profile, including imputed values. Implemented individually within different FeatureInterrelation types, due to imputation differences.
- mean_raw_interrelation_value()
Provides mean value of all explicit interrelation values within the profile, i.e. not counting in the imputation values. Ignores self-relations.
- Returns
the mean explicit interrelation value as a float
- mean_self_relation_value()
Provides mean value of all self-relation values within the profile. Ignores interrelations.
- Returns
the mean explicit interrelation value as a float
- num_features()
Provides the count of all features (individual features, not their interrelations) that occur within the profile.
- Returns
count of all features
- num_max_interrelations()
Provides the count of all possible feature interrelations that can exist within the profile, based solely on the amount of observed features.
- Returns
count of all possible interrelations
- num_raw_interrelations()
Provides the count of all interrelations that explicitly occur within the profile, i.e. all interrelations that are non-imputed, non-self-relations.
- Returns
count of all explicit interrelation values
- static row_zscore(mean, standard_deviation, row, *, input_column_name='value', output_column_name='value')
A drop-in static method to calculate Z-score from a value, mean and standard deviation. Z-score, aka. standard score, is the number of standard deviations a given value is from the mean.
- Parameters
mean – the mean value within the distribution
standard_deviation – the standard deviation within the distribution
row – a Pandas row to calculate the z score for
input_column_name – column name for the processed value, default ‘value’
output_column_name – column name for the output z-score value, default ‘value’ (i.e. overwrite)
- Returns
modified row containing the z-score
- select_all()
Provides all explicit feature relations (both self-relations and interrelations) within the profile as a DataFrame. The DataFrame itself is also directly accessible through InterrelationProfile.df
- Returns
all feature relations as a Pandas DataFrame
- select_major_interrelations(zscore_cutoff=1.0)
Provides all explicit feature interrelations within the profile, that are higher or lower than the profile average by amount of standard deviations provided by the zscore_cutoff value. In other words, the zcore_cutoff is the relative relation strength cutoff based on the amount of standard deviations for the individual interrelation values from their mean.
Returns a selection of the DataFrame. The DataFrame itself is also directly accessible through InterrelationProfile.df
- Parameters
zscore_cutoff – Relative relation strength cutoff value. Default 1.0
- Returns
major feature interrelations as a Pandas DataFrame
- select_major_self_relations(zscore_cutoff=1.0)
Provides all explicit feature self relations within the profile, that are higher or lower than the profile average by amount of standard deviations provided by the zscore_cutoff value. In other words, the zcore_cutoff is the relative relation strength cutoff based on the amount of standard deviations for the individual interrelation values from their mean.
Returns a selection of the DataFrame. The DataFrame itself is also directly accessible through InterrelationProfile.df
- Parameters
zscore_cutoff – Relative relation strength cutoff value. Default 1.0
- Returns
major feature self-relations as a Pandas DataFrame
- select_raw_interrelations(selection=None)
Provides all explicit (i.e. not imputed) feature interrelations (i.e. not self-relations) within the profile as a DataFrame subset selection.
- Returns
feature interrelations as a Pandas DataFrame subset
- select_raw_interrelations_involving(features, depth=0)
Selects all raw interrelations associated with the given feature or group of features, either directly (depth = 0) or by extension through other relations (depth of 1, 2, …)
- Parameters
features – feature or a list of features to gain features from
depth – how far the interrelations are tracked. Default 0, i.e. only direct interrelations.
- Returns
the associated interrelations as a Pandas DataFrame subset
- select_self_relations(selection=None)
Provides all feature self-relations within the profile as a DataFrame subset selection.
- Returns
feature self-relations as a Pandas DataFrame subset
- self_relations_dict()
Returns self-relation values of all features in the profile as a dictionary.
- Returns
a dictionary of {feature: self_interrelation_value} pairs
- abstract standard_interrelation_deviation()
Provides standard deviation for all interrelation values within the profile, including imputed values. Implemented individually within different FeatureInterrelation types, due to imputation differences.
- standard_raw_interrelation_deviation()
Provides standard deviation from all explicit (i.e. non-imputed) interrelation values within the profile. Ignores self-relations.
- Returns
The standard deviation as a float
- standard_self_relation_deviation()
Provides standard deviation from all self-relation values within the profile. Ignores interrelations.
- Returns
The standard deviation as a float
- to_csv(target_file=None)
Export the interrelation matrix to a CSV file
- Parameters
target_file – the path or buffer to the export. Default None
- Returns
None or string
- to_distance_matrix(selection=None, *, distance_conversion_function=None, zero_self_relations=True)
Transforms the interrelation profile, or its subset provided by the selection, into an explicit distance matrix based on interrelation values. Without a specified selection, the entire interrelation profile is converted.
- Parameters
selection – an optional subset to form the explicit matrix on. Default None.
distance_conversion_function – f(x) to transform the interrelation value x into distance. Default 1/x+1.
zero_self_relations – turns distances of all features to themselves into explicit 0. Default True.
- Returns
the explicit interrelation table as a DataFrame
- to_explicit_matrix(selection=None)
Transforms the interrelation profile, or its subset provided by the selection, into an explicit square matrix of interrelation values. Without a specified selection, the entire interrelation profile is converted.
- Parameters
selection – an optional subset to form the explicit matrix on. Default None.
- Returns
the explicit interrelation table as a DataFrame
- class fip.profiles.PointwiseJeffreysDivergenceProfile(df, *, zscore=False, min_cutoff_value=None, **kwargs)
Bases:
fip.profiles.PointwiseKLDivergenceProfileAn interrelation profile consisting of pointwise Jeffreys divergence values, a measure of statistical distances for each feature pair, between its observed co-occurrence in the characterized set and its observed co-occurrence in a reference set - and vice versa. A symmetric variant of the KL divergence.
- classmethod from_cooccurrence_probability_profiles(cooccurrence_probability_profile, reference_probability_profile, *args, **kwargs)
Creates a pointwise Jeffreys divergence interrelation profile quantifying how well do the co-occurrence probabilities in the given interrelation profile match those in the given reference interrelation profile.
pJD(F1|F2) == pJD(F2|F1) = abs(pKLD(F1|F2)) + abs(pKLD(F2|F1))
where F1 and F2 are observed features, pKLD(F1|F2) is their pointwise KL divergence.
- Parameters
cooccurrence_probability_profile – first CooccurrenceProbabilityProfile instance
reference_probability_profile – second CooccurrenceProbabilityProfile instance
args – any further arguments to be passed to the InterrelationProfile init
kwargs – any further keyword arguments to be passed to the InterrelationProfile init
- Returns
PointwiseJeffreysDivergenceProfile instance
- class fip.profiles.PointwiseKLDivergenceProfile(df, *, zscore=False, min_cutoff_value=None, **kwargs)
Bases:
fip.profiles.InterrelationProfileAn interrelation profile consisting of pointwise Kullback–Leibler divergence values, a measure of statistical distances for each feature pair, between its observed co-occurrence in the characterized set and its observed co-occurrence in a reference set.
- classmethod from_cooccurrence_probability_profiles(cooccurrence_probability_profile, reference_probability_profile, *args, **kwargs)
Creates a pointwise KL Divergence interrelation profile quantifying how well do the co-occurrence probabilities in the given interrelation profile match those in the given reference interrelation profile.
pKLD(F1|F2) = log2( P(F1|F2) / Q(F1|F2) )
where F1 and F2 are observed features, P(F1|F2) is their co-occurrence probability within the evaluated interrelation profile, and Q(F1|F2) is the same within the reference interrelation profile.
- Parameters
cooccurrence_probability_profile – the CooccurrenceProbabilityProfile instance to be evaluated
reference_probability_profile – the CooccurrenceProbabilityProfile instance to serve as a reference
args – any further arguments to be passed to the InterrelationProfile init
kwargs – any further keyword arguments to be passed to the InterrelationProfile init
- Returns
PointwiseKLDivergenceProfile instance
- get_imputation_value(*args)
Pointwise KL imputation for the case that the features do not co-occur in neither the evaluated, nor the reference interrelation profile. It is based on the imputation probability for the individual feature profiles.
- Returns
Imputation pointwise KLD for feature co-occurrence appearing in neither profile
- mean_interrelation_value()
Provides mean value of all interrelation values within the profile, including imputed values. Ignores self-relations.
- Returns
the mean interrelation value as a float
- relative_feature_divergence(features)
Provides relative feature divergence, i.e. relative feature tightness (RFT) measure for a given set of features, against this pointwise KL divergence profile between two interrelation profiles. The value quantifies how much does the feature co-occurrence combination in the provided feature vector match the interrelations prevalent in the source profile (positive values) compared to those more prevalent in the reference profile (negative values).
- Parameters
features – an iterable of features
- Returns
Feature divergence value as a float
- standard_interrelation_deviation()
Provides standard deviation for all interrelation values within the profile, including imputed values. Ignores self-relations.
- Returns
the standard deviation value of interrelations as a float
- class fip.profiles.PointwiseMutualInformationProfile(df, *, zscore=False, min_cutoff_value=None, **kwargs)
Bases:
fip.profiles.InterrelationProfileAn interrelation profile consisting of Pointwise Mutual Information (PMI) values observed between the features present in the characterized set. PMI is a measure of association between two features - a ratio between the observed co-occurrence probability of the feature pair, versus the projected co-occurrence if the features were completely independent, based on their individual occurrence rate.
- classmethod from_cooccurrence_probability_profile(cooccurrence_probability_profile, *args, **kwargs)
Generate a PMI interrelation profile.
- Parameters
cooccurrence_probability_profile – the source CooccurrenceProbabilityProfile instance
args – any further arguments to be passed to the InterrelationProfile init
kwargs – any further keyword arguments to be passed to the InterrelationProfile init
- Returns
PointwiseMutualInformationProfile instance
- get_imputation_value(feature1, feature2)
PMI imputation is based on the assumption that two of least occurring features within the set can be expected to have no interrelation, i.e. their PMI value would be 0. For the more frequently occurring features, their lack of co-occurrence is a correspondingly larger surprise, i.e. the imputed PMI values go into the negatives, meaning that the feature co-occur less than what could be expected from their individual occurrence probabilities, if they were independent. The computation of imputation PMI values (iPMI) for two feature is therefore:
iPMI(base) = log2[(p_least_common_feature * p_least_common_feature) / (p_least_common_feature * p_least_common_feature)] = log2[1] = 0
based on which:
iPMI(feature1, feature2) = log2[p_least_common_feature**2 / (p_feature1 * p_feature2)]
where p_least_common_feature is the stand-alone occurrence probability for the least common feature within the profile, and p_feature1 and p_feature2 are the stand-alone occurrence probabilities for the features which PMI is being imputed.
- Parameters
feature1 – First feature for PMI imputation
feature2 – Second feature for PMI imputation
- Returns
The pair PMI based on imputed values
- mean_interrelation_value()
Provides mean value of all PMI values within the profile, including imputed values. Ignores self-relations.
- Returns
the mean PMI value as a float
- relative_feature_tightness(features)
Provides relative feature tightness (RFT) measure for a given set of features. RFT quantifies how well do the feature co-occurrence combination in the provided feature vector match the interrelations within this reference (Z)PMI profile.
- Parameters
features – an iterable of features
- Returns
Feature tightness value as a float
- standard_interrelation_deviation()
Provides standard deviation for all interrelation values within the profile, including imputed values. Ignores self-relations.
- Returns
the standard deviation value of interrelations as a float