fip package

Submodules

fip.chem module

fip.chem.rdmol2brics_blocs_smiles(mol, min_fragment_size=1)

Decomposes the provided rdmol instance into BRICS fragment SMILES, as designed by Degen et al. in On the Art of Compiling and Using ‘Drug-Like’ Chemical Fragment Spaces, https://doi.org/10.1002/cmdc.200800178

Parameters
  • mol – the source RDKit Mol instance

  • min_fragment_size – passed to the minFragmentSize of the RDKit implementation

Returns

SMILES of the BRICS fragments

fip.chem.rdmol2morgan_feature_smiles(mol, radius=3, min_radius=1, *, all_bonds_explicit=False, canonical_smiles=True, isomeric_smiles=False, all_H_explicit=True)

Breaks a molecule, given as an RDKit Mol instance, into a set of ECFP-like fragments with selected radius, and returns them in SMILES notation.

Parameters
  • mol – the molecule for fragmenting, as RDKit Mol

  • radius – EC fragment radius

  • min_radius – minimal fragment radius to consider, default 0. Can be set to ignore lower-scope fragments.

  • all_bonds_explicit – boolean indicating whether all bond orders will be explicitly stated in the output. Default False.

  • canonical_smiles – boolean indicating whether the fragment should be attempted to make canonical. Default True.

  • isomeric_smiles – boolean indicating whether to include stereo information in the fragments. Default False.

  • all_H_explicit – boolean indicating whether to explicitly include all hydrogen atoms. Default True.

Returns

a set of SMILES strings

fip.chem.rdmol2smiles(mol)

Simple conversion of RDKit Mol instance into a SMILES string. Wrapped in case some standardization/postprocessing needed.

Parameters

mol – RDKit Mol instance

Returns

SMILES string

fip.chem.rdmol_bonds2fragment_smiles(mol, bonds, *, all_bonds_explicit=False, canonical_smiles=True, isomeric_smiles=False, all_H_explicit=True)

Generates a fragment in SMILES notation from a molecule given as an RDKit Mol instance, based on the provided bond IDs within the molecule.

Parameters
  • mol – the molecule containing the fragment, as RDKit Mol

  • bonds – a set of bonds that are part of the fragment

  • all_bonds_explicit – boolean indicating whether all bond orders will be explicitly stated in the output. Default False.

  • canonical_smiles – boolean indicating whether the fragment should be attempted to make canonical. Default True.

  • isomeric_smiles – boolean indicating whether to include stereo information in the fragments. Default False.

  • all_H_explicit – boolean indicating whether to explicitly include all hydrogen atoms. Default False.

Returns

a SMILES string of the fragment, None if there are no atoms matched

fip.chem.rdmol_has_substructure_pattern(rdmol, pattern)

Direct wrap of RDKit HasSubstructMatch. Wrapped in case some standardization/postprocessing needed.

Parameters
  • rdmol – RDKit Mol instance to search the pattern in

  • pattern – the pattern to search, also in RDKit Mol form

Returns

bool or None (in case of RDKit Mol instance error)

fip.chem.rdmol_locations2fragments_smiles(mol, fragment_locations, min_radius=0, *, all_bonds_explicit=False, canonical_smiles=True, isomeric_smiles=False, all_H_explicit=True)

Generates a set of fragments in SMILES notation from a molecule given as an RDKit Mol instance, based on provided atom indices and radii.

Parameters
  • mol – the molecule for fragmenting, as RDKit Mol

  • fragment_locations – an iterable of fragment locations, in (atom, radius) tuples

  • min_radius – minimal feature radius to consider, default 0. Can be set to ignore lower-scope features.

  • all_bonds_explicit – boolean indicating whether all bond orders will be explicitly stated in the output. Default False.

  • canonical_smiles – boolean indicating whether the fragment should be attempted to make canonical. Default True.

  • isomeric_smiles – boolean indicating whether to include stereo information in the fragments. Default False.

  • all_H_explicit – boolean indicating whether to explicitly include all hydrogen atoms. Default True.

Returns

a set of SMILES strings

fip.chem.sdf2rdmols(sdf_path)

Converts an SDF file to RDKit Mol instances. Wrap of RDKit SDMolSupplier, modified to omit RDKit Mol instances that can’t be parsed.

Parameters

sdf_path – Path to the SDF file

Returns

a generator yielding RDKit Mol instances

fip.chem.smarts2rdmol(smarts)

Simple conversion of SMARTS string into a RDKit Mol instance. Wrapped in case some standardization/postprocessing needed.

Parameters

smarts – SMARTS string

Returns

RDKit Mol instance

fip.chem.smiles2rdmol(smiles)

Simple conversion of SMILES string into a RDKit Mol instance. Wrapped in case some standardization/postprocessing needed.

Parameters

smiles – SMILES string

Returns

RDKit Mol instance

fip.chem.standardize_mol(mol, *, remove_hydrogens=True, remove_stereo=True)

Simple structure standardization.

Parameters
  • mol – RDKit Mol instance to standardize

  • remove_hydrogens – whether to remove all hydrogens using RDKit RemoveAllHs

  • remove_stereo – whether to remove stereo information using RDKIt RemoveStereochemistry

Returns

standardized RDKit Mol instance

fip.profiles module

class fip.profiles.CooccurrenceProbabilityProfile(df, *, zscore=False, min_cutoff_value=None, **kwargs)

Bases: fip.profiles.InterrelationProfile

A co-occurrence probability profile, holding information about the observed probabilities for feature pair co-occurrences within the characterized set.

classmethod from_cooccurrence_profile(cooccurrence_profile, *args, vector_count=None, **kwargs)

Creates a Cooccurrence Probability Profile from the provided Cooccurrence Profile. To calculate probability, overall count of feature vector is needed. The count is either explicitly provided through the vector_count kwarg, or inferred from the cooccurrence_profile.attrs[‘vector_count’] or, failing that, as a max of its ‘count’ column

Parameters
  • cooccurrence_profile – the source CooccurrenceProfile instance

  • args – any further arguments to be passed to the InterrelationProfile init

  • vector_count – explicit count of feature vectors, i.e. samples, to manually adjust the probabilities

  • kwargs – any further keyword arguments to be passed to the InterrelationProfile init

Returns

a CooccurrenceProbabilityProfile instance

get_imputation_value(*args)

Interrelation probability imputation based on “most-optimistic” scenario that the co-occurrence would happen in the n+1 sample.

Returns

the imputation value as a float

imputable_standalone_probabilities(*args, vector_count=None)

Calculates standalone probabilities for individual features, to be used in the “most-optimistic” imputation scheme, i.e. presuming that the n+1 feature vector will contain all observed features co-occurring.

Parameters

vector_count – explicit count of feature vectors, i.e. samples, to optionally manually adjust the probabilities

Returns

a dictionary of feature:probability

mean_interrelation_value()

Provides mean value of all interrelation values within the profile, including imputed values. Ignores self-relations.

Returns

the mean interrelation value as a float

standard_interrelation_deviation()

Provides standard deviation for all interrelation values within the profile, including imputed values. Ignores self-relations.

Returns

the standard deviation value of interrelations as a float

class fip.profiles.CooccurrenceProfile(df, *, zscore=False, min_cutoff_value=None, **kwargs)

Bases: fip.profiles.InterrelationProfile

A feature Co-occurrence profile, which is usually a core profile for further interrelation analysis. Holds raw counts on how many times did the features occur and co-occur within the characterized set.

add_another_cooccurrence_profile(other)

Adds the contents of another co-occurrence profile to this one, in full outer join fashion. The addition is done inplace, i.e. on this instance.

Parameters

other – another CooccurrenceProfile instance

Returns

self

classmethod from_feature_lists(feature_lists, *args, **kwargs)

Generate a Co-occurrence profile from an iterable of feature lists.

Parameters
  • feature_lists – the iterable of feature lists to derive the CooccurrenceProfile from

  • args – any further arguments to be passed to the CooccurrenceProfile init

  • kwargs – any further keyword arguments to be passed to the CooccurrenceProfile init

Returns

the corresponding CooccurrenceProfile instance

classmethod from_feature_lists_split_on_feature(iterable, feature)

Generates and returns two CooccurrenceProbabilityProfile instances from the provided iterable, and the provided feature. Returns main profile from feature vectors containing the given feature, and a reference profile from all other feature vectors.

Parameters
  • iterable – the iterable of feature lists to derive the CooccurrenceProfile from

  • feature – the feature to split the set on

Returns

a tuple of the CooccurrenceProfile instances (with_feature, without_feature)

get_imputation_value(*args)

Returns interrelation imputation value. For co-occurrences, this is flat 0, unless z-scored.

Returns

imputation value

mean_interrelation_value()

Provides mean value of all interrelation values within the profile, including imputed values. Ignores self-relations.

Returns

the mean interrelation value as a float

standard_interrelation_deviation()

Provides standard deviation for all interrelation values within the profile, including imputed values. Ignores self-relations.

Returns

the standard deviation value of interrelations as a float

class fip.profiles.InterrelationProfile(df, *, zscore=False, min_cutoff_value=None, **kwargs)

Bases: object

A generic parent class representing an interrelation profile, and implementing their common functionality. Not meant for instantiation.

convert_to_zscore()

Converts the values within the InterrelationProfile into Z-scores, i.e. subtracts mean, divides by standard deviation.

Returns

None, the InterrelationProfile is changed itself

distinct_features(selection=None)

Provides a set of all distinct features present within the interrelation profile. Optionally, can be provided a selection containing interrelation profile subset, to return distinct features within that subset.

Parameters

selection – a subset DataFrame for the interrelation profile, to narrow the scope. Optional, default None.

Returns

feature names as a set of strings

static features2cooccurrences(features, *, omit_self_relations=False)

Processes an iterable of features into a string set of feature co-occurrences.

Parameters
  • features – an iterable of features to process, e.g. (feature1, feature2, feature4, …)

  • omit_self_relations – Whether to ignore feature self-relations, i.e. (f1, f1). Default False.

Returns

a co-occurrence generator

features_interrelation_values(features, *, omit_self_relations=False)

Yields interrelation values within the profile for all features within a given feature list. Includes imputed values.

Parameters
  • features – features to look up within the profile

  • omit_self_relations – whether to omit self-relations in the lookup, default False.

Returns

a generator yielding the interrelation values, usually floats or ints

classmethod from_dataframe(dataframe, *args, **kwargs)

Loads an interrelation profile from a dataframe containing ‘feature1’, ‘feature2’ and ‘value’ columns.

Parameters
  • dataframe – the dataframe to load

  • zscore – whether to produce a z-scored profile instead of raw values. Keyword argument, default False.

  • min_cutoff_value – if defined, drop all interrelations with values below the given limit. Default None.

  • args – any further arguments to be passed to the InterrelationProfile init

  • kwargs – any further keyword arguments to be passed to the InterrelationProfile init

Returns

the corresponding InterrelationProfile instance

classmethod from_dict(value_dict, *args, **kwargs)

Loads an interrelation profile from a dictionary of {(feature1, feature2): value}. feature1, feature2 are handled as strings and serve as a multiindex for the value.

Parameters
  • value_dict – the dictionary to load

  • zscore – whether to produce a z-scored profile instead of raw values. Keyword argument, default False.

  • min_cutoff_value – if defined, drop all interrelations with values below the given limit. Default None.

  • args – any further arguments to be passed to the InterrelationProfile init

  • kwargs – any further keyword arguments to be passed to the InterrelationProfile init

Returns

the corresponding InterrelationProfile instance

abstract get_imputation_value(f1, f2)

Provides the imputation value for feature pair that does not occur within the interrelation profile. Implemented individually within different FeatureInterrelation types, due to imputation differences.

interrelation_value(f1, f2=None)

Returns the interrelation value for the feature pair provided in the arguments. If second argument f2 is not filled in, returns self-relation (f1, f1).

Parameters
  • f1 – the first feature

  • f2 – the second feature, default None

Returns

The interrelation value between the two features within the profile. Usually int or float.

iterate_feature_interrelations()

Yields all interrelations values between features in the profile, in a tuple. Omits self-relations of features. Includes imputed values.

Returns

yields tuples of (feature1, feature2, interrelation_value)

mean_feature_interrelation_value(features, *, omit_self_relations=True)

Returns the mean interrelation value within the profile for all features within a given feature list. Corresponds to feature “tightness” measure for profiles such as (Z)PMI and (Z)PKLD. Includes imputed values.

Parameters
  • features – features to look up within the profile

  • omit_self_relations – whether to omit self-relations in the lookup, default True.

Returns

mean interrelation value, as a float

abstract mean_interrelation_value()

Provides mean value of all interrelation values within the profile, including imputed values. Implemented individually within different FeatureInterrelation types, due to imputation differences.

mean_raw_interrelation_value()

Provides mean value of all explicit interrelation values within the profile, i.e. not counting in the imputation values. Ignores self-relations.

Returns

the mean explicit interrelation value as a float

mean_self_relation_value()

Provides mean value of all self-relation values within the profile. Ignores interrelations.

Returns

the mean explicit interrelation value as a float

num_features()

Provides the count of all features (individual features, not their interrelations) that occur within the profile.

Returns

count of all features

num_max_interrelations()

Provides the count of all possible feature interrelations that can exist within the profile, based solely on the amount of observed features.

Returns

count of all possible interrelations

num_raw_interrelations()

Provides the count of all interrelations that explicitly occur within the profile, i.e. all interrelations that are non-imputed, non-self-relations.

Returns

count of all explicit interrelation values

static row_zscore(mean, standard_deviation, row, *, input_column_name='value', output_column_name='value')

A drop-in static method to calculate Z-score from a value, mean and standard deviation. Z-score, aka. standard score, is the number of standard deviations a given value is from the mean.

Parameters
  • mean – the mean value within the distribution

  • standard_deviation – the standard deviation within the distribution

  • row – a Pandas row to calculate the z score for

  • input_column_name – column name for the processed value, default ‘value’

  • output_column_name – column name for the output z-score value, default ‘value’ (i.e. overwrite)

Returns

modified row containing the z-score

select_all()

Provides all explicit feature relations (both self-relations and interrelations) within the profile as a DataFrame. The DataFrame itself is also directly accessible through InterrelationProfile.df

Returns

all feature relations as a Pandas DataFrame

select_major_interrelations(zscore_cutoff=1.0)

Provides all explicit feature interrelations within the profile, that are higher or lower than the profile average by amount of standard deviations provided by the zscore_cutoff value. In other words, the zcore_cutoff is the relative relation strength cutoff based on the amount of standard deviations for the individual interrelation values from their mean.

Returns a selection of the DataFrame. The DataFrame itself is also directly accessible through InterrelationProfile.df

Parameters

zscore_cutoff – Relative relation strength cutoff value. Default 1.0

Returns

major feature interrelations as a Pandas DataFrame

select_major_self_relations(zscore_cutoff=1.0)

Provides all explicit feature self relations within the profile, that are higher or lower than the profile average by amount of standard deviations provided by the zscore_cutoff value. In other words, the zcore_cutoff is the relative relation strength cutoff based on the amount of standard deviations for the individual interrelation values from their mean.

Returns a selection of the DataFrame. The DataFrame itself is also directly accessible through InterrelationProfile.df

Parameters

zscore_cutoff – Relative relation strength cutoff value. Default 1.0

Returns

major feature self-relations as a Pandas DataFrame

select_raw_interrelations(selection=None)

Provides all explicit (i.e. not imputed) feature interrelations (i.e. not self-relations) within the profile as a DataFrame subset selection.

Returns

feature interrelations as a Pandas DataFrame subset

select_raw_interrelations_involving(features, depth=0)

Selects all raw interrelations associated with the given feature or group of features, either directly (depth = 0) or by extension through other relations (depth of 1, 2, …)

Parameters
  • features – feature or a list of features to gain features from

  • depth – how far the interrelations are tracked. Default 0, i.e. only direct interrelations.

Returns

the associated interrelations as a Pandas DataFrame subset

select_self_relations(selection=None)

Provides all feature self-relations within the profile as a DataFrame subset selection.

Returns

feature self-relations as a Pandas DataFrame subset

self_relations_dict()

Returns self-relation values of all features in the profile as a dictionary.

Returns

a dictionary of {feature: self_interrelation_value} pairs

abstract standard_interrelation_deviation()

Provides standard deviation for all interrelation values within the profile, including imputed values. Implemented individually within different FeatureInterrelation types, due to imputation differences.

standard_raw_interrelation_deviation()

Provides standard deviation from all explicit (i.e. non-imputed) interrelation values within the profile. Ignores self-relations.

Returns

The standard deviation as a float

standard_self_relation_deviation()

Provides standard deviation from all self-relation values within the profile. Ignores interrelations.

Returns

The standard deviation as a float

to_csv(target_file=None)

Export the interrelation matrix to a CSV file

Parameters

target_file – the path or buffer to the export. Default None

Returns

None or string

to_distance_matrix(selection=None, *, distance_conversion_function=None, zero_self_relations=True)

Transforms the interrelation profile, or its subset provided by the selection, into an explicit distance matrix based on interrelation values. Without a specified selection, the entire interrelation profile is converted.

Parameters
  • selection – an optional subset to form the explicit matrix on. Default None.

  • distance_conversion_function – f(x) to transform the interrelation value x into distance. Default 1/x+1.

  • zero_self_relations – turns distances of all features to themselves into explicit 0. Default True.

Returns

the explicit interrelation table as a DataFrame

to_explicit_matrix(selection=None)

Transforms the interrelation profile, or its subset provided by the selection, into an explicit square matrix of interrelation values. Without a specified selection, the entire interrelation profile is converted.

Parameters

selection – an optional subset to form the explicit matrix on. Default None.

Returns

the explicit interrelation table as a DataFrame

class fip.profiles.PointwiseJeffreysDivergenceProfile(df, *, zscore=False, min_cutoff_value=None, **kwargs)

Bases: fip.profiles.PointwiseKLDivergenceProfile

An interrelation profile consisting of pointwise Jeffreys divergence values, a measure of statistical distances for each feature pair, between its observed co-occurrence in the characterized set and its observed co-occurrence in a reference set - and vice versa. A symmetric variant of the KL divergence.

classmethod from_cooccurrence_probability_profiles(cooccurrence_probability_profile, reference_probability_profile, *args, **kwargs)

Creates a pointwise Jeffreys divergence interrelation profile quantifying how well do the co-occurrence probabilities in the given interrelation profile match those in the given reference interrelation profile.

pJD(F1|F2) == pJD(F2|F1) = abs(pKLD(F1|F2)) + abs(pKLD(F2|F1))

where F1 and F2 are observed features, pKLD(F1|F2) is their pointwise KL divergence.

Parameters
  • cooccurrence_probability_profile – first CooccurrenceProbabilityProfile instance

  • reference_probability_profile – second CooccurrenceProbabilityProfile instance

  • args – any further arguments to be passed to the InterrelationProfile init

  • kwargs – any further keyword arguments to be passed to the InterrelationProfile init

Returns

PointwiseJeffreysDivergenceProfile instance

class fip.profiles.PointwiseKLDivergenceProfile(df, *, zscore=False, min_cutoff_value=None, **kwargs)

Bases: fip.profiles.InterrelationProfile

An interrelation profile consisting of pointwise Kullback–Leibler divergence values, a measure of statistical distances for each feature pair, between its observed co-occurrence in the characterized set and its observed co-occurrence in a reference set.

classmethod from_cooccurrence_probability_profiles(cooccurrence_probability_profile, reference_probability_profile, *args, **kwargs)

Creates a pointwise KL Divergence interrelation profile quantifying how well do the co-occurrence probabilities in the given interrelation profile match those in the given reference interrelation profile.

pKLD(F1|F2) = log2( P(F1|F2) / Q(F1|F2) )

where F1 and F2 are observed features, P(F1|F2) is their co-occurrence probability within the evaluated interrelation profile, and Q(F1|F2) is the same within the reference interrelation profile.

Parameters
  • cooccurrence_probability_profile – the CooccurrenceProbabilityProfile instance to be evaluated

  • reference_probability_profile – the CooccurrenceProbabilityProfile instance to serve as a reference

  • args – any further arguments to be passed to the InterrelationProfile init

  • kwargs – any further keyword arguments to be passed to the InterrelationProfile init

Returns

PointwiseKLDivergenceProfile instance

get_imputation_value(*args)

Pointwise KL imputation for the case that the features do not co-occur in neither the evaluated, nor the reference interrelation profile. It is based on the imputation probability for the individual feature profiles.

Returns

Imputation pointwise KLD for feature co-occurrence appearing in neither profile

mean_interrelation_value()

Provides mean value of all interrelation values within the profile, including imputed values. Ignores self-relations.

Returns

the mean interrelation value as a float

relative_feature_divergence(features)

Provides relative feature divergence, i.e. relative feature tightness (RFT) measure for a given set of features, against this pointwise KL divergence profile between two interrelation profiles. The value quantifies how much does the feature co-occurrence combination in the provided feature vector match the interrelations prevalent in the source profile (positive values) compared to those more prevalent in the reference profile (negative values).

Parameters

features – an iterable of features

Returns

Feature divergence value as a float

standard_interrelation_deviation()

Provides standard deviation for all interrelation values within the profile, including imputed values. Ignores self-relations.

Returns

the standard deviation value of interrelations as a float

class fip.profiles.PointwiseMutualInformationProfile(df, *, zscore=False, min_cutoff_value=None, **kwargs)

Bases: fip.profiles.InterrelationProfile

An interrelation profile consisting of Pointwise Mutual Information (PMI) values observed between the features present in the characterized set. PMI is a measure of association between two features - a ratio between the observed co-occurrence probability of the feature pair, versus the projected co-occurrence if the features were completely independent, based on their individual occurrence rate.

classmethod from_cooccurrence_probability_profile(cooccurrence_probability_profile, *args, **kwargs)

Generate a PMI interrelation profile.

Parameters
  • cooccurrence_probability_profile – the source CooccurrenceProbabilityProfile instance

  • args – any further arguments to be passed to the InterrelationProfile init

  • kwargs – any further keyword arguments to be passed to the InterrelationProfile init

Returns

PointwiseMutualInformationProfile instance

get_imputation_value(feature1, feature2)

PMI imputation is based on the assumption that two of least occurring features within the set can be expected to have no interrelation, i.e. their PMI value would be 0. For the more frequently occurring features, their lack of co-occurrence is a correspondingly larger surprise, i.e. the imputed PMI values go into the negatives, meaning that the feature co-occur less than what could be expected from their individual occurrence probabilities, if they were independent. The computation of imputation PMI values (iPMI) for two feature is therefore:

iPMI(base) = log2[(p_least_common_feature * p_least_common_feature) / (p_least_common_feature * p_least_common_feature)] = log2[1] = 0

based on which:

iPMI(feature1, feature2) = log2[p_least_common_feature**2 / (p_feature1 * p_feature2)]

where p_least_common_feature is the stand-alone occurrence probability for the least common feature within the profile, and p_feature1 and p_feature2 are the stand-alone occurrence probabilities for the features which PMI is being imputed.

Parameters
  • feature1 – First feature for PMI imputation

  • feature2 – Second feature for PMI imputation

Returns

The pair PMI based on imputed values

mean_interrelation_value()

Provides mean value of all PMI values within the profile, including imputed values. Ignores self-relations.

Returns

the mean PMI value as a float

relative_feature_tightness(features)

Provides relative feature tightness (RFT) measure for a given set of features. RFT quantifies how well do the feature co-occurrence combination in the provided feature vector match the interrelations within this reference (Z)PMI profile.

Parameters

features – an iterable of features

Returns

Feature tightness value as a float

standard_interrelation_deviation()

Provides standard deviation for all interrelation values within the profile, including imputed values. Ignores self-relations.

Returns

the standard deviation value of interrelations as a float

Module contents