fip package

Submodules

fip.chem module

fip.chem.rdmol2brics_blocs_smiles(mol, min_fragment_size=1)

Decomposes the provided rdmol instance into BRICS fragment SMILES, as designed by Degen et al. in On the Art of Compiling and Using ‘Drug-Like’ Chemical Fragment Spaces, https://doi.org/10.1002/cmdc.200800178

Parameters

mol – the source RDKit Mol instance
min_fragment_size – passed to the minFragmentSize of the RDKit implementation

Returns

SMILES of the BRICS fragments

fip.chem.rdmol2morgan_feature_smiles(mol, radius=3, min_radius=1, *, all_bonds_explicit=False, canonical_smiles=True, isomeric_smiles=False, all_H_explicit=True)

Breaks a molecule, given as an RDKit Mol instance, into a set of ECFP-like fragments with selected radius, and returns them in SMILES notation.

Parameters

mol – the molecule for fragmenting, as RDKit Mol
radius – EC fragment radius
min_radius – minimal fragment radius to consider, default 0. Can be set to ignore lower-scope fragments.
all_bonds_explicit – boolean indicating whether all bond orders will be explicitly stated in the output. Default False.
canonical_smiles – boolean indicating whether the fragment should be attempted to make canonical. Default True.
isomeric_smiles – boolean indicating whether to include stereo information in the fragments. Default False.
all_H_explicit – boolean indicating whether to explicitly include all hydrogen atoms. Default True.

Returns

a set of SMILES strings

fip.chem.rdmol2smiles(mol)

Simple conversion of RDKit Mol instance into a SMILES string. Wrapped in case some standardization/postprocessing needed.

Parameters: mol – RDKit Mol instance
Returns: SMILES string

fip.chem.rdmol_bonds2fragment_smiles(mol, bonds, *, all_bonds_explicit=False, canonical_smiles=True, isomeric_smiles=False, all_H_explicit=True)

Generates a fragment in SMILES notation from a molecule given as an RDKit Mol instance, based on the provided bond IDs within the molecule.

Parameters

mol – the molecule containing the fragment, as RDKit Mol
bonds – a set of bonds that are part of the fragment
all_bonds_explicit – boolean indicating whether all bond orders will be explicitly stated in the output. Default False.
canonical_smiles – boolean indicating whether the fragment should be attempted to make canonical. Default True.
isomeric_smiles – boolean indicating whether to include stereo information in the fragments. Default False.
all_H_explicit – boolean indicating whether to explicitly include all hydrogen atoms. Default False.

Returns

a SMILES string of the fragment, None if there are no atoms matched

fip.chem.rdmol_has_substructure_pattern(rdmol, pattern)

Direct wrap of RDKit HasSubstructMatch. Wrapped in case some standardization/postprocessing needed.

Parameters

rdmol – RDKit Mol instance to search the pattern in
pattern – the pattern to search, also in RDKit Mol form

Returns

bool or None (in case of RDKit Mol instance error)

fip.chem.rdmol_locations2fragments_smiles(mol, fragment_locations, min_radius=0, *, all_bonds_explicit=False, canonical_smiles=True, isomeric_smiles=False, all_H_explicit=True)

Generates a set of fragments in SMILES notation from a molecule given as an RDKit Mol instance, based on provided atom indices and radii.

Parameters

mol – the molecule for fragmenting, as RDKit Mol
fragment_locations – an iterable of fragment locations, in (atom, radius) tuples
min_radius – minimal feature radius to consider, default 0. Can be set to ignore lower-scope features.
all_bonds_explicit – boolean indicating whether all bond orders will be explicitly stated in the output. Default False.
canonical_smiles – boolean indicating whether the fragment should be attempted to make canonical. Default True.
isomeric_smiles – boolean indicating whether to include stereo information in the fragments. Default False.
all_H_explicit – boolean indicating whether to explicitly include all hydrogen atoms. Default True.

Returns

a set of SMILES strings

fip.chem.sdf2rdmols(sdf_path)

Converts an SDF file to RDKit Mol instances. Wrap of RDKit SDMolSupplier, modified to omit RDKit Mol instances that can’t be parsed.

Parameters: sdf_path – Path to the SDF file
Returns: a generator yielding RDKit Mol instances

fip.chem.smarts2rdmol(smarts)

Simple conversion of SMARTS string into a RDKit Mol instance. Wrapped in case some standardization/postprocessing needed.

Parameters: smarts – SMARTS string
Returns: RDKit Mol instance

fip.chem.smiles2rdmol(smiles)

Simple conversion of SMILES string into a RDKit Mol instance. Wrapped in case some standardization/postprocessing needed.

Parameters: smiles – SMILES string
Returns: RDKit Mol instance

fip.chem.standardize_mol(mol, *, remove_hydrogens=True, remove_stereo=True)

Simple structure standardization.

Parameters

mol – RDKit Mol instance to standardize
remove_hydrogens – whether to remove all hydrogens using RDKit RemoveAllHs
remove_stereo – whether to remove stereo information using RDKIt RemoveStereochemistry

Returns

standardized RDKit Mol instance

fip.profiles module

class fip.profiles.CooccurrenceProbabilityProfile(df, *, zscore=False, min_cutoff_value=None, **kwargs)

Bases: fip.profiles.InterrelationProfile

A co-occurrence probability profile, holding information about the observed probabilities for feature pair co-occurrences within the characterized set.

classmethod from_cooccurrence_profile(cooccurrence_profile, *args, vector_count=None, **kwargs)

Creates a Cooccurrence Probability Profile from the provided Cooccurrence Profile. To calculate probability, overall count of feature vector is needed. The count is either explicitly provided through the vector_count kwarg, or inferred from the cooccurrence_profile.attrs[‘vector_count’] or, failing that, as a max of its ‘count’ column

Parameters

cooccurrence_profile – the source CooccurrenceProfile instance
args – any further arguments to be passed to the InterrelationProfile init
vector_count – explicit count of feature vectors, i.e. samples, to manually adjust the probabilities
kwargs – any further keyword arguments to be passed to the InterrelationProfile init

Returns

a CooccurrenceProbabilityProfile instance

get_imputation_value(*args)

Interrelation probability imputation based on “most-optimistic” scenario that the co-occurrence would happen in the n+1 sample.

Returns: the imputation value as a float

imputable_standalone_probabilities(*args, vector_count=None)

Calculates standalone probabilities for individual features, to be used in the “most-optimistic” imputation scheme, i.e. presuming that the n+1 feature vector will contain all observed features co-occurring.

Parameters: vector_count – explicit count of feature vectors, i.e. samples, to optionally manually adjust the probabilities
Returns: a dictionary of feature:probability

mean_interrelation_value()

Provides mean value of all interrelation values within the profile, including imputed values. Ignores self-relations.

Returns: the mean interrelation value as a float

standard_interrelation_deviation()

Provides standard deviation for all interrelation values within the profile, including imputed values. Ignores self-relations.

Returns: the standard deviation value of interrelations as a float

class fip.profiles.CooccurrenceProfile(df, *, zscore=False, min_cutoff_value=None, **kwargs)

Bases: fip.profiles.InterrelationProfile

A feature Co-occurrence profile, which is usually a core profile for further interrelation analysis. Holds raw counts on how many times did the features occur and co-occur within the characterized set.

add_another_cooccurrence_profile(other)

Adds the contents of another co-occurrence profile to this one, in full outer join fashion. The addition is done inplace, i.e. on this instance.

Parameters: other – another CooccurrenceProfile instance
Returns: self

classmethod from_feature_lists(feature_lists, *args, **kwargs)

Generate a Co-occurrence profile from an iterable of feature lists.

Parameters

feature_lists – the iterable of feature lists to derive the CooccurrenceProfile from
args – any further arguments to be passed to the CooccurrenceProfile init
kwargs – any further keyword arguments to be passed to the CooccurrenceProfile init

Returns

the corresponding CooccurrenceProfile instance

classmethod from_feature_lists_split_on_feature(iterable, feature)

Generates and returns two CooccurrenceProbabilityProfile instances from the provided iterable, and the provided feature. Returns main profile from feature vectors containing the given feature, and a reference profile from all other feature vectors.

Parameters

iterable – the iterable of feature lists to derive the CooccurrenceProfile from
feature – the feature to split the set on

Returns

a tuple of the CooccurrenceProfile instances (with_feature, without_feature)

get_imputation_value(*args)

Returns interrelation imputation value. For co-occurrences, this is flat 0, unless z-scored.

Returns: imputation value

mean_interrelation_value()

Provides mean value of all interrelation values within the profile, including imputed values. Ignores self-relations.

Returns: the mean interrelation value as a float

standard_interrelation_deviation()

Provides standard deviation for all interrelation values within the profile, including imputed values. Ignores self-relations.

Returns: the standard deviation value of interrelations as a float

class fip.profiles.InterrelationProfile(df, *, zscore=False, min_cutoff_value=None, **kwargs)

Bases: object

A generic parent class representing an interrelation profile, and implementing their common functionality. Not meant for instantiation.

convert_to_zscore()

Converts the values within the InterrelationProfile into Z-scores, i.e. subtracts mean, divides by standard deviation.

Returns: None, the InterrelationProfile is changed itself

distinct_features(selection=None)

Provides a set of all distinct features present within the interrelation profile. Optionally, can be provided a selection containing interrelation profile subset, to return distinct features within that subset.

Parameters: selection – a subset DataFrame for the interrelation profile, to narrow the scope. Optional, default None.
Returns: feature names as a set of strings

static features2cooccurrences(features, *, omit_self_relations=False)

Processes an iterable of features into a string set of feature co-occurrences.

Parameters

features – an iterable of features to process, e.g. (feature1, feature2, feature4, …)
omit_self_relations – Whether to ignore feature self-relations, i.e. (f1, f1). Default False.

Returns

a co-occurrence generator

features_interrelation_values(features, *, omit_self_relations=False)

Yields interrelation values within the profile for all features within a given feature list. Includes imputed values.

Parameters

features – features to look up within the profile
omit_self_relations – whether to omit self-relations in the lookup, default False.

Returns

a generator yielding the interrelation values, usually floats or ints

classmethod from_dataframe(dataframe, *args, **kwargs)

Loads an interrelation profile from a dataframe containing ‘feature1’, ‘feature2’ and ‘value’ columns.

Parameters

dataframe – the dataframe to load
zscore – whether to produce a z-scored profile instead of raw values. Keyword argument, default False.
min_cutoff_value – if defined, drop all interrelations with values below the given limit. Default None.
args – any further arguments to be passed to the InterrelationProfile init
kwargs – any further keyword arguments to be passed to the InterrelationProfile init

Returns

the corresponding InterrelationProfile instance

classmethod from_dict(value_dict, *args, **kwargs)

Loads an interrelation profile from a dictionary of {(feature1, feature2): value}. feature1, feature2 are handled as strings and serve as a multiindex for the value.

Parameters

value_dict – the dictionary to load
zscore – whether to produce a z-scored profile instead of raw values. Keyword argument, default False.
min_cutoff_value – if defined, drop all interrelations with values below the given limit. Default None.
args – any further arguments to be passed to the InterrelationProfile init
kwargs – any further keyword arguments to be passed to the InterrelationProfile init

Returns

the corresponding InterrelationProfile instance

abstract get_imputation_value(f1, f2): Provides the imputation value for feature pair that does not occur within the interrelation profile. Implemented individually within different FeatureInterrelation types, due to imputation differences.

interrelation_value(f1, f2=None)

Returns the interrelation value for the feature pair provided in the arguments. If second argument f2 is not filled in, returns self-relation (f1, f1).

Parameters

f1 – the first feature
f2 – the second feature, default None

Returns

The interrelation value between the two features within the profile. Usually int or float.

iterate_feature_interrelations()

Yields all interrelations values between features in the profile, in a tuple. Omits self-relations of features. Includes imputed values.

Returns: yields tuples of (feature1, feature2, interrelation_value)

mean_feature_interrelation_value(features, *, omit_self_relations=True)

Returns the mean interrelation value within the profile for all features within a given feature list. Corresponds to feature “tightness” measure for profiles such as (Z)PMI and (Z)PKLD. Includes imputed values.

Parameters

features – features to look up within the profile
omit_self_relations – whether to omit self-relations in the lookup, default True.

Returns

mean interrelation value, as a float

abstract mean_interrelation_value(): Provides mean value of all interrelation values within the profile, including imputed values. Implemented individually within different FeatureInterrelation types, due to imputation differences.

mean_raw_interrelation_value()

Provides mean value of all explicit interrelation values within the profile, i.e. not counting in the imputation values. Ignores self-relations.

Returns: the mean explicit interrelation value as a float

mean_self_relation_value()

Provides mean value of all self-relation values within the profile. Ignores interrelations.

Returns: the mean explicit interrelation value as a float

num_features()

Provides the count of all features (individual features, not their interrelations) that occur within the profile.

Returns: count of all features

num_max_interrelations()

Provides the count of all possible feature interrelations that can exist within the profile, based solely on the amount of observed features.

Returns: count of all possible interrelations

num_raw_interrelations()

Provides the count of all interrelations that explicitly occur within the profile, i.e. all interrelations that are non-imputed, non-self-relations.

Returns: count of all explicit interrelation values

static row_zscore(mean, standard_deviation, row, *, input_column_name='value', output_column_name='value')

A drop-in static method to calculate Z-score from a value, mean and standard deviation. Z-score, aka. standard score, is the number of standard deviations a given value is from the mean.

Parameters

mean – the mean value within the distribution
standard_deviation – the standard deviation within the distribution
row – a Pandas row to calculate the z score for
input_column_name – column name for the processed value, default ‘value’
output_column_name – column name for the output z-score value, default ‘value’ (i.e. overwrite)

Returns

modified row containing the z-score

select_all()

Provides all explicit feature relations (both self-relations and interrelations) within the profile as a DataFrame. The DataFrame itself is also directly accessible through InterrelationProfile.df

Returns: all feature relations as a Pandas DataFrame

select_major_interrelations(zscore_cutoff=1.0)

Provides all explicit feature interrelations within the profile, that are higher or lower than the profile average by amount of standard deviations provided by the zscore_cutoff value. In other words, the zcore_cutoff is the relative relation strength cutoff based on the amount of standard deviations for the individual interrelation values from their mean.

Returns a selection of the DataFrame. The DataFrame itself is also directly accessible through InterrelationProfile.df

Parameters: zscore_cutoff – Relative relation strength cutoff value. Default 1.0
Returns: major feature interrelations as a Pandas DataFrame

select_major_self_relations(zscore_cutoff=1.0)

Provides all explicit feature self relations within the profile, that are higher or lower than the profile average by amount of standard deviations provided by the zscore_cutoff value. In other words, the zcore_cutoff is the relative relation strength cutoff based on the amount of standard deviations for the individual interrelation values from their mean.

Returns a selection of the DataFrame. The DataFrame itself is also directly accessible through InterrelationProfile.df

Parameters: zscore_cutoff – Relative relation strength cutoff value. Default 1.0
Returns: major feature self-relations as a Pandas DataFrame

select_raw_interrelations(selection=None)

Provides all explicit (i.e. not imputed) feature interrelations (i.e. not self-relations) within the profile as a DataFrame subset selection.

Returns: feature interrelations as a Pandas DataFrame subset

select_raw_interrelations_involving(features, depth=0)

Selects all raw interrelations associated with the given feature or group of features, either directly (depth = 0) or by extension through other relations (depth of 1, 2, …)

Parameters

features – feature or a list of features to gain features from
depth – how far the interrelations are tracked. Default 0, i.e. only direct interrelations.

Returns

the associated interrelations as a Pandas DataFrame subset

select_self_relations(selection=None)

Provides all feature self-relations within the profile as a DataFrame subset selection.

Returns: feature self-relations as a Pandas DataFrame subset

self_relations_dict()

Returns self-relation values of all features in the profile as a dictionary.

Returns: a dictionary of {feature: self_interrelation_value} pairs

abstract standard_interrelation_deviation(): Provides standard deviation for all interrelation values within the profile, including imputed values. Implemented individually within different FeatureInterrelation types, due to imputation differences.

standard_raw_interrelation_deviation()

Provides standard deviation from all explicit (i.e. non-imputed) interrelation values within the profile. Ignores self-relations.

Returns: The standard deviation as a float

standard_self_relation_deviation()

Provides standard deviation from all self-relation values within the profile. Ignores interrelations.

Returns: The standard deviation as a float

to_csv(target_file=None)

Export the interrelation matrix to a CSV file

Parameters: target_file – the path or buffer to the export. Default None
Returns: None or string

to_distance_matrix(selection=None, *, distance_conversion_function=None, zero_self_relations=True)

Transforms the interrelation profile, or its subset provided by the selection, into an explicit distance matrix based on interrelation values. Without a specified selection, the entire interrelation profile is converted.

Parameters

selection – an optional subset to form the explicit matrix on. Default None.
distance_conversion_function – f(x) to transform the interrelation value x into distance. Default 1/x+1.
zero_self_relations – turns distances of all features to themselves into explicit 0. Default True.

Returns

the explicit interrelation table as a DataFrame

to_explicit_matrix(selection=None)

Transforms the interrelation profile, or its subset provided by the selection, into an explicit square matrix of interrelation values. Without a specified selection, the entire interrelation profile is converted.

Parameters: selection – an optional subset to form the explicit matrix on. Default None.
Returns: the explicit interrelation table as a DataFrame

class fip.profiles.PointwiseJeffreysDivergenceProfile(df, *, zscore=False, min_cutoff_value=None, **kwargs)

Bases: fip.profiles.PointwiseKLDivergenceProfile

An interrelation profile consisting of pointwise Jeffreys divergence values, a measure of statistical distances for each feature pair, between its observed co-occurrence in the characterized set and its observed co-occurrence in a reference set - and vice versa. A symmetric variant of the KL divergence.

classmethod from_cooccurrence_probability_profiles(cooccurrence_probability_profile, reference_probability_profile, *args, **kwargs)

Creates a pointwise Jeffreys divergence interrelation profile quantifying how well do the co-occurrence probabilities in the given interrelation profile match those in the given reference interrelation profile.

pJD(F1|F2) == pJD(F2|F1) = abs(pKLD(F1|F2)) + abs(pKLD(F2|F1))

where F1 and F2 are observed features, pKLD(F1|F2) is their pointwise KL divergence.

Parameters

cooccurrence_probability_profile – first CooccurrenceProbabilityProfile instance
reference_probability_profile – second CooccurrenceProbabilityProfile instance
args – any further arguments to be passed to the InterrelationProfile init
kwargs – any further keyword arguments to be passed to the InterrelationProfile init

Returns

PointwiseJeffreysDivergenceProfile instance

class fip.profiles.PointwiseKLDivergenceProfile(df, *, zscore=False, min_cutoff_value=None, **kwargs)

Bases: fip.profiles.InterrelationProfile

An interrelation profile consisting of pointwise Kullback–Leibler divergence values, a measure of statistical distances for each feature pair, between its observed co-occurrence in the characterized set and its observed co-occurrence in a reference set.

classmethod from_cooccurrence_probability_profiles(cooccurrence_probability_profile, reference_probability_profile, *args, **kwargs)

Creates a pointwise KL Divergence interrelation profile quantifying how well do the co-occurrence probabilities in the given interrelation profile match those in the given reference interrelation profile.

pKLD(F1|F2) = log2( P(F1|F2) / Q(F1|F2) )

where F1 and F2 are observed features, P(F1|F2) is their co-occurrence probability within the evaluated interrelation profile, and Q(F1|F2) is the same within the reference interrelation profile.

Parameters

cooccurrence_probability_profile – the CooccurrenceProbabilityProfile instance to be evaluated
reference_probability_profile – the CooccurrenceProbabilityProfile instance to serve as a reference
args – any further arguments to be passed to the InterrelationProfile init
kwargs – any further keyword arguments to be passed to the InterrelationProfile init

Returns

PointwiseKLDivergenceProfile instance

get_imputation_value(*args)

Pointwise KL imputation for the case that the features do not co-occur in neither the evaluated, nor the reference interrelation profile. It is based on the imputation probability for the individual feature profiles.

Returns: Imputation pointwise KLD for feature co-occurrence appearing in neither profile

mean_interrelation_value()

Provides mean value of all interrelation values within the profile, including imputed values. Ignores self-relations.

Returns: the mean interrelation value as a float

relative_feature_divergence(features)

Provides relative feature divergence, i.e. relative feature tightness (RFT) measure for a given set of features, against this pointwise KL divergence profile between two interrelation profiles. The value quantifies how much does the feature co-occurrence combination in the provided feature vector match the interrelations prevalent in the source profile (positive values) compared to those more prevalent in the reference profile (negative values).

Parameters: features – an iterable of features
Returns: Feature divergence value as a float

standard_interrelation_deviation()

Provides standard deviation for all interrelation values within the profile, including imputed values. Ignores self-relations.

Returns: the standard deviation value of interrelations as a float

class fip.profiles.PointwiseMutualInformationProfile(df, *, zscore=False, min_cutoff_value=None, **kwargs)

Bases: fip.profiles.InterrelationProfile

An interrelation profile consisting of Pointwise Mutual Information (PMI) values observed between the features present in the characterized set. PMI is a measure of association between two features - a ratio between the observed co-occurrence probability of the feature pair, versus the projected co-occurrence if the features were completely independent, based on their individual occurrence rate.

classmethod from_cooccurrence_probability_profile(cooccurrence_probability_profile, *args, **kwargs)

Generate a PMI interrelation profile.

Parameters

cooccurrence_probability_profile – the source CooccurrenceProbabilityProfile instance
args – any further arguments to be passed to the InterrelationProfile init
kwargs – any further keyword arguments to be passed to the InterrelationProfile init

Returns

PointwiseMutualInformationProfile instance

get_imputation_value(feature1, feature2)

PMI imputation is based on the assumption that two of least occurring features within the set can be expected to have no interrelation, i.e. their PMI value would be 0. For the more frequently occurring features, their lack of co-occurrence is a correspondingly larger surprise, i.e. the imputed PMI values go into the negatives, meaning that the feature co-occur less than what could be expected from their individual occurrence probabilities, if they were independent. The computation of imputation PMI values (iPMI) for two feature is therefore:

iPMI(base) = log2[(p_least_common_feature * p_least_common_feature) / (p_least_common_feature * p_least_common_feature)] = log2[1] = 0

based on which:

iPMI(feature1, feature2) = log2[p_least_common_feature**2 / (p_feature1 * p_feature2)]

where p_least_common_feature is the stand-alone occurrence probability for the least common feature within the profile, and p_feature1 and p_feature2 are the stand-alone occurrence probabilities for the features which PMI is being imputed.

Parameters

feature1 – First feature for PMI imputation
feature2 – Second feature for PMI imputation

Returns

The pair PMI based on imputed values

mean_interrelation_value()

Provides mean value of all PMI values within the profile, including imputed values. Ignores self-relations.

Returns: the mean PMI value as a float

relative_feature_tightness(features)

Provides relative feature tightness (RFT) measure for a given set of features. RFT quantifies how well do the feature co-occurrence combination in the provided feature vector match the interrelations within this reference (Z)PMI profile.

Parameters: features – an iterable of features
Returns: Feature tightness value as a float

standard_interrelation_deviation()

Provides standard deviation for all interrelation values within the profile, including imputed values. Ignores self-relations.

Returns: the standard deviation value of interrelations as a float

fip package

Submodules

fip.chem module

fip.profiles module

Module contents