Note

Go to the end to download the full example code or to run this example in your browser via Binder.

Rhomboid proteases¶

This example shows how the Rhomboid dataset is organized and how you can use it to test cocoatree’s features.

Load the dataset

import numpy as np
from cocoatree.datasets import load_rhomboid_proteases
import cocoatree.msa as c_msa
import cocoatree.statistics.position as c_pos

dataset = load_rhomboid_proteases()
print(dataset.keys())

dict_keys(['sequence_ids', 'alignment', 'sector_positions', 'metadata', 'pdb_sequence', 'pdb_positions'])

dataset is a dictionary with 6 elements:

sequence_ids: a list of the sequence identifiers for the MSA
alignment: the aligned sequences
sector_positions: a Numpy NpzFile of the sectors’ positions as found in

the Mihaljević and Urban study

metadata: a Pandas dataframe containing metadata associated with the

alignment, such as taxonomic information.

pdb_sequence: a tuple of the 2NRF pdb sequence from E. coli
pdb_positions: a list of the named positions found in the PDB

Access the sequence data¶

loaded_seqs = dataset["alignment"]
loaded_seqs_id = dataset["sequence_ids"]
n_loaded_pos, n_loaded_seqs = len(loaded_seqs[0]), len(loaded_seqs)

print(f"The loaded MSA has {n_loaded_seqs} sequences and {n_loaded_pos} \
      positions.")

The loaded MSA has 2767 sequences and 135       positions.

Example of sequence filtering¶

sequences, sequences_id, positions = c_msa.filter_sequences(
    loaded_seqs, loaded_seqs_id, gap_threshold=0.4, seq_threshold=0.2)
n_pos = len(positions)
print(f"After filtering, we have {n_pos} remaining positions.")
print(f"After filtering, we have {len(sequences)} remaining sequences.")

seq_weights, m_eff = c_pos.compute_seq_weights(sequences)
print('Number of effective sequences %d' %
      np.round(m_eff))

After filtering, we have 135 remaining positions.
After filtering, we have 2767 remaining sequences.
Number of effective sequences 1792

Access sector information¶

sectors = dataset["sector_positions"]
print("There are", len(sectors), "sectors")

for sect in sectors:
    print(sect, "is composed of:", len(sectors[sect]), "positions.")

There are 3 sectors
sector_1 is composed of: 14 positions.
sector_2 is composed of: 10 positions.
sector_3 is composed of: 17 positions.

Note that the above sector positions are relative to E. coli’s PDB sequence.

Access metadata¶

metadata = dataset["metadata"]
print(metadata)

                Seq_ID                     Protein_names  ...  AlphaFoldDB  PDB
       C7N0Y4_SLAHD  Uncharacterized membrane protein  ...      C7N0Y4;  NaN
       F8DCV9_HALXS           Rhomboid family protein  ...      F8DCV9;  NaN
       C7P1C2_HALMD           Rhomboid family protein  ...      C7P1C2;  NaN
   A0A087SLN7_AUXPR            Rhomboid protease GluP  ...  A0A087SLN7;  NaN
   A0A0G0YMT9_9BACT                               NaN  ...          NaN  NaN
...                ...                               ...  ...          ...  ...
      GLPG_ENT38            Rhomboid protease GlpG  ...       A4WFK8  NaN
      GLPG_CITK8            Rhomboid protease GlpG  ...       A8AQX4  NaN
      GLPG_SHIDS            Rhomboid protease GlpG  ...       Q32AN6  NaN
      GLPG_SALAR            Rhomboid protease GlpG  ...       A9MMA7  NaN
      GLPG_SALTY            Rhomboid protease GlpG  ...       Q8ZLH5  NaN

[2765 rows x 19 columns]

Access PDB information¶

pdb_seq = dataset["pdb_sequence"]
pdb_pos = dataset["pdb_positions"]
print("Start of 2NRF E. coli's sequence:\n", pdb_seq[:10])
print("Named positions of the first ten residues in E. coli's sequence:\n",
      pdb_pos[:10])

Start of 2NRF E. coli's sequence:
 ('ERAGPVTWVMMIACVVVFIAMQILGDQEVMLWLAWPFDPTLKFEFWRYFTHALMHFSLMHILFNLLWWWYLGGAVEKRLGSGKLIVITLISALLSGYVQQKFSGPWFGGLSGVVYALMGYVWLRGERDPQSGIYLQRGLIIFALIWIVAGWFDLFGMSMANGAHIAGLAVGLAMAFVDSLNA',)
Named positions of the first ten residues in E. coli's sequence:
 ['91', '92', '93', '94', '95', '96', '97', '98', '99', '100']

References¶

Total running time of the script: (0 minutes 1.001 seconds)

Gallery generated by Sphinx-Gallery