Rhomboid proteases

This example shows how the Rhomboid dataset is organized and how you can use it to test cocoatree’s features.

Load the dataset

import numpy as np
from cocoatree.datasets import load_rhomboid_proteases
import cocoatree.msa as c_msa
import cocoatree.statistics.position as c_pos

dataset = load_rhomboid_proteases()
print(dataset.keys())
dict_keys(['sequence_ids', 'alignment', 'sector_positions', 'metadata', 'pdb_sequence', 'pdb_positions'])
dataset is a dictionary with 6 elements:
  • sequence_ids: a list of the sequence identifiers for the MSA

  • alignment: the aligned sequences

  • sector_positions: a Numpy NpzFile of the sectors’ positions as found in

the Mihaljević and Urban study
  • metadata: a Pandas dataframe containing metadata associated with the

alignment, such as taxonomic information.
  • pdb_sequence: a tuple of the 2NRF pdb sequence from E. coli

  • pdb_positions: a list of the named positions found in the PDB

Access the sequence data

loaded_seqs = dataset["alignment"]
loaded_seqs_id = dataset["sequence_ids"]
n_loaded_pos, n_loaded_seqs = len(loaded_seqs[0]), len(loaded_seqs)

print(f"The loaded MSA has {n_loaded_seqs} sequences and {n_loaded_pos} \
      positions.")
The loaded MSA has 2767 sequences and 135       positions.

Example of sequence filtering

sequences, sequences_id, positions = c_msa.filter_sequences(
    loaded_seqs, loaded_seqs_id, gap_threshold=0.4, seq_threshold=0.2)
n_pos = len(positions)
print(f"After filtering, we have {n_pos} remaining positions.")
print(f"After filtering, we have {len(sequences)} remaining sequences.")

seq_weights, m_eff = c_pos.compute_seq_weights(sequences)
print('Number of effective sequences %d' %
      np.round(m_eff))
After filtering, we have 135 remaining positions.
After filtering, we have 2767 remaining sequences.
Number of effective sequences 1792

Access sector information

sectors = dataset["sector_positions"]
print("There are", len(sectors), "sectors")

for sect in sectors:
    print(sect, "is composed of:", len(sectors[sect]), "positions.")
There are 3 sectors
sector_1 is composed of: 14 positions.
sector_2 is composed of: 10 positions.
sector_3 is composed of: 17 positions.

Note that the above sector positions are relative to E. coli’s PDB sequence.

Access metadata

metadata = dataset["metadata"]
print(metadata)
                Seq_ID                     Protein_names  ...  AlphaFoldDB  PDB
0         C7N0Y4_SLAHD  Uncharacterized membrane protein  ...      C7N0Y4;  NaN
1         F8DCV9_HALXS           Rhomboid family protein  ...      F8DCV9;  NaN
2         C7P1C2_HALMD           Rhomboid family protein  ...      C7P1C2;  NaN
3     A0A087SLN7_AUXPR            Rhomboid protease GluP  ...  A0A087SLN7;  NaN
4     A0A0G0YMT9_9BACT                               NaN  ...          NaN  NaN
...                ...                               ...  ...          ...  ...
2760        GLPG_ENT38            Rhomboid protease GlpG  ...       A4WFK8  NaN
2761        GLPG_CITK8            Rhomboid protease GlpG  ...       A8AQX4  NaN
2762        GLPG_SHIDS            Rhomboid protease GlpG  ...       Q32AN6  NaN
2763        GLPG_SALAR            Rhomboid protease GlpG  ...       A9MMA7  NaN
2764        GLPG_SALTY            Rhomboid protease GlpG  ...       Q8ZLH5  NaN

[2765 rows x 19 columns]

Access PDB information

pdb_seq = dataset["pdb_sequence"]
pdb_pos = dataset["pdb_positions"]
print("Start of 2NRF E. coli's sequence:\n", pdb_seq[:10])
print("Named positions of the first ten residues in E. coli's sequence:\n",
      pdb_pos[:10])
Start of 2NRF E. coli's sequence:
 ('ERAGPVTWVMMIACVVVFIAMQILGDQEVMLWLAWPFDPTLKFEFWRYFTHALMHFSLMHILFNLLWWWYLGGAVEKRLGSGKLIVITLISALLSGYVQQKFSGPWFGGLSGVVYALMGYVWLRGERDPQSGIYLQRGLIIFALIWIVAGWFDLFGMSMANGAHIAGLAVGLAMAFVDSLNA',)
Named positions of the first ten residues in E. coli's sequence:
 ['91', '92', '93', '94', '95', '96', '97', '98', '99', '100']

References

Total running time of the script: (0 minutes 1.102 seconds)

Gallery generated by Sphinx-Gallery