Note
Go to the end to download the full example code. or to run this example in your browser via Binder
Rhomboid proteases¶
This example shows how the Rhomboid dataset is organized and how you can use it to test cocoatree’s features.
Load the dataset
import numpy as np
from cocoatree.datasets import load_rhomboid_proteases
import cocoatree.msa as c_msa
import cocoatree.statistics.position as c_pos
dataset = load_rhomboid_proteases()
print(dataset.keys())
dict_keys(['sequence_ids', 'alignment', 'sector_positions', 'metadata', 'pdb_sequence', 'pdb_positions'])
- dataset is a dictionary with 6 elements:
sequence_ids: a list of the sequence identifiers for the MSA
alignment: the aligned sequences
sector_positions: a Numpy NpzFile of the sectors’ positions as found in
- the Mihaljević and Urban study
metadata: a Pandas dataframe containing metadata associated with the
- alignment, such as taxonomic information.
pdb_sequence: a tuple of the 2NRF pdb sequence from E. coli
pdb_positions: a list of the named positions found in the PDB
Access the sequence data¶
loaded_seqs = dataset["alignment"]
loaded_seqs_id = dataset["sequence_ids"]
n_loaded_pos, n_loaded_seqs = len(loaded_seqs[0]), len(loaded_seqs)
print(f"The loaded MSA has {n_loaded_seqs} sequences and {n_loaded_pos} \
positions.")
The loaded MSA has 2767 sequences and 135 positions.
Example of sequence filtering¶
sequences, sequences_id, positions = c_msa.filter_sequences(
loaded_seqs, loaded_seqs_id, gap_threshold=0.4, seq_threshold=0.2)
n_pos = len(positions)
print(f"After filtering, we have {n_pos} remaining positions.")
print(f"After filtering, we have {len(sequences)} remaining sequences.")
seq_weights, m_eff = c_pos.compute_seq_weights(sequences)
print('Number of effective sequences %d' %
np.round(m_eff))
After filtering, we have 135 remaining positions.
After filtering, we have 2767 remaining sequences.
Number of effective sequences 1792
Access sector information¶
There are 3 sectors
sector_1 is composed of: 14 positions.
sector_2 is composed of: 10 positions.
sector_3 is composed of: 17 positions.
Note that the above sector positions are relative to E. coli’s PDB sequence.
Access metadata¶
metadata = dataset["metadata"]
print(metadata)
Seq_ID Protein_names ... AlphaFoldDB PDB
0 C7N0Y4_SLAHD Uncharacterized membrane protein ... C7N0Y4; NaN
1 F8DCV9_HALXS Rhomboid family protein ... F8DCV9; NaN
2 C7P1C2_HALMD Rhomboid family protein ... C7P1C2; NaN
3 A0A087SLN7_AUXPR Rhomboid protease GluP ... A0A087SLN7; NaN
4 A0A0G0YMT9_9BACT NaN ... NaN NaN
... ... ... ... ... ...
2760 GLPG_ENT38 Rhomboid protease GlpG ... A4WFK8 NaN
2761 GLPG_CITK8 Rhomboid protease GlpG ... A8AQX4 NaN
2762 GLPG_SHIDS Rhomboid protease GlpG ... Q32AN6 NaN
2763 GLPG_SALAR Rhomboid protease GlpG ... A9MMA7 NaN
2764 GLPG_SALTY Rhomboid protease GlpG ... Q8ZLH5 NaN
[2765 rows x 19 columns]
Access PDB information¶
Start of 2NRF E. coli's sequence:
('ERAGPVTWVMMIACVVVFIAMQILGDQEVMLWLAWPFDPTLKFEFWRYFTHALMHFSLMHILFNLLWWWYLGGAVEKRLGSGKLIVITLISALLSGYVQQKFSGPWFGGLSGVVYALMGYVWLRGERDPQSGIYLQRGLIIFALIWIVAGWFDLFGMSMANGAHIAGLAVGLAMAFVDSLNA',)
Named positions of the first ten residues in E. coli's sequence:
['91', '92', '93', '94', '95', '96', '97', '98', '99', '100']
References¶
Total running time of the script: (0 minutes 1.102 seconds)