Note
Go to the end to download the full example code. or to run this example in your browser via Binder
DHFR proteases¶
This example shows how the DHFR dataset is organized and how you can use it to test cocoatree’s features.
import numpy as np
from cocoatree.datasets import load_DHFR
import cocoatree.msa as c_msa
import cocoatree.statistics.position as c_pos
Load the dataset¶
dict_keys(['sequence_ids', 'alignment', 'sector_positions', 'pdb_sequence', 'pdb_positions'])
- dataset is a dictionary with 5 elements:
sequence_ids: a list of the sequence identifiers for the MSA
alignment: the aligned sequences
sector_positions: a Numpy NpzFile of the sectors’ positions as found in
- the Kalmer study
pdb_sequence: a tuple of the pdb sequence of 3QL3 (wild-type E. coli
- DHFR)
pdb_positions: a list of the named positions found in the PDB
Access the sequence data¶
loaded_seqs = dataset["alignment"]
loaded_seqs_id = dataset["sequence_ids"]
n_loaded_pos, n_loaded_seqs = len(loaded_seqs[0]), len(loaded_seqs)
print(f"The loaded MSA has {n_loaded_seqs} sequences and {n_loaded_pos} \
positions.")
The loaded MSA has 4422 sequences and 802 positions.
Example of sequence filtering¶
sequences, sequences_id, positions = c_msa.filter_sequences(
loaded_seqs, loaded_seqs_id, gap_threshold=0.4, seq_threshold=0.2)
n_pos = len(positions)
print(f"After filtering, we have {n_pos} remaining positions.")
print(f"After filtering, we have {len(sequences)} remaining sequences.")
seq_weights, m_eff = c_pos.compute_seq_weights(sequences)
print('Number of effective sequences %d' %
np.round(m_eff))
After filtering, we have 156 remaining positions.
After filtering, we have 3806 remaining sequences.
Number of effective sequences 3332
Access sector information¶
There are 4 sectors
arr_0 is composed of: 12 positions.
arr_1 is composed of: 19 positions.
arr_2 is composed of: 7 positions.
arr_3 is composed of: 11 positions.
Access PDB information¶
Start of E. coli DHFR's sequence:
('MISLIAALAVDRVIGMENAMPWPPLPADLAWFKRNTLNKPVIMGRHTWESIGRPLPGRKNIILSSQPGTDDRVTWVKSVDEAIAACGDVPEIMVIGGGRVYEQFLPKAQKLYLTHIDAEVEGDTHFPDYEPDDWESVFSEFHDADAQNAHSYCFEILERR',)
Named positions of the first ten residues in E. coli's sequence:
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
References¶
Total running time of the script: (0 minutes 3.176 seconds)