DHFR proteases

This example shows how the DHFR dataset is organized and how you can use it to test cocoatree’s features.

import numpy as np

from cocoatree.datasets import load_DHFR
import cocoatree.msa as c_msa
import cocoatree.statistics.position as c_pos

Load the dataset

dataset = load_DHFR()
print(dataset.keys())
dict_keys(['sequence_ids', 'alignment', 'sector_positions', 'pdb_sequence', 'pdb_positions'])
dataset is a dictionary with 5 elements:
  • sequence_ids: a list of the sequence identifiers for the MSA

  • alignment: the aligned sequences

  • sector_positions: a Numpy NpzFile of the sectors’ positions as found in

the Kalmer study
  • pdb_sequence: a tuple of the pdb sequence of 3QL3 (wild-type E. coli

DHFR)
  • pdb_positions: a list of the named positions found in the PDB

Access the sequence data

loaded_seqs = dataset["alignment"]
loaded_seqs_id = dataset["sequence_ids"]
n_loaded_pos, n_loaded_seqs = len(loaded_seqs[0]), len(loaded_seqs)

print(f"The loaded MSA has {n_loaded_seqs} sequences and {n_loaded_pos} \
      positions.")
The loaded MSA has 4422 sequences and 802       positions.

Example of sequence filtering

sequences, sequences_id, positions = c_msa.filter_sequences(
    loaded_seqs, loaded_seqs_id, gap_threshold=0.4, seq_threshold=0.2)
n_pos = len(positions)
print(f"After filtering, we have {n_pos} remaining positions.")
print(f"After filtering, we have {len(sequences)} remaining sequences.")

seq_weights, m_eff = c_pos.compute_seq_weights(sequences)
print('Number of effective sequences %d' %
      np.round(m_eff))
After filtering, we have 156 remaining positions.
After filtering, we have 3806 remaining sequences.
Number of effective sequences 3332

Access sector information

sectors = dataset["sector_positions"]
print("There are", len(sectors), "sectors")

for sect in sectors:
    print(sect, "is composed of:", len(sectors[sect]), "positions.")
There are 4 sectors
arr_0 is composed of: 12 positions.
arr_1 is composed of: 19 positions.
arr_2 is composed of: 7 positions.
arr_3 is composed of: 11 positions.

Access PDB information

pdb_seq = dataset["pdb_sequence"]
pdb_pos = dataset["pdb_positions"]
print("Start of E. coli DHFR's sequence:\n", pdb_seq[:10])
print("Named positions of the first ten residues in E. coli's sequence:\n",
      pdb_pos[:10])
Start of E. coli DHFR's sequence:
 ('MISLIAALAVDRVIGMENAMPWPPLPADLAWFKRNTLNKPVIMGRHTWESIGRPLPGRKNIILSSQPGTDDRVTWVKSVDEAIAACGDVPEIMVIGGGRVYEQFLPKAQKLYLTHIDAEVEGDTHFPDYEPDDWESVFSEFHDADAQNAHSYCFEILERR',)
Named positions of the first ten residues in E. coli's sequence:
 ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']

References

Total running time of the script: (0 minutes 3.176 seconds)

Gallery generated by Sphinx-Gallery