Load a PDB structure file

This small example shows how to use cocoatree.io.load_pdb() to import your own PDB structure file.

This function allows you to extract the amino acid sequence associated with the PDB file, as well as the residue numbering, which is necessary to specify the residues to highlight in PyMOL for example.

Import necessary packages

from cocoatree.io import load_pdb

# Provide path for thumbnail image
# sphinx_gallery_thumbnail_path = '../../doc/images/3TGI.png'

We will use the PDB structure of rat trypsin as an example. The file can be downloaded at: https://www.rcsb.org/structure/3TGI. It is also the structure that is included in cocoatree.datasets.load_S1A_serine_proteases().

pdb_seq, pdb_pos = load_pdb('data/3TGI.pdb', pdb_id='3TGI', chain='E')

The pdb_id argument is the ID that will be used for the structure, in this case, we choose 3TGI, but it could be TRYPSIN or any name that the user finds fitting.

In order to understand the function’s chain argument, you need to open the PDB in an editor to check the information included within your file:

"""HEADER    COMPLEX (SERINE PROTEASE/INHIBITOR)     15-JUL-98   3TGI
TITLE     WILD-TYPE RAT ANIONIC TRYPSIN COMPLEXED WITH BOVINE
TITLE    2 PANCREATIC TRYPSIN INHIBITOR (BPTI)
COMPND    MOL_ID: 1;
COMPND   2 MOLECULE: TRYPSIN;
COMPND   3 CHAIN: E;
COMPND   4 EC: 3.4.21.4;
COMPND   5 MOL_ID: 2;
COMPND   6 MOLECULE: BOVINE PANCREATIC TRYPSIN INHIBITOR;
COMPND   7 CHAIN: I;
COMPND   8 SYNONYM: BPTI
SOURCE    MOL_ID: 1;
SOURCE   2 ORGANISM_SCIENTIFIC: RATTUS NORVEGICUS;
SOURCE   3 ORGANISM_COMMON: NORWAY RAT;
SOURCE   4 ORGANISM_TAXID: 10116;
SOURCE   5 ORGAN: PANCREATIC;
SOURCE   6 MOL_ID: 2;
SOURCE   7 ORGANISM_SCIENTIFIC: BOS TAURUS;
SOURCE   8 ORGANISM_COMMON: CATTLE;
SOURCE   9 ORGANISM_TAXID: 9913"""
'HEADER    COMPLEX (SERINE PROTEASE/INHIBITOR)     15-JUL-98   3TGI\nTITLE     WILD-TYPE RAT ANIONIC TRYPSIN COMPLEXED WITH BOVINE\nTITLE    2 PANCREATIC TRYPSIN INHIBITOR (BPTI)\nCOMPND    MOL_ID: 1;\nCOMPND   2 MOLECULE: TRYPSIN;\nCOMPND   3 CHAIN: E;\nCOMPND   4 EC: 3.4.21.4;\nCOMPND   5 MOL_ID: 2;\nCOMPND   6 MOLECULE: BOVINE PANCREATIC TRYPSIN INHIBITOR;\nCOMPND   7 CHAIN: I;\nCOMPND   8 SYNONYM: BPTI\nSOURCE    MOL_ID: 1;\nSOURCE   2 ORGANISM_SCIENTIFIC: RATTUS NORVEGICUS;\nSOURCE   3 ORGANISM_COMMON: NORWAY RAT;\nSOURCE   4 ORGANISM_TAXID: 10116;\nSOURCE   5 ORGAN: PANCREATIC;\nSOURCE   6 MOL_ID: 2;\nSOURCE   7 ORGANISM_SCIENTIFIC: BOS TAURUS;\nSOURCE   8 ORGANISM_COMMON: CATTLE;\nSOURCE   9 ORGANISM_TAXID: 9913'
As you can see, there are actually two molecules in the PDB:
  • rat trypsin

  • bovine pancreatic trypsin inhibitor

It is necessary to specify which molecule you wish to load by using the chain argument. In this case, it is chain='E' for trypsin, and chain='I' for the trypsin inhibitor.

For a comparison, here are the first lines of E. coli dihydrofolate reductase’s PDB file (which is the one included in cocoatree’s cocoatree.datasets.load_DHFR()):

"""HEADER    OXIDOREDUCTASE                          02-FEB-11   3QL0
TITLE     CRYSTAL STRUCTURE OF N23PP/S148A MUTANT OF E. COLI DIHYDROFOLATE
TITLE    2 REDUCTASE
COMPND    MOL_ID: 1;
COMPND   2 MOLECULE: DIHYDROFOLATE REDUCTASE;
COMPND   3 CHAIN: A;
COMPND   4 EC: 1.5.1.3;
COMPND   5 ENGINEERED: YES;
COMPND   6 MUTATION: YES
SOURCE    MOL_ID: 1;
SOURCE   2 ORGANISM_SCIENTIFIC: ESCHERICHIA COLI;
SOURCE   3 ORGANISM_TAXID: 364106;
SOURCE   4 STRAIN: UTI89 / UPEC;
SOURCE   5 GENE: FOLA, UTI89_C0054;
SOURCE   6 EXPRESSION_SYSTEM: ESCHERICHIA COLI;
SOURCE   7 EXPRESSION_SYSTEM_TAXID: 562"""
'HEADER    OXIDOREDUCTASE                          02-FEB-11   3QL0\nTITLE     CRYSTAL STRUCTURE OF N23PP/S148A MUTANT OF E. COLI DIHYDROFOLATE\nTITLE    2 REDUCTASE\nCOMPND    MOL_ID: 1;\nCOMPND   2 MOLECULE: DIHYDROFOLATE REDUCTASE;\nCOMPND   3 CHAIN: A;\nCOMPND   4 EC: 1.5.1.3;\nCOMPND   5 ENGINEERED: YES;\nCOMPND   6 MUTATION: YES\nSOURCE    MOL_ID: 1;\nSOURCE   2 ORGANISM_SCIENTIFIC: ESCHERICHIA COLI;\nSOURCE   3 ORGANISM_TAXID: 364106;\nSOURCE   4 STRAIN: UTI89 / UPEC;\nSOURCE   5 GENE: FOLA, UTI89_C0054;\nSOURCE   6 EXPRESSION_SYSTEM: ESCHERICHIA COLI;\nSOURCE   7 EXPRESSION_SYSTEM_TAXID: 562'

In this case, there is only one molecule, which is accessed by specifying chain='A'.

You can now access the pdb sequence and the residue numbering:

print(pdb_seq)
print(pdb_pos)

# ..seealso::
#       :ref:`sphx_glr_auto_examples_a_quick_start_03_plot_map_alignments.py`
#       to see how to perform a mapping of the positions between your MSA and
#       the PDB structure.
IVGGYTCQENSVPYQVSLNSGYHFCGGSLINDQWVVSAAHCYKSRIQVRLGEHNINVLEGNEQFVNAAKIIKHPNFDRKTLNNDIMLIKLSSPVKLNARVATVALPSSCAPAGTQCLISGWGNTLSSGVNEPDLLQCLDAPLLPQADCEASYPGKITDNMVCVGFLEGGKDSCQGDSGGPVVCNGELQGIVSWGYGCALPDNPGVYTKVCNYVDWIQDTIAAN
['16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '127', '128', '129', '130', '132', '133', '134', '135', '136', '137', '138', '139', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '156', '157', '158', '159', '160', '161', '162', '163', '164', '165', '166', '167', '168', '169', '170', '171', '172', '173', '174', '175', '176', '177', '178', '179', '180', '181', '182', '183', '184', '184A', '185', '186', '187', '188', '188A', '189', '190', '191', '192', '193', '194', '195', '196', '197', '198', '199', '200', '201', '202', '203', '204', '209', '210', '211', '212', '213', '214', '215', '216', '217', '219', '220', '221', '221A', '222', '223', '224', '225', '226', '227', '228', '229', '230', '231', '232', '233', '234', '235', '236', '237', '238', '239', '240', '241', '242', '243', '244', '245']

Total running time of the script: (0 minutes 0.021 seconds)

Gallery generated by Sphinx-Gallery