importing molecular data gPDB and PDB format

   10175   12   4
User Avatar
Member
78 posts
Joined: Sept. 2008
Offline
Hi there

I see that Houdini has made a standalone executable file ‘gpdb’ that converts
PDB files (a format that contains data about molecules) into bgeo format with a variety of point attributes.

I've converted a basic PDB file and imported the point cloud created .
I was thinking of using the simple ADD sop to connect corresponding points
then use the polywire sop to flesh them out . But at the moment I can't find the correct corresponding attribute.

There's no ref in the help doc yet so I was hoping someone might have had some experience in using this in houdini and willing to give some advice?

Thanks

John
User Avatar
Member
78 posts
Joined: Sept. 2008
Offline
Hey all
From information glibbed from PDB format doc's and Jenny over at sidefx support.
'The PDB converter parses ATOM and HETATOM lines,
and stores some corresponding fields as point attributes.
By using chainID and resSeq you can get some connectivity.'

I need to do more research but I think the “name” attribute has some connectivity too , AA and AB being connected etc.

I've had a bash at organising the imported point cloud if anyone's interested in having a play and building on the scene
Edited by - Sept. 26, 2014 06:04:40
User Avatar
Member
78 posts
Joined: Sept. 2008
Offline
here's a screen grab of a protein I imported

Attachments:
gPDB_screen.JPG (408.7 KB)

User Avatar
Member
78 posts
Joined: Sept. 2008
Offline
I've looked into the subject a bit more and now have some undergrad knowledge in molecular biology , well almost
The connectivity in the imported and translated PDB is a lot more complex.
And involves separating out the different residues contained and then connecting them using there abbreviated 3 letter names,
e.g ALA for Alanine etc.
Then connecting each with an add sop in the described order.
This doc http://www.wwpdb.org/documentation/PDB_format_1992.pdf [wwpdb.org]
gives a full description of the different residues and amino acids in it's appendix.

Cheers

John
User Avatar
Member
1743 posts
Joined: March 2012
Offline
Good to hear that someone's getting good use out of this feature I slapped together in an hour or two! If you're not already, you can read them with a File SOP directly, instead of using gpdb to convert them.

The awkward thing about most PDB files is that the bonds are assumed to be evident based on the distances between the pairs of atoms, so they don't include them explicitly, which means a lot of special cases and looking up numbers in charts, because different types of bonds have different lengths and to a lesser extent, so do bonds in different arrangements. Most PDB files also omit the hydrogen atoms, unless the atom positions were found using nuclear magnetic resonance (NMR a.k.a. MRI) instead of x-ray crystallography.

As I'm sure you found in your scouring of the PDB format info, a few of the fields in PDB files from the Protein Data Bank are related to secondary protein structures, like alpha sheets and beta helices, as well as larger protein pieces, so they indicate on a broader scale how the atoms are structured, but I'm not sure if you can quite discern bond information from that without a ton of work.

For some of the larger structures, the PDB file format hits its atom count limit (100K?), but they have a similar file format for which I managed to write a simple program to split it into a bunch of PDB files, and then used copy stamping to load them all in. I haven't added it to gpdb, though. Attached is an image of the HIV capsid (something like 2.5 million atoms), with a blue volume emphasizing how much the outer structure looks like snowflakes, for a Christmas image contest.

I also attached a simple visualization just colouring all carbons dark grey, all oxygens red, all nitrogens blue, and all sulfurs yellow. Most atoms in PDB files will probably be C, O, N, S, or P, with maybe a few other things like Fe in hemoglobin and Mg in chlorophyll. You'll probably want to delete the heterogen atoms, (probably mostly O, or H2O), since they're just what the protein was suspended in, (almost always water). I had gpdb put them in a point group for easy deletion.

Attachments:
TestPDBVisualizationSmall.png (176.8 KB)
CapsidLowRes.jpg (217.3 KB)

Writing code for fun and profit since... 2005? Wow, I'm getting old.
https://www.youtube.com/channel/UC_HFmdvpe9U2G3OMNViKMEQ [www.youtube.com]
User Avatar
Member
78 posts
Joined: Sept. 2008
Offline
Hey Mr. Dickson
That's excellent- a couple of hours well spent I'd say thanks very much

I've been geeking out on this as if it's a 3d jigsaw puzzle and pieced together a scene file which does a foreach on each of the 20 standard ‘ATOM’ residues or amino acids , and like you say a ton of work !

The other non standard ‘HETATOM’ I'm having to add as they turn up . These ones are much more complex and very beautiful. But I've noticed that sometimes there's some miscellaneous things like HOH etc that I have as spare parts floating about the molecule , is this the water you mentioned?

The thing I'm finding the hardest is to connect the back bone or primary structure together. I've tried tackling this by deleting everything but the first N C C O in each of the residues and then adding them together . This works half of the time when I do a foreach on seperate chains , but connecting in some of the non standard HETATOM residues is beyond my knowledge at the moment .

Should I be connecting the bonds based on proximity , for the backbone at least ?

I'm really interested in moving on to some of the bigger structures , what's the other file format you mention and will you be adding the copy stamping functionality any time soon , I'd be really interested in giving that a go or beta testing it for you .

Thanks again for adding this functionality to Houdini

John
User Avatar
Member
606 posts
Joined: May 2007
Offline
This is very interesting stuff!

At one point I started on PDB importer of my own, I hooked up BioPython and thought to use its PDB parsing functionality.

Then the Houdini gpdb came out and I got side-sidetracked to another sidetrack and it's been gathering dust since..

Anyway, the BioPython documentation might offer some additional insights into the format: http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec156 [biopython.org]

Looking forward to your progress here!
User Avatar
Member
78 posts
Joined: Sept. 2008
Offline
Hey eetu and mr.dickson
Have a go with this scene file with your PDB's .
lot's of room for improvement especially with the backbone and the layered foreach sop is a little slow.
But I think the main structure is there.

Cheers

John

Attachments:
PDB_import01_JB.hip (2.4 MB)

User Avatar
Member
1743 posts
Joined: March 2012
Offline
HUGHSPEERS
The other non standard ‘HETATOM’ I'm having to add as they turn up . These ones are much more complex and very beautiful. But I've noticed that sometimes there's some miscellaneous things like HOH etc that I have as spare parts floating about the molecule , is this the water you mentioned?
Yep. They're mostly water in which the protein is suspended. They might use other things in the solution, but I'm pretty sure they're always compounds that are not the protein of interest, but are just there for completeness, to show the environment in which the protein was captured. I think they wouldn't ever be bonded to the other atoms. They might be more interesting when looking at interactions between proteins and sites on other molecules, though they can also contain random bits of other proteins that just happened to be around.

The thing I'm finding the hardest is to connect the back bone or primary structure together. I've tried tackling this by deleting everything but the first N C C O in each of the residues and then adding them together . This works half of the time when I do a foreach on seperate chains , but connecting in some of the non standard HETATOM residues is beyond my knowledge at the moment .

Should I be connecting the bonds based on proximity , for the backbone at least ?
Do your PDB files have CONECT lines in them? If so, those would specify bonds, though gpdb doesn't look for them at the moment, since the PDB files on Protein Data Bank don't seem to ever have them, so far as I could find.

Years ago, when I was first looking at PDB files for something else, I was told that the bonds were implicit based on the distance between the atoms, but that means looking at bond length tables [en.wikipedia.org], and if you want to figure out whether bonds are double, triple, or aromatic, it gets more complicated, especially with different orbital hybridizations. For example, the single bond in C=C-C=C (with the appropriate hydrogens to make them all add up to 4), may be shorter than the middle single bond length in C-C-C-C. Cyclobutabenzene has aromatic bonds that are longer than the single bonds in cyclobutane, which really throws a wrench into the chains, but you can use some rough heuristics to get things right most of the time, though, since proteins and DNA hopefully don't have too many crazy edge cases, apart from hydrogen bonds.

Note that the distances from PDB files are in angstroms [en.wikipedia.org], not picometres, so you'll need to multiply by 100 to get picometres or divide distances in picometres by 100 to get angstroms.

For the size of the spheres, you'll probably want to use one of the various definitions of atomic radius [en.wikipedia.org], and maybe scale it so that the visualization looks better.

These sorts of things are probably easiest to do with the AttribWrangle SOP, though you'd need to know how to use VEX code, especially the pcopen function and related functions for finding points within a certain radius of the current point, as well as the addprim function to make polygon curves between the atoms. I'll see if I can find time to put together an example.

I'm really interested in moving on to some of the bigger structures , what's the other file format you mention and will you be adding the copy stamping functionality any time soon , I'd be really interested in giving that a go or beta testing it for you .
Protein Data Bank provides a PDB file for this one [rcsb.org], but it's incomplete, since it maxes out at a bit under 100,000 atoms, so I used the CIF file, which looked like a fairly similar format. The one that has everything is the 3j3q.cif.gz file.
Writing code for fun and profit since... 2005? Wow, I'm getting old.
https://www.youtube.com/channel/UC_HFmdvpe9U2G3OMNViKMEQ [www.youtube.com]
User Avatar
Member
271 posts
Joined: March 2012
Offline
ndickson
For some of the larger structures, the PDB file format hits its atom count limit (100K?), but they have a similar file format for which I managed to write a simple program to split it into a bunch of PDB files, and then used copy stamping to load them all in. I haven't added it to gpdb, though. Attached is an image of the HIV capsid (something like 2.5 million atoms), with a blue volume emphasizing how much the outer structure looks like snowflakes, for a Christmas image contest.

ndickson,
How did you manage to get the capsid structure? I can never seem to get this from the pdb file.
Edited by Anti-Distinctlyminty - June 18, 2016 07:23:07
User Avatar
Member
1743 posts
Joined: March 2012
Offline
Anti-Distinctlyminty
How did you manage to get the capsid structure? I can never seem to get this from the pdb file.
I used the “PDBx/mmCIF Format” download link, then wrote a quick program to split it into about 25 PDB files, and loaded each of them into Houdini. Ooh, maybe I should test out my Load Data Table [orbolt.com] asset on it (after deleting unnecessary parts), to save the trouble of splitting it up.
Writing code for fun and profit since... 2005? Wow, I'm getting old.
https://www.youtube.com/channel/UC_HFmdvpe9U2G3OMNViKMEQ [www.youtube.com]
User Avatar
Member
271 posts
Joined: March 2012
Offline
ndickson
Ooh, maybe I should test out my Load Data Table [orbolt.com] asset on it (after deleting unnecessary parts), to save the trouble of splitting it up.

Yes. Yes you should
This is something that we have to do so often that a decent streamlined workflow will have to be developed, so anything you can do to help will be much appreciated
User Avatar
Member
1743 posts
Joined: March 2012
Offline
Anti-Distinctlyminty
This is something that we have to do so often that a decent streamlined workflow will have to be developed, so anything you can do to help will be much appreciated
Well, when I deleted all of the lines except those starting with “ATOM” and saved it as 3j3q.cif.txt, then used the attached HIP file, it took 7 minutes to parse and load, but it seems to have gotten everything. You could probably pipe the files through grep (or a simple Python script) to get only the lines that start with “ATOM”, if you need to automate it.

The Load Data Table asset would probably run a lot faster if the input didn't have the ton of columns that don't seem to mean much, though I might be able to add a wacky trick to parallelize it for big tables like this. I don't know what most of the columns are, so the attributes have silly names, and I probably should have gone with the fixed width option, instead of delimited, but it looked like all columns had non-whitespace content, so the default of tab, space, comma, and semi-colon delimited worked.

Attachments:
ProteinDataBank_CIF.hip (93.7 KB)

Writing code for fun and profit since... 2005? Wow, I'm getting old.
https://www.youtube.com/channel/UC_HFmdvpe9U2G3OMNViKMEQ [www.youtube.com]
  • Quick Links