PDG + Hqueue

   155   2   1
User Avatar
Member
92 posts
Joined: July 2005
Offline
Hi all, reposting this one from the main forum…

I'm just beginning to play with the hqueuescheduler in TOPS and having no luck getting things going. A ropgeometry1_ropfetch* jobs seem to run but make no forward progress, hqserver.log isn't giving me much help. I'm wondering what ports need to be open for this stuff as we run hqueue in a fairly locked down environment. The hqueue installation is running fine for normal render jobs.

More investigation would suggest that the PDG processes being spawned by hqueue on the renderfarm are trying to connect back (via xmlrpc) to the originating workstation? Unfortunately our renderfarm and workstations networks are firewalled off. So it looks like a no go for the time being, or maybe some convoluted tunneling setups.

Wondering if this is going to be a limitation for all job schedulers, I'm interested in implementing a SLURM scheduler here. How many big installations allow renderfarm nodes to see the artist workstations and vice versa?
User Avatar
Staff
320 posts
Joined: May 2017
Offline
The HQueue scheduler parm interface allows to set custom callback port ranges. Any possibility of setting up a custom range, then opening up just those ports through the firewall? Or if you can at least do a test with firewall off, then perhaps with allowing just those ports, that way we can confirm the firewall is the problem.
User Avatar
Member
92 posts
Joined: July 2005
Offline
I had a chat with the network admins and they're loath to do this, the firewalls are there for security purposes. So now I'm thinking a way around it is to run a VPN on the farms cloud nodes and have the workstations sit on that VPN. It's going to be complicated.

This is what I'm seeing running on the farm node.

[hquser@worker-large-16cpu-centos7-1 ~]$ ps -ef | grep hq
hquser    2355     1  1 Feb26 ?        05:55:09 hserver
root     12383 12369  0 13:28 pts/0    00:00:00 sudo -i -u hquser
hquser   12384 12383  0 13:28 pts/0    00:00:00 -bash
hquser   12409 29869  0 13:29 ?        00:00:00 /bin/bash -c python -c "import xmlrpclib;s = xmlrpclib.ServerProxy('http://150.203.248.126:61034');s.start_cook('ropgeometry1_ropfetch1_1_9', '$JOBID');" && export HFS="$HQROOT/houdini_distros/hfs.$HQCLIENTARCH" && cd $HFS && source ./houdini_setup && "$HFS/bin/hython" "$HQROOT//g/data/z03/drw900/tmp/PDG/pdgtemp/37925/scripts/rop.py" -p "$HQROOT//g/data/z03/drw900/tmp/PDG/untitled.hip" -n "/obj/topnet1/ropgeometry1/ropnet1/geometry1" -to "/obj/topnet1/ropgeometry1" -i "ropgeometry1_ropfetch1_1_9" -s "150.203.248.126:61034" -fs 1 -fe 1 -fi 1
hquser   12412 12409  1 13:29 ?        00:00:00 /local/hquser/hqclient/./bin/python2.7-bin -c import xmlrpclib;s = xmlrpclib.ServerProxy('http://150.203.248.126:61034');s.start_cook('ropgeometry1_ropfetch1_1_9', '789');
hquser   12419 12384  0 13:29 pts/0    00:00:00 ps -ef
hquser   12420 12384  0 13:29 pts/0    00:00:00 grep --color=auto hq
hquser   29869   906  0 Mar15 ?        00:19:29 ./bin/python2.7-bin hqnode.py

BTW I note that there is potential problem here as well besides the network issue. The hip file, sitting on the same mounted file system /g/data/z03 is not strictly under

hqserver.sharedNetwork.path.linux = /g/data/z03/hqueue

which is where houdini_distros/hfs.linux-x86_64 etc lives.

This works fine for hqueue rendering which isn't adding the $HQROOT in front of the absolute path “/g/data/z03/drw900/tmp/PDG/untitled.hip”.

seelan
The HQueue scheduler parm interface allows to set custom callback port ranges. Any possibility of setting up a custom range, then opening up just those ports through the firewall? Or if you can at least do a test with firewall off, then perhaps with allowing just those ports, that way we can confirm the firewall is the problem.
Edited by drew - March 22, 2019 01:47:15
  • Quick Links