Hi all, reposting this one from the main forum…
I'm just beginning to play with the hqueuescheduler in TOPS and having no luck getting things going. A ropgeometry1_ropfetch* jobs seem to run but make no forward progress, hqserver.log isn't giving me much help. I'm wondering what ports need to be open for this stuff as we run hqueue in a fairly locked down environment. The hqueue installation is running fine for normal render jobs.
More investigation would suggest that the PDG processes being spawned by hqueue on the renderfarm are trying to connect back (via xmlrpc) to the originating workstation? Unfortunately our renderfarm and workstations networks are firewalled off. So it looks like a no go for the time being, or maybe some convoluted tunneling setups.
Wondering if this is going to be a limitation for all job schedulers, I'm interested in implementing a SLURM scheduler here. How many big installations allow renderfarm nodes to see the artist workstations and vice versa?
PDG + Hqueue
3125 4 3- drew
- Member
- 117 posts
- Joined: July 2005
- Offline
- seelan
- Member
- 571 posts
- Joined: May 2017
- Offline
The HQueue scheduler parm interface allows to set custom callback port ranges. Any possibility of setting up a custom range, then opening up just those ports through the firewall? Or if you can at least do a test with firewall off, then perhaps with allowing just those ports, that way we can confirm the firewall is the problem.
- drew
- Member
- 117 posts
- Joined: July 2005
- Offline
I had a chat with the network admins and they're loath to do this, the firewalls are there for security purposes. So now I'm thinking a way around it is to run a VPN on the farms cloud nodes and have the workstations sit on that VPN. It's going to be complicated.
This is what I'm seeing running on the farm node.
BTW I note that there is potential problem here as well besides the network issue. The hip file, sitting on the same mounted file system /g/data/z03 is not strictly under
hqserver.sharedNetwork.path.linux = /g/data/z03/hqueue
which is where houdini_distros/hfs.linux-x86_64 etc lives.
This works fine for hqueue rendering which isn't adding the $HQROOT in front of the absolute path “/g/data/z03/drw900/tmp/PDG/untitled.hip”.
This is what I'm seeing running on the farm node.
[hquser@worker-large-16cpu-centos7-1 ~]$ ps -ef | grep hq hquser 2355 1 1 Feb26 ? 05:55:09 hserver root 12383 12369 0 13:28 pts/0 00:00:00 sudo -i -u hquser hquser 12384 12383 0 13:28 pts/0 00:00:00 -bash hquser 12409 29869 0 13:29 ? 00:00:00 /bin/bash -c python -c "import xmlrpclib;s = xmlrpclib.ServerProxy('http://150.203.248.126:61034');s.start_cook('ropgeometry1_ropfetch1_1_9', '$JOBID');" && export HFS="$HQROOT/houdini_distros/hfs.$HQCLIENTARCH" && cd $HFS && source ./houdini_setup && "$HFS/bin/hython" "$HQROOT//g/data/z03/drw900/tmp/PDG/pdgtemp/37925/scripts/rop.py" -p "$HQROOT//g/data/z03/drw900/tmp/PDG/untitled.hip" -n "/obj/topnet1/ropgeometry1/ropnet1/geometry1" -to "/obj/topnet1/ropgeometry1" -i "ropgeometry1_ropfetch1_1_9" -s "150.203.248.126:61034" -fs 1 -fe 1 -fi 1 hquser 12412 12409 1 13:29 ? 00:00:00 /local/hquser/hqclient/./bin/python2.7-bin -c import xmlrpclib;s = xmlrpclib.ServerProxy('http://150.203.248.126:61034');s.start_cook('ropgeometry1_ropfetch1_1_9', '789'); hquser 12419 12384 0 13:29 pts/0 00:00:00 ps -ef hquser 12420 12384 0 13:29 pts/0 00:00:00 grep --color=auto hq hquser 29869 906 0 Mar15 ? 00:19:29 ./bin/python2.7-bin hqnode.py
BTW I note that there is potential problem here as well besides the network issue. The hip file, sitting on the same mounted file system /g/data/z03 is not strictly under
hqserver.sharedNetwork.path.linux = /g/data/z03/hqueue
which is where houdini_distros/hfs.linux-x86_64 etc lives.
This works fine for hqueue rendering which isn't adding the $HQROOT in front of the absolute path “/g/data/z03/drw900/tmp/PDG/untitled.hip”.
seelan
The HQueue scheduler parm interface allows to set custom callback port ranges. Any possibility of setting up a custom range, then opening up just those ports through the firewall? Or if you can at least do a test with firewall off, then perhaps with allowing just those ports, that way we can confirm the firewall is the problem.
Edited by drew - March 22, 2019 01:47:15
- GeordieM
- Member
- 11 posts
- Joined: Nov. 2013
- Offline
- chrisgreb
- Member
- 603 posts
- Joined: Sept. 2016
- Offline
GeordieM
This is unfortunate. I would have assumed all communication between workstation and clients would be proxied through the HQ Server since that's how normal rendering works.
It's actually been since changed to work like that. All communication from jobs is routed through a message queue which runs on the farm. There should be no problems with firewalls or other restrictions.
-
- Quick Links