DjangoBB LoFi version

Full Version: Tractor scheduler, using a geo rop fetch generates socket errors

Root » PDG/TOPs » Tractor scheduler, using a geo rop fetch generates socket errors

mestela

March 25, 2019 19:27:22

We have a rop fetch calling to the render node inside a file cache node in a sop network. When run on the farm, tractor kicks off a python socket error:

[SG update_context_variables] Ignoring update context variables, hdefereval not defined, most likely you are running a job in the farm and this automation is not needed.
Loading .hip file //mnt/ala/mav/2019/sandbox/studio1/tech/matt/pdg/pdg_clean_v01.hipnc.
Traceback (most recent call last):
  File "//mnt/ala/mav/2019/sandbox/studio1/tech/matt/pdg/pdgtemp/44001/scripts/rop.py", line 502, in <module>
    cooker.cookSingleFrame(args)
  File "//mnt/ala/mav/2019/sandbox/studio1/tech/matt/pdg/pdgtemp/44001/scripts/rop.py", line 203, in cookSingleFrame
    reportResultData(parm.evalAtFrame(args.start), server_addr=args.server)
  File "/mnt/ala/mav/2019/sandbox/studio1/tech/matt/pdg/pdgtemp/44001/scripts/pdgcmd.py", line 222, in reportResultData
    result_data_tag, hash_code, jobid)
  File "/opt/hfs17.5.173/python/lib/python2.7/xmlrpclib.py", line 1243, in __call__
    return self.__send(self.__name, args)
  File "/opt/hfs17.5.173/python/lib/python2.7/xmlrpclib.py", line 1602, in __request
    verbose=self.__verbose
  File "/opt/hfs17.5.173/python/lib/python2.7/xmlrpclib.py", line 1283, in request
    return self.single_request(host, handler, request_body, verbose)
  File "/opt/hfs17.5.173/python/lib/python2.7/xmlrpclib.py", line 1311, in single_request
    self.send_content(h, request_body)
  File "/opt/hfs17.5.173/python/lib/python2.7/xmlrpclib.py", line 1459, in send_content
    connection.endheaders(request_body)
  File "/opt/hfs17.5.173/python/lib/python2.7/httplib.py", line 1038, in endheaders
    self._send_output(message_body)
  File "/opt/hfs17.5.173/python/lib/python2.7/httplib.py", line 882, in _send_output
    self.send(msg)
  File "/opt/hfs17.5.173/python/lib/python2.7/httplib.py", line 844, in send
    self.connect()
  File "/opt/hfs17.5.173/python/lib/python2.7/httplib.py", line 821, in connect
    self.timeout, self.source_address)
  File "/opt/hfs17.5.173/python/lib/python2.7/socket.py", line 575, in create_connection
    raise err
socket.error: [Errno 113] No route to host
[SG update_context_variables] Ignoring update context variables, hdefereval not defined, most likely you are running a job in the farm and this automation is not needed.
PDG_RESULT: ropfetch1_1;-1;'/mnt/ala/mav/2019/sandbox/studio1/tech/matt/pdg/geo/cache.0001.bgeo.sc';;0

Any clues?

chrisgreb

March 26, 2019 14:16:10

Could you please attach a hip file that reproduces this?

mestela

March 26, 2019 18:29:15

Yep, I attached it in a support ticket (Bug ID# 95628), can share here if its easier.

Jenny asked if I was trying this under an apprentice licences, we're using education ( hipnc) here at the uni studio, might that be the cause of the problems? We've been able to do the same process of fetching to a renderman ris rop fine.

user1111

March 31, 2019 18:08:30

I think perhaps I was having a similar problem with using a farm and tops. When you go to the taskinfo for each frame is the -s “IP ADDRESS” bit correct to the machine that started the cooking relative to the rest of the farm?
If so the TOPS python command doesnt seem to be reading off the correct network device for me, so I had to change it to read from an OS.environ variable set to the IP that the farm could see.

chrisgreb

April 1, 2019 14:42:06

A recent fix might resolve this issue. It's in 17.5.212
If it doesn't please let us know so we can look into it.

mestela

April 1, 2019 16:36:06

We realised that pdg was trying to communicate over a random port for every task, our farm is in a datacenter which is 2 firewalls away from our studio network. After temporarily unblocking all ports on a workstation, and only allowing these jobs to run on idle workstations within the studio, jobs worked.

We saw there's options to specify a port range on the scheduler. We're gonna try and and have our IT folk talk to the other IT departments to get those port ranges open on the firewalls between us and the farm.

chrisgreb

April 2, 2019 16:28:15

Great.

There is a plan to remove the need for exposed ports on the user's machine. Instead a tracker job will execute on the farm and communicate with the task jobs (hopefully there is no firewall between farm blades). This will be similar to how Houdini distributed sims work. This should remove the need for IT to expose a port range.

Leon_Y

July 4, 2019 22:03:09

Hi

I am wondering whether the port range stuff has been updated in 17.5.293?

chrisgreb

July 5, 2019 15:26:02

Yes. The farm jobs now communicate back to a message queue process which is running as a job on the farm. The only requirement now is that PDG is able to contact that farm machine at a particular port and farm machines can contact each other.