Manual Distributed Simulation tracker problems

   5664   14   3
User Avatar
Member
90 posts
Joined: April 2011
Offline
Hi,
I'm currently trying to set find out how I could simulate across multiple computers a flip simulation while using royal render (as I don't have access to HQueue and my teachers don't want to install it only to simulate only)or, if rRender can't do it, making the distributed jobs without Hqueue, based on the masterclass made by Jeff Wagner (http://www.sidefx.com/index.php?option=com_content&task=view&id=1516&Itemid=9 [sidefx.com]) , witch will be much more painful.

So, I've followed Jeff's steps until the end, just before he automates the distribution.

And here comes the troubles: :twisted:
I always get the same message,

Write Connect Failure. Error occured
My length 34
My adress “Computer name” by 8000
Message in error state after connection atempt.

I first thought about the tracker, but he can be accessed, even from another computer.
Any ideas of where it could comes from or explain to me how I could use rRender to sim across multiple computers (as it doesn't take hqueue simulation into account).

Thank's in advance
User Avatar
Member
6806 posts
Joined: July 2005
Offline
I'm having a similar issue, we can't use Hqueue but need to distribute sims. Docs are a bit lacking on how to do this I see “tracker.py” referenced but, uh, how do you use it?
User Avatar
Member
35 posts
Joined: May 2013
Offline
same here too, only for distribution sims.
mantra distributed works fine.
User Avatar
Member
41 posts
Joined: June 2010
Offline
I just dealt with this about a month ago. I did not use hqueue, instead I ran a script on each machine.

Using the same setup, make sure you disable resize container (slices don't work with resize)

In your houdini file, you have to set the name of the host of your tracker. You will find this under “DISTRIBUTE_pyro_CONTROLS” You can probably just use local host.

make sure you run all the commands in a houdini shell,
(houdini install path)\houdini\14.0.233\bin\hcmd.exe

when you have the shell open you need to run a hython command to start the simtracker

hython (houdini install path)\houdini\14.0.233\houdini\python2.7libs\simtracker.py 8000 9000

Once the sim tracker is running, you can run the houdini shell on your remote machines and use this command:

hbatch -c “setenv SLICE=0; render /obj/distribute_uprespyro/saveslices;quit;” houdinifliepath\distributed_pyro.hip

You'll of course need to change the object and file paths to suit your needs and you will also need to change the slice on each computer.

If any one of the machines fail, the tracker will report the error
~t.goat
User Avatar
Staff
4938 posts
Joined: July 2005
Offline
trojan_goat
Using the same setup, make sure you disable resize container (slices don't work with resize)

In H15, resize works with distribution!
User Avatar
Member
35 posts
Joined: May 2013
Offline
Write Connect Failure. Error occured
My length 34
My adress “Computer name” by 8000
Message in error state after connection atempt.

—–

so is it a know problem?
Trojan, this script execution on each machine gives any feedback from Hqueue monitor?

thx guys.
User Avatar
Staff
4938 posts
Joined: July 2005
Offline
That error is generated by the Houdini session when it fails to connect to another machine.

In this case, the port 8000 means you are likely trying to connect to the machine running the tracker.

Is “Computer name” running the tracker?

You should be able to point a web browser to
http://Computer [computer] Name:9000
and see the tracker status there.
User Avatar
Member
35 posts
Joined: May 2013
Offline
Jeff, I can only access the tracker's status through the own machine,
almost sure its permission/firewall issue. I just dont know which
service/port enable. Every time the tracker uses a different door.

I'm using linux, btw.

Hqueue status: NODE02 computer

The Houdini 15.0.244.16 environment has been initialized.
ALF_PROGRESS 0%
Write Connect failure. code: Bad Address
My length 0
My address node03 by 35332
Write Connect failure. code: Bad Address
My length 31
My address node03 by 35332
Message in error state after connection attempt!
—- Pump enters error status —-
Write Connect failure. code: Bad Address
My length 140515278872504
My address node03 by 35332
Write Connect failure. code: Bad Address
My length 29
My address node03 by 35332
Message in error state after connection attempt!


Tracker message: NODE03
REFRESH(30 sec): http://node03:57854/ [node03]

Active List

Job: flipsolver1_2: @1446651059.781678, (n: 2, a: 1, d: 0, e: 0)

Peer Info acquire->sync acquire->done sync->done
peer #3 (@1446651059.781682) - 192.168.0.11 : 45239 pending pending
pending
__________________________________________________________________

Barriers
__________________________________________________________________

Done List
User Avatar
Member
35 posts
Joined: May 2013
Offline
the documentation shows something important. I'll try to fix using static DNS.

Verify that network connections are possible between machines.
The client machines will communicate with the server and with the shared folder host machine. Check that every client machine can locate the HQueue Server machine by its domain (DNS) name. Similarly check that the clients can locate the Shared Folder Server machine by its DNS name.

Additionally check that the host names (or computer names) of the client machines match their DNS names. This is important for when the HQueue server needs to contact the clients.
User Avatar
Staff
4938 posts
Joined: July 2005
Offline
So you got Hqueue to work? Excellent. That is the best approach.

However, as you notice, it allows the tracker to open any port rather than a fixed port. This is important to ensure more than one tracker can run on one machine, and that you don't get failures due to a port still being held from a previous simulation.

The client-to-client communication also allocates a port at run time. There are no fixed set of ports to open between the machines. There needs to be no firewall/shorewall between the machines that are simulating, or between them and the tracker. There MUST, of course, be a firewall between your machines and the rest of the internet!

If you can't access the tracker's status from other machines, then they also will be unable to access the tracker directly. The first thing to fix is this. Unfortunately there are many linux configurations out there so it is hard to provide more than general pointers.
User Avatar
Member
35 posts
Joined: May 2013
Offline
OK!
the solution to avoid the error before was set /etc/hosts using hostname/ip
I have no error messages, but it seems stuck in the same status.

tracker message:

Active List
Job: flipsolver1_2: @1446662733.185918, (n: 4, a: 1, d: 0, e: 0)
Peer Info acquire->sync acquire->done sync->done
peer #3 (@1446662733.185923) - 127.0.0.1 : 48640 pending pending pending

User Avatar
Staff
4938 posts
Joined: July 2005
Offline
Only one machine successfully connected to the tracker. The only machine that connected was the machine running the tracker, thus the 127.0.0.1 loopback address.

The two problems I see are:
1) The other three machines aren't connecting/finding the tracker
2) The 127.0.0.1 suggests you have your own machine name in /etc/hosts as 127.0.0.1. Some linux machines are configured this way. If you have a line
127.0.0.1 mymachinename
it should be removed.
User Avatar
Member
35 posts
Joined: May 2013
Offline
perfect! its working now, just a final little point.

I had to temporally stop firewall for all machines…
then finally still have to release something in the firewall, it would be a service or port? Apparently the tracker always uses a random port, so I imagine it is a service that needs to be released, right?

final status tracker:

Active List

Job: flipsolver1_228: @1446673101.530937, (n: 3, a: 2, d: 0, e: 0)
Peer Info acquire->sync acquire->done sync->done
peer #2 (@1446673101.530939) - 192.168.0.10 : 35440 pending pending pending
peer #0 (@1446673101.536830) - 192.168.0.11 : 48590 pending pending pending


Barriers

Done List

Job: flipsolver1_228 - Pressure Exchange: @1446673101.446718, (n: 3, a: 3, d: 3, e: 0)
acquire->sync: 0.001169s
sync->done: 0.008716s
Peer Info acquire->sync acquire->done sync->done
peer #1 (@1446673101.446720) - 192.168.0.3 : 47357 0.001167s 0.009683 0.008516
peer #2 (@1446673101.446785) - 192.168.0.10 : 35440 0.001102s 0.009733 0.008631
peer #0 (@1446673101.447858) - 192.168.0.11 : 48590 0.000029s 0.007951 0.007922


Job: flipsolver1_228 - Pressure Request: @1446673101.426048, (n: 3, a: 3, d: 3, e: 0)
acquire->sync: 0.001146s
sync->done: 0.012683s
Peer Info acquire->sync acquire->done sync->done
peer #2 (@1446673101.426050) - 192.168.0.10 : 35440 0.001144s 0.013817 0.012673
peer #1 (@1446673101.426124) - 192.168.0.3 : 47357 0.001070s 0.013174 0.012104
peer #0 (@1446673101.427141) - 192.168.0.11 : 48590 0.000053s 0.012087 0.012034
User Avatar
Staff
4938 posts
Joined: July 2005
Offline
There are no specific ports, I'm afraid. I'm not sure how your firewall is configured, but all ports should be opened for peer-to-peer traffic. I suspect the pre-built “services” are to predefine port-sets.

It is very important you keep a firewall between your compute nodes and the rest of the internet, however. A usual configuration is your node machines are all behind a single router/firewall box but have free connection to each other.
User Avatar
Member
7 posts
Joined: April 2016
Offline
Great topic, Have solved my Distributed Simulation tracker problem
  • Quick Links