DjangoBB LoFi version

Search - User list

Full Version: TractorScheduler farm issues

Root » PDG/TOPs » TractorScheduler farm issues

davidoberst

March 3, 2020 07:02:07

We are having a couple of problems getting TractorScheduler set up with our farm:

In your prtractor.py, the TractorScheduler._initEngineConnection() method explicitly sets the tractor API engine client “session filename” (which caches the session connection info) with:
```
tq.setEngineClientParam(sessionFilename=sessionFilename)
```
You build sessionFilename starting at the user home area:
```
home = os.path.expanduser("~")
```
This seems to bake in the assumption that for a job submitted by, say, “fred.flintstone”, any of our Linux farm blades running it will have something like “/home/fred.flintstone” in existence, or creatable. Our crew don't log into the farm blades, so home areas don't exist for them, and of course can't be created on the fly by anything other than root. So when a TractorScheduler task makes a tractor API call like tq.tasks() (as the tick() method does), Tractor will throw an exception trying to create the session file you have specified:
TractorQueryError: problem creating session directory ‘/home/fred.flintstone’: Permission denied: ‘/home/fred.flintstone’

It would be good if there were some way to specify an override for the directory to create the session files in. Perhaps an environment variable, and change the above line to:
```
home = os.environ.get("PDG_TRACTOR_SESSION_DIR", os.path.expanduser("~"))
```
You already put the hostname in the resulting filename, so could point at a globally accessible area somewhere, or just use something like “/tmp” locally.
When using the “Submit Graph as Job” option, even though the individual cooking tasks don't have to report back to the originating machine, they still do some sort of RPC reporting back to the master task? They all have something like “PDG_RESULT_SERVER=farmblade123.ourfarm.net:39242”, and are currently failing with a Python “No route to host” socket error, presumably because of however our farm's firewalling is set up. Is there any documentation/control on the range of ports used, or are you just asking the network stack for any available port#?

chrisgreb

March 3, 2020 10:27:05

davidoberst
It would be good if there were some way to specify an override for the directory to create the session files in.

Yes, that makes sense we can add that.

Is there any documentation/control on the range of ports used, or are you just asking the network stack for any available port#?

In that mode it's just using an available port #. The assumption is that there is no firewall between farm machines. However we can expose a parm to control which port is used.

davidoberst

March 3, 2020 10:57:25

chrisgreb
davidoberst
Is there any documentation/control on the range of ports used, or are you just asking the network stack for any available port#?
In that mode it's just using an available port #. The assumption is that there is no firewall between farm machines. However we can expose a parm to control which port is used.

It should probably be a “range” sort of parameter, so that one could specify “between 30000 and 31000”, etc. and our IT could set a firewall rule to allow that range only. In theory a farm blade could be running multiple PDG->Cook job tasks each needing to acquire a separate port#, since they are lightweight controllers (you have a separate service key field for these tasks and the cooking tasks). I was trying to find where your code acquires the port# it uses - is it possible to specify an acceptable range at that point?

chrisgreb

March 3, 2020 11:22:02

davidoberst
It should probably be a “range” sort of parameter, so that one could specify “between 30000 and 31000”, etc. and our IT could set a firewall rule to allow that range only. In theory a farm blade could be running multiple PDG->Cook job tasks each needing to acquire a separate port#, since they are lightweight controllers (you have a separate service key field for these tasks and the cooking tasks). I was trying to find where your code acquires the port# it uses - is it possible to specify an acceptable range at that point?

The port in question is ‘Task Callback Port’, which is usually set on the PDG Message Queue job - this parm is at the bottom of the Tractor Node UI. In the case of submit-as-job the TOP Cook job is hosting it's own task callback port, so it makes sense to use that same parm. The way it works is that it tries to bind ports starting at the given port number and going up from there. There is a fixed number of attempts (50). That seems like it should be enough for everyone, but we could probably expose that range length as well if it might be an issue.

davidoberst

March 3, 2020 14:38:31

chrisgreb
davidoberst
It should probably be a “range” sort of parameter, so that one could specify “between 30000 and 31000”, etc. and our IT could set a firewall rule to allow that range only. In theory a farm blade could be running multiple PDG->Cook job tasks each needing to acquire a separate port#, since they are lightweight controllers (you have a separate service key field for these tasks and the cooking tasks). I was trying to find where your code acquires the port# it uses - is it possible to specify an acceptable range at that point?

The port in question is ‘Task Callback Port’, which is usually set on the PDG Message Queue job - this parm is at the bottom of the Tractor Node UI. In the case of submit-as-job the TOP Cook job is hosting it's own task callback port, so it makes sense to use that same parm. The way it works is that it tries to bind ports starting at the given port number and going up from there. There is a fixed number of attempts (50). That seems like it should be enough for everyone, but we could probably expose that range length as well if it might be an issue.

I had tried filling in Task Callback Port, on the chance that it would be used by Graph as Job option as well, but it wasn't. If that does a +50 check that should be more than enough, if you have the Tractor controlling task use that mechanism as well.

Until then, where does the current code do the bind to a port? If I can find the right file and procedure, I may be able to monkey-patch in some different behaviour. That's what we are going to do as an interim solution for #1 - monkeypatch in an altered _initEngineConnection() that calls the original, then does another setEngineClientParam() call to provide a more suitable sessionFilename.

chrisgreb

March 3, 2020 15:03:00

In prtractor.py:532 you can add:

self._callbackserver.custom_port_range =(taskcallbackport, taskcallbackport+50)