Deadline scheduler and AWS

   2672   3   0
User Avatar
Member
68 posts
Joined: Jan. 2014
Offline
Hi there.

I'm trying to send a pdg graph to run on aws (deadline scheduler). Have been reading docs and the farm troubleshooting page, but still didn't find a way to make it work.

Background info:
- Amazon instances runs Houdini 18.0.416 (linux)
- Submitter machine runs Houdini 18.0.416 (windows)
- .hip is a simple scene with a box and a topnet(2 workitems). they get rendered with a ropfetch node (v-ray)
- I can render normally using the deadline's own rop node (not pdg), so cloud instances installation look good.

When running the scheduler in the “regular” mode (not submitting graph as job), it errors on the instances (socket.gaierror) as they are not able to reach the submitter machine to report back pdg progress, makes sense.

When submitting graph as job, it errors on the instances with the following message:

2020-04-24 08:20:50: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2020-04-24 08:20:50: Exception Details
2020-04-24 08:20:50: RenderPluginException – Failed after four attempts to copy “PDGDeadline.param”. Please make sure that the file exists in the plugin directory.
2020-04-24 08:20:50: RenderPluginException.Cause: JobError (2)
2020-04-24 08:20:50: RenderPluginException.Level: Major (1)
2020-04-24 08:20:50: RenderPluginException.HasSlaveLog: True
2020-04-24 08:20:50: RenderPluginException.SlaveLogFileName: /var/log/Thinkbox/Deadline10/deadlineslave_renderthread_0-ip-10-128-22-17-0000.log
2020-04-24 08:20:50: Exception.Data: ( )
2020-04-24 08:20:50: Exception.TargetSite: Deadline.Slaves.Messaging.PluginResponseMemento d(Deadline.Net.DeadlineMessage, System.Threading.CancellationToken)
2020-04-24 08:20:50: Exception.Source: deadline
2020-04-24 08:20:50: Exception.HResult: -2146233088
2020-04-24 08:20:50: Exception.StackTrace:
2020-04-24 08:20:50: at Deadline.Plugins.SandboxedPlugin.d(DeadlineMessage bbm, CancellationToken bbn
2020-04-24 08:20:50: at Deadline.Plugins.SandboxedPlugin.SyncFilesForJob(Job job, Boolean cleanup, String& message, CancellationToken cancellationToken
2020-04-24 08:20:50: at Deadline.Slaves.SlaveRenderThread.e(String ady, Job adz, CancellationToken aea
2020-04-24 08:20:50: at Deadline.Slaves.SlaveRenderThread.b(TaskLogWriter adu, CancellationToken adv)
2020-04-24 08:20:50: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

…I'm not sure this PDGDeadline.param issue is something related to Houdini or Deadline itself, but I have been stressing out many combinations, also tried setting PDG_USE_PDGNET=1 on my houdini.env file, but I always get this error.

any suggestions ?
thank you.
User Avatar
Member
68 posts
Joined: Jan. 2014
Offline
..and of course 2 minutes after the post I found the simple solution.

Needed to copy: $HFS\houdini\pdg\plugins\PDGDeadline to: C:\DeadlineRepository10\plugins

rtfm
Edited by fabriciochamon - April 24, 2020 05:30:37
User Avatar
Member
571 posts
Joined: May 2017
Offline
You shouldn't need to copy the PDGDeadline plugin files manually. The TOP Deadline scheduler should take care of that for you (unless you toggled off the `Copy Plugin To Working Directory` parm on the node, under Advanced > PDG Deadline Plugin parms section). The default target directory where it gets copied to is your PDG working_directory/pdgtemp/Plugins. Please check that it gets created.

But the error says that Deadline is not able to copy PDGDeadline.param, so the most likely culprit is that the farm machine is not able to access the working directory where the plugin got copied to. Do your farm machines have network share access to the copied plugins directory when the job is running?

It could also be an issue of mixed paths considering you are scheduling from Windows, and running the job on Linux. Make sure you enable and specify the correct Remote Shared Path. You can check the job files generated to see what the plugin path is that will be used by the farm machines.

But what you ended up doing (i.e. manually copying over the PDGDeadline folder) is okay as a last resort. Just remember to update this folder when you update Houdini installation (since we might change PDGDeadline plugins).

As for your first error with the regular job submission, if you have VPN access to your AWS network, then that is probably due to lack of DNS on your AWS network. If your AWS machines can't “see” your local submission machine then yes, they will fail to report back. For the DNS issue, you can turn on `Use IP Address for PDGMQ` which might help. Or grab today's Houdini build (18.0.446) which will use the new MQ server, which always uses IP address and therefore should not error out.
Edited by seelan - April 24, 2020 08:09:47
User Avatar
Member
68 posts
Joined: Jan. 2014
Offline
Thank you Seelan.

I'm carrying this conversation over to support, since we could exchange more information there.
  • Quick Links