SUBSCRIBE The SideFX mailing list is a great place to make contact with Houdini users. To subscribe, send us an email with no subject and the word subscribe in the body.
[Sidefx-houdini-list] Distributed sims failing
Friday, 3 February 2017 Fri, 3 Feb '17
Hi Gary, Sorry for the late reply. I tried out the splash tank .hip file on our farm here and was able to simulate through all 120 frames across 5 slices/machines. So I don't think the errors are specific to the scene or simulation setup. Looking at the errors from the job output it looks like general networking issues. It's as if communication between the client machines is disrupted causing a breakdown. I noticed from the diagnostics files that you are using MacOS for your client machines so I wonder if the networking issue is specific to Mac. For example, we have a couple of users here on MacBook Pros using El Capitan and Sierra and they complain that their Macs disconnect from the wifi for inexplicable reasons at least once or twice a day. When I tested the splash tank scene I used an all Linux farm since I didn't have enough dedicated Mac machines. Maybe try simplifying the setup by reducing the number of slices/machines and reducing the number of frames? Perhaps only one of the machines has a networking issue which is causing everything to fail. You can submit the sim job to different machines to see if a certain combination succeeds or fails. Cheers, Rob On 2017-01-31 12:46 PM, Gary Jaeger wrote:
Thanks Antoine- Just checked and they all have at least 600GB avail on their boot drives. Gary Jaeger / 650.728.7957 direct / 415.518.1419 mobile http://corestudio.com <http://corestudio.com/> > On Jan 31, 2017, at 9:14 AM, Antoine Durr <antoinedurr at gmail.com> wrote: > > It sounds like you?re running out of disk space on one of the machines used for the distributed sim. > > ? Antoine > >> On Jan 31, 2017, at 7:11 AM, Gary Jaeger <gary at corestudio.com> wrote: >> >> Hi Gary, >> >> If I were to guess it almost looks like a general network disruption. I'm basing that on the "Write pipe error" messages. Perhaps a disruption occurs while a message is passed from one machine to another so the message is incomplete? >> >> Anyway, would you be able to post the job output and diagnostics files for one of the failed slice jobs? And also the .hip file? >> >> I can give it a whirl here and see if it's an issue with HQueue or distributed sims. >> >> Cheers, >> Rob >> >> On 2017-01-29 7:08 PM, Gary Jaeger wrote: >>> A quick follow up. I watched on of the machines on the farm doing a sim - i >>> just did a quick splash tank to make sure it wasn't my scene. Early on the >>> CPUs are pegged and everything moves along. RAM is not even close to being >>> an issue, the process is using up about 2GB. Looks like maybe it's the >>> hython process? Anyway, at some point the CPU usage just drops to nothing. >>> The process is still alive, but doesn't seem to be doing anything. >>> >>> On Sun, Jan 29, 2017 at 12:30 PM, Gary Jaeger <gary at corestudio.com> wrote: >>> >>>> Anybody have any insight into this? I have a flip sim that I want to >>>> distribute. I'm pretty sure it's all set up correctly, because the sim >>>> starts and all the slices get part of the way through, but always end up >>>> failing. >>>> >>>> I've tried both slice and slice along. When I try a slice along, the job >>>> has been getting about 14% through, then just hanging up. No error >>>> messages, etc. It just never progresses. I've also tried slice and chopping >>>> the sim into 4 quadrants. In that case I was seeing things like this: >>>> >>>> >>>> ALF_PROGRESS 27% >>>> ALF_PROGRESS 28% >>>> Read error on ack: Error Occurred of 12 >>>> Error occurred in message 12 state is 5 >>>> ---- Pump enters error status ---- >>>> Tracker reports an error, aborting >>>> >>>> >>>> ALF_PROGRESS 27% >>>> ALF_PROGRESS 28% >>>> Tracker reports an error, aborting >>>> Write pipe error: Error Occurred offset 0 of 4 >>>> Error occurred in message 4 state is 9 >>>> EOF in pipe at position 0 of 4 >>>> Error occurred in message 4 state is 6 >>>> >>>> -- >>>> >>>> Though the tracker task isn't reporting any errors in hqueue that I can >>>> see. >>>> >>>> Any ideas? >>>> >>>> >>>> >>>> -- >>>> Gary Jaeger // Core Studio >>>> 249 Princeton Avenue >>>> Half Moon Bay, CA 94019 >>>> 650.728.7957 <(650)%20728-7957> (direct) ? 650.728.7060 <(650)%20728-7060> >>>> (main) >>>> http://corestudio.com >>>> >>> >> _______________________________________________ >> Sidefx-houdini-list mailing list >> Sidefx-houdini-list at sidefx.com >> https://lists.sidefx.com:443/mailman/listinfo/sidefx-houdini-list > _______________________________________________ > Sidefx-houdini-list mailing list > Sidefx-houdini-list at sidefx.com > https://lists.sidefx.com:443/mailman/listinfo/sidefx-houdini-list _______________________________________________ Sidefx-houdini-list mailing list Sidefx-houdini-list at sidefx.com https://lists.sidefx.com:443/mailman/listinfo/sidefx-houdini-list