XPBD implementation (c++)

   813   3   1
User Avatar
Member
9 posts
Joined: June 2018
Offline

Hi!

I've been working on implementing XPBD inside houdini and so far I've got a prototype up and running which looks promising. I'm using the inlinecpp python module to compile and run the code.

Here is one of my latest tests:

https://youtu.be/xoi51a6bkkw [youtu.be]


 

I want to optimize this as much as possible but I think I've ran into some memory bandwidth issues( naive guess ). 90% of the collision solving and constraints can be solved in parallel, but when I implement multithreading I see little to no performance gains.

As this is my first try at C++ and the HDK in general I'm probably doing something wrong.

 

The code itself is quite long but here is a small summary of what might be the issue:

 

First I fetch all the entities onto a vector< entity > e, this is basically where I store each rigid body object. For each object I store 30+ attributes, position, velocity, mass, inertia tensor etc. I can then easily iterate over the entities later in the code by referring to the entities.

// fech point attributes
GA_RWHandleV3D p_h( gdp->findAttribute(GA_ATTRIB_POINT, "p"));
GA_RWHandleV3D px_h(gdp->findAttribute(GA_ATTRIB_POINT, "px"));
GA_RWHandleV3D v_h( gdp->findAttribute(GA_ATTRIB_POINT, "v"));
GA_RWHandleV3D vx_h(gdp->findAttribute(GA_ATTRIB_POINT, "vx"));
GA_RWHandleV3D w_h( gdp->findAttribute(GA_ATTRIB_POINT, "w"));
GA_RWHandleV3D wx_h(gdp->findAttribute(GA_ATTRIB_POINT, "wx"));

// create entities
const int entities = gdp->getNumPoints();

std::vector< entity > e;
e.resize(entities);
    
for (int i=0; i<entities; i++){

	e[i].p  = p_h.get(i);
	e[i].px = px_h.get(i);
	e[i].v  = v_h.get(i);
	e[i].vx = vx_h.get(i);
	e[i].w  = w_h.get(i);
 	e[i].wx = wx_h.get(i);

}

//
// XPBD SOLVER
//

// ... at the end of the program I save the updated point attributes
for (int i=0; i<entities; i++){

	p_h.set( i, e[i].p);
	px_h.set(i, e[i].px);
	v_h.set( i, e[i].v);
	vx_h.set(i, e[i].vx);
	w_h.set( i, e[i].w);
	wx_h.set(i, e[i].wx);
     
}
After fetching constraints and possible collisions we solve the main XPBD loop, to get a stable simulation I run this loop 50 times pr. frame.

I store collisions pairs into bulks of sections that can be run in parallel. Each pair contains the index of two entities that I want to solve collisions for.

This is a quick summary of the code;

for (int ss = 0; ss<iterations; ++ss) {

    // update entity position
    for (int i=0; i<entities; i++){
        // apply gravity etc.
    }
    
    // solve collisions
    int collisionIdx = 0;
    for (int i=0; i<sections; i++){
        for (const auto& pair : sections[i]) {
        
            entity &e0 = e[pair.first];
            entity &e1 = (pair.second == -1) ? e_ground : e[pair.second];
                
            collisions_solve( e0, e1, collisions, collisionIdx );
            collisionIdx += collisionsMax;
        }
    }

    // solve constraints (same setup as collisions but for constraints)
    

    // PBD velocity update
    for (int i=0; i<entities; i++){
        // Update the linear velocity
    }
    
    // The velocity solver - we run this additional solver for every collision that we found
    collisionIdx = 0;
    for (int i=0; i<sections; i++){
        for (const auto& pair : sections[i]) {
        
            entity &e0 = e[pair.first];
            entity &e1 = (pair.second == -1) ? e_ground : e[pair.second];
                
            velocity_solver( e0, e1, collisions, collisionIdx );
            collisionIdx += collisionsMax;
        }
    }
}

 

Here is a longer version of the collision solver, for the multithreading I store two list, e0 and e1 which includes the collision pairs that can be computed in parallel:

int collisionIdx = 0;
for (int i = 0; i < sections.size(); ++i) {

    // check if we should use multithreading
    if( sections[i].size() < 100 ) {

        // for every collision pair
        for (const auto& pair : sections[i]) {

            entity &e0 = e[pair.first];
            entity &e1 = (pair.second == -1) ? e_ground : e[pair.second];
            
            collisions_solve( e0, e1, collisions, collisionsMax, collisionIdx, h);

            collisionIdx += collisionsMax;
        }
        
    } else {
    
        vector<int> e0;
        vector<int> e1;
        
        for (const auto& pair : sections[i]) {
            e0.push_back(pair.first);
            e1.push_back(pair.second);
        }
        
        threads.collision_solve( e, e0, e1, e_ground, collisions, collisionsMax, collisionIdx, h );
        
        collisionIdx += collisionsMax * sections[i].size();
        
    }

}

 

At last here is the threaded collisions solve function. _e0 and _e1 holds the list of entities that we can solve in parallel:

class threadedObject
{
public:
    THREADED_METHOD8(                   // Construct two parameter threaded method
                    threadedObject,     // Name of class
                    true,               // Evaluated to see if we should multithread.
                    collision_solve,    // Name of function
                    
                    vector<entity> &, e,
                    vector<int>, _e0,
                    vector<int>, _e1,
                    entity, e_ground,
                    vector< Collision_Constraint > &, collisions,
                    int, collisionsMax,
                    int, collisionIdx,
                    double, h
                    
                    )          
                    
    void collision_solvePartial(vector<entity> &e, vector<int> _e0, vector<int> _e1, entity e_ground, vector< Collision_Constraint > &collisions, int collisionsMax, int collisionIdx, double h, const UT_JobInfo &info)
    {
        int i, n;
        for (info.divideWork( _e0.size() , i, n); i < n; i++)
        {
        
            int collision_idx = collisionIdx + (i*collisionsMax);
            
            entity &e0 = e[ _e0[i] ];
            entity &e1 = (_e1[i] == -1) ? e_ground : e[ _e1[i] ];
            
            collisions_solve( e0, e1, collisions, collisionsMax, collision_idx, h);
            
        }
    };
};

 

Should I do this in another way?

Instead of making the entities list I could fetch the values only when needed using them. This will eliminate the whole entities list and possible make it faster to ingest the code into a threaded function, and maybe make it more tread safe as Houdini's does all the attribute handling.

Here I have a custom "struct handles" that contains all the point attribute handles.

 

Example:

void collisions_solve( int e0, int e1, struct_handles &handles, vector< Collision_constraints > &collisions, int collision_idx  )
{
    // fech attributes
    UT_Vector3 p0 = handles.p_handle.get(_e0[i]);
    UT_Vector3 p1 = handles.p_handle.get(_e1[i]);
    
    // do calculations
    // save collision
    
    // store attributes
    p_handle.set(_e0[i], p0);
    p_handle.set(_e1[i], p1);
}



for (info.divideWork( _e0.size() , i, n); i < n; i++)
{
    // fech all point attributes
    int e0 = _e0[i];
    int e1 = _e1[i];

    // solve collision
    collisions_solve( e0, e1, handles, collisions, collision_idx );
}

 

This goes beyond my limited HDK C++ knowledge so any help or insight on the subject is highly appreciative!

Sorry for the lack of consistence use of brackets.

 

 

All credits goes to Felipe Kersting for the original codebase. I had to implement some custom constraints my self and of course do a lot of adjustments to make it work in houdini, but 90% of the code is copied from this repository:

https://github.com/felipeek/raw-physics [github.com]

User Avatar
Member
9 posts
Joined: May 2012
Offline
I haven't looked at your code too closely, but when multithreading be careful of data races - if different threads access the same memory at the same time undefined behavior occurs.
In the e0 and e1 arrays if each thread goes through the members and does something with them (even without changing the values in the arrays) some bad stuff might happen.
If you need to share values between threads ring buffers are the way to go.
If you already took care of all that, sorry - like i said didn't have time to take a closer look at the code.
Interesting project anyway: ) good luck!
User Avatar
Member
9 posts
Joined: June 2018
Offline
Thanks! I haven't heard about ring buffers, I'll check it out

I think I'm more interested in how I in general should handle large amount of rigid bodies, right now I'm gathering all the attributes in a vector<entity>, each entity is then populated by the corresponding point attributes ( 30+ attributes, mass, velocity etc.).


Instead of using the vector<entity> I could just query multiple GA_RWHandleV3D when I need an attribute, maybe that's better?
GA_RWHandleV3D p_handle( gdp->findAttribute(GA_ATTRIB_POINT, "p"));
UT_Vector3 p = p_handle.get(0);
and then update the point attribute:
handles.p_handle.set(0, p);

I will be calling and setting the GA_RWHandle's millions of times, is that better than dealing with a vector<entity> e that contains all the necessary point attributes for each point?
Edited by heinzelnisse - March 5, 2024 14:42:04
User Avatar
Member
9 posts
Joined: June 2018
Offline
Update:

It seems like a vector<> is thread safe as long as you only touch one component at the time.
Working with GA_RWHandleV3D is necessary not.

You have to use the GA_Pageiterator to divide the point attributes to dividable chunks, you can then safely assign a thread to do the work on that memory page with the GA_RWHandleV3D:

GA_RWHandleV3D p_handle( gdp->findAttribute(GA_ATTRIB_POINT, "P"));
GA_SplittableRange range = gdp->getPointRange();

for (GA_PageIterator pit = range.beginPages(info); !pit.atEnd(); ++pit) {
                 
	GA_Offset start, end;
                 
	for (GA_Iterator it(pit.begin()); it.blockAdvance(start,end);) {
                     
		for (GA_Offset i = start; i < end; ++i) {
                        UT_Vector3 p = p_handle.get(0);
			// do work for each point
                        p_handle.set(0, p);
		}
	}
}


The reason (most likely) why my code isn't running any faster is because the thread overhead is larger than doing the tasks single threaded. I got some improvements when I divided the tasks into larger chunks of entities that could be processed instead of giving one entity to each tread until all the tasks is done, but its not worth it when the task number is too small.

Here is the threaded code to split vector<entity> e into chunks, I divide e so that each thread get one chunk each:
int tasks = e.size();
int chunks = nthreads;
        
int avgTasksPerGroup = tasks / chunks;
int remainingTasks = tasks % chunks;

int i, n;
for (info.divideWork( chunks , i, n); i < n; i++)
{
        
	int taskStart = i*avgTasksPerGroup;
	int taskEnd = (i+1)*avgTasksPerGroup;
	if( i == chunks-1 ) taskEnd += remainingTasks;

	for(int x=taskStart; x<taskEnd;x++) {
		// do work on e[x] here
	}
}

Most of my code that I want to iterate over is less than 200 lines and not that complex. I found that a threshold of 500 tasks is where I can switch over to multi threading for the threaded method to be faster. This is not ideal for my case as most tasks is less than 300 on a typical tank simulation.

Also I think the main bottleneck in my program is that I have to rebuild the vector<entity> e per frame(this vector becomes quite big), It would be much better if I could do this one time when the simulation starts and then keep the internal data on the next frame.

I have some other questions too but I think this tread is becoming too long to read and not specific enough so I think I'll ask those questions in a new tread.

Thanks!
  • Quick Links