ML Train Regression TOP node

Trains a neural network for regression.

On this page	Overview Partial Training File Access Re-Training from Scratch Neural Networks Custom Neural Networks Limitations Parameters
Since	20.5

Overview ¶

This is a general-purpose ML training node to apply to regression problems. In the context of deep learning, regression means training a feedforward neural network (a model) so it can closely approximate a continuous function from a fixed number of input variables to a fixed number of output variables. In Houdini, this function may be provided in the form of a procedural network. See Machine Learning documentation for more general information.

For example, regression ML may learn how to realistically deform a character based on its rig pose to improve over linear blend skinning. In this case, you use ML to learn the function that maps a rig pose to a skin deformation. See the ML Train Deformer recipe. Another example is to learn how to upres a pyro simulation. See the ML Train Volume Upres recipe.

ML Train Regression trains a model given a set of data set consisting of labeled examples, see ML Example. Each labeled example is a pair consisting of an input component and a corresponding target component. For regression, both the input component and the target component consists of continuous variables. This may include point coordinates, colors, or PCA components (among other types of values). The term regression is used because each target component consists of continuous variables, as opposed to discrete ones used in classification. Labeled examples are the basis for training. In the specific case of the ML Deformer, each input component is an (serialized) pose and the corresponding target component is the (PCA-compressed) skin deformation obtained by a flesh simulation. In the case of the ML Volume Upres, both inputs and outputs are 3D volume tiles (voxel grids).

ML Train Regression trains a model that predicts a target component given some user-specified input component. The goal is to have a well-fitted model that generalizes well, it should produce useful predictions for unseen inputs. This means inputs that were not included in the data set for training. ML Train Regression provides several methods that can help ensure that a trained model performs well on unseen inputs. These are called regularization methods.

The trained model produced by ML Train Regression can export to the ONNX format. It can also be saved as a pytorch model. However, Houdini has no built in nodes that can use this model. The easiest way to bring a model produced by ML Train Regression into Houdini is the ML Regression Inference tool.

ML Train Regression creates a feedforward neural network. The number of input units of this network must correspond to the number of variables in each input component of the data set. The number of output units of the network must correspond to the number of variables in each target component of the data set. When you use a standard kind of model, ML Train Regression automatically figures out the number of inputs and outputs of the network using the dimensions stored in the provided raw data set.

In between the input layer and the output layer, there may be several hidden layers. The number of hidden layers and their width can be controlled by the user. The behavior of the network is controlled by a set of parameters. For linear layers, these parameters may include weights (scaling factors) and biases (additive constants). The training aims to optimize these parameters for the regression task.

ML Train Regression performs multiple passes through the training portion of the data set. Each such pass is called an epoch.

The training process may be repeated multiple times, each time with different settings on the training node. Such training settings are commonly referred to as hyperparameters. Common examples of hyperparameters are settings such as the learning rate, the number of hidden layers of the network, and the width of the network.

The Wedge TOP allows the training process to be repeated with varying hyperparameters. For each setting in a wedge, the resulting model can be saved to a separate file by adding a TOP attribute name at the end of the Model Base Name.

ML Train Regression internally splits the data set provided by the user into two parts: a training portion and a validation portion. The relative sizes these portions can be controlled by the user. The training portion is used to optimize the network’s parameters during each training epoch. The validation portion is used to verify the network’s performance on data that is not used for the training.

Underfitting occurs when a model is not sufficiently complex to fit the training data well. When this occurs, the model cannot be expected to produce accurate results on test data either. In that case, a more complex network with more parameters (higher capacity) may be needed. This may be achieved, for example, by increasing the number of hidden layers or increasing the width of each hidden layer.

Overfitting occurs when a model produces outputs that are accurate for inputs in the training data but inaccurate for unseen inputs. To help reduce overfitting, the ML Train Regression TOP supports two simple regularization strategies: early stopping and weight decay. Early stopping checks whether the validation loss (the loss on the validation set) stops decreasing. If this is the case, then the training terminates. Weight decay adds to the loss function a term that penalizes large weights.

Partial Training ¶

Instead of training a model all at once, a model can be partially trained bit by bit, over multiple passes, where each training pass performs only a limited number of epochs. This allows the TOP network to perform other tasks in between training passes. For example, fresh training data can be generated before each partial training pass, which may be applicable when the data set is generated within Houdini through a procedural network. Training on a stream of fresh training data compared to revisiting data points in a fixed-size data set may produce a model that generalizes better to unseen data. Another advantage is the amount of training data does not have to be decided upfront.

Partial training can also be a useful way of limiting the amount of work/time performed by a single cook of ML Train Regression, while allowing the training to be resumed from where it stopped at a later time. This may be relevant when training is performed on a farm, for example.

Note that starting in Houdini 21, partial training is no longer required in order to periodically save and/or export partially trained models. There are now separate options that allow you do do this without partial training.

ML Train Regression can be set up for partial training:

Turn on Enable Partial Training.
Place ML Train Regression inside a feedback block, see Feedback Begin.
Set Create Iterations From on Feedback Begin to Custom Count.
Set Iterations to the total number of partial training passes you want to perform.
Set Max Epochs per Cook on ML Train Regression to control the maximum number of epochs performed for each training pass. For example, you could set Max Epochs per Cook such that each partial training pass takes some approximate amount of time. The appropriate value for Max Epochs per Cook really depends on the size of the data set.
- (Optional) To output and retain partially trained ONNX models, you can specify on the ML Train Regression node, in the Files tab, a Model Base Name that includes an attribute such as @loopiter.
- (Optional) To generate fresh training data before each partial training pass, you can put a ROP Fetch that triggers the generation of a new data set in front of ML Train Regression inside the feedback loop. You should include the attribute @loopiter in the file name of the data set.

File Access ¶

ML Train Regression accesses a variety of files. Each file name can be optionally modified by inserting one or more TOP attributes into a base name of a file. For example, @wedgeindex or @loopiter.

This enables various types of TOP ML training setups such as:

Training for varying hyperparameters with the use of a Wedge TOP,
Generating additional training data on demand in a TOP feedback loop.

Some of the files ML Train Regression accesses are read-only, both read from and written to, and only written to (or appended to).

In the read-only category, there are (custom network) hyperparameters and the data set. The Hyperparameters tab allows you to provide a dictionary that can be accessed from custom network snippets. For example, controlling the neural network structure. The Input tab allows you to specify the source of training data (and validation data if early stopping is on).

In the write-only category, there are trained (or partially trained) models, either saved or exported ONNX format, with file names determined by Models Folder and Model Base Name. These can be found under the Output tab. Also, diagnostic and progress information are controlled by the Log tab. If a log file exists, log information is appended to that existing log file. This logging allows the creation of a single log file if a single training is broken up into parts by invoking ML Train Regression multiple times. For example, in a TOP feedback loop.

In the read-write category, there are parameters under the State tab that allow you to specify the current state of training. These are used when partial training is on. The training state is read from at the start of a single invocation of ML Train Regression and written to at the end. The training state allows ML Train Regression to resume training where it left off in a previous invocation. This training state is only valid when the state name is the same as the previous invocation. The training state can act like a checkpoint, allowing an imcomplete training to resume.

The controls under the Save Frequency and Export Frequency tabs allow you to control how often models are saved and/or exported. Saving and exporting can happen multiple times during one cook of ML Train Regression, even when partial training is turned off.

Re-Training from Scratch ¶

ML Train Regression trains from scratch only if its expected output files are not found on disk. This behavior enables partial training setups. The expected output files include the exported ONNX model, training state, and training log. If these files exist, ML Train Regression will not re-train from scratch.

To make ML Train Regression re-train, remove the output files from disk. You can remove files through the TOPs UI:

ML Train Regression node and select Delete this Node’s Output Files from Disk.
For a single work item, the work item and select Delete File Outputs from Disk.

Neural Networks ¶

ML Train Regression allows you to easily use a standard kind of neural network, such as an MLP, without any coding. However, you can also program your own custom network structure if you want to. This requires writing one or more snippets consisting of a few lines of python code, making use of the PyTorch library.

An MLP consists of several fully connected layers. In theory, you could always use an MLP for your training. However, this is not always practical. For larger input and output sizes, your MLP model may end up having a very large capacity. Such a large model may require a very large number of training examples to achieve good generalization. Also, large models may exceed the amount of memory available on your machine.

Custom Neural Networks ¶

When a standard model or a standard loss does not suffice, you can create your own model or loss function. This applies when Network Composition is set to Partitioned. However, to allow even even greater flexibility, you can switch Network Composition to Unified. In this case, you have full freedom to create a custom unified network that takes an input component and a target component and outputs a scalar value.

You can create your own model, loss function, or unified network using PyTorch network modules by writing python snippets. Each such snippet parameter has a default, which can be useful as a reference.

Each python snippet may access a kwargs dictionary and must set dictionary called result. You can write snippets as if the torch library is already included. See the pytorch documentation for more information about creating neural networks using PyTorch modules.

The kwargs dictionary stores the following values: a tuple of input shapes input_shapes, a tuple of target shapes target_shapes, and a dictionary of (custom network) hyperparameters hyperparameters. In a python snippet, these values could be extracted and put in separate variables like:

input_shapes = kwargs["input_shapes"]
target_shapes = kwargs["target_shapes"]
hyperparameters = kwargs["hyperparameters"]

Both input_shapes and target_shapes are determined by the data set. In Houdini 21, both the input components and target components are always flattened and concatenated, so that input_shapes and target_shapes contain only a single shape and this single shape has only a single dimension. In this flattened, concatenated representation, the total number of (scalar) elements of an input component is always input_shapes[0][0]. Similarly, in the flattened, concatenated representation, the total number of (scalar) elements of a target component is always target_shapes[0][0].

The dictionary hyperparameters is loaded from a JSON file. You can specify this file location using the Hyperparameter Folder and the Hyperparameter Base Name parameters. The contents of this hyperparameter dictionary can control the structure of the custom network.

A snippet that defines a custom network must always set the result dictionary. This dictionary must assign values to both module_class and module_kwargs. The value associated with module_class must be a python class object, for a class that is a subclass of torch.nn.Module. For example, in a python snippet, setting the result could look like

class MyModule(torch.nn.Module):
...
my_module_kwargs = ...
...
result = {"module_class": MyModule, "module_kwargs": my_module_kwargs}

Here my_model_kwargs should be valid keyword arguments for the construction of MyModel.

Limitations ¶

ML Train Regression and the ML Example family of supervised ML nodes in SOP and ROP provide an easy starting point for experimentation with supervised ML in Houdini. The applicability of ML Train Regression can be extended by making use of the script options that allow you to optionally create your own model networks and loss functions. The Unified option on Network Composition allows you to stretch the flexibility even further, allowing ML Train Regression to be applied for many more ML use cases.

Even with custokmizations, you may find that ML Train Regression is not general enough for your use case. In that case, you can extract and copy the underlying scripts of this node, located in $HHP/hutil/ml/regression and use these as a starting point for creating a custom training node. You may still be able to use many of the other tools of the supervised ML toolkit that exist in SOP and ROP to prepare your data set such as the ML Example operator family.

Parameters ¶

Network

Network Composition

How to compose the neural network that will used for training.

Partitioned

The neural network is composed of separate parts: a model, a loss function, and a regularizer. The model takes only the input component from each example (batch), returning the model output. The loss function takes both the model output (batch) and the target component from an example (batch). The regularizer takes only the model output (batch), but also has access to the model parameters.

Unified

The neural network accepts the input and target components of an example (batch) and outputs a loss (batch). This option offers greater flexibility than having separated network parts for the model, the loss function and the regularizer.

Model

Model Kind

Either a standard model (e.g., Mean Squared Error) or a user-defined model.

Standard

A standard kind of model network.

Custom

A user-defined kind of model network.

Standard Type

The type of standard model network.

MLP

Creates a model network consisting of fully connected linear layers alternated by activation layers. This is network architecture is commonly referred to as a multilayer perceptron (MLP).

Hidden Layer Format

The way the hidden layers are configured. For example, Uniform gives each hidden layer the same width and the same activation function.

Uniform Hidden Layers

Specify the number of hidden layers.

Uniform Hidden Width

Specify a common width for all hidden layers.

Uniform Hidden Activation

The common type of activation function for all hidden layers. For example, tanh (hyperbolic tangent).

Model Module

A python script that describes how a model should be created. See Custom Network Structures for details.

Loss

Loss Kind

Either a standard loss (e.g., Mean Squared Error) or a user-defined loss.

Standard

A standard kind of loss network.

Custom

A user-defined kind of loss network.

Standard Loss Type

The specific standard loss function. Currently supports only Mean Squared Error (MSE).

Loss Module

A python script that describes how a model should be created. See Custom Network Structures for details. This expected to be a loss that’s an average over examples in a batch, not a sum.

Regularizer

Weight Decay

The higher this value, the more the training session will try to keep the weights small (only the weights, not the biases). This is a very basic method for preventing overfitting.

Setting a higher value, the training session will try to keep the weights small (not the biases). A basic method to prevent overfitting.

Initialization

Random Seed

The random seed used to initialize neural network parameters. Different seeds result in different models with varying accuracies. This hyperparameter is suitable for wedging.

Unified Network

Unified Module

A python script that describes how a model should be created. See Custom Network Structures for details.

Training

Data Preparation

Shuffle

When on (recommended), reorders data set elements randomly before use. Ensures validation set contains random samples from the data set.

Shuffle Random Seed

The random seed used to shuffle the examples in the data set.

Limit Size

When on, only an initial part of the data set is preserved. The remaining data is deleted. This step takes part right after shuffling, before any of the data is used for training. This option is useful for finding out how the generalization error of the trained model depends on the data size. For example, making use of Wedge TOP. The resulting curve may inidicate whether more data would be beneficial to improve the generalization.

When on, preserves only the initial part of data set after shuffling. Useful for analyzing how generalization error depends on data size using Wedge TOP.

Upper Limit

Specify an upper limit of the number of data sets that are preserved. The size of the remaining data set is the minimum of the initial data set size and this limit.

Partial Training

Enable Partial Training

Enforce a hard upper limit on the number of epochs per work item during training. This is useful when ML Train Regression is used in a feedback loop in TOPs. During each iteration of the feedback, the number of epochs can be limited using this parameter.

Max Epochs (for a single cook)

Hard upper limit on the number of epochs for one single cook of ML Train Regression. The next cook, the training resumes at the next epoch.

Validation

Enable Validation

When on, splitting the data set into training and validation parts, controlled by Training Data Proportion and Validation Data Proportion. Evaluates validation part every Epochs per Validation epochs.

Training Data Proportion

Applies only when Enable Early Stopping is on. The proportion of data set used to train the model.

Validation Data Proportion

Applies only when Enable Early Stopping is on. The proportion of the data set to validate the performance of the model. The validation set consists of a contiguous range of elements at the end of the data when Shuffle is on. It is recommended to turn on the Shuffle option, otherwise the validation set will generally not consist of a random sample of the entire data set.

Epochs per Validation

The number of epochs that are trained before each validation loss evaluation.

Termination

Maximum Epochs Total (over all cooks)

Maximum number of epochs trained over all cooks of ML Train Regression. Training terminates when number of epochs is reached. This is most useful when partial training is on.

Enable Early Stopping

When on, terminates training when performance of the model on the validation set stops improving.

Patience

The number of times the validation loss is evaluated without finding an improvement of the current best validation loss before giving up.

Note

This parameter is expressed as number of evaluations, not epochs. See the Epochs per Validation parameter to see how many epochs are trained between validations.

Optimization

Algorithm

Determines which optimization algorithm to use during the training process.

Adam

Uses the Adam algorithm, see pytorch documentation for more detail.

Adadelta

Uses the ADADELTA algorithm, see pytorch documentation for more detail.

SGD

Uses the Stochastic Gradient Descent algorithm, see pytorch documentation for more detail.

ASGD

Uses the Averaged Stochastic Gradient Descent algorithm, see pytorch documentation for more detail.

Learning Rate

Controls step size during training. Larger steps mean larger parameter updates. Smaller rates may take longer but help avoid skipping locally optimal solutions.

Beta1

This coefficient is specific to the Adam optimization algorithm. See Pytorch documentation.

Beta2

This coefficient is specific to the Adam optimization algorithm. See Pytorch documentation.

Rho

When Algorithm is set to Adadelta, determines the rho value used by the optimization algorithm. A higher value results in a faster rate of training, but maybe also decrease stability.

When using Adadelta algorithm, determines the rho value. Higher values increase training speed but may reduce stability.

Momentum

When Algorithm is set to SGD, determines the momentum value. This value is optional, but helps the training process converge faster by allowing past gradient calculations to influence the optimization process. The momentum value determines how heavily past results influence the optimization process.

Optional SGD parameter that allows past gradients to influence optimization, increasing convergence speed. Determines influence strength of past results.

Scheduling

Rate Scheduler

Determines the type of scheduler which varies the learning rate over the training process.

Cosine Annealing

Uses a cosine annealing schedule, see pytorch docs

Linear

Maintains a constant learning rate as defined in the Learning Rate parameter, and then linearly decays it towards a value of 0 near the end of the training process. The Linear Decay parameter determines the number of training iterations that should be spent reducing the learning rate value to 0.

Step

Maintains a constant learning rate which is multiple by a factor of Gamma after every Step training iterations.

Exponential

Each epoch the learning rate is multiplied by a constant

Linear Decay

When Rate Scheduler is set to Linear, determines how many iterations should be spent decaying the learning rate toward 0. For example, if set to 50 then learning rate will decay towards zero for the last 50 training iterations.

Step Size

When Rate Scheduler is set to Step, determines the number of training iterations between updates to the learning rate.

Gamma

When Rate Scheduler is set to Step, determines the scale factor applied to learning rate each time a scheduler step is reached.

Exponential Style

The way the constant which the learning rate is multiplied is specified.

Multiplicative Factor

Specify the constant directly

End Learning Rate

Determine the constant from the start learning rate and a given end learning rate.

Gamma

When Rate Scheduler is set to Step and Exponential Style is set to Multiplicative Factor, then this is the scale factor applied to the learning rate after each epoch

End Learning Rate

When Rate Scheduler is set to Step and Exponential Style is set to End Learning Rate, then this is sets the scale factor such that the learning rate after the last epoch becomes this specified end learning rate

Training Granularity

Max Batch Size

Upper limit on the number of labeled examples, randomly selected from the training set for each optimization step. This is the maximum size of a minibatch.

Files

Hyperparameters

Hyperparameter Folder

Source folder that contains json files that store (custom network) hyperparameters that are made available to neural network specification and creation scripts.

Hyperparameter Base Name

The base name of a hyperparameters file, excluding the .json extension.

Input

Data Set Folder

Source folder that contains one or more data sets.

Data Set Base Name

The base name of a data set, excluding the .raw extension.

Output

Models Folder

Destination folder for trained models.

Model Base Name

The base name of a trained model, excluding the .onnx extension.

Log

Logs Folder

Folder that contains one or more training logs.

Log Base Name

The base name of a training log, excluding the .txt extension.

State

States Folder

Destination folder where the training node keeps the training state.

State Base Name

The base name of a training state, excluding any extensions. This is the state the training node will be resumed.

Save Frequency

Model Save Event

The type of event that causes the model to be saved while training.

Never

The model is never saved during training. Only the final model is saved.

Per Epoch

The model may be saved, regardless of whether it is an improvement.

Per Improvement

The model may be only when it improves.

Model Save Epochs per Output

Number of epochs between checks for model save events.

Model Save Append Epoch to Name

Number of epochs between checks for model save events.

Export Frequency

Model Export Event

The type of event that causes the model to be saved while training.

Never

The model is never saved during training. Only the final model is saved.

Per Epoch

The model may be saved at each epoch, regardless of whether it is an improvement.

Per Improvement

The model may be only when it improves.

Model Export Epochs per Output

The minimum number of epochs between any two saves. The final model is always saved, regardless of this parameter.

Model Save Epochs per Output

Number of epochs between checks for model export events.

Model Save Append Epoch to Name

Number of epochs between checks for model export events.

Export Options

Parameter Storage

Where the parameters of a trained model get stored on export.

Same File

The parameters of the trained model are saved into the ONNX file. This only works for models smaller than 2GB.

Same Directory

The parameters of the trained model are saved into different file in the same directory as the ONNX file. This is required for exporting models larger than 2GB.

Loss Log Event

The type of event that causes the loss to be logged while training.

Never

The loss is never logged during training.

Per Epoch

The loss is logged after each epoch.

Per Batch

The loss is logged after each minibatch.

Log Configuration

Log to Standard Output

When on, information is written to the standard output during training. This does not stop the same information from being written out to log files.

Device

Use CPU Exclusively

When on, the entire training is on the CPU, not using the GPU. This is not recommended as it is very slow. This option exists for debugging purposes.

Cache Mode

Determines how the processor node handles work items that report expected file results.

Automatic

If the expected result file exists on disk, the work item is marked as cooked without being scheduled. If the file does not exist on disk, the work item is scheduled as normal. If upstream work item dependencies write out new files during a cook, the cache files on work items in this node will also be marked as out-of-date.

Automatic (Ignore Upstream)

The same as Automatic, except upstream file writes do not invalidate cache files on work items in this node and this node will only check output files for its own work items.

Read Files

If the expected result file exists on disk, the work item is marked as cooked without being scheduled. Otherwise the work item is marked as failed.

Write Files

Work items are always scheduled and the expected result file is ignored even if it exists on disk.

Environment Path

The path to the python virtual environment in which the internal training script of this node is run.

Use Pip Cache

When enabled, pip will attempt to use cached packages on the local system instead of downloading them every time. This can improve the installation times when repeatedly installing the same Python package in different virtual environments.

Schedulers

TOP Scheduler Override

This parameter overrides the TOP scheduler for this node.

Schedule When

When enabled, this parameter can be used to specify an expression that determines which work items from the node should be scheduled. If the expression returns zero for a given work item, that work item will immediately be marked as cooked instead of being queued with a scheduler. If the expression returns a non-zero value, the work item is scheduled normally.

Work Item Label

Determines how the node should label its work items. This parameter allows you to assign non-unique label strings to your work items which are then used to identify the work items in the attribute panel, task bar, and scheduler job names.

Use Default Label

The work items in this node will use the default label from the TOP network, or have no label if the default is unset.

Inherit From Upstream Item

The work items inherit their labels from their parent work items.

Custom Expression

The work item label is set to the Label Expression custom expression which is evaluated for each item.

Node Defines Label

The work item label is defined in the node’s internal logic.

Label Expression

When on, this parameter specifies a custom label for work items created by this node. The parameter can be an expression that includes references to work item attributes or built-in properties. For example, $OS: @pdg_frame will set the label of each work item based on its frame value.

Work Item Priority

This parameter determines how the current scheduler prioritizes the work items in this node.

Inherit From Upstream Item

The work items inherit their priority from their parent items. If a work item has no parent, its priority is set to 0.

Custom Expression

The work item priority is set to the value of Priority Expression.

Node Defines Priority

The work item priority is set based on the node’s own internal priority calculations.

This option is only available on the Python Processor TOP, ROP Fetch TOP, and ROP Output TOP nodes. These nodes define their own prioritization schemes that are implemented in their node logic.

Priority Expression

This parameter specifies an expression for work item priority. The expression is evaluated for each work item in the node.

This parameter is only available when Work Item Priority is set to Custom Expression.