On this page | |
Since | 20.5 |
Overview ¶
This is a general-purpose ML training node to apply to regression problems. In the context of deep learning, regression means training a feedforward neural network (a model) so it can closely approximate a continuous function from a fixed number of input variables to a fixed number of output variables. In Houdini, this function may be provided in the form of a procedural network. See Machine Learning documentation for more general information.
For example, regression ML may learn how to realistically deform a character based on its rig pose to improve over linear blend skinning. In this case, you use ML to learn the function that maps a rig pose to a skin deformation. See the ML Train Deformer recipe. Another example is to learn how to upres a pyro simulation. See the ML Train Volume Upres recipe.
ML Train Regression trains a model given a set of data set consisting of labeled examples, see ML Example. Each labeled example is a pair consisting of an input component and a corresponding target component. For regression, both the input component and the target component consists of continuous variables. This may include point coordinates, colors, or PCA components (among other types of values). The term regression is used because each target component consists of continuous variables, as opposed to discrete ones used in classification. Labeled examples are the basis for training. In the specific case of the ML Deformer, each input component is an (serialized) pose and the corresponding target component is the (PCA-compressed) skin deformation obtained by a flesh simulation. In the case of the
ML Volume Upres, both inputs and outputs are 3D volume tiles (voxel grids).
ML Train Regression trains a model that predicts a target component given some user-specified input component. The goal is to have a well-fitted model that generalizes well, it should produce useful predictions for unseen inputs. This means inputs that were not included in the data set for training. ML Train Regression provides several methods that can help ensure that a trained model performs well on unseen inputs. These are called regularization methods.
The trained model produced by ML Train Regression can export to the ONNX format. It can also be saved as a pytorch model. However, Houdini has no built in nodes that can use this model. The easiest way to bring a model produced by ML Train Regression into Houdini is the ML Regression Inference tool.
ML Train Regression creates a feedforward neural network. The number of input units of this network must correspond to the number of variables in each input component of the data set. The number of output units of the network must correspond to the number of variables in each target component of the data set. When you use a standard kind of model, ML Train Regression automatically figures out the number of inputs and outputs of the network using the dimensions stored in the provided raw data set.
In between the input layer and the output layer, there may be several hidden layers. The number of hidden layers and their width can be controlled by the user. The behavior of the network is controlled by a set of parameters. For linear layers, these parameters may include weights (scaling factors) and biases (additive constants). The training aims to optimize these parameters for the regression task.
ML Train Regression performs multiple passes through the training portion of the data set. Each such pass is called an epoch.
The training process may be repeated multiple times, each time with different settings on the training node. Such training settings are commonly referred to as hyperparameters. Common examples of hyperparameters are settings such as the learning rate, the number of hidden layers of the network, and the width of the network.
The Wedge TOP allows the training process to be repeated with varying hyperparameters. For each setting in a wedge, the resulting model can be saved to a separate file by adding a TOP attribute name at the end of the Model Base Name.
ML Train Regression internally splits the data set provided by the user into two parts: a training portion and a validation portion. The relative sizes these portions can be controlled by the user. The training portion is used to optimize the network’s parameters during each training epoch. The validation portion is used to verify the network’s performance on data that is not used for the training.
Underfitting occurs when a model is not sufficiently complex to fit the training data well. When this occurs, the model cannot be expected to produce accurate results on test data either. In that case, a more complex network with more parameters (higher capacity) may be needed. This may be achieved, for example, by increasing the number of hidden layers or increasing the width of each hidden layer.
Overfitting occurs when a model produces outputs that are accurate for inputs in the training data but inaccurate for unseen inputs. To help reduce overfitting, the ML Train Regression TOP supports two simple regularization strategies: early stopping and weight decay. Early stopping checks whether the validation loss (the loss on the validation set) stops decreasing. If this is the case, then the training terminates. Weight decay adds to the loss function a term that penalizes large weights.
Partial Training ¶
Instead of training a model all at once, a model can be partially trained bit by bit, over multiple passes, where each training pass performs only a limited number of epochs. This allows the TOP network to perform other tasks in between training passes. For example, fresh training data can be generated before each partial training pass, which may be applicable when the data set is generated within Houdini through a procedural network. Training on a stream of fresh training data compared to revisiting data points in a fixed-size data set may produce a model that generalizes better to unseen data. Another advantage is the amount of training data does not have to be decided upfront.
Partial training can also be a useful way of limiting the amount of work/time performed by a single cook of ML Train Regression, while allowing the training to be resumed from where it stopped at a later time. This may be relevant when training is performed on a farm, for example.
Note that starting in Houdini 21, partial training is no longer required in order to periodically save and/or export partially trained models. There are now separate options that allow you do do this without partial training.
ML Train Regression can be set up for partial training:
-
Turn on Enable Partial Training.
-
Place ML Train Regression inside a feedback block, see
Feedback Begin.
-
Set Create Iterations From on
Feedback Begin to Custom Count.
-
Set Iterations to the total number of partial training passes you want to perform.
-
Set Max Epochs per Cook on ML Train Regression to control the maximum number of epochs performed for each training pass. For example, you could set Max Epochs per Cook such that each partial training pass takes some approximate amount of time. The appropriate value for Max Epochs per Cook really depends on the size of the data set.
-
(Optional) To output and retain partially trained ONNX models, you can specify on the ML Train Regression node, in the Files tab, a Model Base Name that includes an attribute such as
@loopiter
. -
(Optional) To generate fresh training data before each partial training pass, you can put a
ROP Fetch that triggers the generation of a new data set in front of ML Train Regression inside the feedback loop. You should include the attribute
@loopiter
in the file name of the data set.
-
File Access ¶
ML Train Regression accesses a variety of files. Each file name can be optionally modified by inserting one or more TOP attributes into a base name of a file. For example, @wedgeindex
or @loopiter
.
This enables various types of TOP ML training setups such as:
-
Training for varying hyperparameters with the use of a Wedge TOP,
-
Generating additional training data on demand in a TOP feedback loop.
Some of the files ML Train Regression accesses are read-only, both read from and written to, and only written to (or appended to).
In the read-only category, there are (custom network) hyperparameters and the data set. The Hyperparameters tab allows you to provide a dictionary that can be accessed from custom network snippets. For example, controlling the neural network structure. The Input tab allows you to specify the source of training data (and validation data if early stopping is on).
In the write-only category, there are trained (or partially trained) models, either saved or exported ONNX format, with file names determined by Models Folder and Model Base Name. These can be found under the Output tab. Also, diagnostic and progress information are controlled by the Log tab. If a log file exists, log information is appended to that existing log file. This logging allows the creation of a single log file if a single training is broken up into parts by invoking ML Train Regression multiple times. For example, in a TOP feedback loop.
In the read-write category, there are parameters under the State tab that allow you to specify the current state of training. These are used when partial training is on. The training state is read from at the start of a single invocation of ML Train Regression and written to at the end. The training state allows ML Train Regression to resume training where it left off in a previous invocation. This training state is only valid when the state name is the same as the previous invocation. The training state can act like a checkpoint, allowing an imcomplete training to resume.
The controls under the Save Frequency and Export Frequency tabs allow you to control how often models are saved and/or exported. Saving and exporting can happen multiple times during one cook of ML Train Regression, even when partial training is turned off.
Re-Training from Scratch ¶
ML Train Regression trains from scratch only if its expected output files are not found on disk. This behavior enables partial training setups. The expected output files include the exported ONNX model, training state, and training log. If these files exist, ML Train Regression will not re-train from scratch.
To make ML Train Regression re-train, remove the output files from disk. You can remove files through the TOPs UI:
-
ML Train Regression node and select Delete this Node’s Output Files from Disk.
-
For a single work item,
the work item and select Delete File Outputs from Disk.
Neural Networks ¶
ML Train Regression allows you to easily use a standard kind of neural network, such as an MLP, without any coding. However, you can also program your own custom network structure if you want to. This requires writing one or more snippets consisting of a few lines of python code, making use of the PyTorch library.
An MLP consists of several fully connected layers. In theory, you could always use an MLP for your training. However, this is not always practical. For larger input and output sizes, your MLP model may end up having a very large capacity. Such a large model may require a very large number of training examples to achieve good generalization. Also, large models may exceed the amount of memory available on your machine.
Custom Neural Networks ¶
When a standard model or a standard loss does not suffice, you can create your own model or loss function. This applies when Network Composition is set to Partitioned. However, to allow even even greater flexibility, you can switch Network Composition to Unified. In this case, you have full freedom to create a custom unified network that takes an input component and a target component and outputs a scalar value.
You can create your own model, loss function, or unified network using PyTorch network modules by writing python snippets. Each such snippet parameter has a default, which can be useful as a reference.
Each python snippet may access a kwargs
dictionary and must set dictionary called result
. You can write snippets as if the torch library is already included. See the pytorch documentation for more information about creating neural networks using PyTorch modules.
The kwargs
dictionary stores the following values: a tuple of input shapes input_shapes
, a tuple of target shapes target_shapes
, and a dictionary of (custom network) hyperparameters hyperparameters
. In a python snippet, these values could be extracted and put in separate variables like:
input_shapes = kwargs["input_shapes"] target_shapes = kwargs["target_shapes"] hyperparameters = kwargs["hyperparameters"]
Both input_shapes
and target_shapes
are determined by the data set. In Houdini 21, both the input components and target components are always flattened and concatenated, so that input_shapes
and target_shapes
contain only a single shape and this single shape has only a single dimension.
In this flattened, concatenated representation, the total number of (scalar) elements of an input component is always input_shapes[0][0]
.
Similarly, in the flattened, concatenated representation, the total number of (scalar) elements of a target component is always target_shapes[0][0]
.
The dictionary hyperparameters
is loaded from a JSON file. You can specify this file location using the Hyperparameter Folder and the Hyperparameter Base Name parameters. The contents of this hyperparameter dictionary can control the structure of the custom network.
A snippet that defines a custom network must always set the result
dictionary. This dictionary must assign values to both module_class
and module_kwargs
. The value associated with module_class
must be a python class object, for a class that is a subclass of torch.nn.Module. For example, in a python snippet, setting the result could look like
class MyModule(torch.nn.Module): ... my_module_kwargs = ... ... result = {"module_class": MyModule, "module_kwargs": my_module_kwargs}
Here my_model_kwargs
should be valid keyword arguments for the construction of MyModel
.
Limitations ¶
ML Train Regression and the ML Example family of supervised ML nodes in SOP and ROP provide an easy starting point for experimentation with supervised ML in Houdini. The applicability of ML Train Regression can be extended by making use of the script options that allow you to optionally create your own model networks and loss functions. The Unified option on Network Composition allows you to stretch the flexibility even further, allowing ML Train Regression to be applied for many more ML use cases.
Even with custokmizations, you may find that ML Train Regression is not general enough for your use case. In that case, you can extract and copy the underlying scripts of this node, located in $HHP/hutil/ml/regression and use these as a starting point for creating a custom training node. You may still be able to use many of the other tools of the supervised ML toolkit that exist in SOP and ROP to prepare your data set such as the ML Example operator family.
Parameters ¶
Network Composition
How to compose the neural network that will used for training.
Partitioned
The neural network is composed of separate parts: a model, a loss function, and a regularizer. The model takes only the input component from each example (batch), returning the model output. The loss function takes both the model output (batch) and the target component from an example (batch). The regularizer takes only the model output (batch), but also has access to the model parameters.
Unified
The neural network accepts the input and target components of an example (batch) and outputs a loss (batch). This option offers greater flexibility than having separated network parts for the model, the loss function and the regularizer.
Model Kind
Either a standard model (e.g., Mean Squared Error) or a user-defined model.
Standard
A standard kind of model network.
Custom
A user-defined kind of model network.
Standard Type
The type of standard model network.
MLP
Creates a model network consisting of fully connected linear layers alternated by activation layers. This is network architecture is commonly referred to as a multilayer perceptron (MLP).
Model Module
A python script that describes how a model should be created. See Custom Network Structures for details.
Loss Kind
Either a standard loss (e.g., Mean Squared Error) or a user-defined loss.
Standard
A standard kind of loss network.
Custom
A user-defined kind of loss network.
Standard Loss Type
The specific standard loss function. Currently supports only Mean Squared Error (MSE).
Loss Module
A python script that describes how a model should be created. See Custom Network Structures for details. This expected to be a loss that’s an average over examples in a batch, not a sum.
Weight Decay
The higher this value, the more the training session will try to keep the weights small (only the weights, not the biases). This is a very basic method for preventing overfitting.
Setting a higher value, the training session will try to keep the weights small (not the biases). A basic method to prevent overfitting.
Random Seed
The random seed used to initialize neural network parameters. Different seeds result in different models with varying accuracies. This hyperparameter is suitable for wedging.
Unified Module
A python script that describes how a model should be created. See Custom Network Structures for details.
Shuffle
When on (recommended), reorders data set elements randomly before use. Ensures validation set contains random samples from the data set.
Shuffle Random Seed
The random seed used to shuffle the examples in the data set.
Limit Size
When on, only an initial part of the data set is preserved. The remaining data is deleted. This step takes part right after shuffling, before any of the data is used for training. This option is useful for finding out how the generalization error of the trained model depends on the data size. For example, making use of Wedge TOP. The resulting curve may inidicate whether more data would be beneficial to improve the generalization.
When on, preserves only the initial part of data set after shuffling. Useful for analyzing how generalization error depends on data size using Wedge TOP.
Upper Limit
Specify an upper limit of the number of data sets that are preserved. The size of the remaining data set is the minimum of the initial data set size and this limit.
Enable Partial Training
Enforce a hard upper limit on the number of epochs per work item during training. This is useful when ML Train Regression is used in a feedback loop in TOPs. During each iteration of the feedback, the number of epochs can be limited using this parameter.
Max Epochs (for a single cook)
Hard upper limit on the number of epochs for one single cook of ML Train Regression. The next cook, the training resumes at the next epoch.
Enable Validation
When on, splitting the data set into training and validation parts, controlled by Training Data Proportion and Validation Data Proportion. Evaluates validation part every Epochs per Validation epochs.
Training Data Proportion
Applies only when Enable Early Stopping is on. The proportion of data set used to train the model.
Validation Data Proportion
Applies only when Enable Early Stopping is on. The proportion of the data set to validate the performance of the model. The validation set consists of a contiguous range of elements at the end of the data when Shuffle is on. It is recommended to turn on the Shuffle option, otherwise the validation set will generally not consist of a random sample of the entire data set.
Epochs per Validation
The number of epochs that are trained before each validation loss evaluation.
Maximum Epochs Total (over all cooks)
Maximum number of epochs trained over all cooks of ML Train Regression. Training terminates when number of epochs is reached. This is most useful when partial training is on.
Enable Early Stopping
When on, terminates training when performance of the model on the validation set stops improving.
Patience
The number of times the validation loss is evaluated without finding an improvement of the current best validation loss before giving up.
Note
This parameter is expressed as number of evaluations, not epochs. See the Epochs per Validation parameter to see how many epochs are trained between validations.
Algorithm
Determines which optimization algorithm to use during the training process.
Adam
Uses the Adam algorithm, see pytorch documentation for more detail.
Adadelta
Uses the ADADELTA algorithm, see pytorch documentation for more detail.
SGD
Uses the Stochastic Gradient Descent algorithm, see pytorch documentation for more detail.
ASGD
Uses the Averaged Stochastic Gradient Descent algorithm, see pytorch documentation for more detail.
Learning Rate
Controls step size during training. Larger steps mean larger parameter updates. Smaller rates may take longer but help avoid skipping locally optimal solutions.
Beta1
This coefficient is specific to the Adam optimization algorithm. See Pytorch documentation.
Beta2
This coefficient is specific to the Adam optimization algorithm. See Pytorch documentation.
Rho
When Algorithm is set to Adadelta, determines the rho
value used by the optimization algorithm. A higher value results in a faster
rate of training, but maybe also decrease stability.
When using Adadelta algorithm, determines the rho
value. Higher values increase training speed but may reduce stability.
Momentum
When Algorithm is set to SGD, determines the momentum value. This value is optional, but helps the training process converge faster by allowing past gradient calculations to influence the optimization process. The momentum value determines how heavily past results influence the optimization process.
Optional SGD parameter that allows past gradients to influence optimization, increasing convergence speed. Determines influence strength of past results.
Rate Scheduler
Determines the type of scheduler which varies the learning rate over the training process.
Cosine Annealing
Uses a cosine annealing schedule, see pytorch docs
Linear
Maintains a constant learning rate as defined in the Learning Rate
parameter, and then linearly decays it towards a value of 0
near the end
of the training process. The Linear Decay parameter determines the
number of training iterations that should be spent reducing the learning
rate value to 0
.
Step
Maintains a constant learning rate which is multiple by a factor of Gamma after every Step training iterations.
Exponential
Each epoch the learning rate is multiplied by a constant
Linear Decay
When Rate Scheduler is set to Linear, determines how
many iterations should be spent decaying the learning rate toward 0
. For
example, if set to 50
then learning rate will decay
towards zero for the last 50 training iterations.
Step Size
When Rate Scheduler is set to Step, determines the number of training iterations between updates to the learning rate.
Gamma
When Rate Scheduler is set to Step, determines the scale factor applied to learning rate each time a scheduler step is reached.
Exponential Style
The way the constant which the learning rate is multiplied is specified.
Multiplicative Factor
Specify the constant directly
End Learning Rate
Determine the constant from the start learning rate and a given end learning rate.
Gamma
When Rate Scheduler is set to Step and Exponential Style is set to Multiplicative Factor, then this is the scale factor applied to the learning rate after each epoch
End Learning Rate
When Rate Scheduler is set to Step and Exponential Style is set to End Learning Rate, then this is sets the scale factor such that the learning rate after the last epoch becomes this specified end learning rate
Max Batch Size
Upper limit on the number of labeled examples, randomly selected from the training set for each optimization step. This is the maximum size of a minibatch.
Hyperparameter Folder
Source folder that contains json files that store (custom network) hyperparameters that are made available to neural network specification and creation scripts.
Hyperparameter Base Name
The base name of a hyperparameters file, excluding the .json extension.
Data Set Folder
Source folder that contains one or more data sets.
Data Set Base Name
The base name of a data set, excluding the .raw extension.
Models Folder
Destination folder for trained models.
Model Base Name
The base name of a trained model, excluding the .onnx extension.
Logs Folder
Folder that contains one or more training logs.
Log Base Name
The base name of a training log, excluding the .txt extension.
States Folder
Destination folder where the training node keeps the training state.
State Base Name
The base name of a training state, excluding any extensions. This is the state the training node will be resumed.
Model Save Event
The type of event that causes the model to be saved while training.
Never
The model is never saved during training. Only the final model is saved.
Per Epoch
The model may be saved, regardless of whether it is an improvement.
Per Improvement
The model may be only when it improves.
Model Save Epochs per Output
Model Save Epochs per Output
Number of epochs between checks for model save events.
Model Save Append Epoch to Name
Number of epochs between checks for model save events.
Model Export Event
The type of event that causes the model to be saved while training.
Never
The model is never saved during training. Only the final model is saved.
Per Epoch
The model may be saved at each epoch, regardless of whether it is an improvement.
Per Improvement
The model may be only when it improves.
Model Export Epochs per Output
The minimum number of epochs between any two saves. The final model is always saved, regardless of this parameter.
Model Save Epochs per Output
Number of epochs between checks for model export events.
Model Save Append Epoch to Name
Number of epochs between checks for model export events.
Loss Log Event
The type of event that causes the loss to be logged while training.
Never
The loss is never logged during training.
Per Epoch
The loss is logged after each epoch.
Per Batch
The loss is logged after each minibatch.
Log to Standard Output
When on, information is written to the standard output during training. This does not stop the same information from being written out to log files.
Use CPU Exclusively
When on, the entire training is on the CPU, not using the GPU. This is not recommended as it is very slow. This option exists for debugging purposes.
Environment Path
The path to the python virtual environment in which the internal training script of this node is run.
Use Pip Cache
When enabled, pip will attempt to use cached packages on the local system instead of downloading them every time. This can improve the installation times when repeatedly installing the same Python package in different virtual environments.
TOP Scheduler Override
This parameter overrides the TOP scheduler for this node.
Schedule When
When enabled, this parameter can be used to specify an expression that determines which work items from the node should be scheduled. If the expression returns zero for a given work item, that work item will immediately be marked as cooked instead of being queued with a scheduler. If the expression returns a non-zero value, the work item is scheduled normally.
Work Item Label
Determines how the node should label its work items. This parameter allows you to assign non-unique label strings to your work items which are then used to identify the work items in the attribute panel, task bar, and scheduler job names.
Use Default Label
The work items in this node will use the default label from the TOP network, or have no label if the default is unset.
Inherit From Upstream Item
The work items inherit their labels from their parent work items.
Custom Expression
The work item label is set to the Label Expression custom expression which is evaluated for each item.
Node Defines Label
The work item label is defined in the node’s internal logic.
Label Expression
When on, this parameter specifies a custom label for work items created by this node. The parameter can be an expression that includes references to work item attributes or built-in properties. For example, $OS: @pdg_frame
will set the label of each work item based on its frame value.
Work Item Priority
This parameter determines how the current scheduler prioritizes the work items in this node.
Inherit From Upstream Item
The work items inherit their priority from their parent items. If a work item has no parent, its priority is set to 0.
Custom Expression
The work item priority is set to the value of Priority Expression.
Node Defines Priority
The work item priority is set based on the node’s own internal priority calculations.
This option is only available on the
Python Processor TOP,
ROP Fetch TOP, and ROP Output TOP nodes. These nodes define their own prioritization schemes that are implemented in their node logic.
Priority Expression
This parameter specifies an expression for work item priority. The expression is evaluated for each work item in the node.
This parameter is only available when Work Item Priority is set to Custom Expression.
See also |