I manage a small render farm that uses Thinkbox’s Deadline for render management and primarily VRay for Maya for most significant renders. We are adjusting our render farm to include VRay distributed rendering slaves. The techniques I’m discussing in this article should work with any distributed rendering slave that does not mind the slaves being interrupted mid-render, and any render farm software that allows for job interruption based on priority (for the Loose option).
Note that none of these techniques will speed up any renders when the farm is very busy. We will still get the same number of renders in the same amount of time. There is a slight advantage for Limited in getting a smaller subset more quickly, which I will discuss later.
I have thought through two methods of handling distributed rendering on a farm, and I have labeled them Limited and Loose. I will explain the difference between the two and why I believe Limited is better for most situations.
Let’s first understand the base case without using distributed rendering at all. In the farm, each render node will receive a job and render it by itself. It does not ask for help from any other render nodes, regardless of how busy (or not busy) the farm is. This means if we have 12 render nodes, we are at peak performance only if there are at least 12 jobs to be rendered. If there are less than 12 jobs, the effectiveness drops directly with the number of nodes not being used. Only 3 jobs to render? 9 of our render nodes are ineffective.
Now to add distributed rendering! Adding distributed rendering is always an improvement, and will never slow down render times (excluding a small bit of overhead). The only thing it adds is more complexity to our farm.
The Limited distributed rendering option will label each render node as one of two categories: a Render Master or a Render Slave. The Render Masters will exist in the render farm software, handling scene files as they are received from the users. When a job is received and started, the Render Master will ask for a limited number of Render Slaves to help with the render job (the Render Master can help do the work as well, assuming the software can support it and the machine is powerful enough). This will apply the power of only a section of the farm onto each rendered image. With a Limited farm, we are at peak performance when we have at least as many jobs as we have Render Master nodes. If the farm has less jobs than Render Masters, the efficiency goes down at a rate of (# of unused Render Masters)/(# of all Render Masters).
In the Loose distributed rendering option each render node is considered equal. Every node will be set up to have a low-priority job that is interruptible (by the render farm software). This job will run the a Distributed Rendering Slave (DRS for short). When a job from a user is received and started, whichever node picks up the job will interrupt the DRS job and ask all other render nodes with the DRS job running to help with the user’s job. This will apply the power of all available render nodes onto the rendered image. With a Loose farm, we are at peak performance starting at 1 job, however we lose performance as soon as a busy farm starts to free up.
At first glance, it sounds like Loose is better than Limited. Why wouldn’t we want to always throw all of our rendering power behind every image? There are two problems with Loose and one with Limited. Limited’s problem has already been discussed: only allowing a subsection of the farm to be used on each render. Loose’s problems include: 1) only speeding up the render time of the first user job to get all of the DRS jobs, and 2) having no improvement over a farm without distributed rendering when the farm is very busy.
Loose’s problem 1) can be fixed or avoided if the distributed rendering software will continue to query for helping slaves after the render has started. In my case, VRay does not ask for help after the initial query has been sent. This means that if I sent two jobs to our farm of 12 nodes using Loose, the first job will get 10 helpers and the second job will have to render by itself for the entire duration. I cannot, however, come up with a solution for Loose’s problem 2) because of the nature of the system.
Limited’s problem and Loose’s problem 2) are similarly caused by the nature of their design, but the Limited problem has a slight advantage on a very busy farm: we will get smaller groups of images quicker using Limited than using Loose. If all 12 nodes are busy in a Loose farm, we must wait the full duration of rendering a single image to get our 12 images all at once. If we the same jobs on a Limited farm with 3 Render Masters, each with 3 Render Slaves (12 total), we will get smaller groups of images (3 at a time) in 1/4 the time. This allows for faster feedback for the user, which may result in jobs being canceled and resubmitted sooner.
Therefore, using a Limited distributed rendering system would be the superior of the two options.
Please let me know in the comments if there is an alternative, or there is some advantage that I have missed.