A single kernel to scatter the residuals and then run forward pass at the same time instead of copying and grouping first.