Glu Variants Revolutionize Deep Learning With Improved Transformer Performance

Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid.Computer Science > Machine Learning. [Submitted on 12 Feb 2020]. The paper GLUVariantsImproveTransformer addresses a key challenge in transformer models: improving the quality of the feed-forward layers.This paper explores variations of the Gated Linear Unit (GLU) as alternatives to these activations to enhance performance. But with the emergence of Transformer based models, different variants of activation functions and GLU have been experimented with and do seem to perform better. Google researcher Noam Shazeer demonstrates that replacing standard activation functions in Transformer feed-forward networks with Gated Linear Units (GLU).GLUVariantsImproveTransformer. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements... We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations. In summary, GLUvariants contribute to improvedTransformer models by enhancing the feed-forward layers, increasing parameter efficiency, and achieving better performance metrics on language tasks. Published on Feb 12, 2020.Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid.