Blocking email spam is a constant, ever-evolving battle, and Gmail’s latest technique results in a 38% boost to detection thanks to better text identification.
Spammers often use homoglyphs (characters that look similar to actual letters), invisible characters, keyword stuffing, and other “adversarial text manipulations” to bypass Gmail’s text classification models that identify phishing attacks, scams, and other harmful content.
Google is countering with RETVec (Resilient & Efficient Text Vectorizer). Open sourced by Google Research, this approach “helps models achieve state-of-the-art classification performance and drastically reduces computational cost,” while supporting “every language and all UTF-8 characters without the need for text preprocessing.” This makes it ideal for on-device, web, and other large-scale use cases:
- “Models trained with RETVec can be seamlessly converted to TFLite for mobile and edge devices, as a result of a native implementation in TensorFlow Text. For web application model deployment, we provide a TensorflowJS layer implementation that is available on Github and you can check out a demo web page running a RETVec-based model.”
In Gmail, RETVec has improved the “spam detection rate over the baseline by 38%,” while reducing both the false positive rate (by 19.4%) and Tensor Processing Unit usage (by 83%).
RETVec achieves these improvements by sporting a very lightweight word embedding model (~200k parameters), allowing us to reduce the Transformer model’s size at equal or better performance, and having the ability to split the computation between the host and TPU in a network and memory efficient manner.
Google says it has “battle-tested RETVec extensively” over the past year “and found it to be highly effective for security and anti-abuse applications.”
If you would like to use RETVec for your own use cases or research, we created a tutorial to help you get started.
FTC: We use income earning auto affiliate links. More.
Comments