I've been spending a lot of my time over the last year working on getting machine learning running on microcontrollers, and so it was great to finally start talking about it in public for the first time today at the TensorFlow Developer Summit. Even better, I was able to demonstrate TensorFlow Lite running on a Cortex…
Have you ever Googled some random question, such as how many countries are there in the world, and been impressed to see Google presenting the precise answer to you rather than just a list of links? This feature is clearly nifty and useful, but is also still limited; a search for a slightly more complex question such as how long do I need to bike to burn the calories in a Big Mac will not yield a nice answer, even though any person could look over the content of the first or second link and find the answer.
ELMo (Peterset al., 2018), - Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word rep- resentations. In NAACL
Generative Pre-trained Transformer (OpenAIGPT) (Radford et al., 2018) - Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language under- standing with unsupervised learning. Technical re- port, OpenAI.
Transformer (Vaswani et al., 2017) - Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems , pages 6000–6010.
SQuAD question answering (Rajpurkar et al., 2016) - Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 .
Masked language model - inspired by by the Cloze task (Taylor, 1953).
introduce a “next sentence prediction” task that jointly pre-trains text-pair representations.
natural language inference (Bowman et al., 2015; Williams et al., 2018)
There are two existing strategies for applying pre-trained language representations to downstream tasks: feature-based and fine-tuning.
The feature-based approach, such as ELMo (Peters et al., 2018), uses tasks-specific architectures that include the pre-trained representations as additional features.
The fine-tuning approach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018),
The major limitation is that standard language models are unidirectional,
In addition to the masked language model, we also introduce a “next sentence prediction” task that jointly pre-trains text-pair representations.
ELMo advances the state-of-the-art for several major NLP benchmarks (Peters et al., 2018) including question answering (Rajpurkar et al., 2016) on SQuAD, sentiment analysis (Socher et al., 2013), and named entity recognition (Tjong Kim Sang and De Meulder, 2003).
A recent trend in transfer learning from language models (LMs) is to pre-train some model architecture on a LM objective before fine-tuning that same model for a supervised downstream task (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018)
The advantage of these approaches is that few parameters need to be learned from scratch. At least partly due this advantage, OpenAI GPT (Radford et al., 2018) achieved previously state-of-the-art results on many sentencelevel tasks from the GLUE benchmark (Wang et al., 2018).
transfer from supervised tasks with large datasets, such as natural language inference (Conneau et al., 2017) and machine translation (Mc- Cann et al., 2017).
We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary
denoising auto-encoders (Vincent et al., 2008)
Adam with learning rate of 1e-4, 1 = 0:9, 2 = 0:999, L2 weight decay of 0:01, learning rate warmup over the first 10,000 steps, and linear decay of the learning rate.
We use a dropout probability of 0.1 on all layers.
We use a gelu activation (Hendrycks and Gimpel, 2016) rather than the standard relu,
training loss is the sum of the mean masked LM likelihood and mean next sentence prediction likelihood.
We also observed that large data sets (e.g., 100k+ labeled training examples) were far less sensitive to hyperparameter choice than small data sets. Fine-tuning is typically very fast, so it is reasonable to simply run an exhaustive search over the above parameters and choose the model that performs best on the development set.
A benefit of using ensembles of decision tree methods like gradient boosting is that they can automatically provide estimates of feature importance from a trained predictive model. In this post you will discover how you can estimate the importance of features for a predictive modeling problem using the XGBoost library in Python. After reading this …