"this paper proposes a collection of four tasks designed to evaluate different prerequisite qualities of end-to-end dialog systems"
"QA Dataset: Tests the ability to answer factoid questions that can be answered without relation to previous dialog. The context consists of the question only.
• Recommendation Dataset: Tests the ability to provide personalized responses to the user via recommendations (in this case, of movies) rather than universal facts as above.
• QA+Recommendation Dataset: Tests the ability of maintaining short dialogs involving both factoid and personalized content where conversational state has to be maintained.
• Reddit Dataset: Tests the ability to identify most likely replies in discussions on Reddit.
• Joint Dataset: All our tasks are dialogs. They can be combined into a single dataset, testing the ability of an end-to-end model to perform well at all skills at once."
"We employ the MemN2N architecture of Sukhbaatar et al. (2015) in our experiments, with some additional modifications to construct both long-term and short-term context memories"
"Retrieving long-termmemories For each word in the last N messages we performa hash lookup to return all long-term memories (sentences) from a database that also contain that word. Words above a certain frequency cutoff can be ignored to avoid sentences that only share syntax or unimportant words. We employ the movie knowledge base of Sec. 2.1 for our long-term memories,"
"The wholemodel is trained using stochastic gradient descent byminimizing a standard cross-entropy loss between ˆa and the true label a."
"For matching two documents supervised semantic indexing (SSI) was shown to be superior to unsupervised latent semantic indexing (LSI) (Bai et al., 2009"
"we believe this is a surprisingly strong baseline that is often neglected in evaluations"
Â
"Recurrent Neural Networks (RNNs) have proven successful at several tasks involving natural language, language modeling (Mikolov et al., 2011"
"LSTMs are not known however for tasks such as QA or item recommendation, and so we expect them to find our datasets challenging."
Â
"We chose the method of Bordes et al. (2014)10 as our baseline. This system learns embeddings that match questions to database entries, and then ranks the set of entries, and has been shown to achieve good performance on the WEBQUESTIONS benchmark (Berant et al., 2013)."
"Answering Factual Questions Memory Networks and the baseline QA system are the two methods that have an explicit long-term memory via access to the knowledge base (KB). On the task of answering factual questions where the answers are contained in the KB, they outperform the other methods convincingly, with LSTMS being particularly poor"
Â
"Making Recommendations In this task a long-term memory does not bring any improvement, with LSTMs, Supervised Embeddings and Memory Networks all performing similarly, and all outperforming the SVD baseline."
"LSTMs performpoorly: the posts in Reddit are quite long and the memory of the LSTMis relatively short, as pointed out by Sordoni et al. (2015).
"Testing more powerful recurrent networks such as Seq2Seq or LSTMs with attention on these benchmarks remains as future wor"