No, this is not April Fool's Day: Amazon plans to make available a huge number of data samples targeting natural language processing research. Seattle said today, in September 2019, it will release a Topical Chat dataset, a corpus of mass human conversations provided to teams competing in the Alexa Challenge's annual Big Challenge.
The Contemporary Chat dataset contains more than 210,000 phrases or over 4,100,000 words, Amazon says, meaning it is one of the largest social and chat public datasets. All of the corpus's conversations and turns are linked to information provided to a crowd of workers, and information is said to be gathered from a variety of “unstructured” and “freely structured” text resources relating to a set of entities .
A leading senior scientist of Amazon Dilek Hakkani-Tur made it clear in a blog post that the one of the conversations does not interact with Alexa customers.
“The aim of this collection is to enable the next stages of research in neural response systems that are based on knowledge, addressing hard challenges in a natural conversation that is not covered by data sets. others available to the public, ”said Hakkani-Tur. “This will enable researchers to focus on how human beings change between subjects, choice of information and enrichment, and the integration of facts and opinions into dialogue… [and support] publishing high quality repeatable research. ”
Amazon says that teams competing for an Alexa Award will have access to an extended version of the dataset – the appropriate Extended Text Chat dataset – which includes the results of collections and continuous annotations.
Today's announcement comes about six months after Amazon opened a data set that could be used to train AI models to identify names across languages and types of scripts. It is called a “multilingual entity-entity transliteration system,” which includes nearly 400,000 names in languages such as Arabic, English, Hebrew, Japanese Katakana, and Russia to scratch from Wikipedia. .