Science & Technology

Channels: Science & Technology | Financial Markets | Artificial Intelligence | Blockchain


Microsoft's AI learns to answer questions about scenes from image-text pairs

[2019.10.08, Tue 19:05] Machines struggle to make sense of scenes and language without detailed accompanying annotations. Labeling is generally time-consuming and expensive, and even the best labels convey an understanding only of scenes and not of language. In an attempt to remedy the problem, Microsoft researchers conceived of an AI system that trains on image-text pairs in a fashion mimicking the way humans improve their understanding of the world. They say that their single-model encoder-decoder Vision-Language Pre-training model, which can both generate image descriptions and answer natural language questions about scenes, lays the groundwork for future frameworks that could reach human parity. A model pretrained using three million image-text pairs is available on GitHub in open source. "Making sense of the world around us is a skill we as human beings begin to learn from an early age The more we interact with our physical environments the better we become at understanding and using language to explain the items that exist and the things that are happening in our surroundings," wrote Microsoft senior researcher Hamid Palangi in a blog post. They report that it not only outperformed state-of-the-art models on several image captioning and visual question answering metrics, but that it managed to answer questions about images with which previous models trained only on language struggled.
Read on VentureBeat.com >   Google the news >>

<< Back


(c) 2019 Geo Glance