Microsoft announced that it has built a new caption-generating artificial intelligence model that can add more accurate descriptions to images than humans.
Xuedong Huang, CTO of Microsoft Technology Fellow, explains about the caption creation tool that the system for captioning images is one of the key computer vision functions that enable various services.
Microsoft’s newly built caption generation AI model is available through the Azure Cognitive Services computer vision offering, which is part of Azure AI services. Developers can also improve their service accessibility by using this function. The caption generation AI model is already included in Seeing AI, a camera app for the visually impaired, developed by Microsoft, and will be integrated into Windows and macOS versions of Microsoft Word, Outlook, and PowerPoint in late 2020.
The Microsoft Caption Generation AI model can add captions to any photo, from images displayed in search engines to photos embedded in PowerPoint. Saqib Shaikh, software engineering manager within the Microsoft AI Platform Group, said that using the ability to add captions to photos to create photo descriptions included in webpages or documents is for the blind or people with low vision. Explained that it is important.
The development team worked on integrating the caption generation AI model into Seeing AI. Seeing AI creates a caption when it is illuminated through a camera, and through this, it helps the blind person to understand what is in front of him. Ideally, all images on documents, the web, and social media can be captioned, allowing the visually impaired to access all the information so that the conversation can be continued as if they were around. Unfortunately, it explains that an app that provides image captions is important because people don’t all explain or caption each image.
Lijuan Wang, manager of Microsoft Labs, says photo captions are a central challenge in the field of computer vision and that AI needs to properly understand and explain image elements. You need to understand what is happening in the picture so that you can understand the relationship between objects and actions, and summarize them in natural language sentences.
The caption generation AI model created by Microsoft is said to have scored equal or higher than that of humans in the nocaps, an image caption benchmark. Nocaps scores how accurate the AI model captions images that are not included in the dataset used for training. The caption generation AI model uses a rich dataset of images combined with word tags to pre-train the AI model and reinforce the mapping of specific objects with word tags.
The way Microsoft reinforces the mapping of specific objects with word tags is similar to teaching a child about cats, for example, using a picture of a cat and a book printed with the text Cat. The AI model, which has been learning individual words in advance, improves the accuracy of captions by performing training using the image dataset containing the following captions. This allows caption generation AI to generate accurate captions using natural vocabulary for new images.
In addition, using other benchmarks widely used in the industry, the Microsoft caption generation AI model is said to be twice as good as the image caption model that has been used in Microsoft products since 2015. Related information can be found here .