Microsoft’s Project Alexandria parses documents using unsupervised learning

Where does your enterprise stand on the AI adoption curve? Take our AI survey to find out.

In 2014, Microsoft launched Project Alexandria, a research effort within its Cambridge research division dedicated to discovering entities — topics of information — and their associated properties. Building on the research lab’s work in knowledge mining research using probabilistic programming, the aim of Alexandria was to construct a full knowledge base from a set of documents automatically.

Alexandria technology powers the recently announced Microsoft Viva Topics, which automatically organizes large amounts of content and expertise in an organization. Specifically, the Alexandria team is responsible for identifying topics and metadata, employing AI to parse the content of documents in datasets.

To get a sense of how far Alexandria has come — and still has to go — VentureBeat spoke with Viva Topics director of product development Naomi Moneypenny, Alexandria project lead John Winn, and Alexandria engineering manager Yordan Zaykov in an interview conducted via email. They shared insights on the goals of Alexandria as well as major breakthroughs to date, and on challenges the development team faces that might be overcome with future innovations.

Parsing knowledge

Finding information in an enterprise can be hard, and a number of studies suggest that this inefficiency can impact productivity. According to one survey, employees could potentially save four to six hours a week if they didn’t have to search for information. And Forrester estimates that common business scenarios like onboarding new employees could be 20% to 35% faster.

Alexandria addresses this in two ways: topic mining and topic linking. Topic mining involves the discovery of topics in documents and the maintenance and upkeep of those topics as documents change. Topic linking brings together knowledge from a range of sources into a unified knowledge base.

“When I started this work, machine learning was mainly applied to arrays of numbers — images, audio. I was interested in applying machine learning to more structured things: collections, strings, and objects with types and properties,” Winn said. “Such machine learning is very well suited to knowledge mining, since knowledge itself has a rich and complex structure. It is very important to capture this structure in order to represent the world accurately and meet the expectations of our users.”

The idea behind Alexandria has always been to automatically extract knowledge into a knowledge base, initially with a focus on mining knowledge from websites like Wikipedia. But a few years ago, the project transitioned to the enterprise, working with data such as documents, messages, and emails.

“The transition to the enterprise has been very exciting. With public knowledge, there is always the possibility of using manual editors to create and maintain the knowledge base. But inside an organization, there is huge value to having a knowledge base be created automatically, to make the knowledge discoverable and useful for doing work,” Winn said. “Of course, the knowledge base can still be manually curated, to fill gaps and correct any errors. In fact, we have designed the Alexandria machine learning to learn from such feedback, so that the quality of the extracted knowledge improves over time.”

Knowledge mining

Alexandria achieves topic mining and linking through a machine learning approach called probabilistic programming, which describes the process by which topics and their properties are mentioned in documents. The same program can be run backward to extract topics from documents. An advantage of this approach is that information about the task is included in the probabilistic program itself, rather than labeled data. That enables the process to run unsupervised, meaning it can perform these tasks automatically, without any human input.

“A lot of progress has been made in the project since its founding. In terms of machine learning capabilities, we built numerous statistical types to allow for extracting and representing a large number of entities and properties, such as the name of a project, or the date of an event,” Zaykov said. “We also developed a rigorous conflation algorithm to confidently determine whether the information retrieved from different sources refers to the same entity. As to engineering advancements, we had to scale up the system — parallelize the algorithms and distribute them across machines, so that they can operate on truly big data, such as all the documents of an organization or even the entire web.”

To narrow down the information that needs to be processed, Alexandria first runs a query engine that can scale to over a billion documents to extract snippets from each document with the high probability of containing knowledge. For example, if the model was parsing a document related to a company initiative called Project Alpha, the engine would extract phases likely to contain entity information, like “Project Alpha will be released on 9/12/2021” or “Project Alpha is run by Jane Smith.”

The parsing process requires identifying which parts of text snippets correspond to specific property values. In this approach, the model looks for a set of patterns — templates — such as “Project {name} will be released on {date}.” By matching a template to text, the process can identify which parts of the text correspond with certain properties. Alexandria performs unsupervised learning to create templates from both structured and unstructured text, and the model can readily work with thousands of templates.

The next step is linking, which identifies duplicate or overlapping entities and merges them using a clustering process. Typically, Alexandria merges hundreds or thousands of items to create entries along with a detailed description of the extracted entity, according to Winn.

Alexandria’s probabilistic program can also help sort out errors introduced by humans, like documents in which a project owner was recorded incorrectly. And the linking process can analyze knowledge coming from other sources, even if that knowledge wasn’t mined from a document. Wherever the information comes from, it’s linked together to provide a single unified knowledge base.

Real-world applications

As Alexandria pivoted to the enterprise, the team began exploring experiences that could support employees working with organizational knowledge. One of these experiences grew into Viva Topics, a module of Viva, Microsoft’s collaboration platform that brings together communications, knowledge, and continuous learning.

Viva Topics taps Alexandria to organize information into topics delivered through apps like SharePoint, Microsoft Search, and Office and soon Yammer, Teams, and Outlook. Extracted projects, events, and organizations with related metadata about people, content, acronyms, definitions, and conversations are presented in contextually aware cards.

“With Viva Topics, [companies] are able to use our AI technology to do much of the heavy lifting. This frees [them] up to work on contributing [their] own perspectives and generating new knowledge and ideas based on the work of others,” Moneypenny said. “Viva Topics customers are organizations of all sizes with similar challenges: for example, when onboarding new people, changing roles within a company, scaling individual’s knowledge, or being able to transmit what has been learned faster from one team to another, and innovating on top of that shared knowledge.”

Technical challenges lie ahead for Alexandria, but also opportunities, according to Winn and Zaykov. In the near term, the team hopes to create a schema exactly tailored to the needs of each organization. This would let employees find all events of a given type (e.g. “machine learning talk”) happening at a given time (“the next two weeks”) in a given place (“the downtown office building”), for example.

Beyond this, the Alexandria team aims to develop a knowledge base that leverages an understanding of what a user is trying to achieve and automatically provides relevant information to help them achieve it. Winn calls this “switching from passive to active use of knowledge,” because the idea is to switch from passively recording the knowledge in an organization to actively supporting work being done.

“We can learn from past examples what steps are required to achieve particular goals and help assist with and track these steps,” Winn explained. “This could be particularly useful when someone is doing a task for the first time, as it allows them to draw on the organization’s knowledge of how to do the task, what actions are needed, and what has and hasn’t worked in the past.”


VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Articles You May Like

GitHub Token Leak Exposes Python’s Core Repositories to Potential Attacks
Elon Musk endorses Donald Trump shortly after ex-president injured by shots fired at rally
Oura Ring Reportedly Gets an AI-Powered Oura Advisor Feature That Offers Personalised Insights
Inside the $43 million Veterans Affairs simulation hospital where doctors are piloting new tech
Microsoft Raises Xbox Box Game Pass Prices, Introduces New ‘Standard’ Tier