From a knowledge-representation standpoint, these diverse information are formally represented using relational models. Knowledge graphs form large-scale efforts to formally represent as much information as possible so that it can be manipulated and queried by computers. Yago, for instance, assembles multiple sources, –such as wikipedia, geonames...– in a consistent relational structure that covers as much as possible of general knowledge.
The use of knowledge graphs in data science faces two challenges. The first challenge is that of data preparation: the information that contained in knowledge graphs must be extracted to be fed into a statistical modeling algorithm, such as a supervised learning model. Traditionally, this step is done manually, with the data-scientist crafting SQL or SPARKL queries and is very time consuming. The second problem is that integrating information across multiple sources, both to build a knowledge graph and to integrate information in a data-science analysis, faces the variability of how the shared entity are written: “Londres” in one dataset may need to be matched with “London” in another. This last problem is related to entity matching, in NLP and database research, or deduplication and record linkage in data management.
Proposed work: To solve the problems listed above, we propose to adapt knowledge-graph embedding techniques to generate numerical representations (feature vectors) for all the entities that they represent, adapting the most recent models of this family of model. However, to deal with heterogeneity in the “surface form” –the string representation– of the entities, we will add a string-modeling layer (using sequence modeling tools as developed for NLP), following ideas used in NLP.
- Knowledge of machine learning or applied maths background (mathematical optimization and statistics)
- Some familiarity with fitting deep neural networks (typically pytorch)
Research theme: Machine learning, data science
Keywords: Neural networks, knowledge graph embedding, text modeling, deep learning, dirty data, relational data
Duration & salary: 3 to 6 months, between 500 eand 800 e monthly
Research teams: Parietal & Soda (INRIA Saclay)
Adviser: Gaël Varoquaux
Application: Interested candidate should send CV and motivation letter