Post-doc position in the Parietal team
2021-12-16 France
For many data science problems, for instance in health or business, to most important part of the study is to assemble information about the objects at hand. For instance, a good prediction of housing prices requires assembling historical values of prices, but also various information about the neighboorhood –the access to education, transportation, parks, job, shops– more global trends of geographical growth... This information is available spread across multiple source, for instance on multiple internet pages.
From a knowledge-representation standpoint, these diverse information are formally represented using relational models. Knowledge graphs form large-scale efforts to formally represent as much information as possible so that it can be manipulated and queried by computers. Yago, for instance, assembles multiple sources, –such as wikipedia, geonames...– in a consistent relational structure that covers as much as possible of general knowledge.
The use of knowledge graphs in data science faces two challenges. The first challenge is that of data preparation: the information that contained in knowledge graphs must be extracted to be fed into a statistical modeling algorithm, such as a supervised learning model. Traditionally, this step is done manually, with the data-scientist crafting SQL or SPARKL queries and is very time consuming. The second problem is that integrating information across multiple sources, both to build a knowledge graph and to integrate information in a data-science analysis, faces the variability of how the shared entity are written: “Londres” in one dataset may need to be matched with “London” in another. This last problem is related to entity matching, in NLP and database research, or deduplication and record linkage in data management.
Proposed work: To solve the problems listed above, we propose to adapt knowledge-graph embedding techniques to generate numerical representations (feature vectors) for all the entities that they represent, adapting the most recent models of this family of model. However, to deal with heterogeneity in the “surface form” –the string representation– of the entities, we will add a string-modeling layer (using sequence modeling tools as developed for NLP), following ideas used in NLP.
Required skills:
- Knowledge of machine learning or applied maths background (mathematical optimization and statistics)
- Some familiarity with fitting deep neural networks (typically pytorch)
Research theme: Machine learning, data science
Keywords: Neural networks, knowledge graph embedding, text modeling, deep learning, dirty data, relational data
Duration & salary: 3 to 6 months, between 500 eand 800 e monthly
Research teams: Parietal & Soda (INRIA Saclay)
Adviser: Gaël Varoquaux
Contact: gael.varoquaux@inria.fr
Application: Interested candidate should send CV and motivation letter