Probabilistic databases are the method of choice to store uncertain data, which may arise from unreliable sources or derived by machine learning methods, in large scale knowledge bases like NELL or Google’s Knowledge Vault, to mention a few. The traditional model of probabilistic databases, building on the relational database model, assigns a probability with each ground fact, i.e., each tuple appearing in a relation. The ground facts are regarded as independent events. (This means no loss of generality, because dependencies can be introduced via constraints.) This model inherently makes a closed world assumption: facts not appearing in the database have zero probability, i.e., cannot possibly be true. This does not reflect our intuition and leads to implausible results.
We may instead regard facts not appearing in the database as being unlikely—chances are that otherwise we would have detected them—but not completely impossible. Based on this intuition, Ceylan, Darwiche, and van den Broek recently suggested an open world semantics for probabilistic databases. They set the framework and obtained some very interesting results, but leave a lot of room for further investigations.
The topic of the proposed thesis is to develop the theory of open world probabilistic databases. Guiding questions are:
- What are efficient query evaluation algorithms and heuristics, in particular for queries that go beyond the unions of conjunctive queries considered by Ceylan et al.?
- Can we exploit assumptions on the (expected) structure of the database, rather than assumptions on the structure of the query, for efficient query evaluation? What are realistic structural assumptions?
- The model of Ceylan et al. assumes a known finite universe (and in fact the running time of their algorithms is measured in terms of the size of the universe). A more realistic assumption would be an unknown universe of unknown size, which may be more adequately modelled by an infinite set. Is it feasible to extend the open-world probabilistic database model in this direction?
For all these topics, it will probably be helpful to deepen the connection to statistical relational AI.