Statistical Linguistic Topology

This project considers the unification of two view of language: that from natural language processing and that from linguistic typology. Our view is that typological information is both useful for solving real-world natural language processing and automatically derivable from language data. This research first explores how to use typological knowledge to improve performance on problems such as dependency parsing and machine translation for low density languages.

Intuitively, our statistical models waste time exploring a hypothesis space that is too big: the space of realistic grammars is much smaller than the space of all grammars. The second part of this research considers the automatic acquisition and bootstrapping of typological knowledge from raw text. The outcome of this research is: (a) improved statistical models for hard natural language processing problems; and (b) a larger library of typological universals that have been derived automatically from data. Our outcomes are empirically evaluated on the raw language processing tasks and in terms of the quality of the universal implications mined from data, but comparing them with known repositories of universals.

Our results will impact the fields of natural language processing and linguistics. From the research side, this research will find applications in a wider variety of problems than the ones we intend to study; in particular, the use of linguistic universals in natural language processing technology will fundamentally change the way multilinguality is addressed in this field. From a linguistics perspective, the goal of this project is to shed new light on linguistic universals. This should impact not only the area of typology, but also the study and preservation of endangered languages. By automatically identifying typological features and implications from data, the process of documenting endangered languages could be made more efficient: leading to a smaller loss of knowledge of these languages.

Principal Investigators