Search for a command to run...
Language influences our thinking and affects many aspects of cognition, from how we perceive the world to how we interact socially. Thus, objectively characterizing linguistic background is crucial for research in many areas, including second language acquisition, psycho-linguistics, and cognitive science. Traditional language proficiency tests, however, are manually composed by experts, limiting their scope for both lab and online settings. Here, we propose a pipeline that automatically derives a language proficiency test from a corpus of text and applies it to create new tests for 1,939 languages. Using this approach, we conducted a large-scale survey examining L1 and L2 proficiency across 34 countries, with participants tested on all 34 languages. Drawing from human ratings from 4,137 participants, our results validate that our test can effectively distinguish native speakers, second-language speakers, and nonspeakers within one minute, making it an effective tool for evaluating linguistic proficiency. We show that participants' linguistic and demographic backgrounds systematically influence both their language proficiency and their self-reported skills, and we map the prevalence of global languages, such as English and Spanish, among online participants. Moreover, we show that our vocabulary tests are strongly correlated with other linguistic competences-such as listening and writing-in a set of typologically varied languages, demonstrating our test is an efficient instrument to assess language proficiency. More broadly, our work offers a significant resource for investigating global variation in language skills and contributes to reducing the overreliance on the English language in the cognitive and social sciences.
Published in: Proceedings of the National Academy of Sciences
Volume 123, Issue 13, pp. e2420179123-e2420179123