Search for a command to run...
<i>Escherichia coli</i> is a highly diverse organism that includes a range of commensal and pathogenic variants found across a range of niches and worldwide. In addition to causing severe intestinal and extraintestinal disease, <i>E. coli</i> is considered a priority pathogen due to high levels of observed drug resistance. The diversity in the <i>E. coli</i> population is driven by high genome plasticity and a very large gene pool. All these have made <i>E. coli</i> one of the most well-studied organisms, as well as a commonly used laboratory strain. Today, there are thousands of sequenced <i>E. coli</i> genomes stored in public databases. While data is widely available, accessing the information in order to perform analyses can still be a challenge. Collecting relevant available data requires accessing different sources, where data may be stored in a range of formats, and often requires further manipulation and processing to apply various analyses and extract useful information. In this study, we collated and intensely curated a collection of over 10 000 <i>E. coli</i> and <i>Shigella</i> genomes to provide a single, uniform, high-quality dataset. <i>Shigella</i> were included as they are considered specialized pathovars of <i>E. coli</i>. We provide these data in a number of easily accessible formats that can be used as the foundation for future studies addressing the biological differences between <i>E. coli</i> lineages and the distribution and flow of genes in the <i>E. coli</i> population at a high resolution. The analysis we present emphasizes our lack of understanding of the true diversity of the <i>E. coli</i> species, and the biased nature of our current understanding of the genetic diversity of such a key pathogen.