Goals and Vision

We are very pleased to be hosting the Workshop on Data Science and String Theory. This is the first meeting of its kind, where the goal is to unite big data techniques with string theory, in order to systematically understand the string landscape.

String theory is perhaps the most promising candidate for a unified theory of physics. As a quantum theory of gravity that naturally gives rise to general relativity at long distances and the building blocks for realistic particle and cosmological sectors, it satisfies a number of non-trivial necessary conditions for any unified theory. In fact, it is the only known theory that satisfies these necessary conditions. However, its extra dimensions of space allow for many compactifications to four dimensions, which give rise to a large landscape of vacua that may realize many different incarnations of particle physics and cosmology. Taming the landscape is therefore a central problem in theoretical physics, and is critical to making progress in understanding unification in string theory.

At this workshop, we will come together as a community to treat the landscape as what it clearly is: a big data problem. In fact, the data that arise in string theory may be some of the largest in science. For example, in the type IIb region of the landscape it was originally estimated that there are 10⁵⁰⁰ flux vacua, which has in recent years exploded to 10^272,000. These arise from turning on generalized fluxes in the extra dimensional geometries of string theory, and there are over 10⁷⁵⁵ geometries themselves. Dealing with these large numbers is further exacerbated by the computational complexity of the landscape, making it even more clear that sophisticated techniques are required.

Such problems are tailor made for applying modern techniques in data science. Over the last few years, application of data science techniques to known problems has changed many fields. In addition to impressive examples that are of less practical use, such as using deep learning to train a program that beat the world champion in Go (a system with 10¹⁷⁰ states) or classic Atari games, data science has led to veritable revolutions in genetics, oceanography, climate science, and many other fields.

While the use of data science techniques to address problems in the string landscape is relatively new, there has already been promising progress. For example, genetic algorithms have been utilized to effeciently find viable string vacua and compute cohomology groups, where the latter determine the massless particle content of string vacua and are often inefficient to compute across large ensembles. Machine learning has been utilized to study Calabi-Yau manifolds and quiver gauge theories, as well as to generate a conjecture that led to a rigorous result that proved the existence of E₆ models in 10⁷⁵¹ geometries that are a subset of the mentioned 10⁷⁵⁵.

These efforts are closely related to those of the String Vacuum Project (SVP), a NSF-funded multi-institution effort that studied the string landscape from 2008-2014. The SVP led to many works related to particle physics and cosmology in the landscape and new collaborations between its members. The many goals of the SVP included

the enumeration and classification of string vacua;
the development of a detailed understanding of those string vacua with realistic low-energy phenomenologies;
the development of more explicit connections between string vacua and LHC data and phenomenology generally; and
statistical studies across the entire “landscape” of string vacua.

There were also a series of collaboration workshops, including a 2008 workshop at Arizona, 2010 workshops at KITP and Ohio State, and a 2011 workshop at the UPenn.

In comparison, a renewed thrust to study the landscape (articulated e.g. at String Phenomenology 2017) has many of the same goals as the SVP but places an increased emphasis on using modern techniques in data science to study the landscape. This approach is worth pursuing since the past decade has seen rapid advances in the areas of machine learning, distributed artificial intelligence, and generative adversarial networks. Furthermore, software packages such as R Studio, Scikit-Learn, TensorFlow and Torch have made it easier than ever for newcomers to rapidly approach the frontier questions of data science. Finally, new platforms like cloud computing (AzureML, Google Cloud MLE, Predictive Analytics with AWS, etc.) have made high performance computing and machine learning affordable and scalable. Given the progress these new developments have made possible in other fields, many in our community believe a similar revolution in our understanding of fundamental physics may well be at hand.

At this meeting we will share recent progress and discuss next steps as a community.