# String Data Program

The program will involve a number of seminars on recent research and also breakout groups on specific topics within data science.

Since this is an interdisciplinary conference with participants from multiple communities, some introductory and vision talks are in order. Physicist Michael Douglas will give an overview talk of the string landscape, physicist Cumrun Vafa will review aspects of the landscape and swampland, and physicist Tom Rudelius will introduce string theory and its data structures to data scientists. In a breakout group computer scientists Scott Neal Reilly and Jeff Druce of Charles River Analytics will broadly discuss machline learning from the point of view of computer science and industry. Other breakout groups will introduce theory and code related to network science, reinforcement learning, and supervised machine learning.

## Speakers

- Frederik Denef, Columbia University.
- Keith Dienes, University of Arizona.
- Michael Douglas, Simons Center for Geometry and Physics.
- Yang-Hui He, City University of London.
- Sven Krippendorf, LMU Munich.
- Cody Long, Northeastern University.
- Fernando Marchesano, IFT-Madrid.
- Bryan Ostdiek, University of Oregon.
- Fernando Quevedo, ICTP and Cambridge University.
- Tom Rudelius, IAS
- Fabian Ruehle, Oxford University.
- Rak-Kyeong Seong, Tsinghua University.
- Gary Shiu, University of Wisconsin.
- Washington Taylor, MIT.
- Cumrun Vafa, Harvard University.
- Yi-Nan Wang, MIT.

## Breakout Groups

The workshop will include a number of smaller 90-minute breakout groups focusing on specific topics in data science. Each breakout group will consist of three components:

**Theory**. An overview of the theory behind the topic in data science.**Code**. Publicly available code exemplifying the topic, and sometimes its application in string theory settings. Instructions for pulling and running the code will be posted one week ahead of the workshop so that interested participants can follow and run code in real time.**Discussion**. An open moderator-led conversation about the topic and its potential applications in string theory. Possible questions for discussion will be distributed to the participants one week ahead of the workshop.

The breakout groups are:

**Practical Advances in Machine Learning: A Computer Science Perspective**- Dr. Scott Neal Reilly and Dr. Jeff Druce, Charles River Analytics
**Abstract:**Charles River Analytics is a small computer-science R&D consulting company focused on R&D that will enable new practical applications of AI, machine learning, data science, and big data. They will provide a short presentation of some of the latest advances in these areas, including discussions of probabilistic programming, novel deep learning architectures, and various types of ensemble machine learning. They will talk about some of the challenges of various machine learning problems and some of the techniques available to overcome them. They will also discuss some of the opportunities and challenges associated with collaborations between academic and industrial research groups in the hope of opening a productive dialog about how to create a two-way pipeline of data, domain requirements, and technical advances between the two communities. The breakout will conclude with time for discussion.**Discussion Moderators:**Brent Nelson and Fernando Quevedo**Possible Discussion Questions:**

*1)*Which ML techniques and tools do string theorists currently use?

*2)*What are the challenges associated with those techniques and tools?

*3)*What are the biggest challenges associated with applying ML to string theory problems?

*4)*What are the string theory problems (in layman’s terms if possible!) that are most appropriate for ML to help with?

*5)*What kinds of university-industry collaborations have string theorists engaged in? What worked well or didn’t work so well?

*6)*How do computer scientists determine which techniques to apply to one type of problem versus another? What information is necessary to begin addressing this question?

*7)*When does it benefit industry to partner with academics?

- Dr. Scott Neal Reilly and Dr. Jeff Druce, Charles River Analytics
**Reinforcement Learning**- Fabian Ruehle, Oxford University.
**Abstract:**Reinforcement learning is a machine learning technique based on behavioural psychology. The idea is to have a machine interact with its environment (e.g. traverse the string landscape). The machine receives rewards for interactions that lead to good results (e.g. approaching a physically interesting string vacuum) and/or is punished for steps that lead to undesirable results (e.g. mathematical consistency of the model is violated). The machine explores its environment with the goal to maximize its long-term reward, thus hopefully finding interesting states in its environment (like the Standard Model of Particle Physics).

In order to exemplify the idea behind reinforcement learning, we will set up a very simple environment called gridworld to mimic the exploration of the string landscape. This environment is essentially a maze with walls, pitfalls, and an exit. In the analogy, the walls would be boundaries of the string landscape (such as a negative number of branes), the pitfalls would be undesirable physical properties (such as mathematically inconsistent states), and the exit would be a Standard Model state. We then expose an agent to this environment and let it learn to navigate the maze and find the exit.**Discussion Moderators:**Keith Dienes and Wati Taylor**Possible Discussion Questions:**

*1)*In what ways is gridworld a good or bad analogy to the string landscape?

*2)*What could a gridworld look like for F-Theory model building, intersecting brane models in IIA/B, Heterotic CYs with vector bundles, Heterotic SCFTs (free fermionic, orbifolds), etc?

*3)*How does reinforcement learning compare to other machine learning techniques?

*4)*What are the relative advantages of different reinforcement learning algorithms (SARSA, DQN, DDPG, A3C, Wolpertinger, Rainbow, …)? How might the relative advantages be utilized in string theory?

*5)*How might reinforcement learning be useful elsewhere in the landscape, such as in inflationary setups?

*6)*What libraries currently exist or are being developed for RL and what algorithms are implemented? How should one determine which algorithm to use?

- Fabian Ruehle, Oxford University.
**Network Science**- Will Cunningham, Northeastern University.
**Abstract:**Network science lies at the center of graph theory, data mining, computer vision, and statistical inference. Researchers use methods from these fields to control, predict, and understand the social (Facebook, Twitter), technological (Internet, power grid), and biological (brain, proteins) networks we encounter in our daily lives. In this interactive lecture, we study these problems using graph algorithms in Python to look at structural properties, dynamical growth models, and diffusion processes.**Discussion Moderators:**Thomas Grimm and Cody Long**Possible Discussion Questions:**

*1)*What objects in string theory can be modeled as graphs, i.e., a collection of objects with relations? What about as statistical ensembles of graphs?

*2)*In the string landscape, what types of processes can you imagine occuring? Ideas for toy models?

*3)*How might you model information entanglement using networks?

*4)*What measures of centrality would we expect to be important in these models?

*5)*In network science, we also think about information routing, community detection and link prediction, and network resilience in the face of link or node failures. How might these concepts apply to string theory?

- Will Cunningham, Northeastern University.
**Supervised Machine Learning**- Sven Krippendorf, LMU Munich.
**Abstract:**Supervised learning is the machine learning task of inferring a function from a training data set. The theory part aims at giving a short introduction in some of the algorithms being used, including support vector machines with different regression methods, linear discriminant analysis, decision trees, k-nearest neighbour algorithm and neural networks. We also aim at pointing out some of the common issues in supervised learning such as bias vs. variance trade-off, complexity of training data, and overfitting of data.

The aim of the code section is to discuss two-three examples of supervised learning: 1) The harmonic oscillator of machine learning - handwritten number recognition 2) Learning geometric properties of data 3) Pattern recognition in timed sequences such as in speech recognition.**Discussion Moderators:**Per Berglund and Rak-Kyeong Seong**Possible Discussion Questions:**

*1)*Optimising learning algorithms vs more data collection in string phenomenology.

*2)*How heterogeneous is our data in string theory?

*3)*Can we use different kernel functions in SVMs?

*4)*What are the relative advantages of popular open source ML packages?

*5)*What datasets are currently publicly available in string theory for training, and what questions might supervised ML be able to address?

*6)*What further data sets do we need to expand our understanding of string theory, such as moving away from toric geometry to broader geometric setups?

- Sven Krippendorf, LMU Munich.

## Workshop Code Repository

We encourage participants to share data and code related to this workshop in this GitHub repo, which is where the breakout group codes will be posted.

## Titles & Abstracts

The titles and abstracts of plenary talks, in order of appearance, are listed below.

**Michael Douglas**,*Computational Exploration of the String Landscape*

Abstract: We introduce the string landscape and the physical and mathematical tools used to study it. We then survey a few of the testable predictions we hope to get: the Kaluza-Klein and supersymmetry breaking scales, and the quantity and types of additional matter which might be discovered at colliders or as dark matter. We then discuss how computational methods can help, and what we will need as inputs: to start with, databases of many mathematical structures. These include finite groups, singularities, Calabi-Yau manifolds, lattices, automorphic functions, etc., and tools to work with this data. We will also need to formalize physical concepts such as effective field theories and maintain databases of these as well. If well designed, these databases and tools will be a repository of mathematical knowledge of permanent value for the mathematical and scientific communities. Thus one of our primary near-term efforts should be to establish collaborations with the mathematical and computational communities, agree on our shared goals, and work together to realize them.**Cumrun Vafa**,*Reflections on the String Landscape.*

Abstract: I review aspects of the string landscape and the string swampland.**Tom Rudelius**, *What Is String Theory? An Introduction for Data Scientists *

Abstract: String theory is the only known mathematically-consistent theory of quantum gravity. We will introduce string theory, explore the string landscape, and speculate on the marriage of string theory and machine learning. This talk is intended for those without a background in fundamental physics.**Fabian Ruehle**,*Branes with Brains - Reinforcement learning in the landscape of intersecting brane worlds*

Abstract:We apply reinforcement learning to study the string landscape of Type II intersecting branes with orientifolds. I will give a basic introduction to the string setup and reinforcement learning. After that, I will explain how to map the string problem we are interested in, i.e. finding the Standard Model of Particle Physics, to an environment that can be analysed with A3C (asynchronous advantage actor-critic method) in reinforcement learning. I will end with preliminary results of our Type II string landscape analysis.**Bryan Ostdiek**,*Learning through weak supervision*

Abstract: Determining the best method for training a machine learning algorithm is critical to maximizing its ability to classify data. In this talk, I compare the standard “fully supervised” approach (which relies on knowledge of event-by-event truth-level labels) with a recent proposal that instead utilizes class ratios as the only discriminating information provided during training. This so-called “weakly supervised” technique has access to less information than the fully supervised method and yet is still able to yield impressive discriminating power. In addition, weak supervision seems particularly well suited to particle physics since quantum mechanics is incompatible with the notion of mapping an individual event onto any single Feynman diagram. The technique is examined in detail – both analytically and numerically – with a focus on the robustness to issues of mischaracterizing the training samples. Weakly supervised networks turn out to be remarkably insensitive to a class of systematic mismodeling.**Washington Taylor**,*Machine learning, incomputably large data sets, and the string landscape*

Abstract: The classes of problems for which machine learning is well-suited do not immediately match up with many of the questions that physicists want to answer about the large landscape of string vacuum solutions. In this talk I discuss the kinds of problems that arise in analyzing the string landscape, and more generally when dealing with a data set that in principle contains more elements than there are particles in the observable universe. I introduce as a specific example a well-defined graph describing the “skeleton” of the 4D F-theory landscape. This graph seems to contain at least 10^{3000} nodes, each corresponding to a distinct class of geometries, and can be explored locally but its global structure and statistics are not well understood.**Rak-Kyeong Seong**,*Machine Learning of Calabi-Yau Volumes*

Abstract: In this talk, I will illustrate how machine learning techniques can be used to study the volume minimum of Sasaki-Einstein base manifolds of non-compact toric Calabi-Yau 3-folds. Under the AdS/CFT correspondence, the minimum volumes relate to central charges of a class of 4d N=1 superconformal field theories that arise from Type IIB brane configurations known as Brane Tilings. Employing machine learning techniques, I will show that the usual volume minimization procedure can be circumvented, giving us a way of obtaining central charges without the usual extremization procedure.**Frederik Denef**,*Thoughts on the Future Landscape*.

Abstract: I review some basic facts about the theory of the string landscape, with particular emphasis on the fact that we don’t actually have a theory. Although a fundamental microscopic description of the “Hilbert Space of Everything” presumably exists, it surely will look very different from anything we are currently familiar with in this line of research. To illustrate in what sense it might look different, I will outline a recent proposal for a complete microscopic definition of the Hilbert space of higher spin quantum gravity with a positive cosmological constant. Based on these observations, I will advocate it might be wise to develop AI / Machine Learning / Data Science applications in such a way that they are not tied specifically to our current favorite descriptions of physics. With flexible, general-purpose, newbie-friendly AI tools, future generations of young theorists are bound to get creative in as yet unanticipated directions. This would include in particular consolidation and organization of the ever growing mountain of existing theoretical knowledge, currently scattered over O(10^6) research articles, unlearnable by humans but learnable by machines.**Cody Long**,*Vacuum Selection from Cosmology on Networks of String Geometries*

Abstract: I will introduce network science as a tool to study cosmological models on large networks of string vacua. I will discuss two large networks of string geometries that can be explicitly constructed, where nodes are extra-dimensional six-manifolds and edges represent topological transitions between them. I will then show that a bubble cosmology model on the networks has late-time behavior determined by the eigenvector centrality of the networks, which provides a dynamical mechanism for vacuum selection in the string landscape.**Michael Douglas**,*TBD*.**Sven Krippendorf**,*Towards a guiding principle for string model building using Shannon entropy***Gary Shiu**,*Topological Data Analysis for Cosmology and String Theory*

Abstract: Topological data analysis (TDA) is a multi-scale approach in computational topology used to analyze the ``shape” of large datasets by identifying which homological characteristics persist over a range of scales. In this talk, I will discuss how TDA can be used to extract physics from cosmological datasets (e.g., primordial non-Gaussianities generated by cosmic inflation) and to explore the structure of the string landscape.**Fernando Marchesano**,*TBD*.**Keith R. Dienes**,*Cautionary Tales from the Landscape*

Abstract: In this talk, I first describe a number of explicit results which come from a random statistical investigation of the heterotic landscape, focusing on correlations between gauge groups, degrees of supersymmetry, and cosmological constants. I then proceed to discuss a number of logistical issues which are likely to plague any random search through the landscape. In general these issues arise when attempting to extract statistical correlations from a large data set to which our computational access is necessarily limited. As one example, I discuss the problem of “floating correlations”, which reflects the fact that not all physically distinct string models are equally likely to be sampled in any random search through the landscape. This thereby causes apparent statistical correlations to ``float” as a function of sample size. I also discuss a number of other similar complications, and propose several possible computational methods that can be used to overcome these problems.**Yinan Wang**,*The web of threefold bases in F-theory and machine learning*

Abstract: The classification of complex threefold bases is the foundation of the “geometric landscape program” under the 4D F-theory framework. In this talk, I will present our recent scanning results of the web of toric threefold bases used in 4D F-theory, based on the paper arXiv:1710.11235. We construct these base threefolds using a random blow up sequence from a starting point base such as P3, and they are separated into two classes: the “good bases” and the “resolvable bases”. They correspond to different classes of low-energy physical models. We estimate the total number of these bases using a statistical approach on the directed graph that we scanned. Finally, I will talk about a number of possible applications of machine learning on this dataset.**Yang-Hui He**,*Deep-Learning the Landscape*

Abstract: We propose a paradigm to deep-learn the ever-expanding databases which have emerged in mathematical physics and particle phenomenology, as diverse as the statistics of string vacua or combinatorial and algebraic geometry. As concrete examples, we establish multi-layer neural networks as both classifiers and predictors and train them with a host of available data ranging from Calabi-Yau manifolds and vector bundles, to quiver representations for gauge theories. We find that even a relatively simple neural network can learn many significant quantities to astounding accuracy in a matter of minutes and can also predict hithertofore unencountered results. This paradigm should prove a valuable tool in various investigations in landscapes in physics as well as pure mathematics.