Martin Vechev Martin Vechev is one of the rising stars in program analysis. He brings together statistical learning, program synthesis, and code analysis to new and exciting applications in code transformation and code deobfuscation. Practitioners may know him for JSNice, a tool to automatically unobfuscate JavaScript code by inferring the most likely names and types.

Martin is a tenure-track Assistant Professor of Computer Science at ETH Zürich, where he heads the Software Reliability Lab. He won Faculty Awards and Extraordinary Accomplishment Awards from Google, Facebook, and IBM; and just recently got an ERC Starting Grant, Europe’s highest research funding for individuals. Also, he is a keynote speaker at ISSTA 2016 .

Martin – in your JSNice tool, anyone can paste in a piece of obfuscated JavaScript, and your tool automatically replaces the obfuscated names with very readable ones that make sense. In three sentences: How does this work?

The approach works in two steps. First, during training, we learn a probabilistic graphical model from a corpus of un-obfuscated JavaScript programs available on GitHub. This model captures how good names tend to appear in programs. Second, at query time, JSNice predicts the most likely good names (according to the learned model) to replace the obfuscated ones.

How was the response to JSNice? Anyone want to buy you out yet?

The response to JSNice around the Web (e.g., social media) has been very positive. The tool is being used by several hundred users daily and has had more than 150,000 users, from every country worldwide. We also open sourced the framework upon which JSNice is built at Nice2Predict.org , helping others more quickly build similar tools. We are also in the process of commercializing some of our most advanced research developments.

In your work, you make frequent use of Conditional Random Fields (CRFs). Can you explain their role in your work?

CRFs are a powerful probabilistic graphical model widely used in areas such as computer vision. The main advantage of CRFs is their ability to provide a joint prediction for a set of elements where the predictions are dependent. For instance, JSNice finds an optimal score for the names of a set of variables, in one shot. CRFs are a good model when one wants to solve a particular task and that task involves predicting the values of multiple, dependent elements, at the same time.

Your work is built on learning from “Big Code” – a large body of examples. Are these examples always correct? Isn’t there a risk that you could perpetuate the average in coding, rather than have people follow the role models?

To the first question, it is fully possible that some examples are noisy: they are either incorrect or just poorly written programs. If the number of such cases is small, the probabilistic model would naturally exclude them (assign them low priority) and thus, they would have minimal influence on predictions made by the model.

As to the second question, finding good data sets to learn from is important. However, if the data set consists of some bad, some great, and mostly average examples, we can still use probabilistic models to identify the great ones: the model would find the few outliers but can then ask an oracle (e.g., user) whether the outliers are bad or good. As the difference between bad and good is typically significant, based on the oracle’s answers, the model would classify the rest of the outliers, identifying the good ones. Those can then be used for various purposes (e.g., presented to the user).

Our POPL’16 paper is a step in that direction.

Would I be able to join probabilistic inference with exact symbolic inference?

There are many interesting ways to combine the two. A possible way is to connect them in a feedback directed guess-and-check loop as follows: use probabilistic inference to make predictions on programs followed by a symbolic analysis to check these predictions. If the probabilistic predictions are incorrect, use the counterexample from the symbolic checking to refine the probabilistic model when making the next prediction.

Could people systematically defeat your analysis? How?

I am not sure I should answer this 🙂 An obvious way is to make the data set be of poor quality. A less obvious way is to know the internals of the analysis, e.g., the kind of features and probabilistic models it uses and to transform the programs in the data set into other semantically equivalent, similarly efficient programs, but ones that are more difficult to extract the necessary information out of (over which the learning works).

There are interesting, fundamental questions here involving the amount of internals the learning exposes vs. the amount of influence on its prediction accuracy one can obtain.

Your work requires experience in static analysis, machine learning, program synthesis, and more. Where do you get your students?

About half the students in my group came through the ETH M.Sc. program which covers a wide range of topics, including the above. These students usually take my graduate course which nowadays spans both, theoretical and practical aspects of program analysis, synthesis and probabilistic learning, as it relates to programs. Others have done M.Sc. elsewhere and come directly for PhD and pick up the necessary material as they go. I have been very lucky with the students in my group, they are all great.

While on the topic of students and Big Code, I believe Veselin Raychev’s upcoming PhD thesis will become a definitive reference for how to combine these areas. His thesis covers the entire spectrum, from core concepts all the way to state of the art working systems (e.g., JSNice).

Should software researchers and/or computer scientists get more training in these fields? At the expense of what?

I believe most computer scientists would benefit from knowing the core concepts of automated program analysis (i.e., abstract interpretation) and synthesis as well as their relation to probabilistic learning.

A central concept of static analysis is that of approximation and abstracting the infinite (concrete) into the finite (abstract). That concept enables automated reasoning of complex systems and properties. Understanding how abstraction works is useful for developing clean concepts. Interestingly, approximation is also an important concept in statistical learning, and efficiently representing and manipulating sets of elements (concrete states in the case of static analysis) is also central in synthesis (where one represents sets of programs).

Program synthesis contains various important concepts including the ideas of structuring the search space, representing sets of programs efficiently, and counterexample guided search. In synthesis, a creative combination between the machine and the human (or other oracles) enables the synthesizer to uncover specs that are nearly impossible to formalize a priori, and to find solutions to tasks that are otherwise difficult to solve. There are many tasks in systems, networks, security and even cyber physical systems that can be framed as synthesis problems.

Probabilistic techniques are useful to know as these are good for dealing with noise and uncertainty, as well as for generalizing beyond what is seen in the training data. Both, machine learning and program synthesis involve learning from examples, but usually in different settings: continuously parameterized spaces for machine learning and discrete search spaces for synthesis.

At ETH, I teach all 3 of these directions in a single, interdisciplinary graduate course, allowing students to understand the fundamental concepts, applications and importantly, the connections between areas. The course is taken by students from various areas and it is my impression that teaching all concepts together is important. In general, especially at the graduate level, it is critical to establish connections between areas. Students need to get into a mindset early on that they need to search for and define these connections.

Where do you see the future of program analysis and testing? Is it in Big Code?

I see future work taking place along a number of different directions. At least in my group, we are exploring some of these.

First, beyond applications and tools, the area of Big Code is scientifically interesting for two reasons: (i) learning from Big Code means learning from many, potentially noisy examples (and most existing program synthesizers can’t deal with noise very well), and (ii) unlike the traditional setting of Big Data (e.g., a dataset consisting of a set of images), these examples (i.e., programs) have well defined semantics and can be automatically analyzed (interestingly, the analysis need not be sound), meaning that one can extract intermediate representations of programs over which a model is learned. Overall, this leads to a rich, exciting space of possibilities for combining analysis, synthesis and probabilistic learning.

Second, I believe we will see more work in the space of analyzing and testing probabilistic programs, rather than only traditional programs. Probabilistic programs capture probabilistic models (e.g., Bayesian networks) and aim to simplify the design of machine learning models. Automated techniques can enable one to reason and optimize these programs, for instance, to enable faster probabilistic inference or to prove probabilistic assertions on the program.

Third, there will be further application of analysis and synthesis techniques in the areas of systems, security and networks. Analysis techniques can be used to ensure reliability of software defined networks, eventually consistent data stores, mobile apps, spreadsheets, and other applications. Advanced synthesis techniques are almost entirely unexplored in the area of security and networks.

Fourth, as machine learning systems continue to gain popularity in end-user domains (e.g., in human computer interaction, self-driving cars, etc), their programmability, reliability and predictability will become even more important. Analysis and synthesis techniques can be of significant help towards achieving these objectives.

I believe we are already seeing a “tech transfer” of analysis techniques to a variety application domains and in the next few years, we will see more of that.

Finally: What are the central problems that you would like to see more research activity in?

Concretely, I would like to see more work at the intersection of automated analysis, synthesis and probabilistic learning. While there has been great work in each area, we lack clear knowledge of the connections and overlaps between these. For instance, even after years of research in synthesis, we still have a poor understanding of how the shape of the hypothesis space affects the asymptotic upper bound on the number of questions a machine has to ask a human to find a solution in the space.

I would also like to see research that explores applications of analysis and synthesis techniques to critical domains relevant to the wide public (e.g., disease prediction).

Find more interviews behind the link.