Publications – LASER

Eladawy, Hadeel; Goues, Claire Le; Brun, Yuriy

Automated Program Repair, What Is It Good For? Not Absolutely Nothing! Proceedings Article

In: Proceedings of the 46th International Conference on Software Engineering (ICSE), Lisbon, Portugal, 2024.

@inproceedings{Eladawy24icse,

title = {Automated Program Repair, What Is It Good For? Not Absolutely Nothing!},

author = {Hadeel Eladawy and Claire {Le Goues} and Yuriy Brun},

url = {http://people.cs.umass.edu/brun/pubs/pubs/Eladawy24icse.pdf},

doi = {10.1145/3597503.3639095},

year  = {2024},

date = {2024-04-17},

booktitle = {Proceedings of the 46th International Conference on Software Engineering (ICSE)},

address = {Lisbon, Portugal},

abstract = {Industrial deployments of automated program repair (APR), e.g., at Facebook

  and Bloomberg, signal a new milestone for this exciting and potentially

  impactful technology. In these deployments, developers use APR-generated

  patch suggestions as part of a human-driven debugging process. Unfortunately,

  little is known about how using patch suggestions affects developers during

  debugging. This paper conducts a controlled user study with 40 developers

  with a median of 6 years of experience. The developers engage in debugging

  tasks on nine naturally-occurring defects in real-world, open-source, Java

  projects, using Recoder, SimFix, and TBar, three state-of-the-art APR tools.

  For each debugging task, the developers either have access to the project's

  tests, or, also, to code suggestions that make all the tests pass. These

  suggestions are either developer-written or APR-generated, which can be

  correct or deceptive. Deceptive suggestions, which are a common APR

  occurrence, make all the available tests pass but fail to generalize to the

  intended specification. Through a total of 160 debugging sessions, we find

  that access to a code suggestion significantly increases the odds of

  submitting a patch. Correct APR suggestions increase the odds of debugging

  success by 14,000%, but deceptive suggestions decrease the odds of success by

  65%. Correct suggestions also speed up debugging. Surprisingly, we observe no

  significant difference in how novice and experienced developers are affected

  by APR, suggesting that APR may find uses across the experience spectrum.

  Overall, developers come away with a strong positive impression of APR,

  suggesting promise for APR-mediated, human-driven debugging, despite existing

  challenges in APR-generated repair quality.},

keywords = {},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Gaba, Aimen; Kaufman, Zhanna; Cheung, Jason; Shvakel, Marie; Hall, Kyle Wm; Brun, Yuriy; Bearfield, Cindy Xiong

My Model is Unfair, Do People Even Care? Visual Design Affects Trust and Perceived Bias in Machine Learning Journal Article

In: IEEE Transactions on Visualization and Computer Graphics (TVCG), vol. 30, no. 1, pp. 327-337, 2024.

Abstract | Links | BibTeX

@article{Gaba23vis,

title = {My Model is Unfair, Do People Even Care? Visual Design Affects Trust and Perceived Bias in Machine Learning},

author = {Aimen Gaba and Zhanna Kaufman and Jason Cheung and Marie Shvakel and Kyle Wm Hall and Yuriy Brun and Cindy Xiong Bearfield},

url = {http://people.cs.umass.edu/brun/pubs/pubs/Gaba23vis.pdf},

doi = {10.1109/TVCG.2023.3327192},

year  = {2024},

date = {2024-01-16},

journal = {IEEE Transactions on Visualization and Computer Graphics (TVCG)},

volume = {30},

number = {1},

pages = {327-337},

abstract = {Machine learning technology has become ubiquitous, but, unfortunately, often

  exhibits bias. As a consequence, disparate stakeholders need to interact with

  and make informed decisions about using machine learning models in everyday

  systems. Visualization technology can support stakeholders in understanding

  and evaluating trade-offs between, for example, accuracy and fairness of

  models. This paper aims to empirically answer ``Can visualization design

  choices affect a stakeholder's perception of model bias, trust in a model,

  and willingness to adopt a model?'' Through a series of controlled,

  crowd-sourced experiments with more than 1,500 participants, we identify a

  set of strategies people follow in deciding which models to trust. Our

  results show that men and women prioritize fairness and performance

  differently and that visual design choices significantly affect that

  prioritization. For example, women trust fairer models more often than men

  do, participants value fairness more when it is explained using text than as

  a bar chart, and being explicitly told a model is biased has a bigger impact

  than showing past biased performance. We test the generalizability of our

  results by comparing the effect of multiple textual and visual design choices

  and offer potential explanations of the cognitive mechanisms behind the

  difference in fairness perception and trust. Our research guides design

  considerations to support future work developing visualization systems for

  machine learning.},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

Close

First, Emily; Rabe, Markus; Ringer, Talia; Brun, Yuriy

Baldur: Whole-Proof Generation and Repair with Large Language Models Proceedings Article

In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), San Fransisco, CA, USA, 2023.

Abstract | Links | BibTeX

@inproceedings{First23fse,

title = {Baldur: Whole-Proof Generation and Repair with Large Language Models},

author = {Emily First and Markus Rabe and Talia Ringer and Yuriy Brun},

url = {http://people.cs.umass.edu/brun/pubs/pubs/First23fse.pdf},

doi = {10.1145/3611643.3616243},

year  = {2023},

date = {2023-12-06},

booktitle = {Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE)},

address = {San Fransisco, CA, USA},

abstract = {Formally verifying software properties is a highly desirable

  but labor-intensive task. Recent work has developed methods to automate

  formal verification using proof assistants, such as Coq and Isabelle/HOL,

  e.g., by training a model to predict one proof step at a time, and using

  that model to search through the space of possible proofs. This paper

  introduces a new method to automate formal verification: We use large

  language models, trained on natural language text and code and fine-tuned

  on proofs, to generate whole proofs for theorems at once, rather than one

  step at a time. We combine this proof generation model with a fine-tuned

  repair model to repair generated proofs, further increasing proving power.

  As its main contributions, this paper demonstrates for the first time that:

  (1) Whole-proof generation using transformers is possible and is as

  effective as search-based techniques without requiring costly search. (2)

  Giving the learned model additional context, such as a prior failed proof

  attempt and the ensuing error message, results in proof repair and further

  improves automated proof generation. (3) We establish a new state of the

  art for fully automated proof synthesis. We reify our method in a

  prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL

  theorems and their proofs. In addition to empirically showing the

  effectiveness of whole-proof generation, repair, and added context, we show

  that Baldur improves on the state-of-the-art tool, Thor, by automatically

  generating proofs for an additional 8.7% of the theorems. Together, Baldur

  and Thor can prove 65.7% of the theorems fully automatically. This paper

  paves the way for new research into using large language models for

  automating formal verification.},

keywords = {},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Hoag, Austin; Kostas, James E.; Silva, Bruno Castro; Thomas, Philip S.; Brun, Yuriy

Seldonian Toolkit: Building Software with Safe and Fair Machine Learning Proceedings Article

In: Proceedings of the Demonstrations Track at the 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 2023.

Abstract | Links | BibTeX

Agrawal, Arpan; First, Emily; Kaufman, Zhanna; Reichel, Tom; Zhang, Shizhuo; Zhou, Timothy; Sanchez-Stern, Alex; Ringer, Talia; Brun, Yuriy

Proofster: Automated Formal Verification Proceedings Article

In: Proceedings of the Demonstrations Track at the 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 2023.

Abstract | Links | BibTeX

Talebipour, Saghar; Park, Hyojae; Baral, Kesina; Yee, Leon; Khan, Safwat Ali; Moran, Kevin; Brun, Yuriy; Medvidovic, Nenad; Zhao, Yixue

AVGUST: A Tool for Generating Usage-Based Tests from Videos of App Executions Proceedings Article

In: Proceedings of the Demonstrations Track at the 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 2023.

Abstract | Links | BibTeX

Motwani, Manish; Brun, Yuriy

Understanding Why and Predicting When Developers Adhere to Code-Quality Standards Proceedings Article

In: Proceedings of the Software Engineering in Practice Track at the 45th International Conference on Software Engineering (ICSE SEIP), Melbourne, Australia, 2023.

Abstract | Links | BibTeX

Motwani, Manish; Brun, Yuriy

Better Automatic Program Repair by Using Bug Reports and Tests Together Proceedings Article

In: Proceedings of the 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 2023.

Abstract | Links | BibTeX

@inproceedings{Motwani23icse,

title = {Better Automatic Program Repair by Using Bug Reports and Tests Together},

author = {Manish Motwani and Yuriy Brun},

url = {http://people.cs.umass.edu/brun/pubs/pubs/Motwani23icse.pdf},

doi = {10.1109/ICSE48619.2023.00109},

year  = {2023},

date = {2023-05-14},

urldate = {2023-05-14},

booktitle = {Proceedings of the 45th International Conference on Software Engineering (ICSE)},

address = {Melbourne, Australia},

abstract = {<p>Automated program repair is already deployed in industry, 

 but concerns remain about repair quality. Recent research has shown that 

 one of the main reasons repair tools produce incorrect (but seemingly 

 correct) patches is imperfect fault localization (FL). This paper 

 demonstrates that combining information from natural-language bug reports 

 and test executions when localizing bugs can have a significant positive 

 impact on repair quality. By modifying existing repair tools to use FL that 

 combines bug reports and tests, we are able to correctly repair 7 defects 

 in Defects4J that no prior tools have repaired correctly.</p> 

 <p>We develop, Blues, the first information-retrieval-based, 

 statement-level FL technique that requires no training data. We further 

 develop RAFL, the first unsupervised method for combining multiple FL 

 techniques, which outperforms a supervised method. Using RAFL, we create 

 SBIR by combining Blues with a spectrum-based (SBFL) technique. Evaluated 

 on 815 real-world defects, SBIR consistently ranks buggy statements higher 

 than its underlying techniques.</p> 

 <p>Finally, we modify three state-of-the-art repair tools, Arja, SequenceR, 

 and SimFix, to use SBIR, SBFL, and Blues as their internal FL. We evaluate 

 the quality of the produced patches on 689 real-world defects. Arja and 

 SequenceR significantly benefit from SBIR: Arja using SBIR correctly 

 repairs 28 defects, but only 21 using SBFL, and only 15 using Blues; 

 SequenceR using SBIR correctly repairs 12 defects, but only 10 using SBFL, 

 and only 4 using Blues. SimFix, (which has internal mechanisms to overcome 

 poor FL), correctly repairs 30 defects using SBIR and SBFL, but only 13 

 using Blues. Our promising findings direct further research into combining 

 data from bug reports and test executions for FL and program repair.</p>},

keywords = {},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Brun, Yuriy; Lin, Tian; Somerville, Jessie Elise; Myers, Elisha M.; Ebner, Natalie

Blindspots in Python and Java APIs Result in Vulnerable Code Journal Article

In: ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 32, no. 3, pp. 76:1–76:31, 2023, ISSN: 1049-331X.

Abstract | Links | BibTeX

@article{Brun23tosem,

title = {Blindspots in Python and Java APIs Result in Vulnerable Code},

author = {Yuriy Brun and Tian Lin and Jessie Elise Somerville and Elisha M. Myers and Natalie Ebner},

url = {http://people.cs.umass.edu/brun/pubs/pubs/Brun23tosem.pdf},

doi = {10.1145/3571850},

issn = {1049-331X},

year  = {2023},

date = {2023-04-01},

urldate = {2023-01-01},

journal = {ACM Transactions on Software Engineering and Methodology (TOSEM)},

volume = {32},

number = {3},

pages = {76:1--76:31},

abstract = {Blindspots in APIs can cause software engineers to introduce vulnerabilities, but such blindspots are, unfortunately, common. We study the effect APIs with blindspots have on developers in two languages by replicating an 109-developer, 24-Java-API controlled experiment. Our replication applies to Python and involves 129 new developers and 22 new APIs. We find that using APIs with blindspots statistically significantly reduces the developers' ability to correctly reason about the APIs in both languages, but that the effect is more pronounced for Python. Interestingly, for Java, the effect increased with complexity of the code relying on the API, whereas for Python, the opposite was true. This suggests that Python developers are less likely to notice potential for vulnerabilities in complex code than in simple code, whereas Java developers are more likely to recognize the extra complexity and apply more care, but are more careless with simple code. Whether the developers considered API uses to be more difficult, less clear, and less familiar did not have an effect on their ability to correctly reason about them. Developers with better long-term memory recall were more likely to correctly reason about APIs with blindspots, but short-term memory, processing speed, episodic memory, and memory span had no effect. Surprisingly, professional experience and expertise did not improve the developers' ability to reason about APIs with blindspots across both languages, with long-term professionals with many years of experience making mistakes as often as relative novices. Finally, personality traits did not significantly affect the Python developers' ability to reason about APIs with blindspots, but less extraverted and more open developers were better at reasoning about Java APIs with blindspots. Overall, our findings suggest that blindspots in APIs are a serious problem across languages, and that experience and education alone do not overcome that problem, suggesting that tools are needed to help developers recognize blindspots in APIs as they write code that uses those APIs.},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

Close

Blindspots in APIs can cause software engineers to introduce vulnerabilities, but such blindspots are, unfortunately, common. We study the effect APIs with blindspots have on developers in two languages by replicating an 109-developer, 24-Java-API controlled experiment. Our replication applies to Python and involves 129 new developers and 22 new APIs. We find that using APIs with blindspots statistically significantly reduces the developers' ability to correctly reason about the APIs in both languages, but that the effect is more pronounced for Python. Interestingly, for Java, the effect increased with complexity of the code relying on the API, whereas for Python, the opposite was true. This suggests that Python developers are less likely to notice potential for vulnerabilities in complex code than in simple code, whereas Java developers are more likely to recognize the extra complexity and apply more care, but are more careless with simple code. Whether the developers considered API uses to be more difficult, less clear, and less familiar did not have an effect on their ability to correctly reason about them. Developers with better long-term memory recall were more likely to correctly reason about APIs with blindspots, but short-term memory, processing speed, episodic memory, and memory span had no effect. Surprisingly, professional experience and expertise did not improve the developers' ability to reason about APIs with blindspots across both languages, with long-term professionals with many years of experience making mistakes as often as relative novices. Finally, personality traits did not significantly affect the Python developers' ability to reason about APIs with blindspots, but less extraverted and more open developers were better at reasoning about Java APIs with blindspots. Overall, our findings suggest that blindspots in APIs are a serious problem across languages, and that experience and education alone do not overcome that problem, suggesting that tools are needed to help developers recognize blindspots in APIs as they write code that uses those APIs.

Close

Johnson, Brittany; Bartola, Jesse; Angell, Rico; Witty, Sam; Giguere, Stephen J.; Brun, Yuriy

Fairkit, Fairkit, on the Wall, Who's the Fairest of Them All? Supporting Data Scientists in Training Fair Models Journal Article

In: EURO Journal on Decision Processes, vol. 11, 2023.

Abstract | Links | BibTeX

@article{Johnson23fairkit,

title = {Fairkit, Fairkit, on the Wall, Who's the Fairest of Them All? Supporting Data Scientists in Training Fair Models},

author = {Brittany Johnson and Jesse Bartola and Rico Angell and Sam Witty and Stephen J. Giguere and Yuriy Brun},

url = {http://people.cs.umass.edu/brun/pubs/pubs/Johnson23fairkit.pdf},

doi = {10.1016/j.ejdp.2023.100031},

year  = {2023},

date = {2023-01-01},

journal = {EURO Journal on Decision Processes},

volume = {11},

abstract = {Modern software relies heavily on data and machine learning, and affects decisions that shape our world. Unfortunately, recent studies have shown that because of biases in data, software systems frequently inject bias into their decisions, from producing more errors when transcribing women's than men's voices to overcharging people of color for financial loans. To address bias in software, data scientists and software engineers need tools that help them understand the trade-offs between model quality and fairness in their specific data domains. Toward that end, we present fairkit-learn, an interactive toolkit for helping engineers reason about and understand fairness. Fairkit-learn supports over 70 definition of fairness and works with state-of-the-art machine learning tools, using the same interfaces to ease adoption. It can evaluate thousands of models produced by multiple machine learning algorithms, hyperparameters, and data permutations, and compute and visualize a small Pareto-optimal set of models that describe the optimal trade-offs between fairness and quality. Engineers can then iterate, improving their models and evaluating them using fairkit-learn. We evaluate fairkit-learn via a user study with 54 students, showing that students using fairkit-learn produce models that provide a better balance between fairness and quality than students using scikit-learn and IBM AI Fairness 360 toolkits. With fairkit-learn, users can select models that are up to 67% more fair and 10% more accurate than the models they are likely to train with scikit-learn.},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

Close

Sanchez-Stern, Alex; First, Emily; Zhou, Timothy; Kaufman, Zhanna; Brun, Yuriy; Ringer, Talia

Passport: Improving Automated Formal Verification Using Identifiers Journal Article

In: ACM Transactions on Programming Languages and Systems (TOPLAS), 2023, ISSN: 0164-0925.

Abstract | Links | BibTeX

@article{Sanchez-Stern23toplas,

title = {Passport: Improving Automated Formal Verification Using Identifiers},

author = {Alex Sanchez-Stern and Emily First and Timothy Zhou and Zhanna Kaufman and Yuriy Brun and Talia Ringer},

url = {http://people.cs.umass.edu/brun/pubs/pubs/Sanchez-Stern23toplas.pdf},

doi = {doi.org/10.1145/3593374},

issn = {0164-0925},

year  = {2023},

date = {2023-01-01},

journal = {ACM Transactions on Programming Languages and Systems (TOPLAS)},

abstract = {<p>Formally verifying system properties is one of the most effective ways of 

 improving system quality, but its high manual effort requirements often 

 render it prohibitively expensive. Tools that automate formal verification by 

 learning from proof corpora to synthesize proofs have just begun to show 

 their promise. These tools are effective because of the richness of the data 

 the proof corpora contain. This richness comes from the stylistic conventions 

 followed by communities of proof developers, together with the powerful 

 logical systems beneath proof assistants. However, this richness remains 

 underexploited, with most work thus far focusing on architecture rather than 

 on how to make the most of the proof data. This paper systematically explores 

 how to most effectively exploit one aspect of that proof data: identifiers.</p> 

 <p>We develop the Passport approach, a method for enriching the predictive Coq 

 model used by an existing proof-synthesis tool with three new encoding 

 mechanisms for identifiers: category vocabulary indexing, subword sequence 

 modeling, and path elaboration. We evaluate our approach's enrichment effect 

 on three existing base tools: ASTactic, Tac, and Tok. In head-to-head 

 comparisons, Passport automatically proves 29% more theorems than the 

 best-performing of these base tools. Combining the three tools enhanced by 

 the Passport approach automatically proves 38% more theorems than combining 

 the three base tools. Finally, together, these base tools and their enhanced 

 versions prove 45% more theorems than the combined base tools. Overall, our 

 findings suggest that modeling identifiers can play a significant role in 

 improving proof synthesis, leading to higher-quality software.</p>},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

Close

Brun, Yuriy

The Promise and Perils of Using Machine Learning When Engineering Software (Keynote Paper) Proceedings Article

In: Proceedings of the International Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE), Singapore, 2022.

Abstract | Links | BibTeX

Zhao, Yixue; Talebipour, Saghar; Baral, Kesina; Park, Hyojae; Yee, Leon; Khan, Safwat Ali; Brun, Yuriy; Medvidovic, Nenad; Moran, Kevin

Avgust: Automating Usage-Based Test Generation from Videos of App Executions Proceedings Article

In: Proceedings of the 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 421–433, Singapore, 2022, (ACM artifact badges granted: Artifact Available, Artifact Functional.).

Abstract | Links | BibTeX

Shahbazian, Arman; Karthik, Suhrid; Brun, Yuriy; Medvidovic, Nenad

eQual: Informing Early Design Decisions Proceedings Article

In: Proceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 1039–1051, Sacramento, CA, USA, 2022, (ACM artifact badges granted: Artifact Available.).

Abstract | Links | BibTeX

Johnson, Brittany; Brun, Yuriy

Fairkit-learn: A Fairness Evaluation and Comparison Toolkit Proceedings Article

In: Proceedings of the Demonstrations Track at the 44th International Conference on Software Engineering (ICSE), pp. 70–74, Pittsburgh, PA, USA, 2022.

Abstract | Links | BibTeX

First, Emily; Brun, Yuriy

Diversity-Driven Automated Formal Verification Proceedings Article

In: Proceedings of the 44th International Conference on Software Engineering (ICSE), pp. 749–761, Pittsburgh, PA, USA, 2022, (ACM SIGSOFT Distinguished Paper Award. ACM artifact badges granted: Artifact Available).

Abstract | Links | BibTeX

@inproceedings{First22icse,

title = {Diversity-Driven Automated Formal Verification},

author = {Emily First and Yuriy Brun},

url = {http://people.cs.umass.edu/brun/pubs/pubs/First22icse.pdf},

doi = {10.1145/3510003.3510138},

year  = {2022},

date = {2022-05-22},

urldate = {2022-05-22},

booktitle = {Proceedings of the 44th International Conference on Software Engineering (ICSE)},

pages = {749--761},

address = {Pittsburgh, PA, USA},

abstract = {Formally verified correctness is one of the most desirable properties of software systems. But despite great progress made via interactive theorem provers, such as Coq, writing proof scripts for verification remains one of the most effort-intensive (and often prohibitively difficult) software development activities. Recent work has created tools that automatically synthesize proofs or proof scripts. For example, CoqHammer can prove 26.6% of theorems completely automatically by reasoning using precomputed facts, while TacTok and ASTactic, which use machine learning to model proof scripts and then perform biased search through the proof-script space, can prove 12.9% and 12.3% of the theorems, respectively. Further, these three tools are highly complementary; together, they can prove 30.4% of the theorems fully automatically. Our key insight is that control over the learning process can produce a diverse set of models, and that, due to the unique nature of proof synthesis (the existence of the theorem prover, an oracle that infallibly judges a proof's correctness), this diversity can significantly improve these tools' proving power. Accordingly, we develop Diva, which uses a diverse set of models with TacTok's and ASTactic's search mechanism to prove 21.7% of the theorems. That is, Diva proves 68% more theorems than TacTok and 77% more than ASTactic. Complementary to CoqHammer, Diva proves 781 theorems (27% added value) that Coq-Hammer does not, and 364 theorems no existing tool has proved automatically. Together with CoqHammer, Diva proves 33.8% of the theorems, the largest fraction to date. We explore nine dimensions for learning diverse models, and identify which dimensions lead to the most useful diversity. Further, we develop an optimization to speed up Diva's execution by 40X. Our study introduces a completely new idea for using diversity in machine learning to improve the power of state-of-the-art proof-script synthesis techniques, and empirically demonstrates that the improvement is significant on a dataset of 68K theorems from 122 open-source software projects.},

note = {ACM SIGSOFT Distinguished Paper Award. ACM artifact badges granted: Artifact Available},

keywords = {},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Formally verified correctness is one of the most desirable properties of software systems. But despite great progress made via interactive theorem provers, such as Coq, writing proof scripts for verification remains one of the most effort-intensive (and often prohibitively difficult) software development activities. Recent work has created tools that automatically synthesize proofs or proof scripts. For example, CoqHammer can prove 26.6% of theorems completely automatically by reasoning using precomputed facts, while TacTok and ASTactic, which use machine learning to model proof scripts and then perform biased search through the proof-script space, can prove 12.9% and 12.3% of the theorems, respectively. Further, these three tools are highly complementary; together, they can prove 30.4% of the theorems fully automatically. Our key insight is that control over the learning process can produce a diverse set of models, and that, due to the unique nature of proof synthesis (the existence of the theorem prover, an oracle that infallibly judges a proof's correctness), this diversity can significantly improve these tools' proving power. Accordingly, we develop Diva, which uses a diverse set of models with TacTok's and ASTactic's search mechanism to prove 21.7% of the theorems. That is, Diva proves 68% more theorems than TacTok and 77% more than ASTactic. Complementary to CoqHammer, Diva proves 781 theorems (27% added value) that Coq-Hammer does not, and 364 theorems no existing tool has proved automatically. Together with CoqHammer, Diva proves 33.8% of the theorems, the largest fraction to date. We explore nine dimensions for learning diverse models, and identify which dimensions lead to the most useful diversity. Further, we develop an optimization to speed up Diva's execution by 40X. Our study introduces a completely new idea for using diversity in machine learning to improve the power of state-of-the-art proof-script synthesis techniques, and empirically demonstrates that the improvement is significant on a dataset of 68K theorems from 122 open-source software projects.

Close

Giguere, Stephen; Metevier, Blossom; Brun, Yuriy; Silva, Bruno Castro; Thomas, Philip S.; Niekum, Scott

Fairness Guarantees under Demographic Shift Proceedings Article

In: Proceedings of the 10th International Conference on Learning Representations (ICLR), 2022.

Abstract | Links | BibTeX

@inproceedings{Giguere22iclr,

title = {Fairness Guarantees under Demographic Shift},

author = {Stephen Giguere and Blossom Metevier and Yuriy Brun and Bruno Castro Silva and Philip S. Thomas and Scott Niekum},

url = {http://people.cs.umass.edu/brun/pubs/pubs/Giguere22iclr.pdf, https://openreview.net/forum?id=wbPObLm6ueA},

year  = {2022},

date = {2022-04-25},

booktitle = {Proceedings of the 10th International Conference on Learning Representations (ICLR)},

abstract = {Recent studies found that using machine learning for social applications can lead to injustice in the form of racist, sexist, and otherwise unfair and discriminatory outcomes. To address this challenge, recent machine learning algorithms have been designed to limit the likelihood such unfair behavior occurs. However, these approaches typically assume the data used for training is representative of what will be encountered in deployment, which is often untrue. In particular, if certain subgroups of the population become more or less probable in deployment (a phenomenon we call demographic shift), prior work's fairness assurances are often invalid. In this paper, we consider the impact of demographic shift and present a class of algorithms, called Shifty algorithms, that provide high-confidence behavioral guarantees that hold under demographic shift when data from the deployment environment is unavailable during training. Shifty, the first technique of its kind, demonstrates an effective strategy for designing algorithms to overcome demographic shift's challenges. We evaluate Shifty using the UCI Adult Census dataset, as well as a real-world dataset of university entrance exams and subsequent student success. We show that the learned models avoid bias under demographic shift, unlike existing methods. Our experiments demonstrate that our algorithm's high-confidence fairness guarantees are valid in practice and that our algorithm is an effective tool for training models that are fair when demographic shift occurs.},

keywords = {},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Motwani, Manish; Soto, Mauricio; Brun, Yuriy; Just, René; Goues, Claire Le

Quality of Automated Program Repair on Real-World Defects Journal Article

In: IEEE Transactions on Software Engineering (TSE), vol. 48, no. 2, pp. 637–661, 2022, ISSN: 0098-5589.

Abstract | Links | BibTeX

@article{Motwani22tse,

title = {Quality of Automated Program Repair on Real-World Defects},

author = {Manish Motwani and Mauricio Soto and Yuriy Brun and René Just and Claire Le Goues},

url = {http://people.cs.umass.edu/brun/pubs/pubs/Motwani22tse.pdf},

doi = {10.1109/TSE.2020.2998785},

issn = {0098-5589},

year  = {2022},

date = {2022-02-01},

urldate = {2022-02-01},

journal = {IEEE Transactions on Software Engineering (TSE)},

volume = {48},

number = {2},

pages = {637--661},

abstract = {Automated program repair is a promising approach to reducing the costs of manual debugging and increasing software quality. However, recent studies have shown that automated program repair techniques can be prone to producing patches of low quality, overfitting to the set of tests provided to the repair technique, and failing to generalize to the intended specification. This paper rigorously explores this phenomenon on real-world Java programs, analyzing the effectiveness of four well-known repair techniques, GenProg, Par, SimFix, and TrpAutoRepair, on defects made by the projects' developers during their regular development process. We find that: (1) When applied to real-world Java code, automated program repair techniques produce patches for between 10.6% and 19.0% of the defects, which is less frequent than when applied to C code. (2) The produced patches often overfit to the provided test suite, with only between 13.8% and 46.1% of the patches passing an independent set of tests. (3) Test suite size has an extremely small but significant effect on the quality of the patches, with larger test suites producing higher-quality patches, though, surprisingly, higher-coverage test suites correlate with lower-quality patches. (4) The number of tests that a buggy program fails has a small but statistically significant positive effect on the quality of the produced patches. (5) Test suite provenance, whether the test suite is written by a human or automatically generated, has a significant effect on the quality of the patches, with developer-written tests typically producing higher-quality patches. And (6) the patches exhibit insufficient diversity to improve quality through some method of combining multiple patches. We develop JaRFly, an open-source framework for implementing techniques for automatic search-based improvement of Java programs. Our study uses JaRFly to faithfully reimplement GenProg and TrpAutoRepair to work on Java code, and makes the first public release of an implementation of Par. Unlike prior work, our study carefully controls for confounding factors and produces a methodology, as well as a dataset of automatically-generated test suites, for objectively evaluating the quality of Java repair techniques on real-world defects.},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

Close

Automated program repair is a promising approach to reducing the costs of manual debugging and increasing software quality. However, recent studies have shown that automated program repair techniques can be prone to producing patches of low quality, overfitting to the set of tests provided to the repair technique, and failing to generalize to the intended specification. This paper rigorously explores this phenomenon on real-world Java programs, analyzing the effectiveness of four well-known repair techniques, GenProg, Par, SimFix, and TrpAutoRepair, on defects made by the projects' developers during their regular development process. We find that: (1) When applied to real-world Java code, automated program repair techniques produce patches for between 10.6% and 19.0% of the defects, which is less frequent than when applied to C code. (2) The produced patches often overfit to the provided test suite, with only between 13.8% and 46.1% of the patches passing an independent set of tests. (3) Test suite size has an extremely small but significant effect on the quality of the patches, with larger test suites producing higher-quality patches, though, surprisingly, higher-coverage test suites correlate with lower-quality patches. (4) The number of tests that a buggy program fails has a small but statistically significant positive effect on the quality of the produced patches. (5) Test suite provenance, whether the test suite is written by a human or automatically generated, has a significant effect on the quality of the patches, with developer-written tests typically producing higher-quality patches. And (6) the patches exhibit insufficient diversity to improve quality through some method of combining multiple patches. We develop JaRFly, an open-source framework for implementing techniques for automatic search-based improvement of Java programs. Our study uses JaRFly to faithfully reimplement GenProg and TrpAutoRepair to work on Java code, and makes the first public release of an implementation of Par. Unlike prior work, our study carefully controls for confounding factors and produces a methodology, as well as a dataset of automatically-generated test suites, for objectively evaluating the quality of Java repair techniques on real-world defects.

Close

Afzal, Afsoon; Motwani, Manish; Stolee, Kathryn T.; Brun, Yuriy; Goues, Claire Le

SOSRepair: Expressive Semantic Search for Real-World Program Repair Journal Article

In: IEEE Transactions on Software Engineering (TSE), vol. 47, no. 10, pp. 2162–2181, 2021, ISSN: 0098-5589.

Abstract | Links | BibTeX

@article{Afzal21tse,

title = {SOSRepair: Expressive Semantic Search for Real-World Program Repair},

author = {Afsoon Afzal and Manish Motwani and Kathryn T. Stolee and Yuriy Brun and Claire Le Goues},

url = {http://people.cs.umass.edu/brun/pubs/pubs/Afzal21tse.pdf},

doi = {10.1109/TSE.2019.2944914},

issn = {0098-5589},

year  = {2021},

date = {2021-10-01},

journal = {IEEE Transactions on Software Engineering (TSE)},

volume = {47},

number = {10},

pages = {2162--2181},

abstract = {Automated program repair holds the potential to significantly reduce software maintenance effort and cost. However, recent studies have shown that it often produces low-quality patches that repair some but break other functionality. We hypothesize that producing patches by replacing likely faulty regions of code with semantically-similar code fragments, and doing so at a higher level of granularity than prior approaches can better capture abstraction and the intended specification, and can improve repair quality. We create SOSRepair, an automated program repair technique that uses semantic code search to replace candidate buggy code regions with behaviorally-similar (but not identical) code written by humans. SOSRepair is the first such technique to scale to real-world defects in real-world systems. On a subset of the ManyBugs benchmark of such defects, SOSRepair produces patches for 23 (35%) of the 65 defects, including 3, 5, and 8 defects for which previous state-of-the-art techniques Angelix, Prophet, and GenProg do not, respectively. On these 23 defects, SOSRepair produces more patches (8, 35%) that pass all independent tests than the prior techniques. We demonstrate a relationship between patch granularity and the ability to produce patches that pass all independent tests. We then show that fault localization precision is a key factor in SOSRepair's success. Manually improving fault localization allows SOSRepair to patch 24 (37%) defects, of which 16 (67%) pass all independent tests. We conclude that (1) higher-granularity, semantic-based patches can improve patch quality, (2) semantic search is promising for producing high-quality real-world defect repairs, (3) research in fault localization can significantly improve the quality of program repair techniques, and (4) semi-automated approaches in which developers suggest fix locations may produce high-quality patches.},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

Close

Pinckney, Donald; Guha, Arjun; Brun, Yuriy

Wasm/k: Delimited Continuations for WebAssembly Proceedings Article

In: Proceedings of the ACM SIGPLAN International Symposium on Dynamic Languages (DLS), pp. 16–28, Chicago, IL, USA, 2020.

Abstract | Links | BibTeX

First, Emily; Brun, Yuriy; Guha, Arjun

TacTok: Semantics-Aware Proof Synthesis Journal Article

In: Proceedings of the ACM on Programming Languages (PACMPL) Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA) issue, vol. 4, pp. 231:1–231:31, 2020.

Abstract | Links | BibTeX

@article{First20oopsla,

title = {TacTok: Semantics-Aware Proof Synthesis},

author = {Emily First and Yuriy Brun and Arjun Guha},

url = {http://people.cs.umass.edu/brun/pubs/pubs/First20oopsla.pdf},

doi = {10.1145/3428299},

year  = {2020},

date = {2020-11-15},

journal = {Proceedings of the ACM on Programming Languages (PACMPL) Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA) issue},

volume = {4},

pages = {231:1--231:31},

address = {Chicago, IL, USA},

abstract = {Formally verifying software correctness is a highly manual process. However, because verification proof scripts often share structure, it is possible to learn from existing proof scripts to fully automate some formal verification. The goal of this paper is to improve proof script synthesis and enable fully automating more verification. Interactive theorem provers, such as the Coq proof assistant, allow programmers to write partial proof scripts, observe the semantics of the proof state thus far, and then attempt more progress. Knowing the proof state semantics is a significant aid. Recent research has shown that the proof state can help predict the next step. In this paper, we present TacTok, the first technique that attempts to fully automate proof script synthesis by modeling proof scripts using both the partial proof script written thus far and the semantics of the proof state. Thus, TacTok more completely models the information the programmer has access to when writing proof scripts manually. We evaluate TacTok on a benchmark of 26 software projects in Coq, consisting of over 10 thousand theorems. We compare our approach to five tools. Two prior techniques, CoqHammer, the state-of-the-art proof synthesis technique, and ASTactic, a proof script synthesis technique that models proof state. And three new proof script synthesis technique we create ourselves, SeqOnly, which models only the partial proof script and the initial theorem being proven, and WeightedRandom and WeightedGreedy, which use metaheuristic search biased by frequencies of proof tactics in existing, successful proof scripts. We find that TacTok outperforms WeightedRandom and WeightedGreedy, and is complementary to CoqHammer and ASTactic: for 24 out of the 26 projects, TacTok can synthesize proof scripts for some theorems the prior tools cannot. Together with TacTok, 11.5% more theorems can be proven automatically than by CoqHammer alone, and 20.0% than by ASTactic alone. Compared to a combination of CoqHammer and ASTactic, TacTok can prove an additional 3.6% more theorems, proving 115 theorems no tool could previously prove. Overall, our experiments provide evidence that partial proof script and proof state semantics, together, provide useful information for proof script modeling, and that metaheuristic search is a promising direction for proof script synthesis. TacTok is open-source and we make public all our data and a replication package of our experiments.},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

Close

Formally verifying software correctness is a highly manual process. However, because verification proof scripts often share structure, it is possible to learn from existing proof scripts to fully automate some formal verification. The goal of this paper is to improve proof script synthesis and enable fully automating more verification. Interactive theorem provers, such as the Coq proof assistant, allow programmers to write partial proof scripts, observe the semantics of the proof state thus far, and then attempt more progress. Knowing the proof state semantics is a significant aid. Recent research has shown that the proof state can help predict the next step. In this paper, we present TacTok, the first technique that attempts to fully automate proof script synthesis by modeling proof scripts using both the partial proof script written thus far and the semantics of the proof state. Thus, TacTok more completely models the information the programmer has access to when writing proof scripts manually. We evaluate TacTok on a benchmark of 26 software projects in Coq, consisting of over 10 thousand theorems. We compare our approach to five tools. Two prior techniques, CoqHammer, the state-of-the-art proof synthesis technique, and ASTactic, a proof script synthesis technique that models proof state. And three new proof script synthesis technique we create ourselves, SeqOnly, which models only the partial proof script and the initial theorem being proven, and WeightedRandom and WeightedGreedy, which use metaheuristic search biased by frequencies of proof tactics in existing, successful proof scripts. We find that TacTok outperforms WeightedRandom and WeightedGreedy, and is complementary to CoqHammer and ASTactic: for 24 out of the 26 projects, TacTok can synthesize proof scripts for some theorems the prior tools cannot. Together with TacTok, 11.5% more theorems can be proven automatically than by CoqHammer alone, and 20.0% than by ASTactic alone. Compared to a combination of CoqHammer and ASTactic, TacTok can prove an additional 3.6% more theorems, proving 115 theorems no tool could previously prove. Overall, our experiments provide evidence that partial proof script and proof state semantics, together, provide useful information for proof script modeling, and that metaheuristic search is a promising direction for proof script synthesis. TacTok is open-source and we make public all our data and a replication package of our experiments.

Close

Johnson, Brittany; Brun, Yuriy; Meliou, Alexandra

Causal Testing: Understanding Defects' Root Causes Proceedings Article

In: Proceedings of the 42nd International Conference on Software Engineering (ICSE), pp. 87–99, Seoul, Republic of Korea, 2020, (ACM SIGSOFT Distinguished Artifact Award. ACM artifact badges granted: Artifact Available, Artifact Reusable.).

Abstract | Links | BibTeX

Beschastnikh, Ivan; Liu, Perry; Xing, Albert; Wang, Patty; Brun, Yuriy; Ernst, Michael D.

Visualizing distributed system executions Journal Article

In: ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 29, no. 2, pp. 9:1–9:38, 2020, ISSN: 1049-331X.

Abstract | Links | BibTeX

Metevier, Blossom; Giguere, Stephen; Brockman, Sarah; Kobren, Ari; Brun, Yuriy; Brunskill, Emma; Thomas, Philip S.

Offline Contextual Bandits with High Probability Fairness Guarantees Proceedings Article

In: Proceedings of the 33rd Annual Conference on Neural Information Processing Systems (NeurIPS), Advances in Neural Information Processing Systems 32, pp. 14893–14904, Vancouver, BC, Canada, 2019.

Abstract | Links | BibTeX

Thomas, Philip S.; Silva, Bruno Castro; Barto, Andrew G.; Giguere, Stephen; Brun, Yuriy; Brunskill, Emma

Preventing Undesirable Behavior of Intelligent Machines Journal Article

In: Science, vol. 366, no. 6468, pp. 999-1004, 2019, ISSN: 0036-8075.

Abstract | Links | BibTeX

Jangda, Abhinav; Pinckney, Donald; Brun, Yuriy; Guha, Arjun

Formal Foundations of Serverless Computing Journal Article

In: Proceedings of the ACM on Programming Languages (PACMPL) Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA) issue, vol. 3, pp. 149:1–149:26, 2019, ISSN: 2475-1421, (ACM SIGPLAN Distinguished Paper Award.).

Abstract | Links | BibTeX

@article{Jangda19oopsla,

title = {Formal Foundations of Serverless Computing},

author = {Abhinav Jangda and Donald Pinckney and Yuriy Brun and Arjun Guha},

url = {http://people.cs.umass.edu/brun/pubs/pubs/Jangda19oopsla.pdf},

doi = {10.1145/3360575},

issn = {2475-1421},

year  = {2019},

date = {2019-10-20},

urldate = {2019-10-20},

journal = {Proceedings of the ACM on Programming Languages (PACMPL) Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA) issue},

volume = {3},

pages = {149:1--149:26},

address = {Athens, Greece},

abstract = {Serverless computing (also known as functions as a service) is a new cloud computing abstraction that makes it easier to write robust, large-scale web services. In serverless computing, programmers write what are called serverless functions, and the cloud platform transparently manages the operating system, resource allocation, load-balancing, and fault tolerance. When demand for the service spikes, the platform automatically allocates additional hardware to the service and manages load-balancing; when demand falls, the platform silently deallocates idle resources; and when the platform detects a failure, it transparently retries affected requests. In 2014, Amazon Web Services introduced the first serverless platform, AWS Lambda, and similar abstractions are now available on all major cloud computing platforms. 

 Unfortunately, the serverless computing abstraction exposes several low-level operational details that make it hard for programmers to write and reason about their code. This paper sheds light on this problem by presenting $łambda_łambda$, an operational semantics of the essence of serverless computing. Despite being a small (half a page) core calculus, $łambda_łambda$ models all the low-level details that serverless functions can observe. To show that $łambda_łambda$ is useful, we present three applications. First, to ease reasoning about code, we present a simplified $łambda_łambda$ semantics of serverless execution and precisely characterize when the $łambda_łambda$ semantics and $łambda_łambda$ coincide. Second, we augment $łambda_łambda$ with a key-value store to allow reasoning about stateful serverless functions. Third, since a handful of serverless platforms support serverless function composition, we show how to extend $łambda_łambda$ with a composition language. We have implemented this composition language and show that it outperforms prior work.},

note = {ACM SIGPLAN Distinguished Paper Award.},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

Close

Motwani, Manish; Brun, Yuriy

Automatically Generating Precise Oracles from Structured Natural Language Specifications Proceedings Article

In: Proceedings of the 41st International Conference on Software Engineering (ICSE), pp. 188–199, Montreal, QC, Canada, 2019, (ACM artifact badges granted: Artifact Available, Artifact Reusable.).

Abstract | Links | BibTeX

Brun, Yuriy; Meliou, Alexandra

Software Fairness Proceedings Article

In: Proceedings of the New Ideas and Emerging Results Track at the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 754–759, Lake Buena Vista, FL, USA, 2018.

Abstract | Links | BibTeX

Angell, Rico; Johnson, Brittany; Brun, Yuriy; Meliou, Alexandra

Themis: Automatically Testing Software for Discrimination Proceedings Article

In: Proceedings of the Demonstrations Track at the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 871–875, Lake Buena Vista, FL, USA, 2018.

Abstract | Links | BibTeX

Bang, Jae; Brun, Yuriy; Medvidovic, Nenad

Collaborative Design Conflicts: Costs and Solutions Journal Article

In: IEEE Software, vol. 35, no. 6, pp. 25–31, 2018, ISSN: 0740-7459.

Abstract | Links | BibTeX

Motwani, Manish; Sankaranarayanan, Sandhya; Just, René; Brun, Yuriy

Do Automated Program Repair Techniques Repair Hard and Important Bugs? Journal Article

In: Empirical Software Engineering (EMSE), vol. 23, no. 5, pp. 2901–2947, 2018, ISSN: 1382-3256.

Abstract | Links | BibTeX

@article{Motwani18emse,

title = {Do Automated Program Repair Techniques Repair Hard and Important Bugs?},

author = {Manish Motwani and Sandhya Sankaranarayanan and René Just and Yuriy Brun},

url = {http://people.cs.umass.edu/brun/pubs/pubs/Motwani18emse.pdf},

doi = {10.1007/s10664-017-9550-0},

issn = {1382-3256},

year  = {2018},

date = {2018-10-01},

journal = {Empirical Software Engineering (EMSE)},

volume = {23},

number = {5},

pages = {2901--2947},

abstract = {Existing evaluations of automated repair techniques focus on the fraction of the defects for which the technique can produce a patch, the time needed to produce patches, and how well patches generalize to the intended specification. However, these evaluations have not focused on the applicability of repair techniques and the characteristics of the defects that these techniques can repair. Questions such as "Can automated repair techniques repair defects that are hard for developers to repair?" and Äre automated repair techniques less likely to repair defects that involve loops?" have not, as of yet, been answered. To address such questions, we annotate two large benchmarks totaling 409 C and Java defects in real-world software, ranging from 22K to 2.8M lines of code, with measures of the defect's importance, the developer-written patch's complexity, and the quality of the test suite. We then analyze relationships between these measures and the ability to produce patches for the defects of seven automated repair techniques -- AE, GenProg, Kali, Nopol, Prophet, SPR, and TrpAutoRepair. We find that automated repair techniques are less likely to produce patches for defects that required developers to write a lot of code or edit many files, or that have many tests relevant to the defect. Java techniques are more likely to produce patches for high-priority defects. Neither the time it took developers to fix a defect nor the test suite's coverage correlate with the automated repair techniques' ability to produce patches. Finally, automated repair techniques are less capable of fixing defects that require developers to add loops and new function calls, or to change method signatures. These findings identify strengths and shortcomings of the state-of-the-art of automated program repair along new dimensions. The presented methodology can drive research toward improving the applicability of automated repair techniques to hard and important bugs.},

keywords = {},

pubstate = {published},

tppubtype = {article}

}

Close

Shin, Seung Yeob; Brun, Yuriy; Balasubramanian, Hari; Henneman, Philip L.; Osterweil, Leon J.

Discrete-Event Simulation and Integer Linear Programming for Constraint-Aware Resource Scheduling Journal Article

In: IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 48, no. 9, pp. 1578–1593, 2018, ISSN: 1578--1593.

Abstract | Links | BibTeX

Oliveira, Daniela Seabra; Lin, Tian; Rahman, Muhammad Sajidur; Akefirad, Rad; Ellis, Donovan; Perez, Eliany; Bobhate, Rahul; DeLong, Lois A.; Cappos, Justin; Brun, Yuriy; Ebner, Natalie C.

API Blindspots: Why Experienced Developers Write Vulnerable Code Proceedings Article

In: Proceedings of the USENIX Symposium on Usable Privacy and Security (SOUPS), pp. 315–328, Baltimore, MD, USA, 2018.

Abstract | Links | BibTeX

@inproceedings{Oliveira18soups,

title = {API Blindspots: Why Experienced Developers Write Vulnerable Code},

author = {Daniela Seabra Oliveira and Tian Lin and Muhammad Sajidur Rahman and Rad Akefirad and Donovan Ellis and Eliany Perez and Rahul Bobhate and Lois A. DeLong and Justin Cappos and Yuriy Brun and Natalie C. Ebner},

url = {https://www.usenix.org/system/files/conference/soups2018/soups2018-oliveira.pdf},

year  = {2018},

date = {2018-08-12},

booktitle = {Proceedings of the USENIX Symposium on Usable Privacy and Security (SOUPS)},

pages = {315--328},

address = {Baltimore, MD, USA},

abstract = {Despite the best efforts of the security community, security vulnerabilities in software are still prevalent, with new vulnerabilities reported daily and older ones stubbornly repeating themselves. One potential source of these vulnerabilities is shortcomings in the used language and library APIs. Developers tend to trust APIs, but can misunderstand or misuse them, introducing vulnerabilities. We call the causes of such misuse blindspots. In this paper, we study API blindspots from the developers' perspective to: (1) determine the extent to which developers can detect API blindspots in code and (2) examine the extent to which developer characteristics (i.e., perception of code correctness, familiarity with code, confidence, professional experience, cognitive function, and personality) affect this capability. We conducted a study with 109 developers from four countries solving programming puzzles that involve Java APIs known to contain blindspots. We find that (1) The presence of blindspots correlated negatively with the developers' accuracy in answering implicit security questions and the developers' ability to identify potential security concerns in the code. This effect was more pronounced for I/O-related APIs and for puzzles with higher cyclomatic complexity. (2) Higher cognitive functioning and more programming experience did not predict better ability to detect API blindspots. (3) Developers exhibiting greater openness as a personality trait were more likely to detect API blindspots. This study has the potential to advance API security in (1) design, implementation, and testing of new APIs; (2) addressing blindspots in legacy APIs; (3) development of novel methods for developer recruitment and training based on cognitive and personality assessments; and (4) improvement of software development processes (e.g., establishment of security and functionality teams).},

keywords = {},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Goues, Claire Le; Brun, Yuriy; Apel, Sven; Berger, Emery; Khurshid, Sarfraz; Smaragdakis, Yannis

Effectiveness of Anonymization in Double-Blind Review Journal Article

In: Communications of the ACM, vol. 61, no. 6, pp. 34–37, 2018.

Abstract | Links | BibTeX

Motwani, Manish; Sankaranarayanan, Sandhya; Just, René; Brun, Yuriy

Do Automated Program Repair Techniques Repair Hard and Important Bugs? Proceedings Article

In: Proceedings of the Journal First Track at the International Conference on Software Engineering (ICSE), pp. 25, Gothenburg, Sweden, 2018.

Abstract | Links | BibTeX

Shahbazian, Arman; Lee, Youn Kyu; Brun, Yuriy; Medvidovic, Nenad

Poster: Making Well-Informed Software Design Decisions Proceedings Article

In: Proceedings of the Poster Track at the International Conference on Software Engineering (ICSE), pp. 262–263, Gothenburg, Sweden, 2018.

Abstract | Links | BibTeX

Brun, Yuriy; Johnson, Brittany; Meliou, Alexandra (Ed.)

Proceedings of the IEEE/ACM International Workshop on Software Fairness (FairWare) Proceedings

Gothenburg, Sweden, 2018, ISBN: 978-1-4503-5746-3.

Links | BibTeX

Brun, Yuriy; Johnson, Brittany; Meliou, Alexandra

Message from the FairWare 2018 Chairs Proceedings Article

In: Proceedings of the IEEE/ACM International Workshop on Software Fairness, Gothenburg, Sweden, 2018.

Links | BibTeX

Shahbazian, Arman; Lee, Youn Kyu; Le, Duc; Brun, Yuriy; Medvidovic, Nenad

Recovering Architectural Design Decisions Proceedings Article

In: Proceedings of the IEEE International Conference on Software Architecture (ICSA), pp. 95–104, Seattle, WA, USA, 2018.

Abstract | Links | BibTeX

@inproceedings{Shahbazian18icsa,

title = {Recovering Architectural Design Decisions},

author = {Arman Shahbazian and Youn Kyu Lee and Duc Le and Yuriy Brun and Nenad Medvidovic},

url = {http://people.cs.umass.edu/brun/pubs/pubs/Shahbazian18icsa.pdf},

doi = {10.1109/ICSA.2018.00019},

year  = {2018},

date = {2018-05-02},

booktitle = {Proceedings of the IEEE International Conference on Software Architecture (ICSA)},

pages = {95--104},

address = {Seattle, WA, USA},

abstract = {Designing and maintaining a software system's architecture typically involve making numerous design decisions, each potentially affecting the system's functional and nonfunctional properties. Understanding these design decisions can help inform future decisions and implementation choices and can avoid introducing regressions and architectural inefficiencies later. Unfortunately, design decisions are rarely well documented and are typically a lost artifact of the architecture creation and maintenance process. The loss of this information can thus hurt development. To address this shortcoming, we develop RecovAr, a technique for automatically recovering design decisions from the project's readily available history artifacts, such as an issue tracker and version control repository. RecovAr uses state-of-the-art architectural recovery techniques on a series of version control commits and maps those commits to issues to identify decisions that affect system architecture. While some decisions can still be lost through this process, our evaluation on Hadoop and Struts, two large open-source systems with over 8 years of development each and, on average, more than 1 million lines of code, shows that RecovAr has the recall of 75% and a precision of 77%. Our work formally defines architectural design decisions and develops an approach for tracing such decisions in project histories. Additionally, the work introduces methods to classify whether decisions are architectural and to map decisions to code elements. Finally, our work contributes a methodology engineers can follow to preserve design-decision knowledge in their projects.},

keywords = {},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Lemos, Rogério; Garlan, David; Ghezzi, Carlo; Giese, Holger; Andersson, Jesper; Litoiu, Marin; Schmerl, Bradley; Weyns, Danny; Baresi, Luciano; Bencomo, Nelly; Brun, Yuriy; Camara, Javier; Calinescu, Radu; Cohen, Myra B.; Gorla, Alessandra; Grassi, Vincenzo; Grunske, Lars; Inverardi, Paola; Jezequel, Jean-Marc; Malek, Sam; Mirandola, Raffaela; Mori, Marco; Müller, Hausi A.; Rouvoy, Romain; Rubira, Cecilia M. F.; Rutten, Eric; Shaw, Mary; Tamburrelli, Giordano; Tamura, Gabriel; Villegas, Norha M.; Vogel, Thomas; Zambonelli, Franco

Software Engineering for Self-Adaptive Systems: Research Challenges in the Provision of Assurances Book Section

In: Lemos, Rogério; Garlan, David; Ghezzi, Carlo; Giese, Holger (Ed.): Software Engineering for Self-Adaptive Systems III, vol. 9640, pp. 3–30, Springer, 2018, ISBN: 978-3-319-74183-3.

Abstract | Links | BibTeX

@incollection{Lemos18SEfSAS,

title = {Software Engineering for Self-Adaptive Systems: Research Challenges in the Provision of Assurances},

author = {Rogério Lemos and David Garlan and Carlo Ghezzi and Holger Giese and Jesper Andersson and Marin Litoiu and Bradley Schmerl and Danny Weyns and Luciano Baresi and Nelly Bencomo and Yuriy Brun and Javier Camara and Radu Calinescu and Myra B. Cohen and Alessandra Gorla and Vincenzo Grassi and Lars Grunske and Paola Inverardi and Jean-Marc Jezequel and Sam Malek and Raffaela Mirandola and Marco Mori and Hausi A. Müller and Romain Rouvoy and Cecilia M. F. Rubira and Eric Rutten and Mary Shaw and Giordano Tamburrelli and Gabriel Tamura and Norha M. Villegas and Thomas Vogel and Franco Zambonelli},

editor = {Rogério Lemos and David Garlan and Carlo Ghezzi and Holger Giese},

url = {http://people.cs.umass.edu/brun/pubs/pubs/Lemos18SEfSAS.pdf},

doi = {10.1007/978-3-319-74183-3_1},

isbn = {978-3-319-74183-3},

year  = {2018},

date = {2018-01-01},

booktitle = {Software Engineering for Self-Adaptive Systems III},

volume = {9640},

pages = {3--30},

publisher = {Springer},

abstract = {The important concern for modern software systems is to become more cost-effective, while being versatile, flexible, resilient, dependable, energy-efficient, customisable, configurable, and self-optimising when reacting to run-time changes that may occur within the system itself, its environment, or requirements. One of the most promising approaches to achieving such properties is to equip software systems with self-managing capabilities using self-adaptation mechanisms. Despite recent advances in this area, one key aspect of self-adaptive systems that remains to be tackled in depth is the provision of assurances, i.e., the collection, analysis, and synthesis of evidence that the system satisfies its stated functional and non-functional requirements during its operation in the presence of self-adaptation. The provision of assurances for self-adaptive systems is challenging since run-time changes introduce a high degree of uncertainty. This paper on research challenges complements previous roadmap papers on software engineering for self-adaptive systems covering a different set of topics, which are related to assurances, namely, perpetual assurances, composition and decomposition of assurances, and assurances obtained from control theory. This research challenges paper is one of the many results of the Dagstuhl Seminar 13511 on Software Engineering for Self-Adaptive Systems: Assurances, which took place in December 2013.},

keywords = {},

pubstate = {published},

tppubtype = {incollection}

}

Close

Schmerl, Bradley; Andersson, Jesper; Vogel, Thomas; Cohen, Myra B.; Rubira, Cecilia M. F.; Brun, Yuriy; Gorla, Alessandra; Zambonelli, Franco; Baresi, Luciano

Challenges in Composing and Decomposing Assurances for Self-Adaptive Systems Book Section

In: Lemos, Rogério; Garlan, David; Ghezzi, Carlo; Giese, Holger (Ed.): Software Engineering for Self-Adaptive Systems III. Assurances, vol. 9640, pp. 64–89, Springer, 2018, ISBN: 978-3-319-74183-3.

Abstract | Links | BibTeX

@incollection{Schmerl18SEfSAS,

title = {Challenges in Composing and Decomposing Assurances for Self-Adaptive Systems},

author = {Bradley Schmerl and Jesper Andersson and Thomas Vogel and Myra B. Cohen and Cecilia M. F. Rubira and Yuriy Brun and Alessandra Gorla and Franco Zambonelli and Luciano Baresi},

editor = {Rogério Lemos and David Garlan and Carlo Ghezzi and Holger Giese},

url = {http://people.cs.umass.edu/brun/pubs/pubs/Schmerl18SEfSAS.pdf},

doi = {10.1007/978-3-319-74183-3_3},

isbn = {978-3-319-74183-3},

year  = {2018},

date = {2018-01-01},

booktitle = {Software Engineering for Self-Adaptive Systems III. Assurances},

volume = {9640},

pages = {64--89},

publisher = {Springer},

abstract = {Self-adaptive software systems adapt to changes in the environment, in the system itself, in their requirements, or in their business objectives. Typically, these systems attempt to maintain system goals at run time and often provide assurance that they will meet their goals under dynamic and uncertain circumstances. While significant research has focused on ways to engineer self-adaptive capabilities into both new and legacy software systems, less work has been conducted on how to assure that self-adaptation maintains system goals. For traditional, especially safety-critical software systems, assurance techniques decompose assurances into sub-goals and evidence that can be provided by parts of the system. Existing approaches also exist for composing assurances, in terms of composing multiple goals and composing assurances in systems of systems. While some of these techniques may be applied to self-adaptive systems, we argue that several significant challenges remain in applying them to self-adaptive systems in this chapter. We discuss how existing assurance techniques can be applied to composing and decomposing assurances for self-adaptive systems, highlight the challenges in applying them, summarize existing research to address some of these challenges, and identify gaps and opportunities to be addressed by future research.},

keywords = {},

pubstate = {published},

tppubtype = {incollection}

}

Close

Goues, Claire Le; Brun, Yuriy; Forrest, Stephanie; Weimer, Westley

Clarifications on the Construction and Use of the ManyBugs Benchmark Journal Article

In: IEEE Transactions on Software Engineering (TSE), comment paper, vol. 43, no. 11, pp. 1089–1090, 2017, ISSN: 0098-5589.

Abstract | Links | BibTeX

Weiss, Aaron; Guha, Arjun; Brun, Yuriy

Tortoise: Interactive System Configuration Repair Proceedings Article

In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 625–636, Urbana-Champaign, IL, USA, 2017.

Abstract | Links | BibTeX

@inproceedings{Weiss17ase,

title = {Tortoise: Interactive System Configuration Repair},

author = {Aaron Weiss and Arjun Guha and Yuriy Brun},

url = {http://people.cs.umass.edu/brun/pubs/pubs/Weiss17ase.pdf},

doi = {10.1109/ASE.2017.8115673},

year  = {2017},

date = {2017-10-31},

booktitle = {Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)},

pages = {625--636},

address = {Urbana-Champaign, IL, USA},

abstract = {System configuration languages provide powerful abstractions that simplify managing large-scale, networked systems. Thousands of organizations now use configuration languages, such as Puppet. However, specifications written in configuration languages can have bugs and the shell remains the simplest way to debug a misconfigured system. Unfortunately, it is unsafe to use the shell to fix problems when a system configuration language is in use: a fix applied from the shell may cause the system to drift from the state specified by the configuration language. Thus, despite their advantages, configuration languages force system administrators to give up the simplicity and familiarity of the shell. 

 This paper presents a synthesis-based technique that allows administrators to use configuration languages and the shell in harmony. Administrators can fix errors using the shell and the technique automatically repairs the higher-level specification written in the configuration language. The approach (1) produces repairs that are consistent with the fix made using the shell; (2) produces repairs that are maintainable by minimizing edits made to the original specification; (3) ranks and presents multiple repairs when relevant; and (4) supports all shells the administrator may wish to use. We implement our technique for Puppet, a widely used system configuration language, and evaluate it on a suite of benchmarks under 42 repair scenarios. The top-ranked repair is selected by humans 76% of the time and the human-equivalent repair is ranked 1.31 on average.},

keywords = {},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Galhotra, Sainyam; Brun, Yuriy; Meliou, Alexandra

Fairness Testing: Testing Software for Discrimination Proceedings Article

In: Proceedings of the 11th Joint Meeting of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 498–510, Paderborn, Germany, 2017, (ACM SIGSOFT Distinguished Paper Award.).

Abstract | Links | BibTeX

@inproceedings{Galhotra17fse,

title = {Fairness Testing: Testing Software for Discrimination},

author = {Sainyam Galhotra and Yuriy Brun and Alexandra Meliou},

url = {http://people.cs.umass.edu/brun/pubs/pubs/Galhotra17fse.pdf},

doi = {10.1145/3106237.3106277},

year  = {2017},

date = {2017-09-08},

booktitle = {Proceedings of the 11th Joint Meeting of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE)},

pages = {498--510},

address = {Paderborn, Germany},

abstract = {This paper defines the notions of software fairness and discrimination and develops a testing-based method for measuring if and how much software discriminates. Specifically, the paper focuses on measuring causality in discriminatory behavior. Modern software contributes to important societal decisions and evidence of software discrimination has been found in systems that recommend criminal sentences, grant access to financial loans and products, and determine who is allowed to participate in promotions and receive services. Our approach, Themis, measures discrimination in software by generating efficient, discrimination-testing test suites. Given a schema describing valid system inputs, Themis generates discrimination tests automatically and, notably, does not require an oracle. We evaluate Themis on 20 software systems, 12 of which come from prior work with explicit focus on avoiding discrimination. We find that (1) Themis is effective at discovering software discrimination, (2) state-of-the-art techniques for removing discrimination from algorithms fail in many situations, at times discriminating against as much as 98% of an input subdomain, (3)~Themis optimizations are effective at producing efficient test suites for measuring discrimination, and (4)~Themis is more efficient on systems that exhibit more discrimination. We thus demonstrate that fairness testing is a critical aspect of the software development cycle in domains with possible discrimination and provide initial tools for measuring software discrimination.},

note = {ACM SIGSOFT Distinguished Paper Award.},

keywords = {},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Bang, Jae; Brun, Yuriy; Medvidovic, Nenad

Continuous Analysis of Collaborative Design Proceedings Article

In: Proceedings of the IEEE International Conference on Software Architecture (ICSA), pp. 97–106, Gothenburg, Sweden, 2017, (Best Paper Award.).

Abstract | Links | BibTeX

Wang, Qianqian; Brun, Yuriy; Orso, Alessandro

Behavioral Execution Comparison: Are Tests Representative of Field Behavior? Proceedings Article

In: Proceedings of the 10th IEEE International Conference on Software Testing, Verification, and Validation (ICST), pp. 321–332, Tokyo, Japan, 2017.

Abstract | Links | BibTeX

@inproceedings{Wang17icst,

title = {Behavioral Execution Comparison: Are Tests Representative of Field Behavior?},

author = {Qianqian Wang and Yuriy Brun and Alessandro Orso},

url = {http://people.cs.umass.edu/brun/pubs/pubs/Wang17icst.pdf},

doi = {10.1109/ICST.2017.36},

year  = {2017},

date = {2017-03-13},

booktitle = {Proceedings of the 10th IEEE International Conference on Software Testing, Verification, and Validation (ICST)},

pages = {321--332},

address = {Tokyo, Japan},

abstract = {Software testing is the most widely used approach for assessing and improving software quality, but it is inherently incomplete and may not be representative of how the software is used in the field. This paper addresses the questions of to what extent tests represent how real users use software, and how to measure behavioral differences between test and field executions. We study four real-world systems, one used by end-users and three used by other (client) software, and compare test suites written by the systems' developers to field executions using four models of behavior: statement coverage, method coverage, mutation score, and a temporal-invariant-based model we developed. We find that developer-written test suites fail to accurately represent field executions: the tests, on average, miss 6.2% of the statements and 7.7% of the methods exercised in the field; the behavior exercised only in the field kills an extra 8.6% of the mutants; finally, the tests miss 52.6% of the behavioral invariants that occur in the field. In addition, augmenting the in-house test suites with automatically-generated tests only marginally improves the tests' behavioral representativeness. These differences between field and test executions, and in particular the finer-grained and more sophisticated ones that we measured using our invariant-based model, can provide insight for developers and suggest a better method for measuring test suite quality.},

keywords = {},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Beschastnikh, Ivan; Wang, Patty; Brun, Yuriy; Ernst, Michael D.

Debugging Distributed Systems Journal Article

In: Communications of the ACM, vol. 59, no. 8, pp. 32–37, 2016.

Abstract | Links | BibTeX

Ohmann, Tony; Stanley, Ryan; Beschastnikh, Ivan; Brun, Yuriy

Visually Reasoning about System and Resource Behavior Proceedings Article

In: Proceedings of the Demonstrations Track of the 38th International Conference on Software Engineering (ICSE Demo), pp. 601–604, Austin, TX, USA, 2016.

Abstract | Links | BibTeX