Publications
2024 |
Eladawy, Hadeel; Goues, Claire Le; Brun, Yuriy Automated Program Repair, What Is It Good For? Not Absolutely Nothing! Proceedings Article In: Proceedings of the 46th International Conference on Software Engineering (ICSE), Lisbon, Portugal, 2024. @inproceedings{Eladawy24icse, Industrial deployments of automated program repair (APR), e.g., at Facebook and Bloomberg, signal a new milestone for this exciting and potentially impactful technology. In these deployments, developers use APR-generated patch suggestions as part of a human-driven debugging process. Unfortunately, little is known about how using patch suggestions affects developers during debugging. This paper conducts a controlled user study with 40 developers with a median of 6 years of experience. The developers engage in debugging tasks on nine naturally-occurring defects in real-world, open-source, Java projects, using Recoder, SimFix, and TBar, three state-of-the-art APR tools. For each debugging task, the developers either have access to the project's tests, or, also, to code suggestions that make all the tests pass. These suggestions are either developer-written or APR-generated, which can be correct or deceptive. Deceptive suggestions, which are a common APR occurrence, make all the available tests pass but fail to generalize to the intended specification. Through a total of 160 debugging sessions, we find that access to a code suggestion significantly increases the odds of submitting a patch. Correct APR suggestions increase the odds of debugging success by 14,000%, but deceptive suggestions decrease the odds of success by 65%. Correct suggestions also speed up debugging. Surprisingly, we observe no significant difference in how novice and experienced developers are affected by APR, suggesting that APR may find uses across the experience spectrum. Overall, developers come away with a strong positive impression of APR, suggesting promise for APR-mediated, human-driven debugging, despite existing challenges in APR-generated repair quality. |
Gaba, Aimen; Kaufman, Zhanna; Cheung, Jason; Shvakel, Marie; Hall, Kyle Wm; Brun, Yuriy; Bearfield, Cindy Xiong My Model is Unfair, Do People Even Care? Visual Design Affects Trust and Perceived Bias in Machine Learning Journal Article In: IEEE Transactions on Visualization and Computer Graphics (TVCG), vol. 30, no. 1, pp. 327-337, 2024. @article{Gaba23vis, Machine learning technology has become ubiquitous, but, unfortunately, often exhibits bias. As a consequence, disparate stakeholders need to interact with and make informed decisions about using machine learning models in everyday systems. Visualization technology can support stakeholders in understanding and evaluating trade-offs between, for example, accuracy and fairness of models. This paper aims to empirically answer ``Can visualization design choices affect a stakeholder's perception of model bias, trust in a model, and willingness to adopt a model?'' Through a series of controlled, crowd-sourced experiments with more than 1,500 participants, we identify a set of strategies people follow in deciding which models to trust. Our results show that men and women prioritize fairness and performance differently and that visual design choices significantly affect that prioritization. For example, women trust fairer models more often than men do, participants value fairness more when it is explained using text than as a bar chart, and being explicitly told a model is biased has a bigger impact than showing past biased performance. We test the generalizability of our results by comparing the effect of multiple textual and visual design choices and offer potential explanations of the cognitive mechanisms behind the difference in fairness perception and trust. Our research guides design considerations to support future work developing visualization systems for machine learning. |
2023 |
First, Emily; Rabe, Markus; Ringer, Talia; Brun, Yuriy Baldur: Whole-Proof Generation and Repair with Large Language Models Proceedings Article In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), San Fransisco, CA, USA, 2023. @inproceedings{First23fse, Formally verifying software properties is a highly desirable but labor-intensive task. Recent work has developed methods to automate formal verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by training a model to predict one proof step at a time, and using that model to search through the space of possible proofs. This paper introduces a new method to automate formal verification: We use large language models, trained on natural language text and code and fine-tuned on proofs, to generate whole proofs for theorems at once, rather than one step at a time. We combine this proof generation model with a fine-tuned repair model to repair generated proofs, further increasing proving power. As its main contributions, this paper demonstrates for the first time that: (1) Whole-proof generation using transformers is possible and is as effective as search-based techniques without requiring costly search. (2) Giving the learned model additional context, such as a prior failed proof attempt and the ensuing error message, results in proof repair and further improves automated proof generation. (3) We establish a new state of the art for fully automated proof synthesis. We reify our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs. In addition to empirically showing the effectiveness of whole-proof generation, repair, and added context, we show that Baldur improves on the state-of-the-art tool, Thor, by automatically generating proofs for an additional 8.7% of the theorems. Together, Baldur and Thor can prove 65.7% of the theorems fully automatically. This paper paves the way for new research into using large language models for automating formal verification. |
Hoag, Austin; Kostas, James E.; Silva, Bruno Castro; Thomas, Philip S.; Brun, Yuriy Seldonian Toolkit: Building Software with Safe and Fair Machine Learning Proceedings Article In: Proceedings of the Demonstrations Track at the 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 2023. @inproceedings{Hoag23icse-demo, We present the Seldonian Toolkit, which enables software engineers to integrate provably safe and fair machine learning algorithms into their systems. Software systems that use data and machine learning are routinely deployed in a wide range of settings from medical applications, autonomous vehicles, the criminal justice system, and hiring processes. These systems, however, can produce unsafe and unfair behavior, such as suggesting potentially fatal medical treatments, making racist or sexist predictions, or facilitating radicalization and polarization. To reduce these undesirable behaviors, software engineers need the ability to easily integrate their machine-learning-based systems with domain-specific safety and fairness requirements defined by domain experts, such as doctors and hiring managers. The Seldonian Toolkit provides special machine learning algorithms that enable software engineers to incorporate such expert-defined requirements of safety and fairness into their systems, while provably guaranteeing those requirements will be satisfied. A video demonstrating the Seldonian Toolkit is available at https://youtu.be/wHR-hDm9jX4/. |
Agrawal, Arpan; First, Emily; Kaufman, Zhanna; Reichel, Tom; Zhang, Shizhuo; Zhou, Timothy; Sanchez-Stern, Alex; Ringer, Talia; Brun, Yuriy Proofster: Automated Formal Verification Proceedings Article In: Proceedings of the Demonstrations Track at the 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 2023. @inproceedings{Agrawal23icse-demo, Formal verification is an effective but extremely work-intensive method of improving software quality. Verifying the correctness of software systems often requires significantly more effort than implementing them in the first place, despite the existence of proof assistants, such as Coq, aiding the process. Recent work has aimed to fully automate the synthesis of formal verification proofs, but little tool support exists for practitioners. This paper presents Proofster, a web-based tool aimed at assisting developers with the formal verification process via proof synthesis. Proofster inputs a Coq theorem specifying a property of a software system and attempts to automatically synthesize a formal proof of the correctness of that property. When it is unable to produce a proof, Proofster outputs the proof-space search tree its synthesis explored, which can guide the developer to provide a hint to enable Proofster to synthesize the proof. Proofster runs online at https://proofster.cs.umass.edu/ and a video demonstrating Proofster is available at https://youtu.be/xQAi66lRfwI/. |
Talebipour, Saghar; Park, Hyojae; Baral, Kesina; Yee, Leon; Khan, Safwat Ali; Moran, Kevin; Brun, Yuriy; Medvidovic, Nenad; Zhao, Yixue AVGUST: A Tool for Generating Usage-Based Tests from Videos of App Executions Proceedings Article In: Proceedings of the Demonstrations Track at the 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 2023. @inproceedings{Talebipour23icse-demo, Creating UI tests for mobile applications is a difficult and time-consuming task. As such, there has been a considerable amount of work carried out to automate the generation of mobile tests---largely focused upon the goals of maximizing code coverage or finding crashes. However, comparatively fewer automated techniques have been proposed to generate a highly sought after type of test: usage-based tests. These tests exercise targeted app functionalities for activities such as regression testing. In this paper, we present the Avgust tool for automating the construction of usage-based tests for mobile apps. Avgust learns usage patterns from videos of app executions collected by beta testers or crowd-workers, translates these into an app-agnostic state-machine encoding, and then uses this encoding to generate new test cases for an unseen target app. We evaluated Avgust on 374 videos of use cases from 18 popular apps and found that it can successfully exercise the desired usage in 69% of the tests. Avgust is an open-source tool available at https://github.com/felicitia/UsageTesting-Repo/tree/demo/. A video illustrating the capabilities of Avgust can be found at: https://youtu.be/LPICxVd0YAg. |
Motwani, Manish; Brun, Yuriy Understanding Why and Predicting When Developers Adhere to Code-Quality Standards Proceedings Article In: Proceedings of the Software Engineering in Practice Track at the 45th International Conference on Software Engineering (ICSE SEIP), Melbourne, Australia, 2023. @inproceedings{Motwani23icse-seip, <p>Static analysis tools are widely used in software development. While research has focused on improving tool accuracy, evidence at Microsoft suggests that developers often consider some accurately detected warnings not worth fixing: what these tools and developers consider to be true positives differs. Thus, improving tool utilization requires understanding when and why developers fix static-analysis warnings.</p> <p>We conduct a case study of Microsoft's Banned API Standard used within the company, which describes 195 APIs that can potentially cause vulnerabilities and 142 recommended replacements. We find that developers often (84% of the time) consciously deviate from this standard, specifying their rationale, allowing us to study why and when developers deviate from standards. We then identify 23~factors that correlate with developers using the preferred APIs and build a model that predicts whether the developers would use the preferred or discouraged APIs under different circumstances with 92% accuracy. We also train a model to predict the kind of APIs developers would use in the future based on their past development activity, with 86% accuracy. We outline a series of concrete suggestions static analysis developers can use to prioritize and customize their output, potentially increasing their tools' usefulness.</p> |
Motwani, Manish; Brun, Yuriy Better Automatic Program Repair by Using Bug Reports and Tests Together Proceedings Article In: Proceedings of the 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 2023. @inproceedings{Motwani23icse, <p>Automated program repair is already deployed in industry, but concerns remain about repair quality. Recent research has shown that one of the main reasons repair tools produce incorrect (but seemingly correct) patches is imperfect fault localization (FL). This paper demonstrates that combining information from natural-language bug reports and test executions when localizing bugs can have a significant positive impact on repair quality. By modifying existing repair tools to use FL that combines bug reports and tests, we are able to correctly repair 7 defects in Defects4J that no prior tools have repaired correctly.</p> <p>We develop, Blues, the first information-retrieval-based, statement-level FL technique that requires no training data. We further develop RAFL, the first unsupervised method for combining multiple FL techniques, which outperforms a supervised method. Using RAFL, we create SBIR by combining Blues with a spectrum-based (SBFL) technique. Evaluated on 815 real-world defects, SBIR consistently ranks buggy statements higher than its underlying techniques.</p> <p>Finally, we modify three state-of-the-art repair tools, Arja, SequenceR, and SimFix, to use SBIR, SBFL, and Blues as their internal FL. We evaluate the quality of the produced patches on 689 real-world defects. Arja and SequenceR significantly benefit from SBIR: Arja using SBIR correctly repairs 28 defects, but only 21 using SBFL, and only 15 using Blues; SequenceR using SBIR correctly repairs 12 defects, but only 10 using SBFL, and only 4 using Blues. SimFix, (which has internal mechanisms to overcome poor FL), correctly repairs 30 defects using SBIR and SBFL, but only 13 using Blues. Our promising findings direct further research into combining data from bug reports and test executions for FL and program repair.</p> |
Brun, Yuriy; Lin, Tian; Somerville, Jessie Elise; Myers, Elisha M.; Ebner, Natalie Blindspots in Python and Java APIs Result in Vulnerable Code Journal Article In: ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 32, no. 3, pp. 76:1–76:31, 2023, ISSN: 1049-331X. @article{Brun23tosem, Blindspots in APIs can cause software engineers to introduce vulnerabilities, but such blindspots are, unfortunately, common. We study the effect APIs with blindspots have on developers in two languages by replicating an 109-developer, 24-Java-API controlled experiment. Our replication applies to Python and involves 129 new developers and 22 new APIs. We find that using APIs with blindspots statistically significantly reduces the developers' ability to correctly reason about the APIs in both languages, but that the effect is more pronounced for Python. Interestingly, for Java, the effect increased with complexity of the code relying on the API, whereas for Python, the opposite was true. This suggests that Python developers are less likely to notice potential for vulnerabilities in complex code than in simple code, whereas Java developers are more likely to recognize the extra complexity and apply more care, but are more careless with simple code. Whether the developers considered API uses to be more difficult, less clear, and less familiar did not have an effect on their ability to correctly reason about them. Developers with better long-term memory recall were more likely to correctly reason about APIs with blindspots, but short-term memory, processing speed, episodic memory, and memory span had no effect. Surprisingly, professional experience and expertise did not improve the developers' ability to reason about APIs with blindspots across both languages, with long-term professionals with many years of experience making mistakes as often as relative novices. Finally, personality traits did not significantly affect the Python developers' ability to reason about APIs with blindspots, but less extraverted and more open developers were better at reasoning about Java APIs with blindspots. Overall, our findings suggest that blindspots in APIs are a serious problem across languages, and that experience and education alone do not overcome that problem, suggesting that tools are needed to help developers recognize blindspots in APIs as they write code that uses those APIs. |
Johnson, Brittany; Bartola, Jesse; Angell, Rico; Witty, Sam; Giguere, Stephen J.; Brun, Yuriy Fairkit, Fairkit, on the Wall, Who's the Fairest of Them All? Supporting Data Scientists in Training Fair Models Journal Article In: EURO Journal on Decision Processes, vol. 11, 2023. @article{Johnson23fairkit, Modern software relies heavily on data and machine learning, and affects decisions that shape our world. Unfortunately, recent studies have shown that because of biases in data, software systems frequently inject bias into their decisions, from producing more errors when transcribing women's than men's voices to overcharging people of color for financial loans. To address bias in software, data scientists and software engineers need tools that help them understand the trade-offs between model quality and fairness in their specific data domains. Toward that end, we present fairkit-learn, an interactive toolkit for helping engineers reason about and understand fairness. Fairkit-learn supports over 70 definition of fairness and works with state-of-the-art machine learning tools, using the same interfaces to ease adoption. It can evaluate thousands of models produced by multiple machine learning algorithms, hyperparameters, and data permutations, and compute and visualize a small Pareto-optimal set of models that describe the optimal trade-offs between fairness and quality. Engineers can then iterate, improving their models and evaluating them using fairkit-learn. We evaluate fairkit-learn via a user study with 54 students, showing that students using fairkit-learn produce models that provide a better balance between fairness and quality than students using scikit-learn and IBM AI Fairness 360 toolkits. With fairkit-learn, users can select models that are up to 67% more fair and 10% more accurate than the models they are likely to train with scikit-learn. |
Sanchez-Stern, Alex; First, Emily; Zhou, Timothy; Kaufman, Zhanna; Brun, Yuriy; Ringer, Talia Passport: Improving Automated Formal Verification Using Identifiers Journal Article In: ACM Transactions on Programming Languages and Systems (TOPLAS), 2023, ISSN: 0164-0925. @article{Sanchez-Stern23toplas, <p>Formally verifying system properties is one of the most effective ways of improving system quality, but its high manual effort requirements often render it prohibitively expensive. Tools that automate formal verification by learning from proof corpora to synthesize proofs have just begun to show their promise. These tools are effective because of the richness of the data the proof corpora contain. This richness comes from the stylistic conventions followed by communities of proof developers, together with the powerful logical systems beneath proof assistants. However, this richness remains underexploited, with most work thus far focusing on architecture rather than on how to make the most of the proof data. This paper systematically explores how to most effectively exploit one aspect of that proof data: identifiers.</p> <p>We develop the Passport approach, a method for enriching the predictive Coq model used by an existing proof-synthesis tool with three new encoding mechanisms for identifiers: category vocabulary indexing, subword sequence modeling, and path elaboration. We evaluate our approach's enrichment effect on three existing base tools: ASTactic, Tac, and Tok. In head-to-head comparisons, Passport automatically proves 29% more theorems than the best-performing of these base tools. Combining the three tools enhanced by the Passport approach automatically proves 38% more theorems than combining the three base tools. Finally, together, these base tools and their enhanced versions prove 45% more theorems than the combined base tools. Overall, our findings suggest that modeling identifiers can play a significant role in improving proof synthesis, leading to higher-quality software.</p> |
2022 |
Brun, Yuriy The Promise and Perils of Using Machine Learning When Engineering Software (Keynote Paper) Proceedings Article In: Proceedings of the International Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE), Singapore, 2022. @inproceedings{Brun22MaLTeSQuE, Machine learning has radically changed what computing can accomplish, including the limits of what software engineering can do. I will discuss recent software engineering advances machine learning has enabled, from automatically repairing software bugs to data-driven software systems that automatically learn to make decisions. Unfortunately, with the promises of these new technologies come serious perils. For example, automatically generated program patches can break as much functionality as they repair. And self-learning, data-driven software can make decisions that result in unintended consequences, including unsafe, racist, or sexist behavior. But to build solutions to these shortcomings we may need to look no further than machine learning itself. I will introduce multiple ways machine learning can help verify software properties, leading to higher-quality systems. |
Zhao, Yixue; Talebipour, Saghar; Baral, Kesina; Park, Hyojae; Yee, Leon; Khan, Safwat Ali; Brun, Yuriy; Medvidovic, Nenad; Moran, Kevin Avgust: Automating Usage-Based Test Generation from Videos of App Executions Proceedings Article In: Proceedings of the 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 421–433, Singapore, 2022, (ACM artifact badges granted: Artifact Available, Artifact Functional.). @inproceedings{Zhao22fse, Writing and maintaining UI tests for mobile apps is a time-consuming and tedious task. While decades of research have produced automated approaches for UI test generation, these approaches typically focus on testing for crashes or maximizing code coverage. By contrast, recent research has shown that developers prefer usage-based tests, which center around specific uses of app features, to help support activities such as regression testing. Very few existing techniques support the generation of such tests, as doing so requires automating the difficult task of understanding the semantics of UI screens and user inputs. In this paper, we introduce Avgust, which automates key steps of generating usage-based tests. Avgust uses neural models for image understanding to process video recordings of app uses to synthesize an app-agnostic state-machine encoding of those uses. Then, Avgust uses this encoding to synthesize test cases for a new target app. We evaluate Avgust on 374 videos of common uses of 18 popular apps and show that 69% of the tests Avgust generates successfully execute the desired usage, and that Avgust's classifiers outperform the state of the art. |
Shahbazian, Arman; Karthik, Suhrid; Brun, Yuriy; Medvidovic, Nenad eQual: Informing Early Design Decisions Proceedings Article In: Proceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 1039–1051, Sacramento, CA, USA, 2022, (ACM artifact badges granted: Artifact Available.). @inproceedings{Shahbazian20fse, When designing a software system, architects make a series of design decisions that directly impact the system's quality. The number of available design alternatives grows rapidly with system size, creating an enormous space of intertwined design concerns that renders manual exploration impractical. We present eQual, a model-driven technique for simulation-based assessment of architectural designs. While it is not possible to guarantee optimal decisions so early in the design process, eQual improves decision quality. eQual is effective in practice because it (1) limits the amount of information the architects have to provide and (2) adapts optimization algorithms to effectively explore massive spaces of design alternatives. We empirically demonstrate that eQual yields designs whose quality is comparable to a set of systems' known optimal designs. A user study shows that, compared to the state-of-the-art, engineers using eQual produce statistically significantly higher-quality designs with a large effect size, are statistically significantly more confident in their designs, and find eQual easier to use. |
Johnson, Brittany; Brun, Yuriy Fairkit-learn: A Fairness Evaluation and Comparison Toolkit Proceedings Article In: Proceedings of the Demonstrations Track at the 44th International Conference on Software Engineering (ICSE), pp. 70–74, Pittsburgh, PA, USA, 2022. @inproceedings{Johnson22, Advances in how we build and use software, specifically the integration of machine learning for decision making, have led to widespread concern around model and software fairness. We present fairkit-learn, an interactive Python toolkit designed to support data scientists' ability to reason about and understand model fairness. We outline how fairkit-learn can support model training, evaluation, and comparison and describe the potential benefit that comes with using fairkit-learn in comparison to the state-of-the-art. Fairkit-learn is open source at https://go.gmu.edu/fairkit-learn/. |
First, Emily; Brun, Yuriy Diversity-Driven Automated Formal Verification Proceedings Article In: Proceedings of the 44th International Conference on Software Engineering (ICSE), pp. 749–761, Pittsburgh, PA, USA, 2022, (ACM SIGSOFT Distinguished Paper Award. ACM artifact badges granted: Artifact Available). @inproceedings{First22icse, Formally verified correctness is one of the most desirable properties of software systems. But despite great progress made via interactive theorem provers, such as Coq, writing proof scripts for verification remains one of the most effort-intensive (and often prohibitively difficult) software development activities. Recent work has created tools that automatically synthesize proofs or proof scripts. For example, CoqHammer can prove 26.6% of theorems completely automatically by reasoning using precomputed facts, while TacTok and ASTactic, which use machine learning to model proof scripts and then perform biased search through the proof-script space, can prove 12.9% and 12.3% of the theorems, respectively. Further, these three tools are highly complementary; together, they can prove 30.4% of the theorems fully automatically. Our key insight is that control over the learning process can produce a diverse set of models, and that, due to the unique nature of proof synthesis (the existence of the theorem prover, an oracle that infallibly judges a proof's correctness), this diversity can significantly improve these tools' proving power. Accordingly, we develop Diva, which uses a diverse set of models with TacTok's and ASTactic's search mechanism to prove 21.7% of the theorems. That is, Diva proves 68% more theorems than TacTok and 77% more than ASTactic. Complementary to CoqHammer, Diva proves 781 theorems (27% added value) that Coq-Hammer does not, and 364 theorems no existing tool has proved automatically. Together with CoqHammer, Diva proves 33.8% of the theorems, the largest fraction to date. We explore nine dimensions for learning diverse models, and identify which dimensions lead to the most useful diversity. Further, we develop an optimization to speed up Diva's execution by 40X. Our study introduces a completely new idea for using diversity in machine learning to improve the power of state-of-the-art proof-script synthesis techniques, and empirically demonstrates that the improvement is significant on a dataset of 68K theorems from 122 open-source software projects. |
Giguere, Stephen; Metevier, Blossom; Brun, Yuriy; Silva, Bruno Castro; Thomas, Philip S.; Niekum, Scott Fairness Guarantees under Demographic Shift Proceedings Article In: Proceedings of the 10th International Conference on Learning Representations (ICLR), 2022. @inproceedings{Giguere22iclr, Recent studies found that using machine learning for social applications can lead to injustice in the form of racist, sexist, and otherwise unfair and discriminatory outcomes. To address this challenge, recent machine learning algorithms have been designed to limit the likelihood such unfair behavior occurs. However, these approaches typically assume the data used for training is representative of what will be encountered in deployment, which is often untrue. In particular, if certain subgroups of the population become more or less probable in deployment (a phenomenon we call demographic shift), prior work's fairness assurances are often invalid. In this paper, we consider the impact of demographic shift and present a class of algorithms, called Shifty algorithms, that provide high-confidence behavioral guarantees that hold under demographic shift when data from the deployment environment is unavailable during training. Shifty, the first technique of its kind, demonstrates an effective strategy for designing algorithms to overcome demographic shift's challenges. We evaluate Shifty using the UCI Adult Census dataset, as well as a real-world dataset of university entrance exams and subsequent student success. We show that the learned models avoid bias under demographic shift, unlike existing methods. Our experiments demonstrate that our algorithm's high-confidence fairness guarantees are valid in practice and that our algorithm is an effective tool for training models that are fair when demographic shift occurs. |
Motwani, Manish; Soto, Mauricio; Brun, Yuriy; Just, René; Goues, Claire Le Quality of Automated Program Repair on Real-World Defects Journal Article In: IEEE Transactions on Software Engineering (TSE), vol. 48, no. 2, pp. 637–661, 2022, ISSN: 0098-5589. @article{Motwani22tse, Automated program repair is a promising approach to reducing the costs of manual debugging and increasing software quality. However, recent studies have shown that automated program repair techniques can be prone to producing patches of low quality, overfitting to the set of tests provided to the repair technique, and failing to generalize to the intended specification. This paper rigorously explores this phenomenon on real-world Java programs, analyzing the effectiveness of four well-known repair techniques, GenProg, Par, SimFix, and TrpAutoRepair, on defects made by the projects' developers during their regular development process. We find that: (1) When applied to real-world Java code, automated program repair techniques produce patches for between 10.6% and 19.0% of the defects, which is less frequent than when applied to C code. (2) The produced patches often overfit to the provided test suite, with only between 13.8% and 46.1% of the patches passing an independent set of tests. (3) Test suite size has an extremely small but significant effect on the quality of the patches, with larger test suites producing higher-quality patches, though, surprisingly, higher-coverage test suites correlate with lower-quality patches. (4) The number of tests that a buggy program fails has a small but statistically significant positive effect on the quality of the produced patches. (5) Test suite provenance, whether the test suite is written by a human or automatically generated, has a significant effect on the quality of the patches, with developer-written tests typically producing higher-quality patches. And (6) the patches exhibit insufficient diversity to improve quality through some method of combining multiple patches. We develop JaRFly, an open-source framework for implementing techniques for automatic search-based improvement of Java programs. Our study uses JaRFly to faithfully reimplement GenProg and TrpAutoRepair to work on Java code, and makes the first public release of an implementation of Par. Unlike prior work, our study carefully controls for confounding factors and produces a methodology, as well as a dataset of automatically-generated test suites, for objectively evaluating the quality of Java repair techniques on real-world defects. |
2021 |
Afzal, Afsoon; Motwani, Manish; Stolee, Kathryn T.; Brun, Yuriy; Goues, Claire Le SOSRepair: Expressive Semantic Search for Real-World Program Repair Journal Article In: IEEE Transactions on Software Engineering (TSE), vol. 47, no. 10, pp. 2162–2181, 2021, ISSN: 0098-5589. @article{Afzal21tse, Automated program repair holds the potential to significantly reduce software maintenance effort and cost. However, recent studies have shown that it often produces low-quality patches that repair some but break other functionality. We hypothesize that producing patches by replacing likely faulty regions of code with semantically-similar code fragments, and doing so at a higher level of granularity than prior approaches can better capture abstraction and the intended specification, and can improve repair quality. We create SOSRepair, an automated program repair technique that uses semantic code search to replace candidate buggy code regions with behaviorally-similar (but not identical) code written by humans. SOSRepair is the first such technique to scale to real-world defects in real-world systems. On a subset of the ManyBugs benchmark of such defects, SOSRepair produces patches for 23 (35%) of the 65 defects, including 3, 5, and 8 defects for which previous state-of-the-art techniques Angelix, Prophet, and GenProg do not, respectively. On these 23 defects, SOSRepair produces more patches (8, 35%) that pass all independent tests than the prior techniques. We demonstrate a relationship between patch granularity and the ability to produce patches that pass all independent tests. We then show that fault localization precision is a key factor in SOSRepair's success. Manually improving fault localization allows SOSRepair to patch 24 (37%) defects, of which 16 (67%) pass all independent tests. We conclude that (1) higher-granularity, semantic-based patches can improve patch quality, (2) semantic search is promising for producing high-quality real-world defect repairs, (3) research in fault localization can significantly improve the quality of program repair techniques, and (4) semi-automated approaches in which developers suggest fix locations may produce high-quality patches. |
2020 |
Pinckney, Donald; Guha, Arjun; Brun, Yuriy Wasm/k: Delimited Continuations for WebAssembly Proceedings Article In: Proceedings of the ACM SIGPLAN International Symposium on Dynamic Languages (DLS), pp. 16–28, Chicago, IL, USA, 2020. @inproceedings{Pinckney20dls, WebAssembly is designed to be an alternative to JavaScript that is a safe, portable, and efficient compilation target for a variety of languages. The performance of high-level languages depends not only on the underlying performance of WebAssembly, but also on the quality of the generated WebAssembly code. In this paper, we identify several features of high-level languages that current approaches can only compile to WebAssembly by generating complex and inefficient code. We argue that these problems could be addressed if WebAssembly natively supported first-class continuations. We then present Wasm/k, which extends WebAssembly with delimited continuations. Wasm/k introduces no new value types, and thus does not require significant changes to the WebAssembly type system (validation). Wasm/k is safe, even in the presence of foreign function calls (e.g., to and from JavaScript). Finally, Wasm/k is amenable to efficient implementation: we implement Wasm/k as a local change to Wasmtime, an existing WebAssembly JIT. We evaluate Wasm/k by implementing C/k, which adds delimited continuations to C/C++. C/k uses Emscripten and its implementation serves as a case study on how to use Wasm/k in a compiler that targets WebAssembly. We present several case studies using C/k, and show that on implementing green threads, it can outperform the state-of-the-art approach Asyncify with an 18% improvement in performance and a 30% improvement in code size. |
First, Emily; Brun, Yuriy; Guha, Arjun TacTok: Semantics-Aware Proof Synthesis Journal Article In: Proceedings of the ACM on Programming Languages (PACMPL) Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA) issue, vol. 4, pp. 231:1–231:31, 2020. @article{First20oopsla, Formally verifying software correctness is a highly manual process. However, because verification proof scripts often share structure, it is possible to learn from existing proof scripts to fully automate some formal verification. The goal of this paper is to improve proof script synthesis and enable fully automating more verification. Interactive theorem provers, such as the Coq proof assistant, allow programmers to write partial proof scripts, observe the semantics of the proof state thus far, and then attempt more progress. Knowing the proof state semantics is a significant aid. Recent research has shown that the proof state can help predict the next step. In this paper, we present TacTok, the first technique that attempts to fully automate proof script synthesis by modeling proof scripts using both the partial proof script written thus far and the semantics of the proof state. Thus, TacTok more completely models the information the programmer has access to when writing proof scripts manually. We evaluate TacTok on a benchmark of 26 software projects in Coq, consisting of over 10 thousand theorems. We compare our approach to five tools. Two prior techniques, CoqHammer, the state-of-the-art proof synthesis technique, and ASTactic, a proof script synthesis technique that models proof state. And three new proof script synthesis technique we create ourselves, SeqOnly, which models only the partial proof script and the initial theorem being proven, and WeightedRandom and WeightedGreedy, which use metaheuristic search biased by frequencies of proof tactics in existing, successful proof scripts. We find that TacTok outperforms WeightedRandom and WeightedGreedy, and is complementary to CoqHammer and ASTactic: for 24 out of the 26 projects, TacTok can synthesize proof scripts for some theorems the prior tools cannot. Together with TacTok, 11.5% more theorems can be proven automatically than by CoqHammer alone, and 20.0% than by ASTactic alone. Compared to a combination of CoqHammer and ASTactic, TacTok can prove an additional 3.6% more theorems, proving 115 theorems no tool could previously prove. Overall, our experiments provide evidence that partial proof script and proof state semantics, together, provide useful information for proof script modeling, and that metaheuristic search is a promising direction for proof script synthesis. TacTok is open-source and we make public all our data and a replication package of our experiments. |
Johnson, Brittany; Brun, Yuriy; Meliou, Alexandra Causal Testing: Understanding Defects' Root Causes Proceedings Article In: Proceedings of the 42nd International Conference on Software Engineering (ICSE), pp. 87–99, Seoul, Republic of Korea, 2020, (ACM SIGSOFT Distinguished Artifact Award. ACM artifact badges granted: Artifact Available, Artifact Reusable.). @inproceedings{Johnson20icse, Isolating and repairing buggy software behavior requires finding where the bug is happening and understanding the root cause of the buggy behavior. While there exist tools that can help developers identify where a bug is located, existing work has paid little, if any, attention to helping developers understand the root cause of buggy behavior. We present Causal Testing, a new method of root-cause analysis that relies on the theory of counterfactual causality to identify a set of executions that likely hold key causal information necessary to understand and repair buggy behavior. Evaluating Causal Testing on the Defects4J benchmark, we find that Causal Testing could be applied to 71% of real-world defects, and for 77% of those, it can help developers identify the root cause of the defect. A controlled experiment with 37 developers showed that Causal Testing improved participants' ability to identify the cause of the defect: Users with standard testing tools correctly identified the cause 88% of the time in comparison to 96% of the time with Causal Testing. Overall, participants agreed Causal Testing provided useful information they could not get using tools like JUnit alone. |
Beschastnikh, Ivan; Liu, Perry; Xing, Albert; Wang, Patty; Brun, Yuriy; Ernst, Michael D. Visualizing distributed system executions Journal Article In: ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 29, no. 2, pp. 9:1–9:38, 2020, ISSN: 1049-331X. @article{Beschastnikh20tosem, Distributed systems pose unique challenges for software developers. Understanding the system's communication topology and reasoning about concurrent activities of system hosts can be difficult. The standard approach, analyzing system logs, can be a tedious and complex process that involves reconstructing a system log from multiple hosts' logs, reconciling timestamps among hosts with non-synchronized clocks, and understanding what took place during the execution encoded by the log. This paper presents a novel approach for tackling three tasks frequently performed during analysis of distributed system executions: (1) understanding the relative ordering of events, (2) searching for specific patterns of interaction between hosts, and (3) identifying structural similarities and differences between pairs of executions. Our approach consists of XVector, which instruments distributed systems to capture partial ordering information that encodes the happens-before relation between events, and ShiViz, which processes the resulting logs and presents distributed system executions as interactive time-space diagrams. Two user studies with a total of 109 students and a case study with 2 developers showed that our method was effective, helping participants answer statistically significantly more system-comprehension questions correctly, with a very large effect size. |
2019 |
Metevier, Blossom; Giguere, Stephen; Brockman, Sarah; Kobren, Ari; Brun, Yuriy; Brunskill, Emma; Thomas, Philip S. Offline Contextual Bandits with High Probability Fairness Guarantees Proceedings Article In: Proceedings of the 33rd Annual Conference on Neural Information Processing Systems (NeurIPS), Advances in Neural Information Processing Systems 32, pp. 14893–14904, Vancouver, BC, Canada, 2019. @inproceedings{Metevier19neurips, We present RobinHood, an offline contextual bandit algorithm designed to satisfy a broad family of fairness constraints. Our algorithm accepts multiple fairness definitions and allows users to construct their own unique fairness definitions for the problem at hand. We provide a theoretical analysis of RobinHood, which includes a proof that it will not return an unfair solution with probability greater than a user-specified threshold. We validate our algorithm on three applications: a tutoring system in which we conduct a user study and consider multiple unique fairness definitions; a loan approval setting (using the Statlog German credit data set) in which well-known fairness definitions are applied; and criminal recidivism (using data released by ProPublica). In each setting, our algorithm is able to produce fair policies that achieve performance competitive with other offline and online contextual bandit algorithms. |
Thomas, Philip S.; Silva, Bruno Castro; Barto, Andrew G.; Giguere, Stephen; Brun, Yuriy; Brunskill, Emma Preventing Undesirable Behavior of Intelligent Machines Journal Article In: Science, vol. 366, no. 6468, pp. 999-1004, 2019, ISSN: 0036-8075. @article{Thomas19science, Intelligent machines using machine learning algorithms are ubiquitous, ranging from simple data analysis and pattern recognition tools to complex systems that achieve super-human performance on various tasks. Ensuring that they do not exhibit undesirable behavior — that they do not, for example, cause harm to humans — is therefore a pressing problem that we address here. We propose a general and flexible framework for designing machine learning algorithms that simplifies the problem of specifying and regulating undesirable behavior. To show the viability of this framework, we use it to create machine learning algorithms that preclude the dangerous behavior caused by standard machine learning algorithms in our experiments. Our framework for designing machine learning algorithms simplifies the safe and responsible application of machine learning. |
Jangda, Abhinav; Pinckney, Donald; Brun, Yuriy; Guha, Arjun Formal Foundations of Serverless Computing Journal Article In: Proceedings of the ACM on Programming Languages (PACMPL) Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA) issue, vol. 3, pp. 149:1–149:26, 2019, ISSN: 2475-1421, (ACM SIGPLAN Distinguished Paper Award.). @article{Jangda19oopsla, Serverless computing (also known as functions as a service) is a new cloud computing abstraction that makes it easier to write robust, large-scale web services. In serverless computing, programmers write what are called serverless functions, and the cloud platform transparently manages the operating system, resource allocation, load-balancing, and fault tolerance. When demand for the service spikes, the platform automatically allocates additional hardware to the service and manages load-balancing; when demand falls, the platform silently deallocates idle resources; and when the platform detects a failure, it transparently retries affected requests. In 2014, Amazon Web Services introduced the first serverless platform, AWS Lambda, and similar abstractions are now available on all major cloud computing platforms. Unfortunately, the serverless computing abstraction exposes several low-level operational details that make it hard for programmers to write and reason about their code. This paper sheds light on this problem by presenting $łambda_łambda$, an operational semantics of the essence of serverless computing. Despite being a small (half a page) core calculus, $łambda_łambda$ models all the low-level details that serverless functions can observe. To show that $łambda_łambda$ is useful, we present three applications. First, to ease reasoning about code, we present a simplified $łambda_łambda$ semantics of serverless execution and precisely characterize when the $łambda_łambda$ semantics and $łambda_łambda$ coincide. Second, we augment $łambda_łambda$ with a key-value store to allow reasoning about stateful serverless functions. Third, since a handful of serverless platforms support serverless function composition, we show how to extend $łambda_łambda$ with a composition language. We have implemented this composition language and show that it outperforms prior work. |
Motwani, Manish; Brun, Yuriy Automatically Generating Precise Oracles from Structured Natural Language Specifications Proceedings Article In: Proceedings of the 41st International Conference on Software Engineering (ICSE), pp. 188–199, Montreal, QC, Canada, 2019, (ACM artifact badges granted: Artifact Available, Artifact Reusable.). @inproceedings{Motwani19icse, Software specifications often use natural language to describe the desired behavior, but such specifications are difficult to verify automatically. We present Swami, an automated technique that extracts test oracles and generates executable tests from structured natural language specifications. Swami focuses on exceptional behavior and boundary conditions that often cause field failures but that developers often fail to manually write tests for. Evaluated on the official JavaScript specification (ECMA-262), 98.4% of the tests Swami generated were precise to the specification. Using Swami to augment developer-written test suites improved coverage and identified 1 previously unknown defect and 15 missing JavaScript features in Rhino, 1 previously unknown defect in Node.js, and 18 semantic ambiguities in the ECMA-262 specification. |
2018 |
Brun, Yuriy; Meliou, Alexandra Software Fairness Proceedings Article In: Proceedings of the New Ideas and Emerging Results Track at the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 754–759, Lake Buena Vista, FL, USA, 2018. @inproceedings{Brun18fse-nier, A goal of software engineering research is advancing software quality and the success of the software engineering process. However, while recent studies have demonstrated a new kind of defect in software related to its ability to operate in fair and unbiased manner, software engineering has not yet wholeheartedly tackled these new kinds of defects, thus leaving software vulnerable. This paper outlines a vision for how software engineering research can help reduce fairness defects and represents a call to action by the software engineering research community to reify that vision. Modern software is riddled with examples of biased behavior, from automated translation injecting gender stereotypes, to vision systems failing to see faces of certain races, to the US criminal justice system relying on biased computational assessments of crime recidivism. While systems may learn bias from biased data, bias can also emerge from ambiguous or incomplete requirement specification, poor design, implementation bugs, and unintended component interactions. We argue that software fairness is analogous to software quality, and that numerous software engineering challenges in the areas of requirements, specification, design, testing, and verification need to be tackled to solve this problem. |
Angell, Rico; Johnson, Brittany; Brun, Yuriy; Meliou, Alexandra Themis: Automatically Testing Software for Discrimination Proceedings Article In: Proceedings of the Demonstrations Track at the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 871–875, Lake Buena Vista, FL, USA, 2018. @inproceedings{Angell18demo-fse, Bias in decisions made by modern software is becoming a common and serious problem. We present Themis, an automated test suite generator to measure two types of discrimination, including causal relationships between sensitive inputs and program behavior. We explain how Themis can measure discrimination and aid its debugging, describe a set of optimizations Themis uses to reduce test suite size, and demonstrate Themis' effectiveness on open-source software. Themis is open-source and all our evaluation data are available at http://fairness.cs.umass.edu/. See a video of Themis in action: https://youtu.be/brB8wkaUesY |
Bang, Jae; Brun, Yuriy; Medvidovic, Nenad Collaborative Design Conflicts: Costs and Solutions Journal Article In: IEEE Software, vol. 35, no. 6, pp. 25–31, 2018, ISSN: 0740-7459. @article{Bang18, Collaborative design exposes software architects to the risk of making conflicting modeling changes that either cannot be merged, or, when merged, violate consistency rules, non-functional requirements, or other system constraints. Such design conflicts are common and incur a high cost, including having to redo and abandon work. Proactive design conflict detection can alleviate this risk. We motivate the need for design conflict detection, describe the benefits of such detection to practitioners, and identify requirements for building detection tools. Finally, we present FLAME, our collaborative-design framework that efficiently and continuously detects design conflicts. |
Motwani, Manish; Sankaranarayanan, Sandhya; Just, René; Brun, Yuriy Do Automated Program Repair Techniques Repair Hard and Important Bugs? Journal Article In: Empirical Software Engineering (EMSE), vol. 23, no. 5, pp. 2901–2947, 2018, ISSN: 1382-3256. @article{Motwani18emse, Existing evaluations of automated repair techniques focus on the fraction of the defects for which the technique can produce a patch, the time needed to produce patches, and how well patches generalize to the intended specification. However, these evaluations have not focused on the applicability of repair techniques and the characteristics of the defects that these techniques can repair. Questions such as "Can automated repair techniques repair defects that are hard for developers to repair?" and Äre automated repair techniques less likely to repair defects that involve loops?" have not, as of yet, been answered. To address such questions, we annotate two large benchmarks totaling 409 C and Java defects in real-world software, ranging from 22K to 2.8M lines of code, with measures of the defect's importance, the developer-written patch's complexity, and the quality of the test suite. We then analyze relationships between these measures and the ability to produce patches for the defects of seven automated repair techniques -- AE, GenProg, Kali, Nopol, Prophet, SPR, and TrpAutoRepair. We find that automated repair techniques are less likely to produce patches for defects that required developers to write a lot of code or edit many files, or that have many tests relevant to the defect. Java techniques are more likely to produce patches for high-priority defects. Neither the time it took developers to fix a defect nor the test suite's coverage correlate with the automated repair techniques' ability to produce patches. Finally, automated repair techniques are less capable of fixing defects that require developers to add loops and new function calls, or to change method signatures. These findings identify strengths and shortcomings of the state-of-the-art of automated program repair along new dimensions. The presented methodology can drive research toward improving the applicability of automated repair techniques to hard and important bugs. |
Shin, Seung Yeob; Brun, Yuriy; Balasubramanian, Hari; Henneman, Philip L.; Osterweil, Leon J. Discrete-Event Simulation and Integer Linear Programming for Constraint-Aware Resource Scheduling Journal Article In: IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 48, no. 9, pp. 1578–1593, 2018, ISSN: 1578--1593. @article{Shin18tsmc, This paper presents a method for scheduling resources in complex systems that integrate humans with diverse hardware and software components, and for studying the impact of resource schedules on system characteristics. The method uses discrete-event simulation and integer linear programming, and relies on detailed models of the system's processes, specifications of the capabilities of the system's resources, and constraints on the operations of the system and its resources. As a case study, we examine processes involved in the operation of a hospital emergency department, studying the impact staffing policies have on such key quality measures as patient length of stay, number of handoffs, staff utilization levels, and cost. Our results suggest that physician and nurse utilization levels for clinical tasks of 70% result in a good balance between length of stay and cost. Allowing shift lengths to vary and shifts to overlap increases scheduling flexibility. Clinical experts provided face validation of our results. Our approach improves on the state of the art by enabling using detailed resource and constraint specifications effectively to support analysis and decision making about complex processes in domains that currently rely largely on trial and error and other ad hoc methods. |
Oliveira, Daniela Seabra; Lin, Tian; Rahman, Muhammad Sajidur; Akefirad, Rad; Ellis, Donovan; Perez, Eliany; Bobhate, Rahul; DeLong, Lois A.; Cappos, Justin; Brun, Yuriy; Ebner, Natalie C. API Blindspots: Why Experienced Developers Write Vulnerable Code Proceedings Article In: Proceedings of the USENIX Symposium on Usable Privacy and Security (SOUPS), pp. 315–328, Baltimore, MD, USA, 2018. @inproceedings{Oliveira18soups, Despite the best efforts of the security community, security vulnerabilities in software are still prevalent, with new vulnerabilities reported daily and older ones stubbornly repeating themselves. One potential source of these vulnerabilities is shortcomings in the used language and library APIs. Developers tend to trust APIs, but can misunderstand or misuse them, introducing vulnerabilities. We call the causes of such misuse blindspots. In this paper, we study API blindspots from the developers' perspective to: (1) determine the extent to which developers can detect API blindspots in code and (2) examine the extent to which developer characteristics (i.e., perception of code correctness, familiarity with code, confidence, professional experience, cognitive function, and personality) affect this capability. We conducted a study with 109 developers from four countries solving programming puzzles that involve Java APIs known to contain blindspots. We find that (1) The presence of blindspots correlated negatively with the developers' accuracy in answering implicit security questions and the developers' ability to identify potential security concerns in the code. This effect was more pronounced for I/O-related APIs and for puzzles with higher cyclomatic complexity. (2) Higher cognitive functioning and more programming experience did not predict better ability to detect API blindspots. (3) Developers exhibiting greater openness as a personality trait were more likely to detect API blindspots. This study has the potential to advance API security in (1) design, implementation, and testing of new APIs; (2) addressing blindspots in legacy APIs; (3) development of novel methods for developer recruitment and training based on cognitive and personality assessments; and (4) improvement of software development processes (e.g., establishment of security and functionality teams). |
Goues, Claire Le; Brun, Yuriy; Apel, Sven; Berger, Emery; Khurshid, Sarfraz; Smaragdakis, Yannis Effectiveness of Anonymization in Double-Blind Review Journal Article In: Communications of the ACM, vol. 61, no. 6, pp. 34–37, 2018. @article{LeGoues18cacm, Double-blind review relies on the authors' ability and willingness to effectively anonymize their submissions. We explore anonymization effectiveness at ASE 2016, OOPSLA 2016, and PLDI 2016 by asking reviewers if they can guess author identities. We find that 74%-90% of reviews contain no correct guess and that reviewers who self-identify as experts on a paper's topic are more likely to attempt to guess, but no more likely to guess correctly. We present our findings, summarize the PC chairs' comments about administering double-blind review, discuss the advantages and disadvantages of revealing author identities part of the way through the process, and conclude by advocating for the continued use of double-blind review. |
Motwani, Manish; Sankaranarayanan, Sandhya; Just, René; Brun, Yuriy Do Automated Program Repair Techniques Repair Hard and Important Bugs? Proceedings Article In: Proceedings of the Journal First Track at the International Conference on Software Engineering (ICSE), pp. 25, Gothenburg, Sweden, 2018. @inproceedings{Motwani18emse-icseJF, The full version of this article is: Manish Motwani, Sandhya Sankaranarayanan, René Just, and Yuriy Brun, ``Do Automated Program Repair Techniques Repair Hard and Important Bugs?'' in Empirical Software Engineering, https://doi.org/10.1007/s10664-017-9550-0 |
Shahbazian, Arman; Lee, Youn Kyu; Brun, Yuriy; Medvidovic, Nenad Poster: Making Well-Informed Software Design Decisions Proceedings Article In: Proceedings of the Poster Track at the International Conference on Software Engineering (ICSE), pp. 262–263, Gothenburg, Sweden, 2018. @inproceedings{Shahbazian18icse-poster, Design decisions software architects make directly impact system quality. Real-world systems involve a large number of such decisions, and each decision is typically influenced by others and involves trade-offs in system properties. This paper poses the problem of making complex, interacting design decisions relatively early in the project’s lifecycle and outlines a search-based and simulation-based approach for helping architects make these decisions and understand their effects. |
Brun, Yuriy; Johnson, Brittany; Meliou, Alexandra (Ed.) Proceedings of the IEEE/ACM International Workshop on Software Fairness (FairWare) Proceedings Gothenburg, Sweden, 2018, ISBN: 978-1-4503-5746-3. @proceedings{Brun18fairwareproceeings, |
Brun, Yuriy; Johnson, Brittany; Meliou, Alexandra Message from the FairWare 2018 Chairs Proceedings Article In: Proceedings of the IEEE/ACM International Workshop on Software Fairness, Gothenburg, Sweden, 2018. @inproceedings{Brun18fairwarewelcome, |
Shahbazian, Arman; Lee, Youn Kyu; Le, Duc; Brun, Yuriy; Medvidovic, Nenad Recovering Architectural Design Decisions Proceedings Article In: Proceedings of the IEEE International Conference on Software Architecture (ICSA), pp. 95–104, Seattle, WA, USA, 2018. @inproceedings{Shahbazian18icsa, Designing and maintaining a software system's architecture typically involve making numerous design decisions, each potentially affecting the system's functional and nonfunctional properties. Understanding these design decisions can help inform future decisions and implementation choices and can avoid introducing regressions and architectural inefficiencies later. Unfortunately, design decisions are rarely well documented and are typically a lost artifact of the architecture creation and maintenance process. The loss of this information can thus hurt development. To address this shortcoming, we develop RecovAr, a technique for automatically recovering design decisions from the project's readily available history artifacts, such as an issue tracker and version control repository. RecovAr uses state-of-the-art architectural recovery techniques on a series of version control commits and maps those commits to issues to identify decisions that affect system architecture. While some decisions can still be lost through this process, our evaluation on Hadoop and Struts, two large open-source systems with over 8 years of development each and, on average, more than 1 million lines of code, shows that RecovAr has the recall of 75% and a precision of 77%. Our work formally defines architectural design decisions and develops an approach for tracing such decisions in project histories. Additionally, the work introduces methods to classify whether decisions are architectural and to map decisions to code elements. Finally, our work contributes a methodology engineers can follow to preserve design-decision knowledge in their projects. |
Lemos, Rogério; Garlan, David; Ghezzi, Carlo; Giese, Holger; Andersson, Jesper; Litoiu, Marin; Schmerl, Bradley; Weyns, Danny; Baresi, Luciano; Bencomo, Nelly; Brun, Yuriy; Camara, Javier; Calinescu, Radu; Cohen, Myra B.; Gorla, Alessandra; Grassi, Vincenzo; Grunske, Lars; Inverardi, Paola; Jezequel, Jean-Marc; Malek, Sam; Mirandola, Raffaela; Mori, Marco; Müller, Hausi A.; Rouvoy, Romain; Rubira, Cecilia M. F.; Rutten, Eric; Shaw, Mary; Tamburrelli, Giordano; Tamura, Gabriel; Villegas, Norha M.; Vogel, Thomas; Zambonelli, Franco Software Engineering for Self-Adaptive Systems: Research Challenges in the Provision of Assurances Book Section In: Lemos, Rogério; Garlan, David; Ghezzi, Carlo; Giese, Holger (Ed.): Software Engineering for Self-Adaptive Systems III, vol. 9640, pp. 3–30, Springer, 2018, ISBN: 978-3-319-74183-3. @incollection{Lemos18SEfSAS, The important concern for modern software systems is to become more cost-effective, while being versatile, flexible, resilient, dependable, energy-efficient, customisable, configurable, and self-optimising when reacting to run-time changes that may occur within the system itself, its environment, or requirements. One of the most promising approaches to achieving such properties is to equip software systems with self-managing capabilities using self-adaptation mechanisms. Despite recent advances in this area, one key aspect of self-adaptive systems that remains to be tackled in depth is the provision of assurances, i.e., the collection, analysis, and synthesis of evidence that the system satisfies its stated functional and non-functional requirements during its operation in the presence of self-adaptation. The provision of assurances for self-adaptive systems is challenging since run-time changes introduce a high degree of uncertainty. This paper on research challenges complements previous roadmap papers on software engineering for self-adaptive systems covering a different set of topics, which are related to assurances, namely, perpetual assurances, composition and decomposition of assurances, and assurances obtained from control theory. This research challenges paper is one of the many results of the Dagstuhl Seminar 13511 on Software Engineering for Self-Adaptive Systems: Assurances, which took place in December 2013. |
Schmerl, Bradley; Andersson, Jesper; Vogel, Thomas; Cohen, Myra B.; Rubira, Cecilia M. F.; Brun, Yuriy; Gorla, Alessandra; Zambonelli, Franco; Baresi, Luciano Challenges in Composing and Decomposing Assurances for Self-Adaptive Systems Book Section In: Lemos, Rogério; Garlan, David; Ghezzi, Carlo; Giese, Holger (Ed.): Software Engineering for Self-Adaptive Systems III. Assurances, vol. 9640, pp. 64–89, Springer, 2018, ISBN: 978-3-319-74183-3. @incollection{Schmerl18SEfSAS, Self-adaptive software systems adapt to changes in the environment, in the system itself, in their requirements, or in their business objectives. Typically, these systems attempt to maintain system goals at run time and often provide assurance that they will meet their goals under dynamic and uncertain circumstances. While significant research has focused on ways to engineer self-adaptive capabilities into both new and legacy software systems, less work has been conducted on how to assure that self-adaptation maintains system goals. For traditional, especially safety-critical software systems, assurance techniques decompose assurances into sub-goals and evidence that can be provided by parts of the system. Existing approaches also exist for composing assurances, in terms of composing multiple goals and composing assurances in systems of systems. While some of these techniques may be applied to self-adaptive systems, we argue that several significant challenges remain in applying them to self-adaptive systems in this chapter. We discuss how existing assurance techniques can be applied to composing and decomposing assurances for self-adaptive systems, highlight the challenges in applying them, summarize existing research to address some of these challenges, and identify gaps and opportunities to be addressed by future research. |
2017 |
Goues, Claire Le; Brun, Yuriy; Forrest, Stephanie; Weimer, Westley Clarifications on the Construction and Use of the ManyBugs Benchmark Journal Article In: IEEE Transactions on Software Engineering (TSE), comment paper, vol. 43, no. 11, pp. 1089–1090, 2017, ISSN: 0098-5589. @article{LeGoues17tse, High-quality research requires timely dissemination and the incorporation of feedback. Since the publication of the ManyBugs benchmark and its release on http://repairbenchmarks.cs.umass.edu, Fan Long and Martin Rinard have provided feedback on the benchmark's construction and use. Here, we describe that feedback and our subsequent improvements to the ManyBugs benchmark. We thank these researchers for the feedback. |
Weiss, Aaron; Guha, Arjun; Brun, Yuriy Tortoise: Interactive System Configuration Repair Proceedings Article In: Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 625–636, Urbana-Champaign, IL, USA, 2017. @inproceedings{Weiss17ase, System configuration languages provide powerful abstractions that simplify managing large-scale, networked systems. Thousands of organizations now use configuration languages, such as Puppet. However, specifications written in configuration languages can have bugs and the shell remains the simplest way to debug a misconfigured system. Unfortunately, it is unsafe to use the shell to fix problems when a system configuration language is in use: a fix applied from the shell may cause the system to drift from the state specified by the configuration language. Thus, despite their advantages, configuration languages force system administrators to give up the simplicity and familiarity of the shell. This paper presents a synthesis-based technique that allows administrators to use configuration languages and the shell in harmony. Administrators can fix errors using the shell and the technique automatically repairs the higher-level specification written in the configuration language. The approach (1) produces repairs that are consistent with the fix made using the shell; (2) produces repairs that are maintainable by minimizing edits made to the original specification; (3) ranks and presents multiple repairs when relevant; and (4) supports all shells the administrator may wish to use. We implement our technique for Puppet, a widely used system configuration language, and evaluate it on a suite of benchmarks under 42 repair scenarios. The top-ranked repair is selected by humans 76% of the time and the human-equivalent repair is ranked 1.31 on average. |
Galhotra, Sainyam; Brun, Yuriy; Meliou, Alexandra Fairness Testing: Testing Software for Discrimination Proceedings Article In: Proceedings of the 11th Joint Meeting of the European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE), pp. 498–510, Paderborn, Germany, 2017, (ACM SIGSOFT Distinguished Paper Award.). @inproceedings{Galhotra17fse, This paper defines the notions of software fairness and discrimination and develops a testing-based method for measuring if and how much software discriminates. Specifically, the paper focuses on measuring causality in discriminatory behavior. Modern software contributes to important societal decisions and evidence of software discrimination has been found in systems that recommend criminal sentences, grant access to financial loans and products, and determine who is allowed to participate in promotions and receive services. Our approach, Themis, measures discrimination in software by generating efficient, discrimination-testing test suites. Given a schema describing valid system inputs, Themis generates discrimination tests automatically and, notably, does not require an oracle. We evaluate Themis on 20 software systems, 12 of which come from prior work with explicit focus on avoiding discrimination. We find that (1) Themis is effective at discovering software discrimination, (2) state-of-the-art techniques for removing discrimination from algorithms fail in many situations, at times discriminating against as much as 98% of an input subdomain, (3)~Themis optimizations are effective at producing efficient test suites for measuring discrimination, and (4)~Themis is more efficient on systems that exhibit more discrimination. We thus demonstrate that fairness testing is a critical aspect of the software development cycle in domains with possible discrimination and provide initial tools for measuring software discrimination. |
Bang, Jae; Brun, Yuriy; Medvidovic, Nenad Continuous Analysis of Collaborative Design Proceedings Article In: Proceedings of the IEEE International Conference on Software Architecture (ICSA), pp. 97–106, Gothenburg, Sweden, 2017, (Best Paper Award.). @inproceedings{Bang17icsa, In collaborative design, architects' individual design decisions may conflict and, when joined, may violate system consistency rules or non-functional requirements. These design conflicts can hinder collaboration and result in wasted effort. Proactive detection of code-level conflicts has been shown to improve collaborative productivity; however, the computational resource requirements for proactively computing design conflicts have hindered its applicability in practice. Our survey and interviews of 50 architects from six large software companies find that 60% of their projects involve collaborative design, that architects consider integration costly, and that design conflicts are frequent and lead to lost work. To aid collaborative design, we re-engineer FLAME, our prior design conflict detection technique, to use cloud resources and a novel prioritization algorithm that, together, achieve efficient and nonintrusive conflict detection, and guarantee a bound on the time before a conflict is discovered. Two controlled experiments with 90 students trained in software architecture in a professional graduate program, demonstrate that architects using FLAME design more efficiently, produce higher-quality designs, repair conflicts faster, and prefer using FLAME. An empirical performance evaluation demonstrates FLAME's scalability and verifies its time-bound guarantees. |
Wang, Qianqian; Brun, Yuriy; Orso, Alessandro Behavioral Execution Comparison: Are Tests Representative of Field Behavior? Proceedings Article In: Proceedings of the 10th IEEE International Conference on Software Testing, Verification, and Validation (ICST), pp. 321–332, Tokyo, Japan, 2017. @inproceedings{Wang17icst, Software testing is the most widely used approach for assessing and improving software quality, but it is inherently incomplete and may not be representative of how the software is used in the field. This paper addresses the questions of to what extent tests represent how real users use software, and how to measure behavioral differences between test and field executions. We study four real-world systems, one used by end-users and three used by other (client) software, and compare test suites written by the systems' developers to field executions using four models of behavior: statement coverage, method coverage, mutation score, and a temporal-invariant-based model we developed. We find that developer-written test suites fail to accurately represent field executions: the tests, on average, miss 6.2% of the statements and 7.7% of the methods exercised in the field; the behavior exercised only in the field kills an extra 8.6% of the mutants; finally, the tests miss 52.6% of the behavioral invariants that occur in the field. In addition, augmenting the in-house test suites with automatically-generated tests only marginally improves the tests' behavioral representativeness. These differences between field and test executions, and in particular the finer-grained and more sophisticated ones that we measured using our invariant-based model, can provide insight for developers and suggest a better method for measuring test suite quality. |
2016 |
Beschastnikh, Ivan; Wang, Patty; Brun, Yuriy; Ernst, Michael D. Debugging Distributed Systems Journal Article In: Communications of the ACM, vol. 59, no. 8, pp. 32–37, 2016. @article{Beschastnikh16cacm, Distributed systems pose unique challenges for software developers. Reasoning about concurrent activities of system nodes and even understanding the system's communication topology can be difficult. A standard approach to gaining insight into system activity is to analyze system logs. Unfortunately, this can be a tedious and complex process. This article looks at several key features and debugging challenges that differentiate distributed systems from other kinds of software. The article presents several promising tools and ongoing research to help resolve these challenges. |
Ohmann, Tony; Stanley, Ryan; Beschastnikh, Ivan; Brun, Yuriy Visually Reasoning about System and Resource Behavior Proceedings Article In: Proceedings of the Demonstrations Track of the 38th International Conference on Software Engineering (ICSE Demo), pp. 601–604, Austin, TX, USA, 2016. @inproceedings{Ohmann16tool-demo-icse, As devices proliferate and software is ported to new platforms, developers must reconstruct how their software utilizes resources, such as network bandwidth and memory. Existing software comprehension approaches rarely consider how resource utilization affects system behavior. We present Perfume, a general-purpose tool to help developers understand how resource utilization impacts their systems' control flow. Perfume is broadly applicable, as it is configurable to parse a wide variety of execution log formats and applies to all resource types that can be represented numerically. Perfume mines temporal properties that hold over the logged executions and represents system behavior in a resource finite state automaton that satisfies the mined properties. Perfume's interactive interface allows the developer to understand system behavior and formulate and test hypotheses about system executions. A controlled experiment with 40 students shows that Perfume effectively supports understanding and debugging tasks. Students using Perfume answered 8.3% more questions correctly than those using execution logs alone and did so 15.5% more quickly. Perfume is open source and deployed at http://perfume.cs.umass.edu/. |
Shin, Seung Yeob; Brun, Yuriy; Osterweil, Leon J. Specification and Analysis of Human-Intensive System Resource-Utilization Policies Proceedings Article In: Proceedings of the 8th International Workshop on Software Engineering in Healthcare Systems (SEHS), pp. 8–14, Austin, TX, USA, 2016. @inproceedings{Shin16SEHS, Complex, human-intensive systems, such as those used in hospital Emergency Departments, typically require the effective support of many types of resources, each governed by potentially complex utilization policies. Resource utilization policies range from simple, e.g., sickest patient first, to extremely complex, responding to changes in system environment, state, and stimuli. Further, policies may at times conflict with each other, requiring conflict resolution strategies that further increase the complexity. Sound policies for the management of these resources are crucial in assuring that these systems achieve their key goals. To help system developers make sound resource management decisions, this paper presents a resource utilization policy specification and analysis framework for complex human-intensive systems. We provide (1) a precise specification language to describe very diverse and potentially complex resource utilization policies, (2) a process- and resource-aware discrete-event simulation engine that executes simulations to dynamically evaluate the policies' effects on the outcomes achieved by systems that use the resources, and (3) a process- and resource-aware finite state verification framework that supports formal verification that resource management policies are correctly implemented by these simulations. |
Beschastnikh, Ivan; Wang, Patty; Brun, Yuriy; Ernst, Michael D. Debugging Distributed Systems Journal Article In: ACM Queue, vol. 14, no. 2, pp. 91–110, 2016. @article{Beschastnikh16queue, Distributed systems pose unique challenges for software developers. Reasoning about concurrent activities of system nodes and even understanding the system's communication topology can be difficult. A standard approach to gaining insight into system activity is to analyze system logs. Unfortunately, this can be a tedious and complex process. This article looks at several key features and debugging challenges that differentiate distributed systems from other kinds of software. The article presents several promising tools and ongoing research to help resolve these challenges. |