Nothing ever becomes real till it
is experienced; even a proverb is no
proverb to you till your life has illustrated it.
— John Keats
|
|
Last update 18-January, 2019
|
Experimentation in Computing
Project Description
Spring 2019
(Updated 18-January)
Ninth Mason / Skövde Workshop on Experimentation in Computing
Program Chairs:
Jeff Offutt & Birgitta Lindström
Technical Program Committee:
All students taking Experimentation in Computing
The Mason / Skövde Workshop on Experimentation in Computing
provides a forum for discussing
current experimental studies in the field of computing.
Papers are solicited for the studies listed
in this CFP,
as well as for other studies.
Accepted papers will not be published in any conference proceedings.
Submitted papers must not have been published previously,
but they may be submitted elsewhere in the future.
All submitted papers will be accepted.
Full-Length Papers:
Papers should be submitted single-spaced
in a font size no smaller than 11 points,
fully justified.
Papers must not exceed 12 pages
including references and figures,
and will not be refereed by external reviewers.
All papers should indicate what is interesting about the presented work.
The first page should include an abstract of maximum 150 words,
a list of keywords,
the author’s name,
affiliation,
and contact information
(email address and URL).
Papers should be single-author.
The citations and references should be formatted in
standard computing format,
that is,
with bracketed citations (“[1]”)
and citation keys that are either numeric or
strings based on the authors’ names (“[Basi91]”).
Presentations:
You will be allowed 20 minutes
for your presentation,
plus 5 minutes for questions.
Submission Procedure:
A first draft of each paper
must be submitted before
22 April
by posting on the Piazza bulletin board.
Each paper will receive at least three reviews,
one from a program chair and two from technical program committee members
(your classmates).
Reviews will be returned on
29 April,
and the final paper must be submitted electronically by
13 May.
Final papers must be submitted in PDF format (not MS Word or Latex!).
The final paper must be single spaced and in 10 point font.
Milestones | Date |
Topic selection: | 4 February |
Experimental design review: | 25 February |
Draft paper submitted: | 22 April |
Reviews due: | 29 April |
Final paper submitted: | 13 May |
Presentations: | TBD
|
Don’t mind criticism --
If it is untrue, disregard it,
If it is unfair, don’t let it irritate you,
If it is ignorant, smile,
If it is justified, learn from it.
- Anonymous
Following are possible topics for your empirical study.
You may choose any topic you wish,
either from this list or your own idea.
I specifically encourage you to consider carrying out an experiment related to your current research.
Talk with your research advisor or supervisor.
Many of these suggestions are related to software testing.
This emphatically does not imply a preference in the class,
but just reflects the limits of our creativity.
That is, most of our ideas are about testing problems.
They are also unordered.
There might be a fair amount of overlap between these studies,
and you may want to share programs, test data sets,
or other artifacts.
Trading of this kind of experimental artifacts is greatly encouraged!
Some of these studies could use a partner to carry out some of the work,
to avoid bias from having one person conduct the entire experiment.
I encourage you to help each other;
please communicate among yourselves if you need help ...
ask and offer.
These descriptions are concise overviews
and most are fairly open-ended, by design,
to encourage more creativity and divergent thinking.
If you need help understanding or refining a suggestion,
please ask your instructors.
Suggestions for empirical studies
- What are the effects of test driven development?
TDD turns the software development process around backwards from traditional development.
Instead of the traditional process of
(1) writing functional requirements,
(2) implementing code to satisfy the requirements,
(3) and then designing tests to evaluate how well the code satisfies the rquirements,
TDD engineers
(1) create automated tests to specify functional behavior,
(2) develop code to ensure the tests pass,
(3) and then refactor the code to improve non-functional attributes such as maintainability and efficiency.
Two empirical questions have not been addressed.
First, is the resulting TDD code different from code built in the traditional way?
Is it larger, smaller, more or less complex, more reliable, or different in any other quantitative way?
Second, how good are the resulting TDD tests as compared with traditionally created tests?
What do they miss in terms of coverage,
failure detection, etc?
- RACC vs. CACC in real life?
Restricted Active Clause Coverage (RACC)
and
Correlated Active Clause Coverage (CACC)
are test criteria based on logic expressions.
The difference between the definitions of
RACC and CACC
is small and subtle.
Some RACC requirements are infeasible
when the CACC requirements on the same logic predicate
are feasible.
But is this difference significant in real software?
That is,
how many predicates in existing software
behave differently under RACC than under CACC?
- How are mutation tests different from human-designed tests?
While researchers have evaluated the quality of human-designed tests
by measuring them against mutation,
nobody has asked whether
human-designed tests tend to miss
particular types of mutants.
Unkilled mutants may reveal types of faults that humans tend to miss.
- Quality of JUnit assertions (test oracles)?
With my former PhD student, Nan Li,
we extended the traditional RIP (reachability-infection-propagation) model
to the RIPR (revealability) model.
We noticed that even when tests cause incorrect behavior,
the test oracle sometimes does not observe the incorrect part of the output space,
thus the fault is not revealed.
This brings the question:
How good are the test oracles in automated tests?
Or more specifically,
how often do JUnit assertions fail to reveal incorrect behavior?
- Does weak mutation work with minimal mutation?
Ammann, Delamaro, Kurtz, and Offutt recently invented
the mutation subsumption graph,
which allows us to identify the minimal set of mutants needed,
a set that is much smaller than the full set.
Years ago, experiments found that weak mutation,
where results are checked immediately after the mutated statement
rather than the end of execution,
works almost as well as strong mutation.
However, these results may not hold with minimal mutation,
thus a new experiment is needed to validate minimal-weak mutation.
- Covering the model versus covering the program:
If we design and generate tests to cover a model of a program,
for example,
a finite state maching or UML diagram,
how well will those tests cover the program on the same coverage criterion?
Note that this study could be done with multiple test generation criteria.
- Major vs. muJava vs. PIT vs. javalanche:
Several mutation tools are available,
each of which use different collections of mutation operators.
Clearly, these operators will result in different tests,
but how different are they in terms of strength?
The simplest comparison would be a cross-scoring,
where tests are created to kill all mutants for each tool,
then run against all mutants generated by the other tools.
- Comparing input space partitioning criteria:
Dozens of studies comparing structural, data flow, and mutation test criteria
have been published.
But I have not seen any studies that compared input space partitioning criteria
such as each choice, base choice, pair-wise, and multiple base choice.
- Metrics comparison:
Researchers have suggested many ways to measure the
complexity and quality of software.
These software metrics are difficult to evaluate,
particularly on an analytical basis.
An interesting project would be to take two or more metrics,
measure a number of software systems,
and compare the measurements in an objective way.
The difficult part of this study would be the evaluation method:
How can we compare different software metrics?
To come up with a sensible answer to this question,
start with a deeper question:
What do we want from our metrics?