The “Naturalness” of Software: A Research Vision Abram Hindle, Earl Barr, Zhendong Su,
Mark Gabel, and Premkumar Devanbu
% 0 0 1 ral u t a N
% 0 0 1 ral u t a N
public class FunctionCall {! public static void funct1 () {! ! System.out.println ("Inside funct1");! }! public static void main (String[] args) {! int val;! System.out.println ("Inside main");! funct1();! System.out.println ("About to call funct2");! val = funct2(8);! System.out.println ("funct2 returned a value of " + val);! System.out.println ("About to call funct2 again");! val = funct2(-3);! System.out.println ("funct2 returned a value of " + val);! }! public static int funct2 (int param) {! System.out.println ("Inside funct2 with param " + param);! return param * 2;! }! }!
% 0 0 1 ral u t a N
!
English, Tamil, German Can be rich, powerful, expressive ..but “in nature” are mostly simple, repetitive, boring Statistical Models
Can be rich, powerful, expressive Mostly simple, repetitive, boring Statistical Models
Two Examples A speech recognizer example
“European Central Fish” ? Another speech recognizer example
“fish++” ?
Repetition Mathematical Models Useful Software
Is software really repetitive?
The “Uniqueness” of Code
Mark Gabel
Zhendong Su
A study of the Uniqueness of Source Code, Gabel and Su, ACM SIGSOFT FSE 2010
How Redundant is Code? How much code?
6000 projects (C, C++, Java)
430,000,000 LOC
How long?
Sequences of 6-77 token length
(1) How matched?
Exact Match
1-4 edits
(2) How matched?
Raw Tokens
Renamed Identifiers
Non-Uniqueness (Redundancy) in a Large Java Corpus 100 90
Percent Redundancy
80 70
Identifiers Renamed Exact Tokens
60 50 40 30 20 10 0 5
20
35
50
65
80
Length of Candidate Code Fragment in Tokens
Software is really repetitive. How can we use this?
How has the “naturalness” (repetitive structure) of natural language been exploited?
Large Corpora Language Models Speech Recognition,
Translation, etc.
Language Models For any utterance U,
If Ua is often uttered than Ub,
!
0 p(U ) 1 p(Ua ) > p(Ub )
p(“EuropeanCentralF ish”) < p(“EuropeanCentralBank”) p(for(i = 0; i < 10; fish + +)) < p(for(i = 0; i < 10; i + +))
History of Language Models in NLP “Rationalist Methods” based on linguistic and • Initially, “Every time I fire a linguist,
logical theories...
performance the “Empiricist” approach
of our
• ...enterthe Most “natural” utterances are repetitive, recognizer goes simple
up”
✓speech on-line corpora
✓Faster computers + large—Fred Jelenik ➡ Good, high quality language models
➡ Rapid, revolutionary advances
Language Models: a Revolution in NLP The design and estimation
Good Language Models have been used for:
of language models
Speech recognition
➡ is at the heart
➡Natural language translation
of modern NLP Document summarization
➡
➡Document retrieval
But what about code? and “code language models”?
Exploiting Code Language Models Suggest the next token for developers
Complete the current token for developers
Assistive (speech, gesture) coding
Summarization and retrieval as translation
Fast, “good guess” static analysis
Search-based Software Engineering
Building a Language Model Large Text
Corpus
(Training)
Estimation
Algorithm
Statistical
Model
Design
Estimated using
Model frequency of occurrence!
Large Text
Corpus
(Test)
Evaluation
Model
Quality
What a Language Model Does ..of the European Central Bank
Language
Model
p(of)
p(the) p(European) p(Central) p(Bank)
Vastly more complex Language
Models
Almost always face data-sparsity
Novel, NLP-specific estimation methods
Evaluating Language Model Quality The words it encounters are not “too surprising” to it.
Frequently encountered language events are assigned higher probability
Infrequent language events are assigned lower probability.
....measured using “Cross-Entropy”
Background Cross Entropy Language
Model
Good
Description?
public class FunctionCall {! public static void funct1 () {! ! System.out.println ("Inside funct1");! }! public static void main (String[] args) {! int val;! System.out.println ("Inside main");! funct1();! System.out.println ("About to call funct2");! val = funct2(8);! System.out.println ("funct2 returned a value of " + val);! System.out.println ("About to call funct2 again");! val = funct2(-3);! System.out.println ("funct2 returned a value of " + val);! }! public static int funct2 (int param) {! System.out.println ("Inside funct2 with param " + param);! return param * 2;! }! }!
!
Background Cross Entropy Language
Model
Good
Description?
public class FunctionCall {! public static void funct1 () {! ! System.out.println ("Inside funct1");! }! public static void main (String[] args) {! int val;! System.out.println ("Inside main");! funct1();! System.out.println ("About to call funct2");! val = funct2(8);! System.out.println ("funct2 returned a value of " + val);! System.out.println ("About to call funct2 again");! val = funct2(-3);! System.out.println ("funct2 returned a value of " + val);! }! public static int funct2 (int param) {! System.out.println ("Inside funct2 with param " + param);! return param * 2;! }! }!
!
Low Cross Entropy!!
Background Cross Entropy Language
Model
Good
Description?
public class FunctionCall {! public static void funct1 () {! ! System.out.println ("Inside funct1");! }! public static void main (String[] args) {! int val;! System.out.println ("Inside main");! funct1();! System.out.println ("About to call funct2");! val = funct2(8);! System.out.println ("funct2 returned a value of " + val);! System.out.println ("About to call funct2 again");! val = funct2(-3);! System.out.println ("funct2 returned a value of " + val);! }! public static int funct2 (int param) {! System.out.println ("Inside funct2 with param " + param);! return param * 2;! }! }!
!
High Cross Entropy!!
Measuring Goodness: Cross entropy
Higher if Model assigns
Low-Probability
to frequent events
Cross entropy
n X 1
n i=1
For a document
probability
with
assigned by
n words
Model to word
log(p(ei ))
Lower if Model assigns
High-Probability
to frequent events
What language model gives low cross-entropy?
n-gram models • Intuition: Local Context Helps
• Examples (NL, then code)
• multiple choice question
• item = item→next
!
More context helps more!
What is
This?
What is
This?
n-gram models of code: Experimental Results
Java Datasets
C Datasets
N-gram Cross Entropy English
Code
10 7.5
3-4 Bits!
5 2.5
Five Bits! 0 1-gram
2-gram
3-gram
4-gram
5-gram
6-gram
7-gram
8-gram
The Skeptic Asks... Is it just that C, Java, Python... are simpler than English?
➡ Do cross-project testing!
➡ Train on one project, test on the others
➡ If it’s all “in the language”, entropy should be similar
Train on one project, test on the others.
14 12 10 Cross Entropy
8 6 4 2
Self Cross Entropy Ant
Batik
Cassandra
Eclipse
Log4j
Lucene
Corpus Projects
Maven2
Maven3
Xalan−J
Xerces2
Train on one Ubuntu application domain,
test on the others.
6.0 5.5
● ●
4.5
116 ●
4.0
23 22 ●
21
●
15
86 ●
26 ●
135
● ●
118
3.5
●
31
3.0 2.5
Cross Entropy
5.0
●
Self Cross Entropy Admin
Doc
Graphics
Interpreters
Mail
Net
Corpus Categories
Sound
Tex
Text
Web
The “Naturalness” Vision Suggest the next token for developers
developers Complete the current token for developers
Assistive (speech, gesture) coding
Summarization and retrieval as translation
Stupid, statistical, static analysis
Search-based Software Engineering
Uses Type, Scope,
Etc !
Suggesting Tokens
What token could appear here? What token has most often appeared here? Use just
previous two tokens!
Do n-grams help? Eclipse
Suggestion
Engine
Test Set
(existing
code)
Merge
Algorithm
Evaluation
Suggestion
from
Language
Model Additional
Benefit
Benefit.
from from
Language
Model
How many more correct suggestions? Suggestion1
Suggestion1
Suggestion2
Suggestion2
Suggestion3
Language Models
Suggestion1 ALWAYS Suggestion2 improve performance
Suggestion4
Suggestion3
Suggestion5
Suggestion4
Suggestion6
Suggestion5 Suggestion6
Suggestion7 Suggestion8 Suggestion9
Suggestion10
120
●
Percent Gain Raw Gain (count)
4000
100
80 ●
60
Improved performance
Suggestion2 at every token length Suggestion1
●
2000
●
●
40
Raw Gain (count)
Percent Gain over Eclipse
3000
1000
20 ● ●
●
●
● ●
●
●
●
0
0 3
4
5
6
7
8
9
10
11
Suggestion Length
12
13
14
15
N-Gram suggestions
always add value to the native Eclipse suggestion engine,
in a very large trial.
Can be rich, powerful, expressive Mostly simple, repetitive, boring Statistical Models
The “Naturalness” Vision Suggest the next token for developers
Complete the current token for developers
Assistive (speech, gesture) coding
?????
Summarization and retrieval as translation
????? Fast, “good guess” static analysis
Search-based Software Engineering
Assisted Coding
Dasher++
Rachel Aurand
(Graduate
Student)
Eclipse
The “Naturalness” Vision Suggest the next token for developers
Complete the current token for developers
Assistive (speech, gesture) coding
Summarization and retrieval as translation
Fast, “good guess” static analysis
Search-based Software Engineering
Noisy Channel Model “Comment allez vous? ”
He’s trying to speak English, but it is Oh, it saying:
May be he’s
What was the most likelyinto English systematically “messed up” French. must be:
“Do you comment all your code?” “Fine, thank you, How are you?”
sentence
! to say? he was trying
p(F | E).p(E) p(E | F ) = p(F )
Most Likely
way
“it got messed up”
Most Likely
English Sentence
p(F | E).p(E) p(E | F ) = p(F ) Maximize Numerator
over “E” to get
best translation
Normalizing
Constant
Joint Distribution from
Aligned Corpus
English Language
Model
p(F | E).p(E) p(E | F ) = p(F ) Where do the probability distributions
come from?
Normalizing
Constant
Noisy Channel Model Toast.makeText(context, “hello”, 5).show();
He’s trying to speak English, but it What was most likely
comes out the funny-sounding English summary code? Oh, itof this Maybe his code means
must be:
toast?” “Make me some “Pop up a message window!”
p(C | E).p(E) p(E | C) = p(C)
Code-English
Joint Corpus
“Domain-Specific”
English Language
Model
p(C | E).p(E) p(E | C) = p(C) Where do the probability distributions
come from?
Normalizing
Constant
The “Naturalness” Vision Suggest or Complete next tokens
Assistive (speech, gesture) coding
Summarization and Retrieval as Translation
Learn and Enforce Coding Conventions
Syntax Errors
Machine Translation for Porting
Fast, “good guess” static analysis
Search-based Software Engineering