The Naturalness of Software: A Research Vision. Abram Hindle, Earl Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu

The “Naturalness” of Software: A Research Vision Abram Hindle, Earl Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu % 0 0 1 ral u t a N ...
Author: Trevor Griffin
1 downloads 2 Views 5MB Size
The “Naturalness” of Software: A Research Vision Abram Hindle, Earl Barr, Zhendong Su,

Mark Gabel, and Premkumar Devanbu



% 0 0 1 ral u t a N



% 0 0 1 ral u t a N

public class FunctionCall {! public static void funct1 () {! ! System.out.println ("Inside funct1");! }! public static void main (String[] args) {! int val;! System.out.println ("Inside main");! funct1();! System.out.println ("About to call funct2");! val = funct2(8);! System.out.println ("funct2 returned a value of " + val);! System.out.println ("About to call funct2 again");! val = funct2(-3);! System.out.println ("funct2 returned a value of " + val);! }! public static int funct2 (int param) {! System.out.println ("Inside funct2 with param " + param);! return param * 2;! }! }!



% 0 0 1 ral u t a N

!

English, Tamil, German Can be rich, powerful, expressive ..but “in nature” are mostly simple, repetitive, boring Statistical Models

Can be rich, powerful, expressive Mostly simple, repetitive, boring Statistical Models

Two Examples A speech recognizer example

“European Central Fish” ? Another speech recognizer example

“fish++” ?

Repetition Mathematical Models Useful Software

Is software really repetitive?

The “Uniqueness” of Code

Mark Gabel

Zhendong Su

A study of the Uniqueness of Source Code, Gabel and Su, ACM SIGSOFT FSE 2010

How Redundant is Code? How much code?

6000 projects (C, C++, Java)

430,000,000 LOC

How long?

Sequences of 6-77 token length

(1) How matched?

Exact Match

1-4 edits

(2) How matched?

Raw Tokens

Renamed Identifiers

Non-Uniqueness (Redundancy) in a Large Java Corpus 100 90

Percent Redundancy

80 70

Identifiers Renamed Exact Tokens

60 50 40 30 20 10 0 5

20

35

50

65

80

Length of Candidate Code Fragment in Tokens

Software is really repetitive. How can we use this?

How has the “naturalness” (repetitive structure) of natural language been exploited?

Large Corpora Language Models Speech Recognition,

Translation, etc.

Language Models For any utterance U,

If Ua is often uttered than Ub,

!

0  p(U )  1 p(Ua ) > p(Ub )

p(“EuropeanCentralF ish”) < p(“EuropeanCentralBank”) p(for(i = 0; i < 10; fish + +)) < p(for(i = 0; i < 10; i + +))

History of Language Models in NLP “Rationalist Methods” based on linguistic and • Initially, “Every time I fire a linguist,

logical theories...

performance the “Empiricist” approach

of our

• ...enterthe Most “natural” utterances are repetitive, recognizer goes simple

up”

✓speech on-line corpora

✓Faster computers + large—Fred Jelenik ➡ Good, high quality language models

➡ Rapid, revolutionary advances

Language Models: a Revolution in NLP The design and estimation

Good Language Models have been used for:

of language models

Speech recognition

➡ is at the heart

➡Natural language translation

of modern NLP Document summarization



➡Document retrieval

But what about code? and “code language models”?

Exploiting Code Language Models Suggest the next token for developers

Complete the current token for developers

Assistive (speech, gesture) coding

Summarization and retrieval as translation

Fast, “good guess” static analysis

Search-based Software Engineering

Building a Language Model Large Text

Corpus
 (Training)


Estimation
 Algorithm

Statistical

Model

Design

Estimated using

Model frequency of occurrence!

Large Text

Corpus
 (Test)


Evaluation

Model
 Quality


What a Language Model Does ..of the European Central Bank

Language
 Model

p(of)

p(the) p(European) p(Central) p(Bank)

Vastly more complex Language

Models

Almost always face data-sparsity

Novel, NLP-specific estimation methods

Evaluating Language Model Quality The words it encounters are not “too surprising” to it.

Frequently encountered language events are assigned higher probability

Infrequent language events are assigned lower probability.

....measured using “Cross-Entropy”

Background Cross Entropy Language
 Model

Good 
 Description?

public class FunctionCall {! public static void funct1 () {! ! System.out.println ("Inside funct1");! }! public static void main (String[] args) {! int val;! System.out.println ("Inside main");! funct1();! System.out.println ("About to call funct2");! val = funct2(8);! System.out.println ("funct2 returned a value of " + val);! System.out.println ("About to call funct2 again");! val = funct2(-3);! System.out.println ("funct2 returned a value of " + val);! }! public static int funct2 (int param) {! System.out.println ("Inside funct2 with param " + param);! return param * 2;! }! }!

!

Background Cross Entropy Language
 Model

Good 
 Description?

public class FunctionCall {! public static void funct1 () {! ! System.out.println ("Inside funct1");! }! public static void main (String[] args) {! int val;! System.out.println ("Inside main");! funct1();! System.out.println ("About to call funct2");! val = funct2(8);! System.out.println ("funct2 returned a value of " + val);! System.out.println ("About to call funct2 again");! val = funct2(-3);! System.out.println ("funct2 returned a value of " + val);! }! public static int funct2 (int param) {! System.out.println ("Inside funct2 with param " + param);! return param * 2;! }! }!

!

Low Cross Entropy!!

Background Cross Entropy Language
 Model

Good 
 Description?

public class FunctionCall {! public static void funct1 () {! ! System.out.println ("Inside funct1");! }! public static void main (String[] args) {! int val;! System.out.println ("Inside main");! funct1();! System.out.println ("About to call funct2");! val = funct2(8);! System.out.println ("funct2 returned a value of " + val);! System.out.println ("About to call funct2 again");! val = funct2(-3);! System.out.println ("funct2 returned a value of " + val);! }! public static int funct2 (int param) {! System.out.println ("Inside funct2 with param " + param);! return param * 2;! }! }!

!

High Cross Entropy!!

Measuring Goodness: Cross entropy

Higher if Model assigns

Low-Probability
 to frequent events


Cross entropy


n X 1

n i=1

For a document

probability

with

assigned by

n words
 Model to word


log(p(ei ))

Lower if Model assigns

High-Probability
 to frequent events


What language model gives low cross-entropy?

n-gram models • Intuition: Local Context Helps

• Examples (NL, then code)

• multiple choice question

• item = item→next

!

More context helps more!

What is

This?

What is

This?

n-gram models of code: Experimental Results

Java Datasets

C Datasets

N-gram Cross Entropy English

Code

10 7.5

3-4 Bits!

5 2.5

Five Bits! 0 1-gram

2-gram

3-gram

4-gram

5-gram

6-gram

7-gram

8-gram

The Skeptic Asks... Is it just that C, Java, Python... are simpler than English?

➡ Do cross-project testing!

➡ Train on one project, test on the others

➡ If it’s all “in the language”, entropy should be similar

Train on one project, test on the others.

14 12 10 Cross Entropy

8 6 4 2

Self Cross Entropy Ant

Batik

Cassandra

Eclipse

Log4j

Lucene

Corpus Projects

Maven2

Maven3

Xalan−J

Xerces2

Train on one Ubuntu application domain,

test on the others.

6.0 5.5

● ●

4.5

116 ●

4.0

23 22 ●

21



15

86 ●

26 ●

135

● ●

118

3.5



31

3.0 2.5

Cross Entropy

5.0



Self Cross Entropy Admin

Doc

Graphics

Interpreters

Mail

Net

Corpus Categories

Sound

Tex

Text

Web

The “Naturalness” Vision Suggest the next token for developers

developers Complete the current token for developers

Assistive (speech, gesture) coding

Summarization and retrieval as translation

Stupid, statistical, static analysis

Search-based Software Engineering

Uses Type, Scope,
 Etc !

Suggesting Tokens

What token could appear here? What token has most often appeared here? Use just

previous two tokens!

Do n-grams help? Eclipse

Suggestion

Engine


Test Set
 (existing

code)


Merge
 Algorithm

Evaluation

Suggestion

from

Language

Model Additional

Benefit

Benefit.

from from



Language

Model


How many more correct suggestions? Suggestion1

Suggestion1

Suggestion2

Suggestion2

Suggestion3

Language Models

Suggestion1 ALWAYS Suggestion2 improve performance

Suggestion4

Suggestion3

Suggestion5

Suggestion4

Suggestion6

Suggestion5 Suggestion6

Suggestion7 Suggestion8 Suggestion9

Suggestion10

120



Percent Gain Raw Gain (count)

4000

100

80 ●

60

Improved performance

Suggestion2 at every token length Suggestion1



2000





40

Raw Gain (count)

Percent Gain over Eclipse

3000

1000

20 ● ●





● ●







0

0 3

4

5

6

7

8

9

10

11

Suggestion Length

12

13

14

15

N-Gram suggestions

always add value to the native Eclipse suggestion engine,

in a very large trial.

Can be rich, powerful, expressive Mostly simple, repetitive, boring Statistical Models

The “Naturalness” Vision Suggest the next token for developers

Complete the current token for developers

Assistive (speech, gesture) coding

?????

Summarization and retrieval as translation

????? Fast, “good guess” static analysis

Search-based Software Engineering

Assisted Coding

Dasher++

Rachel Aurand

(Graduate

Student)

Eclipse

The “Naturalness” Vision Suggest the next token for developers

Complete the current token for developers

Assistive (speech, gesture) coding

Summarization and retrieval as translation

Fast, “good guess” static analysis

Search-based Software Engineering

Noisy Channel Model “Comment allez vous? ”

He’s trying to speak English, but it is Oh, it saying:
 May be he’s 
 What was the most likelyinto English systematically “messed up” French. must be:
 “Do you comment all your code?” “Fine, thank you, How are you?”

sentence

! to say? he was trying

p(F | E).p(E) p(E | F ) = p(F )

Most Likely

way 
 “it got messed up”

Most Likely

English Sentence

p(F | E).p(E) p(E | F ) = p(F ) Maximize Numerator

over “E” to get

best translation

Normalizing

Constant

Joint Distribution from

Aligned Corpus

English Language
 Model

p(F | E).p(E) p(E | F ) = p(F ) Where do the probability distributions

come from?

Normalizing

Constant

Noisy Channel Model Toast.makeText(context, “hello”, 5).show();

He’s trying to speak English, but it What was most likely

comes out the funny-sounding English summary code? Oh, itof this Maybe his code means
 must be:
 toast?” “Make me some “Pop up a message window!”

p(C | E).p(E) p(E | C) = p(C)

Code-English

Joint Corpus

“Domain-Specific”
 English Language
 Model

p(C | E).p(E) p(E | C) = p(C) Where do the probability distributions

come from?

Normalizing

Constant

The “Naturalness” Vision Suggest or Complete next tokens

Assistive (speech, gesture) coding

Summarization and Retrieval as Translation

Learn and Enforce Coding Conventions

Syntax Errors

Machine Translation for Porting

Fast, “good guess” static analysis

Search-based Software Engineering

Suggest Documents