Python Natural Language Code Parser

Demo

As can be seen in the demo, my program (using a TKinter interface), can accept Natural Language as input and parse this into Python - allowing a user to dictate their code.

How it works

The repository is open-source and available here, with the important part of the code found in the interpret.py file.

The program loads a set of patterns and uses recursive pattern matching against the user’s input (or “utterance”) to find all patterns which represent a possible interpretation. Patterns consist of “phrases” and “literals”. By breaking down every small command we could want to understand (fewer than 300 small patterns such as the following), we can build up a highly complex and comprehensive parser.

In the following pattern:


X plus Y

plus is a literal (denoted by lowercase); if an utterance contains several words with the word “plus” in the middle, it will match. X and Y are phrases (denoted by uppercase) which matches with any number of other words. If there a match exists, the phrases are assigned to X and Y in the same way that variables might be assigned.

Given the above pattern, one matching utterance might be my variable plus one is equal to five, where my variable is assigned to X and one is equal to five is assigned to Y.

After a match is found, X and Y are evaluated in the same way as the initial utterance (recursively) until it has been split up as much as possible.

The main patterns for the program to match are found in /Python/main.txt. If no patterns create a full match (for the initial utterance and all the sub-utterances/ sub-phrases), the program simply says “no match”.

The atomic building block (that which can’t be split further) is the variable name. This program doesn’t store special values for variable names (it would be impossible to store every possible variable name), so it must detect when a user is talking about some variable name. This means that when no more patterns match during a recursion, a variable name is created from the remaining phrase.

In the section Introducing types, I explain how I prevent it from turning everything it doesn’t have a pattern for into a variable name.

Disambiguation

There may be many pattern matches for a single utterance - this means the “true intent” of an utterance must be disambiguated from other matches. The program finds the best match based on the highest specificity (i.e. the pattern containing the most literals that match with the utterance). For example, if there is a pattern which matches an utterance word-for-word, it’s quite likely that I want to interpret it with that pattern. Here’s how it works with an example:

Let’s say we have two patterns documented, which we know how to interpret:


# Natural Language Utterance :: Matching Python Code

X plus X :: X + X
N plus equals X :: N += X

If the user utters "my variable plus equals five", there are two matches to disambiguate:


my_variable + equals_five # X plus X :: X + X
my_variable += 5          # N plus equals X :: N += X

The goal here is to disambiguate the += shorthand from adding my_variable to another variable which happens to be called equals_five - both solutions match the pattern (the program is dumb and doesn’t store any special values → it must decide which words are variable names).

To do this disambiguation, it chooses the most specific pattern match. In this case, X plus X matches one word in the utterance, whereas, N plus equals X matches two. The more specific match is the second pattern, which it selects.

It turns out, this is a really good rule of thumb for disambiguation in this way!

Introducing types

I introduced different types that a phrase can be in my patterns (remember all uppercases are phrases in my patterns):

X: statement (or block of code)

N: name of a variable

B: boolean statement

V: value (e.g. string/ number)

I: iterable

Each type has it’s own file which stores the patterns associated with that type. When the pattern matcher hits a B variable, it will open the B patterns and only try to match the phrase against the boolean patterns.

For example, in the pattern:


# Natural Language Utterance :: Matching Python Code

if B X :: if B:\n \t X

The interpreter will find a match for something like, if three is greater than two and my variable is true print that my variable is true, where the patterns for B create a several matches within the B patterns for three is greater than four and my variable is true and print that my variable is true creates a match (or several) for X. The output becomes:


if 3 > 2 and my_variable == True:
	print("my variable is True")

Only those phrases which are stored as the N type will be interpreted as variable names and the parser will turn into snake case. This restricts the interpreter from turning every indivisible pattern into a variable name.

Adjacent phrases and complexity

Note that if B X is an example where there are two adjacent phrases in a pattern. Currently this searches all the variations of splitting the words uttered after the word if to try and find a pattern which satisfies the pattern.

Looking back on the project, the complexity of the matching algorithm (recursive with no memoisation, plus the ability to evaluate multiple adjacent phrases of different types) is the problem I would most like to improve upon if I looked at this system again.

Pattern examples

Here are some examples of patterns I took a short amount of time to write, which cover a lot of parsing NL into Python. Note that some of the natural language in the demo could easily have been made to accept more natural sounding input (e.g. "add a comment to say X" instead of "comment X") by tweaking the patterns or adding more special case patterns.


#####################
# VARIABLE STATEMENTS
#####################
N is V :: N = V
N equals X :: N = V
N is a list of X :: N = [ X ]
N equals true :: N = True
N equals false :: N = False
N is true :: N = True
N is false :: N = False
###############
# IF STATEMENTS
###############
if :: if <?>: \n \t
if else :: if <?>:\n \t<?>\n\belse:\n \t
if B :: if B:\n \t
if N :: if N:\n \t
if B X :: if B:\n \t X
if B then X :: if V :\n \t X
if N then X :: if N :\n \t X
if B then X else :: if X :\n \t X \n\belse: \n
if B then X else X :: if X :\n \t X \n\belse: \n X
else X :: \belse: \n \t X
elif B :: \belif X: \n \t
elif B then X :: \belif X: \n \t X
################
# FOR STATEMENTS
################
for N in X :: for N in X : \n \t
for N in X do X :: for N in X : \n \t X
for N in V X :: for N in V : \n \t X
for N in range len V X :: for N in range(len(V)) : \n \t X

It’s easy to see how some simple patterns can build up to make complex interpretations.

What was this based on: Enguage

I built this system based on my understanding of the Natural Language Processing Software called Enguage, which I was working with in 2019-20.

Here’s how Enguage creates an interpretation from an utterance (a video I created to explain part of how the software works):