It appears that since I spend (way too much) time writing code, I apparently should be writing, even if I'm apparently not sure that this will interest people. So, let's go down this path again fighting my procrastination and my perfectionism daemons while sharing what I'm doing and what I'm discovering.
Let's start with why, how and what I'm doing on (one of) my (too many) last project.
The Baron project goal is to a create an AST for the python programming language that guarantees a lossless conversion between the source code and the AST.
Not clear? Let's start with definitions.
AST stands for "abstract syntax tree", it is an abstract representation of what some source code file means from the compiler/interpreter point of view.
In general, when you execute (or compile) some source code, for e.g by doing "python my_script.py", the interpreter/compiler will parse the source file, transform it into an AST, then transform this AST into something he understand (bytecode for example). (The reality is more complicated with a lot more steps).
While this is the most frequent use of an AST, this is not the only one: it can be use for everything related to code analysis, modification, creating tools, modifying the inner representation of code before sending it to the interpreter (some libs do that, for e.g. py.test does this with the asserts) etc... Those are the case that are interesting me.
So Baron is going to offer you an AST for the python language where the
source code → Baron's AST → source code will give an identical source
Yes, python standard lib allows you very easily to play with python's AST which is very cool. The problems are that:
isinstanceeverywhere) but none of them are very cool or pleasant to use (I've already done that several times)
There are several other existing tools, but none of them guaranteed a lossless conversion. The closest one is this tokenize lib that is very close to a lossless conversion but which is not made for it (and I was too lazy to hack it for that, and it's "only" the tokenizing part).
This question is already partially answered, but let's hit the nail one last time. So, having an AST with a lossless conversion with the source code will allow, among other things:
Also, a fun consequence: once you have this high level AST, if you add a conversion between this AST and another format, you'll be able to edit you python code in this format and convert it back to its original (or modified) source code. For example: if you transform this AST into xml, you'll be able to use BeautifulSoup or lxml or xpath or xslt on your python source code. I'm not sure if this will useful but I like this idea.
My current strategy can be summarize by one quote from this very interesting article:
The details of this paper aren't quite as important as the general concept: a compiler is nothing more than a series of transformations of the internal representation of a program. The authors promote using dozens or hundreds of compiler passes, each being as simple as possible. Don't combine transformations; keep them separate.
You can find the current version of the code here.
Next post will probably be on how I've written the splitting part.
Thanks for reading and have a nice day,