MentalJS Sandbox/Parser
Thursday, 18 October 2012
I was driving in the car on my own in a lot of traffic lights and I suddenly had an idea. To take the work I did in JSReg and make a real parser by matching starting and ending characters. I began to code it in my head as I was sitting in the lights. As the idea formed in my mind so did the idea of what to call it. MentalJS! It’s a cool name because you can say “whoa that’s mental” 🙂
How it works
I was new to how parsers worked so I looked around at the basic concepts but noticed there is a lexing step. I was confused because you can discover tokens on the fly as you parse. The main argument against this is you know the state before for example “{” to decide if it’s a block or object literal but by the time you get to “}” you don’t know the state since there could be many objects inside those pairs of characters. The core of my technique takes advantage of the fact that these characters require each other and a starting “{” character can’t be outside the last “}” character because that would be a syntax error. The great thing about this technique is that there are more characters in which you can match this way such as [ and ] and ( and ).
How do you know that “{” belongs to “}”? You just need a stack to pop states out when a closing “}” is found. Here is what the structure looks like:
lookups = {
'[':{
end:']',
states:[]
},
'{':{
end:'}',
states:[]
},
'(':{
end:')',
states:[]
}
};
When you encounter a “[” you populate the lookups state with the current state to the stack of the matching character, then when you encounter the ending “]” you read the lookups again and pop the state stack but return the known state. I do this because as you go inwards depending how many characters you have each pop will return the correct state for the matching character and these states will only be stored until the last outer character. If you have any states left at the end then you know there is a syntax error since there is a unmatched character.
There is a complication with this technique, some characters overlap such as the ternary operator which overlaps with object literals and labels since “:” is used by all. I got round this in two ways, for object literals I populated another state which stored the parent state and for case statements I used a current expression state to store the state until “:” was encounted. I used these states in the lookup function since when the closing character is encountered I can track the parent state and expression.
What I noticed with this technique is that it’s super fast since the lexing step/comment removal happens as it’s parsed. The code isn’t perfect however and I need to improve how I track states and automate the valid syntax and correct tokens and states because I’m doing it manually which is pretty messy at the moment.
Note I’m still learning parser terminology so apologises if I was incorrect with any references.
Sandboxing
I decided to dump support for non ES5 browsers since I want to execute within the window but prevent sandboxed code from writing to native objects. ES5 is the only way to do this reliably. Another reason is I want to use the enumerable feature which allows me to make properties non-enumerable for objects. This is a big difference to JSReg where I had to manually rewrite for loops to remove properties. Now I create a mirror of new object literals e.g. ({a,123}) will be ({“$a$”:123,”a”:1}) allowing me to whitelist properties but enabling the “in” operator to work. The core sandboxing features are the same as JSReg since they were pretty strong and there weren’t many hacks on the actual sandbox system.
Demo
I’ve created a simple demo to see how it works, it’s still early days yet but it’s parsing quite a lot of stuff and doing a lot of sandbox work.
MentalJS Demo
Credits
Thanks especially to LeverOne (Alexey Silin) (from the sla.ckers forums) since his JavaScript skills and hacks lead me to develop a real parser that could handle his attacks. I’d also like to thank all those who broke JSReg: Jonas Magazinius, Mario, Sirdarckcat, David Lindsay, Stefano Di Paola, Soroush Dalili, Giorgio Maone and anyone else I forgot.
No. 1 — October 19th, 2012 at 9:15 am
can you provide some docs with more info on parsers and lexing?
This was not very clear to me….thanks!
No. 2 — October 19th, 2012 at 3:38 pm
http://www.codeconscious.com/rebol/parse-tutorial.html
http://llvm.org/docs/tutorial/OCamlLangImpl1.html