2014-04-13

Bison

Today I decided to bite something of Bison. In my far past I had already experimented (though in a strong yacc fashion) with it, and by this I mean I have read few quick and simple tutorials — nothing more than the omnipresent infix calculator with frills, which by the way is also the main example in the manual, with enhancements found in mfcalc (hopefully updated for the future). To me the matter (using the tool as well as understanding bits of its inner working) is vast, deep and really interesting, but this is also the reason why I was always pushed towards stack-based languages in my experimenting with this world. Stack-based languages can polish complexity, largely undesidered in toy languages. But in the very same time it makes these toy languages almost alien. Stack-based or not, beyond a point a computer language can't miss a tool like Bison, unless you want to make a lot of craft work by yourself — there could be good reasons to do so, but I can't imagine one that fits the world of a toy language.

So, maybe only to make a noise and a vibration here and there, here's the result and, ladies and gentlemen, it is … hold on tight … the omnipresent basic infix calculator! More or less. In fact, you can assign the result of an expression to a symbol, and use it later. The lexical scanner reads only from standard input, and … again hold on tight … you can write 0.5a instead of 0.5*a! I admit it, MetaFONT book was very influential over me, and so it was the MetaFONT language, which is by heart the only language I know that accepts a more natural notation for the multiplication. Think about it: 2a is a syntactic error in the vast majority of computer programming languages, at least among the most known. Even languages thought to handle math stuffs, I am thinking about R and Octave mainly (and also Maxima!), disallow this syntactic sugar. Nothing bad, but my very simply infix calculator makes it possible! This is an incredible feature!! (Irony here, of course).

If you are interested in these basic things for beginners and in a complete, messed up, but working example to play on, you can take it from a gist of mine. I have avoided full C++ style (other examples show the “driver class” C++ approach), just used C++ where it turns to be ease (STL map class, since for the rest C would have sufficed). The next idea will be similar, maybe, and it will be about lambda. Indeed, since easter is near, I have started these tests in order to build a toy tool to play with the Church lambda calculus (shame on you, it already exists cool Xyz you can use very profitably! Ok, that's not my plan to be profitable or whatever, I am just playing to keep my last two survivor neurons almost alive), but I suppose I will be late, as usual.

Final note: on gist, if you assign the name for the file, you can't use the highlight you want. So, since .yy is an unknown extension, it made me impossible to select C++ highlight. We live in a world dominated by extensions rather than by users' will.

2014-02-22

Plans for Golfrun

The C Golfrun interpreter is stuck, as it was before. I think there's no reason to keep the idea of a full formalization of the language. I still want to rewrite it in C++11, but I don't want to find myself trying to inflate this or that feature of the language for the sake of it. Rather, I will use bits of C++11 when and if they fit; otherwise, plain C++, and even not too much C++ized i.e., it will be C++ for the STL, rather than for heavy Object Orientation; likely STL will be just a cozy replacement for the glib, and that'll be all.

In the meantime the language drifted and diverged from GolfScript in my mind. And from current Golfrun too! The changes I thought about were
  • no comments: code golfing can lack comments; and if you need them, a string that then you will drop from the stack can be used instead. There's a difference of course, since the string is digested at the lexical analyzer level while a comment is consumed by the parser and never results in a token. In contrast with a common good practice, the rule is: avoid comments! The change makes the symbol # available for other magic;
  • strings: use only "; strings without escape characters interpretation can be added through another syntax, like _"string". Another symbol, ', will be available;
  • case unsensitive symbols: case change can be used to separate symbols; e.g. thisIS has two tokens, THIS and IS. It could be useful to save some extra space.  A sequence like ThIs will produce 4 tokes: T, H, I, S.
  • maybe, rational numbers: a syntax like 0r13/3 could be used. A number like 0.123 would be written as 0r123/100, which has length 9 against 5. Cumbersome, and bad for code golfing in few strokes. Maybe I should accept the fact that a new symbol must be exploited for this; e.g. 0'123. No, I don't want to make it impossible to duplicate a number without adding extra space(s), e.g. 12.13++ must leave 37 on stack and not give a stack underflow instead. So maybe the dup must become ' and the dot must be back to its common meaning as part of syntax for numbers.
    • Rather the ' could be used as part of the syntax to introduce some kind of literals, e.g. '0.123 (where the dot is the decimal separator)
  • underscore can't be part of a symbol anymore, so hey_ and _hey will result both in two symbol tokens. But if followed by a number, it will be the unary minus, as in J; so you can write 5_5+ instead of 5 -5+ and the minus will be only a dyadic operator.
  • assignment syntax can't be used to assign to single character non-alphabetic symbols, e.g. {5}:* won't work. Instead, the syntax could be used to mean some sort of symbol modifier, i.e. interpreted as the token :*. The longer assignment syntax will be used (to be defined; it will be similar to the lookup system service) 
Other syntax changes were already made, in particular those allowing to feed the stack with complex numbers and to write the colon symbol (simply doubling it). Data types are or will be:
  • numbers
    • integers (arbitrary precision using the GNU Multiprecision arithmetic library)
    • complex integers, i.e. there's a real (integer) part and an imaginary (integer) part
    • rationals (complex or not), maybe
  • strings (of bytes; not C strings, so that they can contain zero bytes as well)
  • blocks, they are strings after all, but with a different syntax and can trigger different behaviour of operators
  • arrays (collection of eterogeneous objects)
  • hashmaps (keys are only strings; these strings can be the string representation of an object)
There will be 2 stacks: operands stack, and context stack. The context stack "contains" the operands stack and the symbol table. Currently, Golfrun can restore the original symbol table using a specific “system service”. This won't be needed anymore and the mechanisms of the context can be used instead. The context provides basically local variables capabilities and local stack capabilities.

Some extra built-ins will be kept, e.g.
  • dd (as 2dup in Forth); shorter synonym: D
  • sys (“system service”); single symbol ":" (written :: in the syntax) as synonym
  • stack (debug purpose mainly: dump the stack => stack associated with the topmost context)
  • sqrt (now it could return rational numbers approximating the result); shorter synonym: ST.
  • type (return the type of the object on the stack, without dropping it; shorter synonym: T
Others, added: e.g. 2swap (Forth) as SS, and rotation of more than 3 objects (@), as R.

A lot of symbols are now "free", and others need to have a defined behaviour with some kind of arguments. Coercion/implicit conversions need a clear, easy to remember rule.

An example of unclear behaviour is: [97 98]""+1/ will result in an array with two single character strings, "a" and "b". How do you go back? Something more elaborated as {(\;}%. This means that ( (or )) over a string will behave as if the string is an array of integers, except that the "head" (or "tail") is pushed as such, while the string remains a string. This is the same in GolfScript, where "ab") will result in two objects on stack, the string "a" and the integer 98. To get the "head" or the "tail" as single character, we need some extra work. In the eye of some operators, a string is an array of integers. Which is not wrong, but sometimes the information of the original interepretation is lost and coercion makes sense only if I am going to sum it with an integer. Something like Erlang $a could be desiderable (not using $  but something else instead).

Ok, I think it's time I start to code, without needing to have all this already planned in details, otherwise I won't start it again anymore.

2013-12-31

Am I a human?

Indeed, sometimes I have doubts about it.

Anyway, Happy New 2014.