In my current job I had to deal with a few big git merges lately. The difficulty with big merges doesn't particularly lie in consolidating the conflicts in the various source files but in getting a proper Xcode project file out of it. Xcode project files often require manual intervention when dealing with merge conflicts. If you've ever had to fix conflicts in an Xcode project file, you probably know that having a functioning project file is just the start. After the project can again be loaded into Xcode, it's about resolving duplicate filenames, adding files to targets, restoring missing file references and sometimes marrying Coredata models from source and target branches. Hence I thought, wouldn't it be nice to have a tool that could assist me a little along the way. This tool would of course have to understand the Xcode format on a semantic and syntax level. The Xcode project file is not in JSON but in some format that is quite similar to it. So I wrote a parser for it about which I'd like to talk in this blog's episode.
Structure & Implementation
To implement a parser, one has to understand first the building blocks of the files to parse. At a basic level an Xcode project is just a sequence of key - value pairs separated by an equal sign. While the key can always be treated as a string, the value can take one of three forms:
- a string value
- a list of values
- a dictionary of values
When dealing with lists and dictionaries, their values can again be any combination of string, list and dictionary values.
With its building blocks identified, parsing an Xcode project boils down to iterating over a state machine while scanning the project. As shown in Picture 1, the state machine first skips any comments. When the comments have been consumed at the parser's current position, the next element is a key which is read and stored in a placeholder.
What comes next is the value, which can be one of three things. If it's a string value, the value can be stored in a dictionary using the saved key. In case of a list or a dictionary however the corresponding structure needs to be parsed first before it can be added to the dictionary. Since the structure at any level is the same i.e. no matter the current parser's position in the project file, at each structure level the parser deals with a combination of key - value pairs, scanning a dictionary or a list can easily be handled by a recursive call. Since there is only a limited number of key - value pairs, the recursion ends when all key - value combinations have been scanned.
Unfortunately scanning the structure of a project file is just half the battle. The other half can be won by dealing at a more granular level with scanning keys, string values, lists and dictionaries. Reading string values can easily be solved with regular expressions. They can't however be used to scan lists and dictionaries. To scan a list or a dictionary with regular expressions one needs to use recursion to get the same number of closing and opening braces (for dictionaries) or parentheses (for lists). This feature is unfortunately not available in Foundation's regular expression API. It uses a version of ICU Regular expressions that doesn't support it. The parser solves this problem by using a stack based algorithm: each opening parenthesis pushes the current read position onto the stack, each matching closing parenthesis pops the read position off the stack. Matching the stored position of all opening parentheses with the position of the corresponding closing parenthesis determines the start and end of each sub list.
Measure first then optimise ...
I tested the parser with a simple Xcode project for which it needed roughly 4 seconds. Since I wasn't happy with this result, I tried to improve it on various levels:
- I replaced all regular expression with state-machines in Swift
- I determined ranges of all sub-expressions once and stored them in a look-up table
- I reduced the amount of string copies made when dealing with sub-expressions
- I scanned all sub-structures in parallel
I knew early on in my implementation that using regular expressions is not the fastest way to scan a string. The reason I went down this path anyway was that I got a running implementation of the parser pretty quickly. Replacing those various regular expressions was something I could gradually do. Since I wrote unit tests for pretty much all of the parser's methods, replacing them with switch-based state-machines was straight forward.
Another way to speed things up was to determine the position of each sub-expression before the actual scan in the parser's init method. Storing the start position and range of each expression in a dictionary was much more efficient and enabled me to scan sub-structures in parallel later on.
When dealing with regular expressions, I used a lot of copying to convert substrings into strings. After I ended up with all those state-machines that did their job, I was able to further optimise the project scan time by dropping the substring - string conversion as well.
Last but not least, I added an operation queue to scan sub-structures in parallel. Lists and dictionaries don't have dependencies on each other which make them a good candidate for parallel processing.
By and large, all optimisations reduced the time to scan the sample project from 4 seconds down to 0.5 seconds.
The parser unfortunately doesn't do that well with big project files. My initial sample project had about 1000 lines. Real world projects are much bigger than that. They usually range between 10 to 20 times that size. When I used a project of about 17000 lines, the parser needed about 15 seconds to scan it. I expected an upper boundary of less than 8.5 seconds since the parser should outperform a linear scale. The speed bumps in the current implementation are the data structures I used to scan the project into. Hence to improve the parser further I need to concentrate on
- using different data structures to hold the project data
- pre-allocate memory so the allocation of new structures doesn't have an impact on scan time
The project gave me something to play around with after I got back from my main job. I love puzzles and when during the implementation of the project things didn't go as expected, I wanted to figure out why. Hence I was tempted before writing this blog to optimise the parser even further but the news about NIO then convinced me to move on. So further optimisations need to wait a little longer. In the meantime anybody who's interested can find the source code for the current parser's implementation here.