What is this article about?
This is the third article in the series of articles on the life of IDE developers for databases.
Its structure will be similar to the first one and the second one, even so, I am not going to tell you about the parsing of text. This article will present information on some tricks working with files and various problems of creating a big desktop application on the .NET platform.
To understand this article, it is not necessary to read the first and the second parts in full, but the first article in the series has several paragraphs that are perfect if you want to delve into the development context. This part of the series seems to be interesting to a wider audience, compared to the previous ones. It would not hurt to look through them before reading the article. Although if you don’t feel like doing that or have a lack of time, here are a few key points from the previous articles:
We are making an IDE line for DBMS MySQL, SQL Server, Oracle, PostgreSQL
This is a .NET stack desktop application with all that it implies
Many functions are aimed at SQL code analysis. Thus, we use heavily modified ANTLR for that.
SQL parsing is a difficult task in terms of performance and memory. Therefore, we constantly have to use different tricks for optimization.
Throughout the article, I am going to add links to the other parts:
Part 1. The difficulties of parsing. Stories about manually enhancing ANTLR.
Part 2. Optimizing string processing and file opening.
Part 3. The life of Visual Studio extensions. Working with IO. Using SQL in unconventional ways.
Part 4. Working with exceptions and the influence of data on the development process. Using ML.NET
Up to a point, the second part of the article is similar in structure to the first one. Considering those who only take interest in certain blocks of the article and just for the sake of convenient navigation, let me provide the content:
What are the difficulties?
To start with, all further problems unfold in the context of desktop development on the .NET stack.
The 32-bit problem
First of all, some users opt to use standalone versions of our products; while others stick to work inside Visual Studio and SQL Server Management Studio. As there are many extensions being developed for them. One of these extensions is SQL Complete. To clarify, it replaces the standard Code Completion SSMS and VS for SQL with a more powerful one and adds a number of useful functions, such as permanent backup of documents. Some of these functions, by their nature, are resource-consuming in terms of RAM.
VS and SSMS, however, at the dawn of the year 2020 are still 32-bit applications. Purely theoretically, they can only have 3.25 GB of random access memory, but in practice much less. Plug-ins are loaded in the very same process, that is they share this memory. SQL parsing is a very costly process, both in terms of CPU and RAM resources. In order to prompt the list of objects in user scripts, without unnecessary calls to the server, we store the object cache in RAM. Oftentimes, it doesn’t take up much space, but some of our users have databases that contain up to a quarter of a million objects.
The tight RAM limits and the features of working with SQL led to a number of tasks that were indeed interesting to solve. I spent days or even whole weeks on running memory profilers and trying dozens of different experiments. One case was a revelation to me. It was when we saved 6 bytes on an instance of the Token class, we then started saving about 100MB of RAM, since there were so many of these objects.
Another exciting trick was the idea to throw everything parsed, as soon as the user goes to another document. We will be able to easily restore everything from the script text, right after the user returns to it. This can actually cause freeze for a second or two on a large file, still, it allows you to work with a hundred of very big files.
Working with SQL is quite different from working with other languages. In C#, there are practically no files even with a thousand lines of code. Meanwhile, in SQL a developer can work with a database dump consisting of several million lines of code. There is nothing unusual about it.
DLL-Hell inside VS
There’s a handy tool to develop plugins in .NET Framework, it is an application domain. Everything is performed in an isolated way, it is possible to unload. For the most part, the implementation of extensions is, perhaps, the main reason why application domains were introduced.
Also, there is MAF Framework, that was designed by MS to solve the problem of creating add-ons to the program. It isolates these add-ons to such an extent that it can send them to a separate process and take over all communications. Frankly speaking, this solution is too cumbersome and has not gained much popularity.
Unfortunately, due to different reasons, Microsoft Visual Studio and SQL Server Management Studio built upon it, implement the extension system in a different manner. On the one hand, this makes it simple for plugins to access hosting applications, but it forces them to fit in together within one process and domain with another one.
Just like any other application in the 21st century, ours has a lot of dependencies. The majority of them are well-known, time-proven, and popular libraries in the .NET world. Unfortunately, this is exactly what causes various collisions. There was one case when, as part of one of the updates, Microsoft updated Newtonsoft.JSON, which is used inside VS, and added the corresponding binding redirect to the configuration file. This broke the work of the other library that we were using, which depended on an older version. We urgently had to release an update.
A good many times the source of problems has been the DevExpress controls library, upon which our UI is built. The problem is that this library is good and not only for us but also for other extension creators for VS. Sometimes users install our application as well as the application from other developers. We obviously have our own release recycle, and they have their own, and the user has his own idea of when which products should be updated.
As a result, applications assembled on different versions of DevExpress are in the same domain (AppDomain). Perhaps, .NET could resolve such a situation, but DevExpress takes on the work of resolving assemblies in such a way that two versions of this library within the same process will definitely result in a good deal of exceptions.
We experienced tons of problems when we made a humble attempt to start creating all new windows in WPF. The root of all these problems was the fact that WPF, by default, does not specify the StrongName of assemblies. Thus, CLR will just take a random assembly that can be not the one we need. It was very disappointing for us, as a number of tasks we had, would be solved much easier in the geometry of WPF.
Some of the above-mentioned problems were solved with a subscription to a tricky event AssemblyResolve at AppDomain.Current. The event handler changed in response to complaints from users with different sets of extensions. After another change, it was clear that we have already started losing some of the initial cases. So we wrote a unit test for each of the scenarios, and we left a comment in the handler code, reminding to write a test upon making any change to the handler code.
Pulling messages inside lock
It is not widely known that .NET Framework will pump Windows Message Queue inside every WaitHandle. To put it another way, inside every lock, any handler of any event in an application can be called, if this lock has time to switch to kernel mode, and it is not released during the spin-wait phase. This can result in re-entrancy in some very unexpected places. A few times it led to problems like “Collection was modified during enumeration” and various ArgumentOutOfRangeException. In addition, it took a lot of time to understand what had really happened.
Cool reinventing the wheel solutions
Adding an assembly to a solution using SQL
To a man with a hammer, everything looks like a nail.
When the project grows, the task of adding assemblies, simple at first, develops into a dozen of complicated steps. You can’t but forget one of them. In my opinion, some of them could be solved by creating project templates, some of them apparently not.
Once, we had to add a dozen of different assemblies to the solution, we performed a big refactoring. The thing is that nearly 80 solutions, including product and test ones, were created on the basis of around 300 .NET projects. Based on product solutions, Inno Setup files were written. They include lists of assemblies that are packaged in the installation that the user downloads. The algorithm of adding a project looked the following way:
Create a new project.
Add a certificate to it. Set up the tag of the build.
Add a version file.
Reconfigure the paths where the project is going to.
Rename the folder to match the internal specification.
Add the project to the solution once again.
Add a couple of assemblies that all projects need links to.
Add the build to all necessary solutions: test and product.
For all of the product solutions, add the assemblies to installation.
All in all, these 9 steps had to be repeated about 10 times. Steps 8 and 9 are not that trivial, and it is easy to forget to add builds everywhere. Faced with such a big and routine task, any normal programmer would want to automate it. That is exactly what we wanted to do. But how do we indicate which solutions and installations exactly to add to the newly created project?
There are so many scenarios and what is more, it is difficult to predict some of them. Then, we have come up with a crazy idea. Solutions are connected with projects like many-to-many, projects with installations in the same way, and SQL is a tool that has been created and perfected for more than 50 years to solve exactly the kind of tasks that we had.
Here is what we decided to do. We created a .Net Core Console App that scans all .sln files in the source folder, retrieves the list of projects from them with the help of DotNet CLI, and puts it to the SQLite database. The program has a few modes:
New – creates a project and all necessary folders, adds a certificate, sets up a tag, adds a version, minimum essential assemblies.
Add-Project – adds the project to all solutions that satisfy the SQL query that will be given as one of the parameters. To add the project to the solution, the program inside uses DotNet CLI.
Add-ISS – adds the project to all installations, that satisfy SQL query.
Although the idea to indicate the list of solutions through the SQL query may seem cumbersome, it completely closed all existing cases and most likely any possible cases in the future.
Let me demonstrate a usage scenario. Create a project “A” and add it to all solutions where projects “B” is used:
dbforgeasm add-project Folder1Folder2A “SELECT s.Id FROM Projects p JOIN Solutions s ON p.SolutionId = s.Id WHERE p.Name = ‘B'”
A problem with LiteDB
A couple of years ago, we were given a task to develop a background function for saving user documents. It had two main application flows: the ability to instantly close the IDE and leave, and upon returning to start from where you left off and the ability to restore in urgent situations like blackouts or program crashes.
To implement this task, it was necessary to save the contents of the files somewhere on the side, and do it often and quick. Apart from the contents, it was necessary to save some metadata, which made direct storage in the file system inconvenient. At that point, we found the LiteDB library, which impressed us with its simplicity and performance. LiteDB is a quick light-weight embedded database, which was entirely written in C#. Before using it, we examined it in much detail. The speed and overall simplicity won us over. I wrote more on this here (habr, coding).
In the course of the development process the whole team was satisfied with working with LiteDB; the main problems, however, started after the release of the recent version. Unfortunately, some customers encountered problems that had not appeared at the testing stage. We did a great deal of communication with customers, log and stack analysis, and several maintenance releases to stabilize the product.
What was the matter after all? The official documentation guaranteed that the database ensures proper work with concurrent access from multiple threads as well as several processes. Aggressive synthetic tests showed, that the database does not work correctly in a multithreaded environment.
To quickly fix the problem, we synchronized the processes with each other with the help of the self-written interprocess ReadWriteLock. Fortunately, we did not need the concurrent write from processes. A little later, we created 10 different issues in GitHub, made a few pull-requests to fix the situation. Now, after almost three years, LiteDB is working much better. Nevertheless, we learned the lesson and now examine any new third-parties in the project more thoroughly.
This problem is the opposite of the case with the partial lexical analysis, which was described in the previous part of the article. Let’s start with some context. When we work with a text, it is more convenient to work with it as a string list. Strings can be requested in random order, but some certain memory access density is still present. At some point, it was necessary to run several tasks to process very big files without full memory load. The idea was as follows:
To read the file line by line. Remember offsets in the file.
Upon request, issue the next line, set a required offset, and return the data.
The main task is completed. This structure does not take up much space compared to the file size. At the testing stage, we thoroughly check the memory footprint for big and very big files. Obviously, large files will be processed for a long time, and small ones will be processed immediately. There was no reference for checking the execution time.
RAM is called Random Access Memory for a reason, this is its competitive advantage over SSD and especially over HDD. These drivers start to work very badly for random access. That is, I can’t recall exactly how that happened, but it turned out that this approach slowed down the work by almost 40 times, compared to fully loading a file into memory. Besides, it turned out that we read the file 2,5 -10 full times depending on the context.
The solution was simple, and improvement was enough so that the operation would only take a little longer than when the file is fully loaded into memory. Likewise, RAM consumption was also insignificant. We found inspiration in the principle of loading data from RAM into cache processor: when you access an array element, the processor copies dozens of neighboring elements to its cache, because it often turns out that the necessary elements are nearby.
There are a number of data structures that use this processor optimization to gain top performance. It is because of this peculiarity that random access to array elements is much slower than sequential access. We implemented a similar mechanism: we read a set of a thousand strings and remembered their offsets. When we access the 1001st string, we drop the first 500 strings and load the next 500. In case we need any of the first 500 lines, then we go to it separately, because we already have the offset.
Hence, the moral of this story goes like this: the programmer does not necessarily need to carefully formulate and check non-functional requirements, but to create an inner checklist inside his head. But then again, this checklist might lead to premature optimization, and excessive concentration on non-functional requirements will delay their development, and will not fix all sorts of problems. As a result, we remembered for the future cases, that we need to work sequentially with persistent memory.
Let us draw the following intermediate conclusions:
VS and SSMS are still 32-bit applications. And the extensions for them have to fit in well with one another within a 32-bit address space.
All assemblies of all extensions are loaded into one AppDomain. There may be problems using shared Third Party.
It is necessary to carefully examine each Third Party library included in the project, not only for convenience and performance but also for stability.
I will go over the conclusion of the previous article. After successful performance optimization, dump memory usage profile and vice versa.
See you in the final part!