PVS-Studio impressed with the quality of the Abbyy NeoML code

image1.png


The other day, ABBYY has published the source code for its NeoML framework. We were offered to check this library using PVS-Studio. This is an interesting project from the point of view of analysis, so we did not postpone it on the back burner. It won't take you much time to read this article, as the project is of high quality :).



The NeoML source code can be found on GitHub . This framework is cross-platform and is designed to implement machine learning models. It is used by ABBYY developers to solve computer vision problems, natural language processing, including image processing, document analysis and so on. Currently, programming languages ​​such as C ++, Java, Objective-C are supported, and Python should be added to this list soon. The main language in which the framework itself was written is C ++.



Running analysis



It was very easy to run the analysis on this framework. After generating a Visual Studio project in CMake, I launched the PVS-Studio analysis in Visual Studio for the projects of interest to us from Solution, excluding third-party libraries. In addition to NeoML itself, the solution also included such libraries from ABBYY as NeoOnnx and NeoMathEngine. I also included them in the list of projects for which the analysis was launched.



Analysis results



Of course, I really wanted to find some terrible errors, but ... the code turned out to be quite clean and warnings were received only, nothing. It is likely that static analysis has already been used in the development. Many of the warnings were triggers of the same diagnostics for similar sections of the code.



For example, in this project it is very common to call a virtual method from a constructor. In general, this is a dangerous approach. The V1053 diagnostic responds to such cases : Calling the 'foo' virtual function in the constructor / destructor may lead to unexpected result at runtime . A total of 10 such warnings were issued. You can read more about why this is a dangerous practice and what problems it causes in this article by Scott Myers "Never Call Virtual Functions during Construction or Destruction . "However, apparently the developers understand what they are doing, and there are no errors.



There are also 11 V803 diagnostic warnings from the micro-optimizations section. This diagnostic recommends replacing the postfix increment with a prefix one. if the previous value of the iterator is not used. In the case of a postfix increment, an extra temporary object is created. Of course, this is not an error, just a small detail . If such diagnostics are not interesting, then when using the analyzer, you can simply turn it off. Well, in principle, a set of "micro- optimizations "and turned off by default.



Actually, I think you understand that since we have come to the analysis in the article of such trifles as the increment of the iterator, then everything is generally fine and we simply do not know what to find fault with.



Very often, some diagnostics may be inapplicable or uninteresting for the user and it is better not to eat a cactus, but spend a little time setting up the analyzer. You can read more about the steps to take in order to immediately get closer to the most interesting analyzer responses in our article " How to quickly see the interesting warnings that the PVS-Studio analyzer generates for C and C ++ code? "



Among the triggers from the section " micro-optimizations "there are also interesting V802 diagnostic warnings, which recommends arranging the structure fields in descending order of type sizes, which makes it possible to reduce the structure size.



V802 On 64-bit platform, structure size can be reduced from 24 to 16 bytes by rearranging the fields according to their sizes in decreasing order. HierarchicalClustering.h 31



struct CParam {
  TDistanceFunc DistanceType; 
  double MaxClustersDistance;
  int MinClustersCount; 
};


By simply swapping the MaxClustersDistance field with the double type and the DistanceType enumerator, you can reduce the structure size from 24 to 16 bytes.



struct CParam {
  double MaxClustersDistance;
  int MinClustersCount; 
  TDistanceFunc DistanceType; 
};


TDistanceFunc is an enum , so its size is equivalent to int or less, so it should be moved to the end of the structure.



Again, this is not a mistake, but if the user is interested in microoptimizations or if they are, in principle, important for the project, such analyzer operations can quickly find places for at least primary refactoring.



In general, all the code is written neatly and legibly, but the V807 diagnostics pointed to a couple of places that could be made a little more optimal and readable. Let me give you the most striking example:



V807 Decreased performance. Consider creating a reference to avoid using the same expression repeatedly. GradientBoostFullTreeBuilder.cpp 469



image3.png


The call to curLevelStatistics [i] -> ThreadStatistics [j] can be replaced with a call to a separate variable. There is no call to any complex methods in this chain, so there may not be much optimization here, but the code, in my opinion, will be read much easier and more compact. In addition, with the support of this code, it will be clearly visible in the future that it is necessary to address precisely these indices and there is no error. For clarity, I will give the code with a replacement for a variable:



auto threadStatistics = curLevelStatistics[i]->ThreadStatistics[j];

if(threadStatistics.FeatureIndex != NotFound ) {
  if(   threadStatistics.Criterion > criterion
     || ( .... ))
  {
    criterion = threadStatistics.Criterion;
    curLevelStatistics[i]->FeatureIndex    = threadStatistics.FeatureIndex;
    curLevelStatistics[i]->Threshold       = threadStatistics.Threshold;
    curLevelStatistics[i]->LeftStatistics  = threadStatistics.LeftStatistics;
    curLevelStatistics[i]->RightStatistics = threadStatistics.RightStatistics;
  }
}


Conclusion



As you can see, from the point of view of static analysis, the code base of this framework turned out to be very clean.



It should be understood that one run of the analysis on an actively developed project weakly reflects the need for static analysis, since many errors, especially if they were critical, have already been detected in other ways, but much more time-consuming and resource-intensive. This point was analyzed in more detail in the article " Errors that static code analysis does not find because it is not used ."



But even with this fact in mind, few warnings were issued on NeoML, and I want to express respect for the quality of the code in this project, regardless of whether the developers used static analysis or not.





If you want to share this article with an English-speaking audience, then please use the link to the translation: Victoria Khanieva. PVS-Studio Impressed by the Code Quality of ABBYY NeoML .



All Articles