Microsoft Open XML SDK Code Quality Study

image1.png


My acquaintance with the Open XML SDK started when I needed a library for creating Word documents with some reporting. After working with the Word API for over 7 years, I wanted to try something new and more convenient. This is how I found out that Microsoft has an alternative solution. Traditionally, we pre-check programs and libraries used in the company using the PVS-Studio analyzer.



Introduction



Office Open XML, also known as OpenXML or OOXML, is an XML-based format for office documents, including word processing documents, spreadsheets, presentations, and diagrams, shapes, and other graphics. The specification was developed by Microsoft and adopted by ECMA International in 2006. In June 2014, Microsoft released the Open XML SDK in open source. The source is now available on GitHub under the MIT license.



To find errors in the source code of the library, we used PVS-Studio . It is a tool for detecting bugs and potential vulnerabilities in the source code of programs written in C, C ++, C # and Java. Works in 64-bit systems on Windows, Linux and macOS.



The project is small enough and there were few warnings. But the choice of the title picture was based precisely on the results. There are a lot of useless conditional operators in the code. It seems to me that if you refactor all such places in the code, then the volume will be noticeably reduced. As a result, the readability of the code will also increase.



Why Word API and not Open XML SDK



As you might have guessed from the title, I continued to use the Word API. This method has a lot of disadvantages:



  • Old awkward API;
  • Microsoft Office must be installed;
  • The need to distribute a distribution kit with Office libraries;
  • Dependence of the Word API on the system locale settings;
  • Low speed of work.


With the locale in general, an amusing incident happened. Windows has a dozen regional settings. On one of the server computers there was a mess from the USA and UK locales. Because of this, Word documents were created, where instead of the dollar symbol there was a ruble, and pounds were not displayed at all. Improving the operating system settings fixed the problem.



Listing all this, I wondered again why I still use it ...



But no, I still like the Word API better, and here's why.



OOXML looks like this:



<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<w:document ....>
  <w:body>
    <w:p w:rsidR="00E22EB6"
         w:rsidRDefault="00E22EB6">
      <w:r>
        <w:t>This is a paragraph.</w:t>
      </w:r>
    </w:p>
    <w:p w:rsidR="00E22EB6"
         w:rsidRDefault="00E22EB6">
      <w:r>
        <w:t>This is another paragraph.</w:t>
      </w:r>
    </w:p>
  </w:body>
</w:document>


Where <w: r> (Word Run) is not a sentence, and not even a word, but any piece of text that has attributes that differ from the adjacent text fragments.



It is programmed with something like this:



Paragraph para = body.AppendChild(new Paragraph());
Run run = para.AppendChild(new Run());
run.AppendChild(new Text(txt));


The document has a specific internal structure, and in the code you need to create the same elements. The Open XML SDK, in my opinion, does not have enough abstract data access layer. Creating a document using the Word API will be clearer and shorter. Especially when it comes to tables and other complex data structures.



In turn, the Open XML SDK solves a wide range of tasks. With it, you can create documents not only for Word, but also for Excel and PowerPoint. This library is probably more suitable for some tasks, but I decided to stay on the Word API for now. In any case, it will not be possible to completely abandon it, because for internal needs, we are developing a plugin for Word, and there it is possible to use only the Word API.



Two values ​​for string



V3008 The '_rawOuterXml' variable is assigned values ​​twice successively. Perhaps this is a mistake. Check lines: 164, 161. OpenXmlElement.cs 164



internal string RawOuterXml
{
    get => _rawOuterXml;

    set
    {
        if (string.IsNullOrEmpty(value))
        {
            _rawOuterXml = string.Empty;
        }

        _rawOuterXml = value;
    }
}


The string type can have 2 types of values: null and text value. It is definitely safer to use textual meaning, but both approaches are valid. In this project , it is unacceptable to use the null value and it is rewritten to string.Empty ... at least that's how it was intended. But due to an error in RawOuterXml , you can still write null , and then refer to this field, receiving a NullReferenceException .



V3022 Expression 'namespaceUri! = Null' is always true. OpenXmlElement.cs 497



public OpenXmlAttribute GetAttribute(string localName, string namespaceUri)
{
    ....
    if (namespaceUri == null)
    {
        // treat null string as empty.
        namespaceUri = string.Empty;
    }
    ....
    if (HasAttributes)
    {
        if (namespaceUri != null)  // <=
        {
            ....
        }
        ....
    }
    ....
}


This snippet uses the same approach, the author of the code didn't make any major mistake, but it still smells like a failed refactoring. Most likely, one check can be removed here, which will reduce the width of the code, and, therefore, increase its readability.



About the compactness of the code



image2.png


V3009 It's odd that this method always returns one and the same value of '".xml"'. CustomXmlPartTypeInfo.cs 31



internal static string GetTargetExtension(CustomXmlPartType partType)
{
    switch (partType)
    {
        case CustomXmlPartType.AdditionalCharacteristics:
            return ".xml";

        case CustomXmlPartType.Bibliography:
            return ".xml";

        case CustomXmlPartType.CustomXml:
            return ".xml";

        case CustomXmlPartType.InkContent:
            return ".xml";

        default:
            return ".xml";
    }
}


I don’t know if there is any typo here, or if the author of the code wrote "nice" code in his opinion. I am sure that there is no point in returning so many values ​​of the same type from a function, and the code can be greatly reduced.



This is not the only such place. Here are a couple more of these warnings:



  • V3009 It's odd that this method always returns one and the same value of '".xml"'. CustomPropertyPartTypeInfo.cs 25
  • V3009 It's odd that this method always returns one and the same value of '".bin"'. EmbeddedControlPersistenceBinaryDataPartTypeInfo.cs 22


It would be interesting to hear why write like that.



V3139 Two or more case-branches perform the same actions. OpenXmlPartReader.cs 560



private void InnerSkip()
{
    Debug.Assert(_xmlReader != null);

    switch (_elementState)
    {
        case ElementState.Null:
            ThrowIfNull();
            break;

        case ElementState.EOF:
            return;

        case ElementState.Start:
            _xmlReader.Skip();
            _elementStack.Pop();
            GetElementInformation();
            return;

        case ElementState.End:
        case ElementState.MiscNode:
            // cursor is end element, pop stack
            _xmlReader.Skip();
            _elementStack.Pop();
            GetElementInformation();
            return;
        ....
    }
    ....
}


There are fewer questions about this code. Most likely, identical cases can be combined and the code will become shorter and more obvious.



A few more such places:



  • V3139 Two or more case-branches perform the same actions. OpenXmlMiscNode.cs 312
  • V3139 Two or more case-branches perform the same actions. CustomPropertyPartTypeInfo.cs 30
  • V3139 Two or more case-branches perform the same actions. CustomXmlPartTypeInfo.cs 15
  • V3139 Two or more case-branches perform the same actions. OpenXmlElement.cs 1803


Those Always true / false



Now is the time to provide some code examples that determined my choice of title image.



Warning 1



V3022 Expression 'Complete ()' is always false. ParticleCollection.cs 243



private bool IsComplete => Current is null ||
                           Current == _collection._element.FirstChild;

public bool MoveNext()
{
    ....
    if (IsComplete)
    {
        return Complete();
    }

    if (....)
    {
        return Complete();
    }

    return IsComplete ? Complete() : true;
}


The IsComplete property is used 2 times, and it is easy to understand from the code that its value will not change. Thus, at the end of the function, you can simply return the second value of the ternary operator - true .



Warning 2



V3022 Expression '_elementStack.Count> 0' is always true. OpenXmlDomReader.cs 501



private readonly Stack<OpenXmlElement> _elementStack;

private bool MoveToNextSibling()
{
    ....
    if (_elementStack.Count == 0)
    {
        _elementState = ElementState.EOF;
        return false;
    }
    ....
    if (_elementStack.Count > 0) // <=
    {
        _elementState = ElementState.End;
    }
    else
    {
        // no more element, EOF
        _elementState = ElementState.EOF;
    }
    ....
}


Obviously, if there are not 0 elements in the _elementStack , then there are more of them. The code can be shortened by at least 8 lines.



Warning 3



V3022 Expression 'rootElement == null' is always false. OpenXmlPartReader.cs 746



private static OpenXmlElement CreateElement(string namespaceUri, string name)
{
    if (string.IsNullOrEmpty(name))
    {
        throw new ArgumentException(....);
    }

    if (NamespaceIdMap.TryGetNamespaceId(namespaceUri, out byte nsId)
        && ElementLookup.Parts.Create(nsId, name) is OpenXmlElement element)
    {
        return element;
    }

    return new OpenXmlUnknownElement();
}

private bool ReadRoot()
{
  ....
  var rootElement = CreateElement(....);

  if (rootElement == null) // <=
  {
      throw new InvalidDataException(....);
  }
  ....
}


The CreateElement function cannot return null . If the company made a rule to write methods for creating xml nodes that either return a valid object or throw an exception, then users of such functions can not abuse additional checks.



Warning 4



V3022 Expression 'nameProvider' is always not null. The operator '?.' is excessive. OpenXmlSimpleTypeExtensions.cs 50



public static XmlQualifiedName GetSimpleTypeQualifiedName(....)
{
    foreach (var validator in validators)
    {
        if (validator is INameProvider nameProvider &&
            nameProvider?.QName is XmlQualifiedName qname) // <=
        {
            return qname;
        }
    }

    return type.GetSimpleTypeQualifiedName();
}


The is operator has the following pattern:



expr is type varname


If expr evaluates to true , then the varname object is valid and does not need to be compared to null again , as is done in this code snippet.



Warning 5



V3022 Expression 'extension == ".xlsx" || extension == ".xlsm" 'is always false. PresentationDocument.cs 246



public static PresentationDocument CreateFromTemplate(string path)
{
    ....
    string extension = Path.GetExtension(path);
    if (extension != ".pptx" && extension != ".pptm" &&
        extension != ".potx" && extension != ".potm")
    {
        throw new ArgumentException("...." + path, nameof(path));
    }

    using (PresentationDocument template = PresentationDocument.Open(....)
    {
        PresentationDocument document = (PresentationDocument)template.Clone();

        if (extension == ".xlsx" || extension == ".xlsm")
        {
            return document;
        }
        ....
    }
    ....
}


An interesting code turned out. First author weed out all the documents with the following extensions are not .pptx , .pptm ,. potx and. potm , and then decided to check that there were no .xlsx and .xlsm among them . The PresentationDocument function is definitely a victim of refactoring.



Warning 7



V3022 Expression 'OpenSettings.MarkupCompatibilityProcessSettings == null' is always false. OpenXmlPackage.cs 661



public MarkupCompatibilityProcessSettings MarkupCompatibilityProcessSettings
{
    get
    {
        if (_mcSettings is null)
        {
            _mcSettings = new MarkupCompatibilityProcessSettings(....);
        }

        return _mcSettings;
    }

    set
    {
        _mcSettings = value;
    }
}

public MarkupCompatibilityProcessSettings MarkupCompatibilityProcessSettings
{
    get
    {
        if (OpenSettings.MarkupCompatibilityProcessSettings == null) // <=
        {
            return new MarkupCompatibilityProcessSettings(....);
        }
        else
        {
            return OpenSettings.MarkupCompatibilityProcessSettings;
        }
    }
}


The MarkupCompatibilityProcessSettings property never returns null . If it turns out in the getter that the class field is null , then the object is overwritten with a new one. Also, note that this is not a recursive call to one property, but properties of the same name from different classes. Perhaps some confusion has led to the writing of unnecessary checks.



Other warnings



Warning 1



V3080 Possible null dereference. Consider inspecting 'previousSibling'. OpenXmlCompositeElement.cs 380



public OpenXmlElement PreviousSibling()
{
    if (!(Parent is OpenXmlCompositeElement parent))
    {
        return null;
    }
    ....
}

public override T InsertBefore<T>(T newChild, OpenXmlElement referenceChild)
{
    ....
    OpenXmlElement previousSibling = nextNode.PreviousSibling();
    prevNode.Next = nextNode;
    previousSibling.Next = prevNode;    // <=
    ....
}


And here is an example where additional verification is just not enough. The PreviousSibling method can return null , and the result of this function is immediately used without checking.



2 more dangerous places:



  • V3080 Possible null dereference. Consider inspecting 'prevNode'. OpenXmlCompositeElement.cs 489
  • V3080 Possible null dereference. Consider inspecting 'prevNode'. OpenXmlCompositeElement.cs 497


Warning 2



V3093 The '&' operator evaluates both operands. Perhaps a short-circuit '&&' operator should be used instead. UniqueAttributeValueConstraint.cs 60



public override ValidationErrorInfo ValidateCore(ValidationContext context)
{
    ....
    foreach (var e in root.Descendants(....))
    {
        if (e != element & e.GetType() == elementType) // <=
        {
            var eValue = e.ParsedState.Attributes[_attribute];

            if (eValue.HasValue && _comparer.Equals(....))
            {
                return true;
            }
        }
    }
    ....
}


Some people like to apply the '&' operator to logical expressions where they shouldn't. In the case of this operator, the second operand is evaluated first, regardless of the result of the first. This is not a very serious error here, but such sloppy code after refactoring can lead to potential NullReferenceException exceptions .



Warning 3



V3097 Possible exception: type marked by [Serializable] contains non-serializable members not marked by [NonSerialized]. OpenXmlPackageValidationEventArgs.cs 15



[Serializable]
[Obsolete(ObsoleteAttributeMessages.ObsoleteV1ValidationFunctionality, false)]
[EditorBrowsable(EditorBrowsableState.Never)]
public sealed class OpenXmlPackageValidationEventArgs : EventArgs
{
    private string _message;

    [NonSerialized]
    private readonly object _sender;

    [NonSerialized]
    private OpenXmlPart _subPart;

    [NonSerialized]
    private OpenXmlPart _part;

    ....

    internal DataPartReferenceRelationship
        DataPartReferenceRelationship { get; set; } // <=
}


Serialization of the OpenXmlPackageValidationEventArgs class can fail because one of the properties was forgotten to be marked as non-serializable. Or it is necessary to modify the return type of this property to be serializable, otherwise an exception may occur at runtime.



Conclusion



We at the company love Microsoft projects and technologies. In the section where we list Open Source projects verified with PVS-Studio, we even allocated a separate section for Microsoft. There have already accumulated 21 projects, about which 26 articles have been written. This is the 27th.



I'm sure you are wondering if there is Microsoft among our customers. The answer is yes! But let's not forget that this is a huge corporation leading development around the world. There are definitely divisions that already use PVS-Studio in their projects, but there are even more of those that do not use! And our experience with open source projects shows that they clearly lack a good tool for finding bugs;).





Also from the news, who are interested in analyzing code in C ++, C # and Java: we have recently added support for the OWASP standard and are actively increasing its coverage.





If you want to share this article with an English-speaking audience, please use the translation link: Svyatoslav Razmyslov. Analyzing the Code Quality of Microsoft's Open XML SDK .



All Articles