CodeQL workshop for Java: Unsafe deserialization in Apache Struts


Problem statement

Serialization is the process of converting in memory objects to text or binary output formats, usually for the purpose of sharing or saving program state. This serialized data can then be loaded back into memory at a future point through the process of deserialization.


In languages such as Java, Python and Ruby, deserialization provides the ability to restore not only primitive data, but also complex types such as library and user defined classes. This provides great power and flexibility, but introduces a signficant attack vector if the deserialization happens on untrusted user data without restriction.


Apache Struts is a popular open-source MVC framework for creating web applications in Java. In 2017, a researcher from the predecessor of the GitHub Security Lab found CVE-2017-9805, an XML deserialization vulnerability in Apache Struts that would allow remote code execution.

Apache Struts是一个流行的开源MVC框架,用于用Java创建Web应用。2017年,GitHub安全实验室前身的研究人员发现CVE-2017-9805,Apache Struts中存在一个XML反序列化漏洞,将允许远程代码执行。

The problem occurred because included as part of the Apache Struts framework is the ability to accept requests in multiple different formats, or content types. It provides a pluggable system for supporting these content types through the ContentTypeHandler interface, which provides the following interface method:

问题发生的原因是,作为Apache Struts框架的一部分,包含了接受多种不同格式或_内容类型_的请求的能力。它通过ContentTypeHandler接口提供了一个可插拔的系统来支持这些内容类型,它提供了以下接口方法:

     * Populates an object using data from the input stream
     * @param in The input stream, usually the body of the request
     * @param target The target, usually the action class
     * @throws IOException If unable to write to the output stream
    void toObject(Reader in, Object target) throws IOException;

New content type handlers are defined by implementing the interface and defining a toObject method which takes data in the specified content type (in the form of a Reader) and uses it to populate the Java object target, often via a deserialization routine. However, the in parameter is typically populated from the body of a request without sanitization or safety checks. This means it should be treated as “untrusted” user data, and only deserialized under certain safe conditions.

新的内容类型处理程序是通过实现接口和定义 “toObject “方法来定义的,该方法接受指定内容类型的数据(以 “Reader “的形式),并使用它来填充Java对象 “target”,通常是通过反序列化例程。然而,”in “参数通常是从请求的主体中填充的,没有经过净化或安全检查。这意味着它应该被视为 “不受信任 “的用户数据,只有在某些安全条件下才会被反序列化。

In this workshop, we will write a query to find CVE-2017-9805 in a database built from the known vulnerable version of Apache Struts.

在本工作坊中,我们将编写一个查询,在一个由已知的Apache Struts脆弱版本构建的数据库中找到CVE-2017-9805。

Setup instructions for Visual Studio Code

To take part in the workshop you will need to follow these steps to get the CodeQL development environment setup:


  1. Install the Visual Studio Code IDE. 安装Visual Studio Code IDE。
  2. Download and install the CodeQL extension for Visual Studio Code. Full setup instructions are here. 下载并安装Visual Studio Code的CodeQL扩展。完整的安装说明在这里
  3. Set up the starter workspace.
    • **Important**: Don’t forget to git clone --recursive or git submodule update --init --remote, so that you obtain the standard query libraries.
  4. Open the starter workspace: File > Open Workspace > Browse to vscode-codeql-starter/vscode-codeql-starter.code-workspace.
  5. Download and unzip the database.
  6. Choose this database in CodeQL (using Ctrl + Shift + P to open the command palette, then selecting “CodeQL: Choose Database”).
  7. Create a new file in the codeql-custom-queries-java directory called UnsafeDeserialization.ql.

The workshop is split into several steps. You can write one query per step, or work with a single query that you refine at each step. Each step has a hint that describes useful classes and predicates in the CodeQL standard libraries for Java. You can explore these in your IDE using the autocomplete suggestions (Ctrl + Space) and the jump-to-definition command (F12).

该研讨会分为几个步骤。你可以在每个步骤中编写一个查询,或者在每个步骤中完善一个查询。每个步骤都有一个提示,描述了Java的CodeQL标准库中有用的类和谓词。你可以在IDE中使用自动完成建议(Ctrl + Space)和跳转到定义命令(F12)来探索这些。

Section 1: Finding XML deserialization

XStream is a Java framework for serializing Java objects to XML used by Apache Struts. It provides a method XStream.fromXML for deserializing XML to a Java object. By default, the input is not validated in any way, and is vulnerable to remote code execution exploits. In this section, we will identify calls to fromXML in the codebase.

XStream是一个Java框架,用于将Java对象序列化为Apache Struts使用的XML。它提供了一个方法XStream.fromXML,用于将XML反序列化为一个Java对象。默认情况下,输入的内容不会以任何方式进行验证,并且容易受到远程代码执行的攻击。在本节中,我们将识别代码库中对fromXML的调用。

  1. Find all method calls in the program.

    1. 查找程序中的所有方法调用。
    Hint - A method call is represented by the `MethodAccess` type in the CodeQL Java library. > 在CodeQL Java库中,方法调用由`MethodAccess`类型表示。
    Solution ```ql import java from MethodAccess call select call ```
  2. Update your query to report the method being called by each method call.


    Hints - Add a CodeQL variable called `method` with type `Method`. > 添加一个名为 "method "的CodeQL变量,类型为 "Method"。 - `MethodAccess` has a predicate called `getMethod()` for returning the method. > `MethodAccess`有一个叫做`getMethod()`的谓词用于返回方法。 - Add a `where` clause. > 添加一个`where`子句。
    Solution ``` import java from MethodAccess call, Method method where call.getMethod() = method select call, method ```

  3. Find all calls in the program to methods called fromXML.


    Hint - `Method.getName()` returns a string representing the name of the method. > `Method.getName()`返回一个代表方法名称的字符串。
    Solution ```ql import java from MethodAccess fromXML, Method method where fromXML.getMethod() = method and method.getName() = "fromXML" select fromXML ``` However, as we now want to report only the call itself, we can inline the temporary `method` variable like so: > 然而,由于我们现在只想报告调用本身,我们可以像这样内联临时`method`变量。 ```ql import java from MethodAccess fromXML where fromXML.getMethod().getName() = "fromXML" select fromXML ```
  4. The XStream.fromXML method deserializes the first argument (i.e. the argument at index 0). Update your query to report the deserialized argument.


    Hint - `MethodCall.getArgument(int i)` returns the argument at the i-th index. > `MethodCall.getArgument(int i)`返回第i个索引的参数。 - The arguments are _expressions_ in the program, represented by the CodeQL class `Expr`. Introduce a new variable to hold the argument expression. > 参数是程序中的_表达式,由CodeQL类`Expr`表示。引入一个新的变量来存放参数表达式。
    Solution ```ql import java from MethodAccess fromXML, Expr arg where fromXML.getMethod().getName() = "fromXML" and arg = fromXML.getArgument(0) select fromXML, arg ```
  5. Recall that predicates allow you to encapsulate logical conditions in a reusable format. Convert your previous query to a predicate which identifies the set of expressions in the program which are deserialized directly by fromXML. You can use the following template:


    predicate isXMLDeserialized(Expr arg) {
      exists(MethodAccess fromXML |
        // TODO fill me in

    exists is a mechanism for introducing temporary variables with a restricted scope. You can think of them as their own from-where-select. In this case, we use it to introduce the fromXML temporary variable, with type MethodAccess.

    exists是一种引入范围有限的临时变量的机制。你可以把它们看作是自己的from-where-select。在本例中,我们使用它来引入类型为 “MethodAccess “的 “fromXML “临时变量。

    Hint - Copy the `where` clause of the previous query > 复制上一个查询的 "where "子
    Solution ```` import java predicate isXMLDeserialized(Expr arg) { exists(MethodAccess fromXML | fromXML.getMethod().getName() = "fromXML" and arg = fromXML.getArgument(0) ) } from Expr ar where isXMLDeserialized(arg) select arg ````

Section 2: Find the implementations of the toObject method from ContentTypeHandler

Like predicates, classes in CodeQL can be used to encapsulate reusable portions of logic. Classes represent single sets of values, and they can also include operations (known as member predicates) specific to that set of values. You have already seen numerous instances of CodeQL classes (MethodAccess, Method etc.) and associated member predicates (MethodAccess.getMethod(), Method.getName(), etc.).


  1. Create a CodeQL class called ContentTypeHandler to find the interface You can use this template:


    class ContentTypeHandler extends RefType {
      ContentTypeHandler() {
          // TODO Fill me in
    Hint - Use `RefType.hasQualifiedName(string packageName, string className)` to identify classes with the given package name and class name. For example: > 使用`RefType.hasQualifiedName(string packageName, string className)`来识别具有给定包名和类名的类: ```ql from RefType r where r.hasQualifiedName("java.lang", "String") select r ``` - Within the characteristic predicate you can use the magic variable `this` to refer to the RefType > 在特性谓词中,你可以使用神奇的变量`this`来引用RefType。
    Solution ```ql import java /** The interface ``. */ class ContentTypeHandler extends RefType { ContentTypeHandler() { this.hasQualifiedName("", "ContentTypeHandler") } } ```
  2. Create a CodeQL class called ContentTypeHandlerToObject for identfying Methods called toObject on classes whose direct super-types include ContentTypeHandler.

    创建一个名为 “ContentTypeHandlerToObject “的CodeQL类,用于识别直接超类型包括 “ContentTypeHandler “的类上调用 “toObject “的 “Method”。

    Hint - Use `Method.getName()` to identify the name of the method. > 使用`Method.getName()`来识别方法的名称。 - To identify whether the method is declared on a class whose direct super-type includes `ContentTypeHandler`, you will need to: > 要识别该方法是否在直接超级类型包括`ContentTypeHandler`的类上声明,你需要: - Identify the declaring type of the method using `Method.getDeclaringType()`. > 使用`Method.getDeclaringType()`识别方法的声明类型。 - Identify the super-types of that type using `RefType.getASuperType()` > 使用`RefType.getASuperType()`识别该类型的超级类型。 - Use `instanceof` to assert that one of the super-types is a `ContentTypeHandler` > 使用 "instanceof "断言其中一个超级类型是 "ContentTypeHandler"。
    Solution ```ql /** A `toObject` method on a subtype of ``. */ class ContentTypeHandlerToObject extends Method { ContentTypeHandlerToObject() { this.getDeclaringType().getASupertype() instanceof ContentTypeHandler and this.hasName("toObject") } } ```
  3. toObject methods should consider the first parameter as untrusted user input. Write a query to find the first (i.e. index 0) parameter for toObject methods.


    Hint - Use `Method.getParameter(int index)` to get the i-th index parameter. > 使用`Method.getParameter(int index)`来获取第i个索引参数。 - Create a query with a single CodeQL variable of type `ContentTypeHandlerToObject`. > 用类型为`ContentTypeHandlerToObject`的单个CodeQL变量创建一个查询。
    Solution ```ql from ContentTypeHandlerToObject toObjectMethod select toObjectMethod.getParameter(0) ```

Section 3: Unsafe XML deserialization

We have now identified (a) places in the program which receive untrusted data and (b) places in the program which potentially perform unsafe XML deserialization. We now want to tie these two together to ask: does the untrusted data ever flow to the potentially unsafe XML deserialization call?


In program analysis we call this a data flow problem. Data flow helps us answer questions like: does this expression ever hold a value that originates from a particular other place in the program?


We can visualize the data flow problem as one of finding paths through a directed graph, where the nodes of the graph are elements in program, and the edges represent the flow of data between those elements. If a path exists, then the data flows between those two nodes.


Consider this example Java method:


int func(int tainted) {
   int x = tainted;
   if (someCondition) {
     int y = x;
   } else {
     return x;
   return -1;

The data flow graph for this method will look something like this:



This graph represents the flow of data from the tainted parameter. The nodes of graph represent program elements that have a value, such as function parameters and expressions. The edges of this graph represent flow through these nodes.


CodeQL for Java provides data flow analysis as part of the standard library. You can import it using The library models nodes using the DataFlow::Node CodeQL class. These nodes are separate and distinct from the AST (Abstract Syntax Tree, which represents the basic structure of the program) nodes, to allow for flexibility in how data flow is modeled.

CodeQL for Java提供的数据流分析是标准库的一部分。你可以使用导入它。该库使用DataFlow::NodeCodeQL类对节点进行建模。这些节点与AST(Abstract Syntax Tree,表示程序的基本结构)节点是分开的,有别于AST节点,以便灵活地对数据流进行建模。

There are a small number of data flow node types – expression nodes and parameter nodes are most common.


In this section we will create a data flow query by populating this template:


 * @name Unsafe XML deserialization
 * @kind problem
 * @id java/unsafe-deserialization
import java

// TODO add previous class and predicate definitions here

class StrutsUnsafeDeserializationConfig extends DataFlow::Configuration {
  StrutsUnsafeDeserializationConfig() { this = "StrutsUnsafeDeserializationConfig" }
  override predicate isSource(DataFlow::Node source) {
    exists(/** TODO fill me in **/ |
      source.asParameter() = /** TODO fill me in **/
  override predicate isSink(DataFlow::Node sink) {
    exists(/** TODO fill me in **/ |
      /** TODO fill me in **/
      sink.asExpr() = /** TODO fill me in **/

from StrutsUnsafeDeserializationConfig config, DataFlow::Node source, DataFlow::Node sink
where config.hasFlow(source, sink)
select sink, "Unsafe XML deserialization"
  1. Complete the isSource predicate using the query you wrote for Section 2.

    使用你为第2节写的查询完成 “isSource “谓词。

    Hint - You can translate from a query clause to a predicate by: > 你可以通过以下方式从查询子句翻译成谓词: - Converting the variable declarations in the `from` part to the variable declarations of an `exists` > 将 "from "部分的变量声明转换为 "exists "部分的变量声明。 - Placing the `where` clause conditions (if any) in the body of the exists > 将 "where "子句条件(如果有的话)放在existence的正文中。 - Adding a condition which equates the `select` to one of the parameters of the predicate. > 添加一个条件,将`select`等同于谓词的一个参数。 - Remember to include the `ContentTypeHandlerToObject` class you defined earlier. > 记住要包含你之前定义的`ContentTypeHandlerToObject`类。
    Solution ```ql override predicate isSource(Node source) { exists(ContentTypeHandlerToObject toObjectMethod | source.asParameter() = toObjectMethod.getParameter(0) ) } ```
  2. Complete the isSink predicate by using the final query you wrote for Section 1. Remember to use the isXMLDeserialized predicate!


    Hint - Complete the same process as above. > 完成与上述相同的过程。
    Solution ```ql override predicate isSink(Node sink) { exists(Expr arg | isXMLDeserialized(arg) and sink.asExpr() = arg ) } ```

You can now run the completed query. You should find exactly one result, which is the CVE reported by our security researchers in 2017!


For this result, it is easy to verify that it is correct, because both the source and sink are in the same method. However, for many data flow problems this is not the case.


We can update the query so that it not only reports the sink, but it also reports the source and the path to that source. We can do this by making these changes:


The answer to this is to convert the query to a path problem query. There are five parts we will need to change:


  1. Convert your previous query to a path-problem query.


    Solution ```ql /** * @name Unsafe XML deserialization * @kind path-problem * @id java/unsafe-deserialization */ import java import import DataFlow::PathGraph predicate isXMLDeserialized(Expr arg) { exists(MethodAccess fromXML | fromXML.getMethod().getName() = "fromXML" and arg = fromXML.getArgument(0) ) } /** The interface ``. */ class ContentTypeHandler extends RefType { ContentTypeHandler() { this.hasQualifiedName("", "ContentTypeHandler") } } /** A `toObject` method on a subtype of ``. */ class ContentTypeHandlerToObject extends Method { ContentTypeHandlerToObject() { this.getDeclaringType().getASupertype() instanceof ContentTypeHandler and this.hasName("toObject") } } class StrutsUnsafeDeserializationConfig extends DataFlow::Configuration { StrutsUnsafeDeserializationConfig() { this = "StrutsUnsafeDeserializationConfig" } override predicate isSource(DataFlow::Node source) { exists(ContentTypeHandlerToObject toObjectMethod | source.asParameter() = toObjectMethod.getParameter(0) ) } override predicate isSink(DataFlow::Node sink) { exists(Expr arg | isXMLDeserialized(arg) and sink.asExpr() = arg ) } } from StrutsUnsafeDeserializationConfig config, DataFlow::PathNode source, DataFlow::PathNode sink where config.hasFlowPath(source, sink) select sink, source, sink, "Unsafe XML deserialization" ```

For more information on how the vulnerability was identified, you can read the blog disclosing the original problem.


Although we have created a query from scratch to find this problem, it can also be found with one of our default security queries, UnsafeDeserialization.ql. You can see this on a vulnerable copy of Apache Struts that has been analyzed on, our free open source analysis platform.

虽然我们从头开始创建了一个查询来发现这个问题,但也可以通过我们的一个默认安全查询,UnsafeDeserialization.ql来发现。你可以在Apache Struts的脆弱副本上看到这个问题,这个漏洞已经被在LGTM.com上分析过了,LGTM.com是我们免费的开源分析平台。

What’s next?