nginScript Use Case: Data Masking for User Privacy

Original: https://www.nginx.com/blog/data-masking-user-privacy-nginscript/


The effect of wearing clothing made of a fabric designed to thwart face-recogntion software is similar to data masking in log files
Fabric pattern that confuses facial-recognition software, designed by Adam Harvey

In October 2016, the Court of Justice of the European Union ruled that IP addresses are “personal information” and as such fall under the Data Protection Directive and General Data Protection Regulation (GDPR). For many website owners, this presents challenges for archiving and analyzing log files if the data leaves the EU. For data moving into the US, the EU‑US Privacy Shield provides some protection but faces legal challenges by privacy groups and governments who do not believe the level of protection is adequate.

However, protecting personal data in log files is not just an EU problem. For organizations with security certifications like ISO/ICE 27001, moving log files outside of the security realm where they were generated, say from network ops to marketing, can compromise the scope and compliance of the certification.

In this blog post, we describe some simple solutions for sanitizing NGINX Plus and NGINX log files so that they can be safely exported without exposing what is often called personally identifiable information (PII).

Editor – This is the fourth in a series of blog posts about nginScript. The first post discusses why NGINX, Inc. developed its own implementation of JavaScript, and presents a sample use case. The subsequent posts explore additional use cases:

Simple Approaches Do Not Work

The simplest approach to personal data protection is to strip IP addresses from logs before they are exported. This is easy to achieve with standard Linux command line tools, but log analysis systems may expect log files in a standard format and fail to import logs that omit the IP address field. Even if logs are successfully imported, the value of log processing may be significantly reduced if the analysis system relies on IP addresses to track a user across a site.

Another potential approach, substituting fake or random values for real IP addresses, results in log files that look complete, but the quality of log analysis is compromised because each log entry appears to originate from a different randomly generated IP address.

Masking the Client IP Address

The most effective solution is to use a technique called data masking to transform the real IP address into one that does not identify the end user but still allows correlation of website activity for a particular user. Data masking algorithms always produce the same pseudorandom value for a given input value in a way that ensures it cannot be converted back to the original input value. Every occurrence of an IP address is always transformed to the same pseudorandom value.

You can implement IP address masking in NGINX and NGINX Plus with the nginScript module. nginScript is a unique JavaScript implementation for NGINX and NGINX Plus, designed specifically for server‑side use cases and per‑request processing. In this case we execute a small amount of JavaScript code to mask the client IP address as each request is logged.

Instructions for enabling nginScript with NGINX and NGINX Plus appear at the end of this article.

NGINX and NGINX Plus Configuration for IP Address Masking

The log_format directive controls which information appears in the access logs. NGINX and NGINX Plus ship with a default log format called combined which produces log files that can be processed by most log‑processing tools.

For this configuration we create a new log format, masked, which is identical to the combined format except for the first field, where we replace the $remote_addr variable with $remote_addr_masked. The new variable is evaluated by executing our JavaScript code, which generates a masked version of the client IP address.

To have NGINX and NGINX Plus write access logs in this format we configure a server block with an access_log directive that specifies the masked format.

As we are going to use nginScript to mask the client IP address, we use the js_include directive to specify the location of our JavaScript code. The js_set directive specifies the JavaScript function that will be executed when evaluating the $remote_addr_masked variable.

Notice that we use two access_log directives. The first uses the default log format to produce access logs that can be used by administrators for operational purposes. The second specifies the masked log format. With this configuration we write two access logs for each request – one for sysadmins and DevOps, and one for export.

Finally, the location block defines a very simple response using the return directive to show that data masking is working. In production, this would most likely contain a proxy_pass directive to direct requests to a backend server.

nginScript Code for IP Address Masking

We construct masked IP addresses using three simple JavaScript functions. Dependent functions must appear first so we will discuss them in that order.

The essence of the data masking solution is to use a one‑way hashing algorithm to transform the client IP address. In this example we are using the FNV‑1a hash algorithm, which is compact, fast, and has reasonably good distribution characteristics. Its other advantage is that it returns a positive 32‑bit integer (the same size as an IPv4 address), which makes it trivial to present as an IP address. The fnv32a function is a JavaScript implementation of the FNV‑1a algorithm.

The i2ipv4 function converts a 32‑bit integer to an IPv4 address in quad‑dotted notation. It takes the hashed values from fnv32a() and provides a representation that “looks right” in our access log. Both IPv6 addresses and IPv4 addresses are represented in IPv4 format

Finally, we have the maskRemoteAddress function, which is referenced by the js_set directive in the NGINX and NGINX Plus configuration above. It has a single parameter, req, which is the JavaScript object representing the HTTP request. The remoteAddress property contains the value of the client IP address (equivalent to the $remote_addr variable).

IP Address Masking in Action

With the above configuration in place, we can make a simple request to our server and check the response and the resultant access log entries.

$ curl http://localhost/
127.0.0.1 -> 8.163.209.30
$ sudo tail --lines=1 /var/log/nginx/access*.log
==> /var/log/nginx/access.log <==
127.0.0.1 - - [16/Mar/2017:19:08:19 +0000] "GET / HTTP/1.1" 200 26 "-" "curl/7.47.0"

==> /var/log/nginx/access_masked.log <==
8.163.209.30 - - [16/Mar/2017:19:08:19 +0000] "GET / HTTP/1.1" 200 26 "-" "curl/7.47.0"

Masking Personal Data in the Query String

Log files can also contain personal data other than IP addresses. Email addresses, postal addresses, and other identifiers can be passed as query‑string parameters and consequently be logged as part of the request URI. If your application passes personal data in this way, you can extend the nginScript IP address‑masking solution to sanitize personal data in the query string.

NGINX and NGINX Plus Configuration for Query String Masking

Like the default combined format, the masked log format defined above for IP address masking logs the $request variable, which captures three components of a request: HTTP method, URI (including query string), and HTTP version. We need to mask only the query string, so in the interests of code efficiency we use a separate variable for each of the three components, transforming only the request URI (second component) with the $request_uri_masked variable and using standard variables ($request_method and $server protocol) for the first and third components.

The server block requires another js_set directive to define how the $request_uri_masked variable is evaluated.

nginScript Code for Query String Masking

We add the maskRequestURI JavaScript function to the logmask.js file shown above. Like maskRemoteAddress(), maskRequestURI() depends on the fnv32a hashing function and so appears below it in the file.

The maskRequestURI function iterates through each key‑value pair in the query string, looking for specific keys that are known to contain personal data. For each of these keys, the value is transformed to a masked value.

Depending on the type of processing to be carried out on NGINX and NGINX Plus log files, the masked query string values may need to resemble genuine data. In the example above we have formatted zip to be five digits and email to comply with RFC 821. Other keys may require more sophisticated formatting or dedicated functions to construct them.

Query String Masking in Action

With these additions to our configuration, we can see query string masking in action.

$ curl "http://localhost/index.php?foo=bar&zip=90210&email=liam@nginx.com"
127.0.0.1 -> 8.163.209.30
$ sudo tail --lines=1 /var/log/nginx/access*.log
==> /var/log/nginx/access.log <==
127.0.0.1 - - [16/Mar/2017:20:05:55 +0000] "GET /index.php?foo=bar&zip=90210&email=liam@nginx.com HTTP/1.1" 200 26 "-" "curl/7.47.0"

==> /var/log/nginx/access_masked.log <==
8.163.209.30 - - [16/Mar/2017:20:05:55 +0000] "GET /index.php?foo=bar&zip=38643&email=2852675791@example.com HTTP/1.1" 200 26 "-" "curl/7.47.0"

Summary

NGINX and NGINX Plus with nginScript provide a simple and powerful solution for applying custom logic to request processing. In this article we demonstrated how nginScript can be used to mask personal data so that log files can be written in a way that allows offline analysis without violating data protection requirements.

We would love to hear about how you are using nginScript to solve business problems – please tell us about it in the comments section below.

To try NGINX Plus, start your free 30‑day trial today or contact us for a demo.


Enabling nginScript for NGINX and NGINX Plus

Loading nginScript for NGINX Plus

nginScript is available as a free dynamic module for NGINX Plus subscribers (for open source NGINX, see Loading nginScript for Open Source NGINX below).

  1. Obtain the module itself by installing it from the NGINX Plus repository.

    • For Ubuntu and Debian systems:

      $ sudo apt‑get install nginx-plus-module-njs
    • For RedHat, CentOS, and Oracle Linux systems:

      $ sudo yum install nginx-plus-module-njs
    • Enable the module by including a load_module directive for it in the top‑level ("main") context of the nginx.conf configuration file (not in the http or stream context). This example loads the nginScript modules for both HTTP and TCP/UDP traffic.

      load_module modules/ngx_http_js_module.so;
      load_module modules/ngx_stream_js_module.so;
    • Reload NGINX Plus to load the nginScript modules into the running instance.

      $ sudo nginx -s reload

Loading nginScript for Open Source NGINX

If your system is configured to use the official prebuilt packages for open source NGINX and your installed version is 1.9.11 or later, then you can install nginScript as a prebuilt package for your platform.

  1. Install the prebuilt package.

    • For Ubuntu and Debian systems:

      $ sudo apt-get install nginx-module-njs
    • For RedHat, CentOS, and Oracle Linux systems:

      $ sudo yum install nginx-module-njs
  2. Enable the module by including a load_module directive for it in the top‑level ("main") context of the nginx.conf configuration file (not in the http or stream context). This example loads the nginScript modules for both HTTP and TCP/UDP traffic.

    load_module modules/ngx_http_js_module.so;
    load_module modules/ngx_stream_js_module.so;
  3. Reload NGINX Plus to load the nginScript modules into the running instance.

    $ sudo nginx -s reload

Compiling nginScript as a Dynamic Module for Open Source NGINX

If you prefer to compile an NGINX module from source:

  1. Follow these instructions to build either or both the HTTP and TCP/UDP nginScript modules from the open source repository.
  2. Copy the module binaries (ngx_http_js_module.so, ngx_stream_js_module.so) to the modules subdirectory of the NGINX root (usually /etc/nginx/modules).
  3. Perform Steps 2 and 3 in Loading nginScript for Open Source NGINX.
Retrieved by Nick Shadrin from nginx.com website.