RoleModels

Role-based Debloating for Web Applications

Introduction

Previous debloating schemes produce one debloated copy of the target application that includes features required by all users. In this work, we built a pipeline to identify clusters of users that interact with similar set of features, and assigned them to dynamically generated roles. Next, we produced debloated applications tailored to these roles. As a result, we produce smaller web applications compared to prior work that are exposed to fewer CVEs. Our tool named dbltr comes with the clustering algorithm to generate roles. It also incorporates a transparent reverse-proxy to identify successful logins and redirect users to their underlying debloated web applications.

The paper is available at https://www.securitee.org/files/dbltr_2023codaspy.pdf

Videos

Source Code Repository

The source code of DBLTR is available at: https://github.com/pragseclab/DBLTR_Demo
The main modules in the repository are as follows:

"LIM" includes CLI tools to debloat web applications based on dynamic code-coverage. It also extracts source code features (e.g., class names, function names, etc.) used in the clustering algorithm. The CLI tools are invoked by the Jupyter notebook during debloating.
"analysis" includes the Jupyter notebook docker file used for clustering, debloating, and analysis of attack-surface reduction after debloating.
"docker" includes the configuration files for the docker images. "docker/reverse-proxy/lua" contains the configuration files to extract successful logins and session cookies.
"dockerfiles" includes the content delivery environment. "bootstrap" container populates the Redis datastore with the mappings of users to roles produced by the clustering algorithm. "db" hosts the database of web applications. "redis" includes the mapping of authentication cookies to users, and mapping of users to roles. "web" containers host the debloated copies of web applications.
"docker-compose.yml" is the main docker file that runs the web applications, and the machinery to route users to their respective debloated copies based on their roles. "rever-proxy" container uses Lua configuration files to identify logins and populate the Redis datastore.
"webapps" directory includes the original and debloated copies of web applications.

Adding New Web Applications

To onboard new web applications, we first need a training phase where we collect the usage traces from users (e.g., line-level code-coverage). Next, we use the analysis docker environment to train our classifier and generate the roles and debloated web applications. We then have to configure the Lua scripts to identify login attempts on the new web application and extract the session cookie values. Finally, we use the generated docker-compose file by the "analysis" docker environment to host our web applications.

DBLTR Playbook

In this playbook, we go over the steps for debloating and serving a web application using DBLTR. At a high level, first we use the Less is More platform to generate a baseline usage profile of the web application users in the form of line coverage logs. Next, we import the code coverage data into the DBLTR Jupyter notebook and incorporate classifier to group users with similar behavior together under the same role (i.e., cluster). Finally, we deploy the docker compose environment with the produced configurations generated by DBLTR to serve the debloated web applications to the users.

LIM Setup

Less is More can be setup using the following guide: https://lessismore.debloating.com/. More details and playbook available at: https://playground.debloating.com/ After this step is done, we export the code coverage of each user into the CSV format. sql_to_csv.py script can help automate this process. For this demonstration, Less is More is hosted under LIM/training directory for debloating phpMyAdmin. In this setup, we have 5 users (Alice, Bob, Charlie, David, and Eli), they perform minimal actions on phpMyAdmin (kept minimal for demonstration purposes). Alice, creates a database and inserts some rows of data, Bob does the same but also views the list of users, Charlie views the existing databases without making any changes, David, views databases and runs manual queries and finally Eli who only views various phpMyAdmin parameters. After exporting the code coverage data from LIM, we use the provided python script (sql_to_csv.py) on a system with mysql-server installed to convert the database backup to csv files for DBLTR.

files.csv: includes the "filename" of covered files in the csv file.
lines.csv: is in "filename, line_number" format.

The output CSV files generated by LIM is available under LIM/training/users/.

Generating Roles and Debloating Web Applications

Now we switch our focus to the jupter notebook "rbd_dataanalysis". This notebook is hosted under analysis directory and can be setup using the provided docker-compose environment through: docker compose up -d and then navigating to http://localhost:8888/lab/tree/work/rbd_dataanalysis.ipynb. The token to access this notebook is set in the docker-compose env variable and is currently set to "jupytersecrettokenabhsyd68ay". We can follow the cells in the notebook. Certain steps can take a long time from 30 minutes to couple hours to complete on large applications with many users. We have also provided the output of lengthy steps in the form of Python pickled objects. At the end of each section, the pickle files are restored. This would be an alternative to running individual cells in the notebook for that section.

Jupyter notebook sections

For the sections where pickle file is available, you can jump to the end of the section and quickly restore the data from the pickle file. For new web applications outside our dataset, the whole process needs to be followed instead of restoring pickle files.

Lib Imports: Prepares the packages required for the debloating and analysis of the results.
Import CSV Files [Pickle file available]: Import the CSV files for file and line coverage information of web application users.
Add Source Code Features [Pickle file available]: Extracts features from the code coverage data used to identify similar usage patterns in the clustering. This includes files, functions, classes, and namespaces used by each user's code coverage.
Clustering [Pickle file available]: We incorporate the spectral clustering algorithm in combination with Jaccards similarity metric to perform the clustering.
Evaluate Clusters: This step compares the debloating of various clusters to identify the optimal number of clusters (i.e., roles) for the web applications. The slope of the lines plotting the reduction of remaining functions after debloating based on the total number of roles can be used to optimize the total number of roles. We want the minimal number of roles that provide the best debloating, that is also referred to as the elbow method.
Optimal Cluster Size: Includes the number of roles determined by the previous step. In our example, 6, and 7 roles for phpMyAdmin and WordPress respectively.
Generate Artifacts [Pickle file available]: The output of clustering is the roles and mapping of users to roles. Based on this information, we merge the code coverage of users assigned to each role, debloat the copies of web applications specific to each role and generate the docker-compose file to serve these applications. Finally, we provide the user to role mapping information to our reverse-proxy to route user traffic towards their specially debloated web applications.
Generate Docker Environment Files: This step generates the docker files. This is the last step required to produce debloated web applications. We can now use the provided user-to-role mappings and docker-compose configuration to serve the web applications.
Attack Surface Reduction Analysis: This step is extra and can be used to extract and analyze the information about the reduced lines of code, removed CVEs, and gadget chains after debloating.

Serving the debloated web applications

In order to serve the debloated web applications, we use the generated mappings.txt configuration file including user to role mappings along with the docker-compose.yaml in the root of this repository to host the DBLTR setup. The web applications will be served under localhost:8080. Upon logging in, each the authentication cookie of each user is extracted by our OpenResty Lua modules and stored in the Redis datastore. Subsequent requests from users containing the authentication cookie will instruct the reverse-proxy to transparently route their requests towards their custom debloated web applications. Responses from DBLTR will include an "active_proxy" HTTP header to show which backend served that request.

Demo of DBLTR protecting users against CVE-2019-12616:

Adding new web applications to DBLTR

Setup the web application under a LIM-like setup to collect the code-coverage data from web application users for a period of time.

Import the code-coverage data into the debloating pipeline (Jupyter notebook) to produce the debloating roles.
Create the user authentication detection logic as a new OpenResty Lua module.
Use the provided configuration to host the debloated web applications.

OpenResty Authentication Detection Lua Module

The files for this module are located under docker/reverse-proxy/lua/. The skeleton of this code is available under common.lua as well as application specific files under pma (phpMyAdmin login detection) and wp (WordPress login detection). default.conf file which is an Nginx/OpenResty config file is used to activate the Lua module. At a high level:

login_handler.lua: Detects a successful login request, in the example of phpMyAdmin, this consists of a POST request towards the root of the web application and should result in a 302 HTTP response code. This module then extracts the authentication cookie value under "phpmyadmin" cookie. This mapping is stored in the redis datastore for future use.
login_username_extractor.lua: Extracts the provided username from the login request. In the example of phpMyAdmin, we look for "/" or "index.php" POST requests containing "pma_username" POST parameter containing the username.
redirect_to_proxy.lua: Looks for the presence of authentication cookies (e.g., "phpmyadmin"), it then tries to extract the username from the datastore based on the authentication cookie value. Next, we find the mapping of user to role and instruct the reverse-proxy to route user traffic to their debloated web application.

About

We are a team of security researchers at PragSec Lab, Stony Brook University (https://securitee.org).
For any queries or questions contact Babak Amin Azad at [email protected]