It’s not a secret if you publish it on PyPI
Python packages can contain sensitive information. Here's how software development teams can keep secrets secret.
Modern software development has reached an admirable level of maturity when it comes to productivity. The implementation of the continuous integration and continuous delivery (CI/CD) pipeline into the development process has made the delivery of software changes to the customers more frequent and automated. Nevertheless, software code still has to be written. To speed up development, software engineers often decide to use open source projects to avoid spending time on implementing some common functionalities.
The benefits of third-party code reuse are obvious, but developers must be aware of the risks associated with it. Among the most notable risks are the presence of malware in third-party dependencies and licensing violations originating from their inappropriate usage. Various research estimates that the amount of open source code in software projects ranges between 60 and 80 percent. Since the open source software dependencies often have a whole set of their own dependencies, the number of such transitive dependencies can escalate pretty fast. Software Composition Analysis (SCA) tools like ReversingLabs Secure Software Platform are used to effectively manage licensing and security issues originating from such complex dependency relations.
Security risks introduced by the use of third-party dependencies aren’t the only ones that need to be managed. Mistakes made while creating deployment artifacts can result in leaking sensitive information like access credentials or API tokens. This blog will show a few real-world examples of such sensitive information leaks detected in the PyPI repository. We will also demonstrate how our Secure Software Platform can recognize these security risks and prevent them from reaching the public.
Sensitive information leaks
The first thing that probably comes to mind when you hear “sensitive information” are access credentials for the services you use. But that is not the only information that can be considered sensitive. Many web services, like Amazon Web Services (AWS), manage their client usage by means of various access tokens. Such long-term access keys are used to make programmable consumptions of various APIs easier. The User guide for “AWS Identity and Access Management” clearly states that you should “Manage your access keys as securely as you do your user name and password”. Unfortunately, many developers tend to use such tokens in the wrong way: by hard-coding them into their codebase. When that code gets published to GitHub, or any other public code repository, hard-coded credentials become visible to everybody.
There are also developers who are aware of such mistakes and handle API credentials carefully. One way to go about it, specifically in git repositories, is to put all access credentials into a configuration file and instruct git to ignore it with the “.gitignore” file. As a result, the excluded configuration file doesn’t get published to the public repository, and credentials remain a secret. But there is another step where such credentials can leak. When developers create release artifacts, they usually do it with a tool integrated into their CI/CD pipeline, turning that action into an automated task. Some of those tools package the entire directory structure, which can often contain the aforementioned configuration files. Even though the developer excluded those files from publishing to the code repository, in the end they get published to a package repository, undoing the previous effort.
ReversingLabs Secure Software Platform can help detect leakage of more than thirty different sensitive information types, including SSH, SSL and PGP keys, Amazon Web Services, Square and PayPal access credentials, and several other access tokens and API keys. During our research on the PyPI repository, almost 3,000,000 releases in over 300,000 different projects were analyzed in search of sensitive information leaks.
Figure 1: Sensitive data detected in PyPI packages
Figure 1 shows statistics for the detected access tokens. Access tokens for the most popular API services, like the ones from Google and Amazon, were leaked the most often. Even though many detected tokens were generated and used in examples and tests (which was explicitly noted in the code comments), there is still a considerable number of credentials that were leaked accidentally.
Developer who almost got it right
An excellent example of a package where a developer accidentally leaks their AWS access credentials together with some additional credentials is the dynamo_lock PyPI package. The report generated from the Secure Software Platform analysis shows that a file named config.py containing web service access credentials is extracted from the package.
Figure 2: Detected sensitive data leak in the config.py file inside dynamo_lock package
Looking at the contents of this config.py file shows that, besides the access credentials for Amazon’s DynamoDB database service used in other parts of code in this project, this file contains access credentials for two more services. Those services - Amazon's Simple Queue Service (SQS) and Gigya’s identity management platform - were not used from anywhere within the code.
Figure 3: Contents of the “config.py” file
Obviously, none of this information should be public. The project homepage linked in the package description shows that this file was never uploaded to GitHub. Commit history reveals that the project author actually wanted to keep this file private, and in a commit conveniently named “prepare for pypi”, added the config.py file to the list in .gitignore file to prevent it from being uploaded.
Figure 4: Commit in which config.py was added to .gitignore
Unfortunately, the author forgot or wasn’t aware that the tool used for creating the PyPI package added all the files from the project directory structure to the package, which was later published to the PyPI repository. To prevent this, a good practice would be to perform an overall software quality analysis of your release artifacts after you build them and before they get published. Secure Software Platform can help you detect such sensitive information leaks, together with other software quality issues, before your artifacts get uploaded to public repositories.
The developer who got it wrong
Another example is a developer who published a bunch of access keys together with email credentials. The package where these credentials were found is called django-pegasus-cms. As visible from Figure 5, sensitive data was detected in three files: dev.py, local_settings.py and prod.py.
Figure 5: Detected sensitive data leaks in the django-pegasus-cms package
The most interesting of the three files is the last; the prod.py file that contains the final configuration prepared for use in the production environment. Looking at its contents reveals that the developer didn’t only leak the access credentials for Amazon Web Services, but also the email and database credentials together with the database server URL.
Figure 6: Access credentials leaked in the prod.py file
Anyone with the slightest knowledge of information security can tell that leaking email credentials is bad. Hopefully, those access credentials weren’t reused to access other email services. It seems that the developer misunderstood the general idea behind the aforementioned advice from AWS documentation - “Manage your access keys as securely as you do your user name and password”.
The developer who got it totally wrong
The last example we’ll discuss is the bad practice detected in several packages published by a developer employed in the Xiaomi ML department (as stated in his resume). Our analysis showed that several PyPI packages published by the “18550288233” user contain various API access tokens, including AWS and Google Cloud API keys. Some of them also contain database access credentials together with the IP addresses of the endpoints where those databases are hosted. In one of the packages, the developer listed not only his own access keys, but also those belonging to several of his team members.
One of the packages published by this developer is named mi, and it seemed interesting because the developer works at the Xiaomi company. The project description header (translated from Chinese) reads: “TODO: The company's internal needs are encapsulated in mi, and others are encapsulated in meutils”, probably as a reminder to separate the private needs into the MeUtils package published by the same developer. Several files in the mi package contain plaintext username and password pairs for access to various databases. However, the code reveals that they are used to authenticate to services accessed through local IP addresses, meaning that they can probably only be accessed from the private network. Nevertheless, it is a bad practice to publish plaintext credentials to public repositories.
Figure 7: Xiaomi File Storage Service credentials leaked from mi package
Still, there are other files that leak credentials used to access public services, like the Xiaomi File Storage Service access key and secret found in the mi/mifds/xiaomi.py file. Another example are access credentials for the Baidu Artificial Intelligence platform leaked in the mi/sdk/BaiduApi.py file.
Figure 8: Baidu AI platform credentials leaked from mi package
A different package named ctrzoo from the same author shows that handling access credentials in this way is not an accident, but rather a bad practice. This package contains a conf/cm.conf file that lists access keys and secrets belonging to several other employees of the Xiaomi company, according to the “xiaomi.com” email domain.
Figure 9: Leaked access credentials belonging to Xiaomi employees
This developer published 37 projects on PyPI, and a large number of them have similar problems regarding sensitive information handling. Evidently, the developer is either unaware of these problems, or might not care as much as the rest.
Keep secrets secret
Most modern software solutions need to interact with a remote API to realize full functionality. Such APIs usually employ some type of access control to protect themselves from abuse. Every type of access control implies the use of some kind of a secret that can prove the identity of API consumers. For practical reasons, developers often put those secrets somewhere into the codebase to be able to use them in an automated way. That approach makes development and testing faster and easier. Before software publishing and deployment, secrets should be removed from the codebase and the release artifacts. Unfortunately, many developers aren’t aware, or sometimes forget, that the release artifacts contain some secrets used during development.
Leaking sensitive information, like access credentials and tokens, can cause unexpected financial expenses related to account abuse. ReversingLabs Secure Software Platform can help to detect sensitive information leaks before they result in financial losses. Our platform can detect more than thirty different access secrets inside any of the files extracted during our static file analysis.
Static file analysis of release artifacts should be mandatory in every development lifecycle, not only to detect potentially leaked secrets, but also to find potential vulnerabilities and code quality issues that might impact your software product. Detecting such issues before the product gets published and delivered to customers can significantly reduce potential remediation costs.
Our Secure Software Platform can generate a detailed and easily understandable report for your software product. Produced software quality reports contain comprehensive information on issues related to licensing, known vulnerabilities, digital signatures, gaps in implementation of vulnerability mitigations, and sensitive information leaks. In addition to general software quality information, these reports can help you detect supply chain attacks. From the report, you can find out if any known malware was found inside your release artifacts. You can also examine behavior indicators triggered on the extracted components, and by comparing different version reports, you can discover if some unexpected behavior changes were introduced in the new version of your software.
Leaking sensitive information can cause significant financial losses. Always check if any kind of sensitive information is present in your release artifacts before you publish them.