Magical file type detection in Python
Be magic with Magika
Sorry for the clickbait title, but I was searching for a rhyme for the library I will present to you. 😆
Introduction
There are many use cases where we need to know the correct type of a file:
- When returning/rendering a file in a web application or API, knowing the mime type can help the browser display it correctly.
- In webmail or drives such as Dropbox or Google Drive, knowing the file type can help display the correct image, giving users a clue as to what the file contains.
- Again, in mail clients, some malicious users may want to pass executable scripts using innocent file type extensions such as txt, so knowing the correct type of file is crucial to preventing some attacks.
So far, the best tool to detect the type of a file is the file/libmagic project. It relies on file format heuristics. This approach can sometimes be error-prone because malicious users can confuse detection with adversarially-crafted payloads. This is why a team at Google decided to tackle this problem using deep learning techniques with the Keras library. They developed a small, highly accurate model for detecting file types and integrated it into a library called Magika (now you know why I chose the title for this article)…