Magical file type detection in Python

Kevin Tewouda
6 min readMar 24, 2024

Be magic with Magika

Photo by Almos Bechtold on Unsplash

Sorry for the clickbait title, but I was searching for a rhyme for the library I will present to you. 😆

Introduction

There are many use cases where we need to know the correct type of a file:

  • When returning/rendering a file in a web application or API, knowing the mime type can help the browser display it correctly.
  • In webmail or drives such as Dropbox or Google Drive, knowing the file type can help display the correct image, giving users a clue as to what the file contains.
  • Again, in mail clients, some malicious users may want to pass executable scripts using innocent file type extensions such as txt, so knowing the correct type of file is crucial to preventing some attacks.

So far, the best tool to detect the type of a file is the file/libmagic project. It relies on file format heuristics. This approach can sometimes be error-prone because malicious users can confuse detection with adversarially-crafted payloads. This is why a team at Google decided to tackle this problem using deep learning techniques with the Keras library. They developed a small, highly accurate model for detecting file types and integrated it into a library called Magika (now you know why I chose the title for this article)…

--

--

Kevin Tewouda

Déserteur camerounais résidant désormais en France. Passionné de programmation, sport, de cinéma et mangas. J’écris en français et en anglais dû à mes origines.