Southwest University Unified Identity Authentication CAPTCHA Recognition: A Full-Stack Open-Source Practice from Dataset Construction to Model Deployment

(已编辑)

/ , , ,

AI Translation中文 → English

This article has been automatically translated by AI and may contain inaccuracies

This article was last modified on . Some content may be outdated. Feel free to ask the author if you have questions.

Take it easy, sit back and relax

Southwest University Unified Identity Authentication CAPTCHA Recognition: A Full-Stack Open-Source Practice from Dataset Construction to Model Deployment

(已编辑)

/ , , ,

AI Translation中文 → English

This article has been automatically translated by AI and may contain inaccuracies

This article was last modified on . Some content may be outdated. Feel free to ask the author if you have questions.

Introduction

In my previous NJUlogin, account and password login required CAPTCHA recognition, which I implemented using ddddocr with decent accuracy.

I also deployed a server-side service and had a friend write a Tampermonkey script to automatically fill in the CAPTCHA for me every time I needed to log in (account and password are auto-filled by the browser), so I only needed to click login.

However, recently I wanted to make the recognition model even lighter and more suitable for edge deployment, leading to this project. (Please give it a Star if you like it ＞︿＜; if you just want to use it without diving into the tech details, skip to the end—recommended NJU server API version)

Data Collection

https://github.com/Do1e/NJUcaptcha/tree/main/build_dataset

Dataset construction is largely automated, relying primarily on two tools:

ddddocr: used for initial CAPTCHA recognition
NJUlogin: used to validate the correctness of recognition results

I slightly modified NJUlogin to determine whether the recognition was correct, then saved them into different folders. The incorrectly recognized ones (around hundreds?) were manually renamed afterward.

To collect 100,000 images, I ran it in the background for about 3–4 days. The time.sleep interval couldn’t be too small, or else the IP would get blocked. ＞︿＜

Thus, this dataset was created—welcome to download and use. It contains 100,000 CAPTCHA images, named in the format {CAPTCHA text}_{image md5}.jpg, all lowercase.

Dataset download link: NJU-captcha-dataset.7z

Decryption password: @Do1e

Dataset structure:

https://github.com/Do1e/NJUcaptcha/blob/main/model/dataset.py

Recognition Model

https://github.com/Do1e/NJUcaptcha/tree/main/model

With the dataset ready, we can now design and train the model. This time, I fully delegated the model design to AI, and the results are quite satisfactory.

Model size: 12.98MiB → 2.25MiB

Model accuracy: 99.37% → 99.83%

Throughput: 173.95 images/sec → 1076.56 images/sec [AMD Ryzen 7 8845H]

https://github.com/Do1e/NJUcaptcha/blob/main/model/model.py

Maybe it could be even smaller? ~~Leave that for future upgrades~~

Server Deployment

https://github.com/Do1e/NJUcaptcha/tree/main/service

Previously, I used fastapi to quickly implement a recognition server that receives base64-encoded images and returns the CAPTCHA text. This time, I deployed it on vercel as part of the open-source release.

Linux test command:

CodeBlock Loading...

Tampermonkey Script for Automatic Filling

As mentioned earlier, to avoid manual CAPTCHA recognition and input during login, I wrote a Tampermonkey script to auto-fill it. The previous version was server-based:

https://github.com/Do1e/NJUcaptcha/blob/main/njucaptcha.user.js

The open-source code still uses Vercel’s service, which is slow, and requires logging into p.nju, but doesn’t work when logging into p.nju. (￣﹃￣)

My own solution is to set up a local server within campus and map it via frp to my public server, accessing the internal service when logging into p.nju:

CodeBlock Loading...

The hardest part of this entire project was getting ONNX inference to run directly in the client. After spending several hours experimenting with AI tools, I finally succeeded. The implementation uses ONNX Runtime Web.

https://github.com/Do1e/NJUcaptcha/blob/main/njucaptcha_onnx.user.js

A downside of the ONNX version is that without cache, it needs internet access and scientific browsing to download necessary inference dependencies. However, after the first use, these files can be cached (ort-wasm-simd-threaded.jsep.mjs and ort-wasm-simd-threaded.jsep.wasm) — though the cache lasts only 7 days, not very long. If any expert knows how to achieve near-permanent caching like @resource, PRs are welcome.

In short, each of the above approaches has its pros and cons. The most recommended option is either deploying it yourself following my method or directly using the NJU server API version provided at the end of this article.

The Tampermonkey scripts above can be installed directly via the links below (assuming Tampermonkey is already installed):

	Vercel API Version	NJU Server API Version	ONNX Local Inference Version
Pros	No need for scientific browsing	Best practice, personally considered ideal	Very fast—the CAPTCHA is filled before page load completes, and works even when logging into p.nju (with cache)
Cons	Very slow, cannot be used when logging into p.nju	Requires setting up both internal and external servers; I won’t be able to maintain it after graduation	No cache requires scientific browsing to cache some files, cannot be used when logging into p.nju, and cache lasts only 7 days

Note: This code is released under the GPL-3.0 license. Please ignore the following remarks about licensing. ~~Too lazy to update the website code—I keep the right to interpret my site as I see fit~~

Introduction

In my previous NJUlogin, account and password login required CAPTCHA recognition, which I implemented using ddddocr with decent accuracy.

Data Collection

https://github.com/Do1e/NJUcaptcha/tree/main/build_dataset

Dataset construction is largely automated, relying primarily on two tools:

ddddocr: used for initial CAPTCHA recognition
NJUlogin: used to validate the correctness of recognition results

To collect 100,000 images, I ran it in the background for about 3–4 days. The time.sleep interval couldn’t be too small, or else the IP would get blocked. ＞︿＜

Thus, this dataset was created—welcome to download and use. It contains 100,000 CAPTCHA images, named in the format {CAPTCHA text}_{image md5}.jpg, all lowercase.

Dataset download link: NJU-captcha-dataset.7z

Decryption password: @Do1e

Dataset structure:

https://github.com/Do1e/NJUcaptcha/blob/main/model/dataset.py

Recognition Model

https://github.com/Do1e/NJUcaptcha/tree/main/model

With the dataset ready, we can now design and train the model. This time, I fully delegated the model design to AI, and the results are quite satisfactory.

Model size: 12.98MiB → 2.25MiB

Model accuracy: 99.37% → 99.83%

Throughput: 173.95 images/sec → 1076.56 images/sec [AMD Ryzen 7 8845H]

https://github.com/Do1e/NJUcaptcha/blob/main/model/model.py

Maybe it could be even smaller? ~~Leave that for future upgrades~~

Server Deployment

https://github.com/Do1e/NJUcaptcha/tree/main/service

Linux test command:

curl -s -L "https://authserver.nju.edu.cn/authserver/captcha.html" -o "captcha.jpg" && [ -f "captcha.jpg" ] && curl -s -X POST -H "Content-Type: application/x-www-form-urlencoded" -d "captcha=$(base64 -i captcha.jpg | tr -d '\n')" "https://njucaptcha.vercel.app" || { echo "Failed to download captcha image"; exit 1; }

CodeBlock Loading...

Tampermonkey Script for Automatic Filling

As mentioned earlier, to avoid manual CAPTCHA recognition and input during login, I wrote a Tampermonkey script to auto-fill it. The previous version was server-based:

https://github.com/Do1e/NJUcaptcha/blob/main/njucaptcha.user.js

The open-source code still uses Vercel’s service, which is slow, and requires logging into p.nju, but doesn’t work when logging into p.nju. (￣﹃￣)

My own solution is to set up a local server within campus and map it via frp to my public server, accessing the internal service when logging into p.nju:

const url_pub = 'https://example.com/';
const url_nju = 'https://nju.example.com/';
const currentUrl = window.location.href;
const serverUrl = currentUrl.includes('//p.nju.edu.cn') ? url_nju : url_pub;

CodeBlock Loading...

https://github.com/Do1e/NJUcaptcha/blob/main/njucaptcha_onnx.user.js

The Tampermonkey scripts above can be installed directly via the links below (assuming Tampermonkey is already installed):

	Vercel API Version	NJU Server API Version	ONNX Local Inference Version
Pros	No need for scientific browsing	Best practice, personally considered ideal	Very fast—the CAPTCHA is filled before page load completes, and works even when logging into p.nju (with cache)
Cons	Very slow, cannot be used when logging into p.nju	Requires setting up both internal and external servers; I won’t be able to maintain it after graduation	No cache requires scientific browsing to cache some files, cannot be used when logging into p.nju, and cache lasts only 7 days

Southwest University Unified Identity Authentication CAPTCHA Recognition: A Full-Stack Open-Source Practice from Dataset Construction to Model Deployment

Southwest University Unified Identity Authentication CAPTCHA Recognition: A Full-Stack Open-Source Practice from Dataset Construction to Model Deployment

Search

Southwest University Unified Identity Authentication CAPTCHA Recognition: A Full-Stack Open-Source Practice from Dataset Construction to Model Deployment

Southwest University Unified Identity Authentication CAPTCHA Recognition: A Full-Stack Open-Source Practice from Dataset Construction to Model Deployment

Introduction

Data Collection

Recognition Model

Server Deployment

Tampermonkey Script for Automatic Filling

Introduction

Data Collection

Recognition Model

Server Deployment

Tampermonkey Script for Automatic Filling