New addon for categorizing visited pages

Hi.

IPFire Categorizer Engine - Technical Overview

Purpose

A Perl script that processes Squid proxy logs and categorizes web traffic using domain-based blacklists. Generates hourly/daily statistics for network monitoring and creates alerts for suspicious activity.

Core Functionality

  1. Blacklist Loading
  • Scans /var/ipfire/urlfilter/blacklists/ directory at startup.
  • Each subdirectory = one category (e.g., adult, malware, phishing).
  • Loads domains file from each category into memory hash tables.
  • Current implementation: Category name = directory name (automatic detection).
  • Loads ~500k domains across 78 categories into RAM.
  • Uses caching to avoid repeated lookups (100% cache hit ratio achieved)
  1. Squid Log Processing
    Input: /var/log/squid/access.log (standard Squid format)
    Process flow:

Performance: Currently processes 226 lines/second.

  1. Domain Categorization Logic

Priority system: Security categories (malware, phishing, ddos, cryptojacking, stalkerware) checked first.

  1. Statistics Generated
  • Hourly JSON files: Request counts, bandwidth, cache efficiency, top domains/IPs/categories.
  • Metrics index: Lightweight index for anomaly detection.
  • Alerts: Generated for sensitive categories, off-hours access, traffic spikes.

Current Issue

Problem: Domain xvideos.com appears as ā€œuncategorizedā€ despite having adult content
Root cause: Blacklist contains thousands of xvideos subdomains (xvideoss.com, xvideosporn.com) but NOT the main domain xvideos.com
Why it matters: This suggests the blacklist may be incomplete for other major adult sites’ main domains

Questions for IPFire Community

  1. URLFilter integration: How does IPFire’s URLFilter consume these blacklists? Does it use Squid’s url_rewrite_program or SquidGuard?.
  2. Custom blacklists: What’s the recommended way to add missing domains that persists across blacklist updates? Is /var/ipfire/urlfilter/blacklists/custom/ the correct location?.
  3. Blacklist updates: Do IPFire blacklist updates preserve entries in custom/ subdirectories?.
  4. Domain matching: Does URLFilter perform subdomain matching (e.g., blocking *.xvideos.com if xvideos.com is listed)?.
  5. Recommended blacklist source: Are UT1 Toulouse blacklists the recommended source, or are there better-maintained alternatives compatible with IPFire?.
  6. Performance: For a proxy handling 500k blacklist domains, are there recommended optimizations for Squid/SquidGuard?

Technical Environment

IPFire version: [current version]
Squid log format: Standard access.log
Blacklist structure: UT1 Toulouse format (one domain per line)
Processing: Incremental (tracks last position)
Optimization: Memoization, regex precompilation, object pooling

The engine works correctly - it categorizes based on what’s in the blacklists. The issue is the blacklist data quality, specifically missing primary domains for major sites.

They are the only available blacklist.

There used to be the Shalla list and MESD but Shalla closed their doors and MESD just became unreachable back in 2022 and so in CU164 these were removed from the URL Filter list.
https://www.ipfire.org/docs/configuration/network/proxy/url-filter#automatic-blacklist-update

This only left the Toulouse University list and since then no other list has been identified by anyone to be included.

If anyone identifies a list they can evaluate it as the URL Filter allows you to define a Custom source URL and download the list.

With regard to your other questions then maybe @bbitsch can help as he has a lot more experience on this function..

The assumptions about the location of the lists are right.
Squid uses SquidGuard as rewriter.
SquidGuard processes the defined lists. The set of lists consists of the defined external lists and the local custom lists. The update process changes the copy of the external lists only.

Your example:
ā€˜xvideos.com’ isn’t the main domain.
According to the FQDN syntax xvideoss.com and xvideosporn.com are other domain names. The hierarchy is build with the ā€˜.’ character.
So, if xvideos.com is defined as domain to be blocked, all sub domains 'defined by the regex ā€˜.*\.xvideo\.com’ are blocked also.

BTW: UrlFilter logs to /var/log/squidGuard.

I have created this script for that function (I still need to make the interface), but it does not categorize certain pages correctly.

categorizer-engine.zip (15,4 KB)

Create the directory structure in /var/ipfire/categorizer with the .json files with the categorizations already made.

Hi,
Just a thought out loud.
I’ve always wondered why the IPFire community doesn’t manage the lists.

1 Like

that would require a big handful of volunteers to identify web sites and then categorize each site. And then the head(s) of that group would need to review and approve.

If you are interested, please speak up. A forward moving effort starts with just one person leading the effort!

2 Likes

Hi,
I understand the work and the people involved, but it could be a service that the IPFire community could offer by subscription to support itself.
But as I said, it’s just a thought.
If they decide to do it, I’d be available.

Hi,

I had thought of a pseudo-automatic system, not quite automatic. For pages that don’t appear as categorized (uncategorized) at the University of Toulouse, a button would be enabled so I could send a recategorization request (an anonymous email) to an email address. The recipient would categorize it and update a ā€œcustomā€ repository. All IPFires that have the addon would update the ā€œcustomā€ files in that repository, and thus, little by little, build a good database.

I’m slowly designing the interface. When I have something, I’ll add it. (without the pseudo-categorization module, since that needs careful thought.)

Bye.

Hi.

As I mentioned, I was creating a small add-on to categorize visited pages. Here it is.

To install:

  1. Decompress zip file
    categorizer.ipfire.tgz.zip (53,8 KB)

  2. Copy file ā€œcategorize.ipfire.tgzā€ to /opt/pakfire/tmp.

  3. Unpak with:

tar xvf categorize.ipfire.tgz

To install:

./install.sh

To uninstall:

./uninstall.sh

  1. You’ll see the menu inside ā€œIPFireā€.

There’s a feature called ā€œUpload Logs,ā€ which I’ve leveraged to upload and create reports generated by Squid. To do this, go to ā€œLogs → Proxy Logsā€ and regenerate whatever you want. Then, save it from your browser as a .dat file and you can create the report.

Any incident reports you may have will be appreciated.

Bye.

Give me the budget and I will do it. We are looking at a full-time job for several people here…

The things that are sometimes being asked for on this forum cost a lot of money. And indeed the features that we are bringing you also cost a lot of money. These things cannot be financed by just donating $5 once a year…

4 Likes

I don’t know what the budget might be, but I think it could be a project sold as a service.
I know that other companies, even well-known ones, do it.

but as said it’s an idea said out loud

Yes, you’re absolutely right. But first, I need to figure out how to implement what you’re talking about in the addon I’ve made, and then add these manually cataloged lists to a parallel ā€œcustomā€ list so that all the installed addons can feed back. I was thinking about implementing something, like an ā€œuncagecorizeā€ button that would anonymously (obviously) send the request so that I could categorize it and upload it to a ā€œcustomā€ repository, and then download the addon every so often. I’d have to think about it.

If anyone comes up with a better idea…

Bye.

This is why all the free lists are going away.

Hi guys,

I’ve created an addon (and improved it) so that pages that appear as ā€œUncategorizedā€ now have a ā€œReportā€ button. This sends an anonymous email to ā€œreports@northsecure.esā€ so I can update a repository and the installed addons will be updated from it. I’ve created a short document showing how it works.

IPFire Categorizer.pdf (169,8 KB)

This is an example of the email I receive:

Mail example.pdf (98,2 KB)

Repository structure:

Within each category, there is a file called ā€œdomainsā€ with a domain/subdomain on each line.

Here is the new addon:

categorizer.ipfire.v2.0.tgz.zip (60,8 KB)

What @jon said motivated me to do it:

that would require a big handful of volunteers to identify web sites and then categorize each site. And then the head(s) of that group would need to review and approve.

If you are interested, please speak up. A forward moving effort starts with just one person leading the effort!

I hope all my efforts are worth something.

Best regards, and give it a try.

Hi,
Even though I haven’t tried the component yet, but I will as soon as possible, I can say that from what you’ve described, it seems really interesting.
As I’ve already said, I’m available to catalog the various sites.

1 Like

Hi,
I tried the system on a test machine, and I’d say it works pretty well.
Allow me to offer some advice. I’d also provide a zip file with the remote server structure.

another note, the strings to change for your own customization are not documented