What is the correct way to use Active Learning on a huge and unbalanced text dataset?

Question

I am working on a huge and unlabeled text dataset. It has over 30M lines (= sentences).
We are trying to detect illegal sentences by analyzing 20 different legal issues (e.g. racism, insults, etc.). So, this is a multi-label classification problem. As you can guess, most of the sentences are legal, making the dataset very unbalanced.
We decided to focus on 3 main classes (according to feedbacks) and started using Active Learning (AL) to get a training set and a validation set (the test set will be created later on) :

We extracted a first set of 1 500 sentences and labelled them. This dataset was created with half random sentences, and half judiciously selected sentences (with regex & embedding proximity). Thus, this dataset does not represent the true distribution.

We split this first set into a training set and a validation set (500 / 1 000)

From this, we trained a model on the training set and extracted 1 000 more samples from the pool set to be labelled (using an AL selection function). Once labeled, we injected those samples into the training set. We did this process several times.

At the end, we have 7 000 samples in our training set, and 1 000 in our validation set.

Once this process done, we trained a BI-LSTM and we were surprised to have a better f1-score on our validation set (~ 0.97) than on our training set (~ 0.90). Our understanding is that the training set has more complex cases (through the AL) than our validation set, in which case we are clearly unable to check if the model performs well on those "complex cases".
It also creates annoying side effects, such as triggering the early stopping too soon, or saving "the best model" according to the validation set whereas some epochs were clearly better at managing more varied cases.
We would like to add some of these new specific cases in our validation set. So, we think that, in our process, we should add a portion of newly labeled samples into our validation set (let's say 15%).
Pros:

More varied cases in our validation set, better metrics
No more annoying side effect during the training phase

Cons:

Less data in our training set
Risk of overfitting these corner cases ???

Do you think it is a good approach ?
Is it something common with Active Learning ?
Have we done something bad in our labeling process ?

What is the correct way to use Active Learning on a huge and unbalanced text dataset?

Add your own answers!

Ask a Question