Efficient keyword replacement and sensitive word filtering tool

Efficient keyword replacement and sensitive word filtering tool#

1. Introduction to the algorithm#

Build a keyword tree using an efficient trie tree, as shown in the following figure, and then find whether the connected characters in the string form a path in the tree in turn

trie

I found that this article on Nuggets is written in more detail, you can read it first, the specific principle is not detailed here.

2. keyword replacement#

Support keyword overlap, automatically select the longest keywords, code examples are as follows.

replacer := stringx.NewReplacer(map[string]string{
"Japan": "France",
"Japan's capital": "Tokyo",
"Tokyo": "Japan's capital",
})
fmt.Println(replacer.Replace("The capital of Japan is Tokyo"))

You can get.

Tokyo is the capital of Japan

The sample code can be found in stringx/replace/replace.go

3. Find sensitive words#

The code example is as follows.

filter := stringx.NewTrie([]string{
"AV actor",
"Aoi Air",
"AV",
"Japanese AV actress",
"AV actor porn",
})
keywords := filter.FindKeywords("Japanese AV actress and TV and movie actress. Asuka AV actress is xx debut, Japanese AV actresses the best performance is AV actor porn performance ")
fmt.Println(keywords)

can get.

[Aoi Air Japanese AV actress AV actor porn AV AV actor]

4. sensitive word filtering#

Code examples are as follows.

filter := stringx.NewTrie([]string{
"AV actor",
"Aoi Air",
"AV",
"Japanese AV actress",
"AV Actor Porn",
}, stringx.WithMask('?')) // default replace with *
safe, keywords, found := filter.Filter("Japanese AV actress and TV and movie actress. Asuka AV actress is xx debut, Japanese AV actresses the best performance is AV actor porn performance ")
fmt.Println(safe)
fmt.Println(keywords)
fmt.Println(found)

You can get.

Japan ???? Part-time TV and movie actress. ????? Actress is xx debut, ?????? Their best performances are ?????? Acting
[Aoi Air Japanese AV actress AV actor porn AV AV actor]
true

Example code is in stringx/filter/filter.go

5. Benchmark#

SentencesKeywordsRegexgo-zero
100001000016min10s27.2ms
Last updated on