忽略大小写的.NET脏字过滤算法

http://www.sina.com.cn 2008年06月23日 10:57  IT168.com

【IT168技术文档】

  除了实现忽略大小写外,其他方面的性能也做了一些改进,主要借助下面的类,StringSegement,实现了大小写无关的比较和GetHashCode,同时避免了Substring的调用。
public class StringSegment
{
private string original;
private int offset = 0;
private int length = 0;
public StringSegment(string s)
{
this.original = s;
this.length = original.Length;
}
public void Slice(int offset, int length)
{
this.offset = offset;
this.length = length;
}
public override bool Equals(object obj)
{
StringSegment sg = obj as StringSegment;
return sg != null && sg.length == this.length && string.Compare(this.original, this.offset, sg.original, sg.offset, this.length, true) == 0;
}
public override int GetHashCode()
{
// call char.tolower and calculate hash code
    }
}
  GetHashCode的实现完全参考string的实现,就不重复了,挺长的一段。

  另外,对于特征判断的数据,规划为64k x 4 bytes,也就是每char给4 bytes特征数据,如果有算法改进,只要在这4 bytes里重新规划,现在的定义如下:
[StructLayout(LayoutKind.Explicit, Size = 32)]
internal struct FastCheckFlag
{
[FieldOffset(0)]
public byte occur;    // 1st~8th char occurrence
    [FieldOffset(8)]
public byte length;   // 2~9 word length begin with this char
    [FieldOffset(16)]
public byte rlength; // 2~9 word length end with this char (reverse length);
    [FieldOffset(24)]
public bool single; // single char bad words
    [FieldOffset(25)]
public bool last;    // last occurrence
    [FieldOffset(26)]
public byte occurParity;  // 9th~ occurrence parity flag
    [FieldOffset(28)]
public byte lengthParity; // 10~ length parity flag
    [FieldOffset(30)]
public byte rlengthParity; // 10~ rlength parity flag
}
另外在初始化数据的时候,取字符的大小写同时处理,例如:
if (word.Length == 1)
{
fastCheck[char.ToLower(word[0])].single = true;
fastCheck[char.ToUpper(word[0])].single = true;
}

Powered By Google
·城市对话改革30年 ·新浪城市同心联动 ·诚招合作伙伴 ·企业邮箱畅通无阻