The language used for tokenization is specified at the column level, or can be identified within 用于標(biāo)記化的語(yǔ)言將在列級(jí)指定,或者也可以通過(guò)篩選器組件在
tokenization is important because it defines the units within the data that are compared to each other 因?yàn)闃?biāo)記化在要進(jìn)行比較的數(shù)據(jù)內(nèi)定義相關(guān)單元,所以標(biāo)記化是非常重要的操作。
If you are unsure, a general best bet is to use the neutral word breaker, which performs its tokenization purely on white space and punctuation 如果您對(duì)此不能確定,常用的最好辦法是使用非特定語(yǔ)言斷字符,這種斷字符只基于空格和標(biāo)點(diǎn)進(jìn)行標(biāo)記化。
The transformation provides a default set of delimiters used to tokenize the data, but you can add new delimiters that improve the tokenization of your data 該轉(zhuǎn)換提供了一組用于標(biāo)記化數(shù)據(jù)的默認(rèn)分隔符,但您可以添加新的分隔符以改善您數(shù)據(jù)的標(biāo)記化。