MyCaffe  1.12.2.41
Deep learning software for Windows C# programmers.
MyCaffe.layers.gpt.TextListData Class Reference

The TextListData manages parallel lists of data where the first list contains the encoder input data and the second the decoder input/target data. More...

Inheritance diagram for MyCaffe.layers.gpt.TextListData:
MyCaffe.layers.gpt.InputData

Public Types

enum  VOCABUARY_TYPE { CHARACTER , WORD }
 Defines the vocabulary time to use. More...
 

Public Member Functions

 TextListData (Log log, string strSrcFile, string strVocabFile, bool bIncludeTarget, TokenizedDataParameter.VOCABULARY_TYPE vocabType, int? nRandomSeed=null, Phase phase=Phase.NONE)
 The constructor. More...
 
override bool GetDataAvailabilityAt (int nIdx, bool bIncludeSrc, bool bIncludeTrg)
 Returns true if data is available at the given index. More...
 
override Tuple< float[], float[]> GetData (int nBatchSize, int nBlockSize, InputData trgData, out int[] rgnIdx)
 Retrieve random blocks from the source data where the data and target are the same but offset by one element where the target is offset +1 from the data. More...
 
override Tuple< float[], float[]> GetDataAt (int nBatchSize, int nBlockSize, int[] rgnIdx)
 Fill a batch of data from a specified array of indexes. More...
 
override List< int > Tokenize (string str, bool bAddBos, bool bAddEos)
 Tokenize an input string using the internal vocabulary. More...
 
override string Detokenize (float[] rgfTokIdx, int nStartIdx, int nCount, bool bIgnoreBos, bool bIgnoreEos)
 Detokenize an array into a string. More...
 
override string Detokenize (int nTokIdx, bool bIgnoreBos, bool bIgnoreEos)
 Detokenize a single token. More...
 
- Public Member Functions inherited from MyCaffe.layers.gpt.InputData
 InputData (int? nRandomSeed=null)
 The constructor. More...
 

Properties

override List< string > RawData [get]
 Return the raw data. More...
 
override uint TokenSize [get]
 The text data token size is a single character. More...
 
override uint VocabularySize [get]
 Returns the number of unique characters in the data. More...
 
override char BOS [get]
 Return the special begin of sequence character. More...
 
override char EOS [get]
 Return the special end of sequence character. More...
 
- Properties inherited from MyCaffe.layers.gpt.InputData
abstract List< string > RawData [get]
 Returns the raw data. More...
 
abstract uint TokenSize [get]
 Returns the size of a single token (e.g. 1 for character data) More...
 
abstract uint VocabularySize [get]
 Returns the size of the vocabulary. More...
 
abstract char BOS [get]
 Return the special begin of sequence character. More...
 
abstract char EOS [get]
 Return the special end of sequence character. More...
 

Additional Inherited Members

- Protected Attributes inherited from MyCaffe.layers.gpt.InputData
Random m_random
 Specifies the random object made available to the derived classes. More...
 

Detailed Description

The TextListData manages parallel lists of data where the first list contains the encoder input data and the second the decoder input/target data.

Definition at line 608 of file TokenizedDataPairsLayer.cs.

Member Enumeration Documentation

◆ VOCABUARY_TYPE

Defines the vocabulary time to use.

Enumerator
CHARACTER 

Specifies a character vocabulary.

WORD 

Specifies a space separated word vocabulary.

Definition at line 621 of file TokenizedDataPairsLayer.cs.

Constructor & Destructor Documentation

◆ TextListData()

MyCaffe.layers.gpt.TextListData.TextListData ( Log  log,
string  strSrcFile,
string  strVocabFile,
bool  bIncludeTarget,
TokenizedDataParameter.VOCABULARY_TYPE  vocabType,
int?  nRandomSeed = null,
Phase  phase = Phase.NONE 
)

The constructor.

Parameters
logSpecifies the output log.
strSrcFileSpecifies the text file name for the data source.
strVocabFileSpecifies the vocabulary file (used by SENTENCEPICE type).
bIncludeTargetSpecifies to create the target tokens.
vocabTypeSpecifies the vocabulary type to use.
nRandomSeedOptionally, specifies a random seed for testing.
phaseSpecifies the currently running phase.

Definition at line 643 of file TokenizedDataPairsLayer.cs.

Member Function Documentation

◆ Detokenize() [1/2]

override string MyCaffe.layers.gpt.TextListData.Detokenize ( float[]  rgfTokIdx,
int  nStartIdx,
int  nCount,
bool  bIgnoreBos,
bool  bIgnoreEos 
)
virtual

Detokenize an array into a string.

Parameters
rgfTokIdxSpecifies the array of tokens to detokenize.
nStartIdxSpecifies the starting index where detokenizing begins.
nCountSpecifies the number of tokens to detokenize.
bIgnoreBosSpecifies to ignore the BOS token.
bIgnoreEosSpecifies to ignore the EOS token.
Returns
The detokenized string is returned.

Implements MyCaffe.layers.gpt.InputData.

Definition at line 890 of file TokenizedDataPairsLayer.cs.

◆ Detokenize() [2/2]

override string MyCaffe.layers.gpt.TextListData.Detokenize ( int  nTokIdx,
bool  bIgnoreBos,
bool  bIgnoreEos 
)
virtual

Detokenize a single token.

Parameters
nTokIdxSpecifies an index to the token to be detokenized.
bIgnoreBosSpecifies to ignore the BOS token.
bIgnoreEosSpecifies to ignore the EOS token.
Returns
The detokenized character is returned.

Implements MyCaffe.layers.gpt.InputData.

Definition at line 912 of file TokenizedDataPairsLayer.cs.

◆ GetData()

override Tuple< float[], float[]> MyCaffe.layers.gpt.TextListData.GetData ( int  nBatchSize,
int  nBlockSize,
InputData  trgData,
out int[]  rgnIdx 
)
virtual

Retrieve random blocks from the source data where the data and target are the same but offset by one element where the target is offset +1 from the data.

Parameters
nBatchSizeSpecifies the batch size.
nBlockSizeSpecifies teh block size.
trgDataSpecifies the matching target data used to verify that both source and target have data at each chosen index.
rgnIdxReturns an array of the indexes of the data returned.
Returns
A tuple containing the data and target is returned.

Implements MyCaffe.layers.gpt.InputData.

Definition at line 754 of file TokenizedDataPairsLayer.cs.

◆ GetDataAt()

override Tuple< float[], float[]> MyCaffe.layers.gpt.TextListData.GetDataAt ( int  nBatchSize,
int  nBlockSize,
int[]  rgnIdx 
)
virtual

Fill a batch of data from a specified array of indexes.

Parameters
nBatchSizeSpecifies the number of blocks in the batch.
nBlockSizeSpecifies the size of each block.
rgnIdxSpecifies the array of indexes to the data to be retrieved.
Returns
A tuple containing the data and target is returned.

Implements MyCaffe.layers.gpt.InputData.

Definition at line 823 of file TokenizedDataPairsLayer.cs.

◆ GetDataAvailabilityAt()

override bool MyCaffe.layers.gpt.TextListData.GetDataAvailabilityAt ( int  nIdx,
bool  bIncludeSrc,
bool  bIncludeTrg 
)
virtual

Returns true if data is available at the given index.

Parameters
nIdxSpecifies the index to check
bIncludeSrcSpecifies to include the source in the check.
bIncludeTrgSpecifies to include the target in the check.
Returns
If the data is available, true is returned.

Implements MyCaffe.layers.gpt.InputData.

Definition at line 734 of file TokenizedDataPairsLayer.cs.

◆ Tokenize()

override List< int > MyCaffe.layers.gpt.TextListData.Tokenize ( string  str,
bool  bAddBos,
bool  bAddEos 
)
virtual

Tokenize an input string using the internal vocabulary.

Parameters
strSpecifies the string to tokenize.
bAddBosAdd the begin of sequence token.
bAddEosAdd the end of sequence token.
Returns
A list of tokens corresponding to the input is returned.

Implements MyCaffe.layers.gpt.InputData.

Definition at line 876 of file TokenizedDataPairsLayer.cs.

Property Documentation

◆ BOS

override char MyCaffe.layers.gpt.TextListData.BOS
get

Return the special begin of sequence character.

Definition at line 920 of file TokenizedDataPairsLayer.cs.

◆ EOS

override char MyCaffe.layers.gpt.TextListData.EOS
get

Return the special end of sequence character.

Definition at line 928 of file TokenizedDataPairsLayer.cs.

◆ RawData

override List<string> MyCaffe.layers.gpt.TextListData.RawData
get

Return the raw data.

Definition at line 706 of file TokenizedDataPairsLayer.cs.

◆ TokenSize

override uint MyCaffe.layers.gpt.TextListData.TokenSize
get

The text data token size is a single character.

Definition at line 714 of file TokenizedDataPairsLayer.cs.

◆ VocabularySize

override uint MyCaffe.layers.gpt.TextListData.VocabularySize
get

Returns the number of unique characters in the data.

Definition at line 722 of file TokenizedDataPairsLayer.cs.


The documentation for this class was generated from the following file: