MyCaffe  1.12.2.41
Deep learning software for Windows C# programmers.
MyCaffe.layers.gpt.TextInputData Class Reference

The TextInputData manages character data read in from a text file. Data is tokenized into indexes that reference each character within the vocabulary. More...

Inheritance diagram for MyCaffe.layers.gpt.TextInputData:
MyCaffe.layers.gpt.InputData

Public Member Functions

 TextInputData (string strSrc, TokenizedDataParameter.VOCABULARY_TYPE vocabType=TokenizedDataParameter.VOCABULARY_TYPE.CHARACTER, int? nRandomSeed=null, string strDebugIndexFile=null, Phase phase=Phase.NONE)
 The constructor. More...
 
override bool GetDataAvailabilityAt (int nIdx, bool bIncludeSrc, bool bIncludeTrg)
 Returns true if data is available at the given index. More...
 
override Tuple< float[], float[]> GetData (int nBatchSize, int nBlockSize, InputData trgData, out int[] rgnIdx)
 Retrieve random blocks from the source data where the data and target are the same but offset by one element where the target is offset +1 from the data. More...
 
override Tuple< float[], float[]> GetDataAt (int nBatchSize, int nBlockSize, int[] rgnIdx)
 Specifies the GetDataAt method - Not used. More...
 
override List< int > Tokenize (string str, bool bAddBos, bool bAddEos)
 Tokenize an input string using the internal vocabulary. More...
 
override string Detokenize (int nTokIdx, bool bIgnoreBos, bool bIgnoreEos)
 Detokenize a single token. More...
 
override string Detokenize (float[] rgfTokIdx, int nStartIdx, int nCount, bool bIgnoreBos, bool bIgnoreEos)
 Detokenize an array into a string. More...
 
- Public Member Functions inherited from MyCaffe.layers.gpt.InputData
 InputData (int? nRandomSeed=null)
 The constructor. More...
 

Properties

override List< string > RawData [get]
 Return the raw data. More...
 
override uint TokenSize [get]
 The text data token size is a single character. More...
 
override uint VocabularySize [get]
 Returns the number of unique characters in the data. More...
 
override char BOS [get]
 Return the special begin of sequence character. More...
 
override char EOS [get]
 Return the special end of sequence character. More...
 
- Properties inherited from MyCaffe.layers.gpt.InputData
abstract List< string > RawData [get]
 Returns the raw data. More...
 
abstract uint TokenSize [get]
 Returns the size of a single token (e.g. 1 for character data) More...
 
abstract uint VocabularySize [get]
 Returns the size of the vocabulary. More...
 
abstract char BOS [get]
 Return the special begin of sequence character. More...
 
abstract char EOS [get]
 Return the special end of sequence character. More...
 

Additional Inherited Members

- Protected Attributes inherited from MyCaffe.layers.gpt.InputData
Random m_random
 Specifies the random object made available to the derived classes. More...
 

Detailed Description

The TextInputData manages character data read in from a text file. Data is tokenized into indexes that reference each character within the vocabulary.

For example if the data source contains the text "a red fox ran.", the vocabulary would be:

Vocabulary: ' ', '.', 'a', 'd', 'e', 'f', 'o', 'n', 'r' Index Vals: 0, 1, 2, 3, 4, 5, 6, 7, 8

Tokenizing is the process of converting each input character to its respective 'token' or in this case, index value. So, for example, 'a' is tokenized as index 2; 'd' is tokenized as index 3, etc.

Definition at line 500 of file TokenizedDataLayer.cs.

Constructor & Destructor Documentation

◆ TextInputData()

MyCaffe.layers.gpt.TextInputData.TextInputData ( string  strSrc,
TokenizedDataParameter.VOCABULARY_TYPE  vocabType = TokenizedDataParameter.VOCABULARY_TYPE.CHARACTER,
int?  nRandomSeed = null,
string  strDebugIndexFile = null,
Phase  phase = Phase.NONE 
)

The constructor.

Parameters
strSrcSpecifies the data source as the filename of the text data file.
vocabTypeSpecifies the vocabulary type to use.
nRandomSeedOptionally, specifies a random seed for testing.
strDebugIndexFileOptionally, specifies the debug index file containing index values in the form 'idx = #', one per line.
phaseSpecifies the currently running phase.

Definition at line 520 of file TokenizedDataLayer.cs.

Member Function Documentation

◆ Detokenize() [1/2]

override string MyCaffe.layers.gpt.TextInputData.Detokenize ( float[]  rgfTokIdx,
int  nStartIdx,
int  nCount,
bool  bIgnoreBos,
bool  bIgnoreEos 
)
virtual

Detokenize an array into a string.

Parameters
rgfTokIdxSpecifies the array of tokens to detokenize.
nStartIdxSpecifies the starting index where detokenizing begins.
nCountSpecifies the number of tokens to detokenize.
bIgnoreBosSpecifies to ignore the BOS token.
bIgnoreEosSpecifies to ignore the EOS token.
Returns
The detokenized string is returned.

Implements MyCaffe.layers.gpt.InputData.

Definition at line 694 of file TokenizedDataLayer.cs.

◆ Detokenize() [2/2]

override string MyCaffe.layers.gpt.TextInputData.Detokenize ( int  nTokIdx,
bool  bIgnoreBos,
bool  bIgnoreEos 
)
virtual

Detokenize a single token.

Parameters
nTokIdxSpecifies an index to the token to be detokenized.
bIgnoreBosSpecifies to ignore the BOS token.
bIgnoreEosSpecifies to ignore the EOS token.
Returns
The detokenized character is returned.

Implements MyCaffe.layers.gpt.InputData.

Definition at line 680 of file TokenizedDataLayer.cs.

◆ GetData()

override Tuple< float[], float[]> MyCaffe.layers.gpt.TextInputData.GetData ( int  nBatchSize,
int  nBlockSize,
InputData  trgData,
out int[]  rgnIdx 
)
virtual

Retrieve random blocks from the source data where the data and target are the same but offset by one element where the target is offset +1 from the data.

Parameters
nBatchSizeSpecifies the batch size.
nBlockSizeSpecifies teh block size.
trgDataSpecifies the target data provided to check for data availability at the selected data index.
rgnIdxReturns an array of indexes of the data returned.
Returns
A tuple containing the data and target is returned.

Implements MyCaffe.layers.gpt.InputData.

Definition at line 599 of file TokenizedDataLayer.cs.

◆ GetDataAt()

override Tuple< float[], float[]> MyCaffe.layers.gpt.TextInputData.GetDataAt ( int  nBatchSize,
int  nBlockSize,
int[]  rgnIdx 
)
virtual

Specifies the GetDataAt method - Not used.

Parameters
nBatchSizeSpecifies the number of blocks in the batch.
nBlockSizeSpecifies the size of each block.
rgnIdxNot used.
Exceptions
NotImplementedException

Implements MyCaffe.layers.gpt.InputData.

Definition at line 656 of file TokenizedDataLayer.cs.

◆ GetDataAvailabilityAt()

override bool MyCaffe.layers.gpt.TextInputData.GetDataAvailabilityAt ( int  nIdx,
bool  bIncludeSrc,
bool  bIncludeTrg 
)
virtual

Returns true if data is available at the given index.

Parameters
nIdxSpecifies the index to check
bIncludeSrcSpecifies to include the source in the check.
bIncludeTrgSpecifies to include the target in the check.
Returns
If the data is available, true is returned.

Implements MyCaffe.layers.gpt.InputData.

Definition at line 585 of file TokenizedDataLayer.cs.

◆ Tokenize()

override List< int > MyCaffe.layers.gpt.TextInputData.Tokenize ( string  str,
bool  bAddBos,
bool  bAddEos 
)
virtual

Tokenize an input string using the internal vocabulary.

Parameters
strSpecifies the string to tokenize.
bAddBosAdd the begin of sequence token.
bAddEosAdd the end of sequence token.
Returns
A list of tokens corresponding to the input is returned.

Implements MyCaffe.layers.gpt.InputData.

Definition at line 668 of file TokenizedDataLayer.cs.

Property Documentation

◆ BOS

override char MyCaffe.layers.gpt.TextInputData.BOS
get

Return the special begin of sequence character.

Definition at line 712 of file TokenizedDataLayer.cs.

◆ EOS

override char MyCaffe.layers.gpt.TextInputData.EOS
get

Return the special end of sequence character.

Definition at line 720 of file TokenizedDataLayer.cs.

◆ RawData

override List<string> MyCaffe.layers.gpt.TextInputData.RawData
get

Return the raw data.

Definition at line 554 of file TokenizedDataLayer.cs.

◆ TokenSize

override uint MyCaffe.layers.gpt.TextInputData.TokenSize
get

The text data token size is a single character.

Definition at line 565 of file TokenizedDataLayer.cs.

◆ VocabularySize

override uint MyCaffe.layers.gpt.TextInputData.VocabularySize
get

Returns the number of unique characters in the data.

Definition at line 573 of file TokenizedDataLayer.cs.


The documentation for this class was generated from the following file: