The TextListData manages parallel lists of data where the first list contains the encoder input data and the second the decoder input/target data. More...

Inheritance diagram for MyCaffe.layers.gpt.TextListData:

Public Types
enum	VOCABUARY_TYPE { CHARACTER , WORD }
	Defines the vocabulary time to use. More...

Public Member Functions
	TextListData (Log log, string strSrcFile, string strVocabFile, bool bIncludeTarget, TokenizedDataParameter.VOCABULARY_TYPE vocabType, int? nRandomSeed=null, Phase phase=Phase.NONE)
	The constructor. More...

override bool	GetDataAvailabilityAt (int nIdx, bool bIncludeSrc, bool bIncludeTrg)
	Returns true if data is available at the given index. More...

override Tuple< float[], float[]>	GetData (int nBatchSize, int nBlockSize, InputData trgData, out int[] rgnIdx)
	Retrieve random blocks from the source data where the data and target are the same but offset by one element where the target is offset +1 from the data. More...

override Tuple< float[], float[]>	GetDataAt (int nBatchSize, int nBlockSize, int[] rgnIdx)
	Fill a batch of data from a specified array of indexes. More...

override List< int >	Tokenize (string str, bool bAddBos, bool bAddEos)
	Tokenize an input string using the internal vocabulary. More...

override string	Detokenize (float[] rgfTokIdx, int nStartIdx, int nCount, bool bIgnoreBos, bool bIgnoreEos)
	Detokenize an array into a string. More...

override string	Detokenize (int nTokIdx, bool bIgnoreBos, bool bIgnoreEos)
	Detokenize a single token. More...

Public Member Functions inherited from MyCaffe.layers.gpt.InputData
	InputData (int? nRandomSeed=null)
	The constructor. More...

Properties
override List< string >	RawData `[get]`
	Return the raw data. More...

override uint	TokenSize `[get]`
	The text data token size is a single character. More...

override uint	VocabularySize `[get]`
	Returns the number of unique characters in the data. More...

override char	BOS `[get]`
	Return the special begin of sequence character. More...

override char	EOS `[get]`
	Return the special end of sequence character. More...

Properties inherited from MyCaffe.layers.gpt.InputData
abstract List< string >	RawData `[get]`
	Returns the raw data. More...

abstract uint	TokenSize `[get]`
	Returns the size of a single token (e.g. 1 for character data) More...

abstract uint	VocabularySize `[get]`
	Returns the size of the vocabulary. More...

abstract char	BOS `[get]`
	Return the special begin of sequence character. More...

abstract char	EOS `[get]`
	Return the special end of sequence character. More...

Additional Inherited Members
Protected Attributes inherited from MyCaffe.layers.gpt.InputData
Random	m_random
	Specifies the random object made available to the derived classes. More...

Detailed Description

The TextListData manages parallel lists of data where the first list contains the encoder input data and the second the decoder input/target data.

Definition at line 608 of file TokenizedDataPairsLayer.cs.

Member Enumeration Documentation

◆ VOCABUARY_TYPE

enum MyCaffe.layers.gpt.TextListData.VOCABUARY_TYPE

Defines the vocabulary time to use.

Enumerator
CHARACTER	Specifies a character vocabulary.
WORD	Specifies a space separated word vocabulary.

Definition at line 621 of file TokenizedDataPairsLayer.cs.

Constructor & Destructor Documentation

◆ TextListData()

MyCaffe.layers.gpt.TextListData.TextListData	(	Log	log,
		string	strSrcFile,
		string	strVocabFile,
		bool	bIncludeTarget,
		TokenizedDataParameter.VOCABULARY_TYPE	vocabType,
		int?	nRandomSeed = `null`,
		Phase	phase = `Phase.NONE`
	)

The constructor.

Parameters

log	Specifies the output log.
strSrcFile	Specifies the text file name for the data source.
strVocabFile	Specifies the vocabulary file (used by SENTENCEPICE type).
bIncludeTarget	Specifies to create the target tokens.
vocabType	Specifies the vocabulary type to use.
nRandomSeed	Optionally, specifies a random seed for testing.
phase	Specifies the currently running phase.

Definition at line 643 of file TokenizedDataPairsLayer.cs.

Member Function Documentation

◆ Detokenize() [1/2]

override string MyCaffe.layers.gpt.TextListData.Detokenize	(	float[]	rgfTokIdx,
		int	nStartIdx,
		int	nCount,
		bool	bIgnoreBos,
		bool	bIgnoreEos
	)

virtual

Detokenize an array into a string.

Parameters

rgfTokIdx	Specifies the array of tokens to detokenize.
nStartIdx	Specifies the starting index where detokenizing begins.
nCount	Specifies the number of tokens to detokenize.
bIgnoreBos	Specifies to ignore the BOS token.
bIgnoreEos	Specifies to ignore the EOS token.

Returns: The detokenized string is returned.

Implements MyCaffe.layers.gpt.InputData.

Definition at line 890 of file TokenizedDataPairsLayer.cs.

◆ Detokenize() [2/2]

override string MyCaffe.layers.gpt.TextListData.Detokenize	(	int	nTokIdx,
		bool	bIgnoreBos,
		bool	bIgnoreEos
	)

virtual

Detokenize a single token.

Parameters

nTokIdx	Specifies an index to the token to be detokenized.
bIgnoreBos	Specifies to ignore the BOS token.
bIgnoreEos	Specifies to ignore the EOS token.

Returns: The detokenized character is returned.

Implements MyCaffe.layers.gpt.InputData.

Definition at line 912 of file TokenizedDataPairsLayer.cs.

◆ GetData()

override Tuple< float[], float[]> MyCaffe.layers.gpt.TextListData.GetData	(	int	nBatchSize,
		int	nBlockSize,
		InputData	trgData,
		out int[]	rgnIdx
	)

virtual

Retrieve random blocks from the source data where the data and target are the same but offset by one element where the target is offset +1 from the data.

Parameters

nBatchSize	Specifies the batch size.
nBlockSize	Specifies teh block size.
trgData	Specifies the matching target data used to verify that both source and target have data at each chosen index.
rgnIdx	Returns an array of the indexes of the data returned.

Returns: A tuple containing the data and target is returned.

Implements MyCaffe.layers.gpt.InputData.

Definition at line 754 of file TokenizedDataPairsLayer.cs.

◆ GetDataAt()

override Tuple< float[], float[]> MyCaffe.layers.gpt.TextListData.GetDataAt	(	int	nBatchSize,
		int	nBlockSize,
		int[]	rgnIdx
	)

virtual

Fill a batch of data from a specified array of indexes.

Parameters

nBatchSize	Specifies the number of blocks in the batch.
nBlockSize	Specifies the size of each block.
rgnIdx	Specifies the array of indexes to the data to be retrieved.