The TextInputData manages character data read in from a text file. Data is tokenized into indexes that reference each character within the vocabulary. More...

Inheritance diagram for MyCaffe.layers.gpt.TextInputData:

Public Member Functions
	TextInputData (string strSrc, TokenizedDataParameter.VOCABULARY_TYPE vocabType=TokenizedDataParameter.VOCABULARY_TYPE.CHARACTER, int? nRandomSeed=null, string strDebugIndexFile=null, Phase phase=Phase.NONE)
	The constructor. More...

override bool	GetDataAvailabilityAt (int nIdx, bool bIncludeSrc, bool bIncludeTrg)
	Returns true if data is available at the given index. More...

override Tuple< float[], float[]>	GetData (int nBatchSize, int nBlockSize, InputData trgData, out int[] rgnIdx)
	Retrieve random blocks from the source data where the data and target are the same but offset by one element where the target is offset +1 from the data. More...

override Tuple< float[], float[]>	GetDataAt (int nBatchSize, int nBlockSize, int[] rgnIdx)
	Specifies the GetDataAt method - Not used. More...

override List< int >	Tokenize (string str, bool bAddBos, bool bAddEos)
	Tokenize an input string using the internal vocabulary. More...

override string	Detokenize (int nTokIdx, bool bIgnoreBos, bool bIgnoreEos)
	Detokenize a single token. More...

override string	Detokenize (float[] rgfTokIdx, int nStartIdx, int nCount, bool bIgnoreBos, bool bIgnoreEos)
	Detokenize an array into a string. More...

Public Member Functions inherited from MyCaffe.layers.gpt.InputData
	InputData (int? nRandomSeed=null)
	The constructor. More...

Properties
override List< string >	RawData `[get]`
	Return the raw data. More...

override uint	TokenSize `[get]`
	The text data token size is a single character. More...

override uint	VocabularySize `[get]`
	Returns the number of unique characters in the data. More...

override char	BOS `[get]`
	Return the special begin of sequence character. More...

override char	EOS `[get]`
	Return the special end of sequence character. More...

Properties inherited from MyCaffe.layers.gpt.InputData
abstract List< string >	RawData `[get]`
	Returns the raw data. More...

abstract uint	TokenSize `[get]`
	Returns the size of a single token (e.g. 1 for character data) More...

abstract uint	VocabularySize `[get]`
	Returns the size of the vocabulary. More...

abstract char	BOS `[get]`
	Return the special begin of sequence character. More...

abstract char	EOS `[get]`
	Return the special end of sequence character. More...

Additional Inherited Members
Protected Attributes inherited from MyCaffe.layers.gpt.InputData
Random	m_random
	Specifies the random object made available to the derived classes. More...

Detailed Description

The TextInputData manages character data read in from a text file. Data is tokenized into indexes that reference each character within the vocabulary.

For example if the data source contains the text "a red fox ran.", the vocabulary would be:

Vocabulary: ' ', '.', 'a', 'd', 'e', 'f', 'o', 'n', 'r' Index Vals: 0, 1, 2, 3, 4, 5, 6, 7, 8

Tokenizing is the process of converting each input character to its respective 'token' or in this case, index value. So, for example, 'a' is tokenized as index 2; 'd' is tokenized as index 3, etc.

Definition at line 500 of file TokenizedDataLayer.cs.

Constructor & Destructor Documentation

◆ TextInputData()

MyCaffe.layers.gpt.TextInputData.TextInputData	(	string	strSrc,
		TokenizedDataParameter.VOCABULARY_TYPE	vocabType = `TokenizedDataParameter.VOCABULARY_TYPE.CHARACTER`,
		int?	nRandomSeed = `null`,
		string	strDebugIndexFile = `null`,
		Phase	phase = `Phase.NONE`
	)

The constructor.

Parameters

strSrc	Specifies the data source as the filename of the text data file.
vocabType	Specifies the vocabulary type to use.
nRandomSeed	Optionally, specifies a random seed for testing.
strDebugIndexFile	Optionally, specifies the debug index file containing index values in the form 'idx = #', one per line.
phase	Specifies the currently running phase.

Definition at line 520 of file TokenizedDataLayer.cs.

Member Function Documentation

◆ Detokenize() [1/2]

override string MyCaffe.layers.gpt.TextInputData.Detokenize	(	float[]	rgfTokIdx,
		int	nStartIdx,
		int	nCount,
		bool	bIgnoreBos,
		bool	bIgnoreEos
	)

virtual

Detokenize an array into a string.

Parameters

rgfTokIdx	Specifies the array of tokens to detokenize.
nStartIdx	Specifies the starting index where detokenizing begins.
nCount	Specifies the number of tokens to detokenize.
bIgnoreBos	Specifies to ignore the BOS token.
bIgnoreEos	Specifies to ignore the EOS token.

Returns: The detokenized string is returned.

Implements MyCaffe.layers.gpt.InputData.

Definition at line 694 of file TokenizedDataLayer.cs.

◆ Detokenize() [2/2]

override string MyCaffe.layers.gpt.TextInputData.Detokenize	(	int	nTokIdx,
		bool	bIgnoreBos,
		bool	bIgnoreEos
	)

virtual

Detokenize a single token.

Parameters

nTokIdx	Specifies an index to the token to be detokenized.
bIgnoreBos	Specifies to ignore the BOS token.
bIgnoreEos	Specifies to ignore the EOS token.

Returns: The detokenized character is returned.

Implements MyCaffe.layers.gpt.InputData.

Definition at line 680 of file TokenizedDataLayer.cs.

◆ GetData()

override Tuple< float[], float[]> MyCaffe.layers.gpt.TextInputData.GetData	(	int	nBatchSize,
		int	nBlockSize,
		InputData	trgData,
		out int[]	rgnIdx
	)

virtual

Retrieve random blocks from the source data where the data and target are the same but offset by one element where the target is offset +1 from the data.

Parameters

nBatchSize	Specifies the batch size.
nBlockSize	Specifies teh block size.
trgData	Specifies the target data provided to check for data availability at the selected data index.
rgnIdx	Returns an array of indexes of the data returned.

Returns: A tuple containing the data and target is returned.

Implements MyCaffe.layers.gpt.InputData.

Definition at line 599 of file TokenizedDataLayer.cs.

◆ GetDataAt()

override Tuple< float[], float[]> MyCaffe.layers.gpt.TextInputData.GetDataAt	(	int	nBatchSize,
		int	nBlockSize,
		int[]	rgnIdx
	)

virtual

Specifies the GetDataAt method - Not used.

Parameters

nBatchSize	Specifies the number of blocks in the batch.
nBlockSize	Specifies the size of each block.
rgnIdx	Not used.

Exceptions

NotImplementedException

Implements MyCaffe.layers.gpt.InputData.

Definition at line 656 of file TokenizedDataLayer.cs.

◆ GetDataAvailabilityAt()

override bool MyCaffe.layers.gpt.TextInputData.GetDataAvailabilityAt	(	int	nIdx,
		bool	bIncludeSrc,
		bool	bIncludeTrg
	)

virtual

Returns true if data is available at the given index.

Parameters

nIdx	Specifies the index to check
bIncludeSrc	Specifies to include the source in the check.
bIncludeTrg	Specifies to include the target in the check.

Returns: If the data is available, true is returned.

Implements MyCaffe.layers.gpt.InputData.

Definition at line 585 of file TokenizedDataLayer.cs.

◆ Tokenize()

override List< int > MyCaffe.layers.gpt.TextInputData.Tokenize	(	string	str,
		bool	bAddBos,
		bool	bAddEos
	)

virtual

Tokenize an input string using the internal vocabulary.

Parameters

str	Specifies the string to tokenize.
bAddBos	Add the begin of sequence token.
bAddEos	Add the end of sequence token.

Returns: A list of tokens corresponding to the input is returned.

Implements MyCaffe.layers.gpt.InputData.

Definition at line 668 of file TokenizedDataLayer.cs.

Property Documentation

◆ BOS

override char MyCaffe.layers.gpt.TextInputData.BOS

get

Return the special begin of sequence character.

Definition at line 712 of file TokenizedDataLayer.cs.

◆ EOS

override char MyCaffe.layers.gpt.TextInputData.EOS

get

Return the special end of sequence character.

Definition at line 720 of file TokenizedDataLayer.cs.

◆ RawData

override List<string> MyCaffe.layers.gpt.TextInputData.RawData

get

Return the raw data.

Definition at line 554 of file TokenizedDataLayer.cs.

◆ TokenSize

override uint MyCaffe.layers.gpt.TextInputData.TokenSize

get

The text data token size is a single character.

Definition at line 565 of file TokenizedDataLayer.cs.

◆ VocabularySize

override uint MyCaffe.layers.gpt.TextInputData.VocabularySize

get

Returns the number of unique characters in the data.

Definition at line 573 of file TokenizedDataLayer.cs.

The documentation for this class was generated from the following file:

C:/Data/Data/SS_Projects/Intelligence/GitHub/MyCaffe/MyCaffe.layers.gpt/layers.gpt/TokenizedDataLayer.cs

Public Member Functions

Properties

Additional Inherited Members

Detailed Description

Constructor & Destructor Documentation

◆ TextInputData()

Member Function Documentation

◆ Detokenize() [1/2]

◆ Detokenize() [2/2]

◆ GetData()

◆ GetDataAt()

◆ GetDataAvailabilityAt()

◆ Tokenize()

Property Documentation

◆ BOS

◆ EOS

◆ RawData

◆ TokenSize

◆ VocabularySize