API Reference

Training

class pycrfsuite.Trainer

The trainer class.

This class maintains a data set for training, and provides an interface to various training algorithms.

append_dicts(self, xseq, yseq, int group=0)

Append an instance (item/label sequence) to the data set.

Parameters:

xseq : a sequence of {string: float} dicts

The item sequence of the instance. Each dict should be a string -> float mapping where keys are observed features and values are their weights.

yseq : a sequence of strings

The label sequence of the instance. The number of elements in yseq must be identical to that in xseq.

group : int, optional

The group number of the instance. Group numbers are used to select subset of data for heldout evaluation.

append_stringlists(self, xseq, yseq, int group=0)

Append an instance (item/label sequence) to the data set.

Parameters:

xseq : a sequence of lists of strings

The item sequence of the instance. Each list should contain observed features (as strings).

yseq : a sequence of strings

The label sequence of the instance. The number of elements in yseq must be identical to that in xseq.

group : int, optional

The group number of the instance. Group numbers are used to select subset of data for heldout evaluation.

clear(self)

Remove all instances in the data set.

get(self, name)

Get the value of a training parameter. This function gets a parameter value for the graphical model and training algorithm specified by Trainer.select() method.

Parameters:

name : string

The parameter name.

help(self, name)

Get the description of a training parameter. This function obtains the help message for the parameter specified by the name. The graphical model and training algorithm must be selected by Trainer.select() method before calling this method.

Parameters:

name : string

The parameter name.

Returns:

string :

The description (help message) of the parameter.

message(self, message)

Receive messages from the training algorithm. Override this method to receive messages of the training process.

By default, this method uses Python logging subsystem to output the messages (logger name is ‘pycrfsuite’).

Parameters:

message : string

The message

params(self)

Obtain the list of parameters.

This function returns the list of parameter names available for the graphical model and training algorithm specified by Trainer.select() method.

Returns:

list of strings :

The list of parameters available for the current graphical model and training algorithm.

select(self, string algorithm, string type='crf1d')

Initialize the training algorithm.

Parameters:

algorithm : {‘lbfgs’, ‘l2sgd’, ‘ap’, ‘pa’, ‘arow’}

The name of the training algorithm.

  • ‘lbfgs’ for Gradient descent using the L-BFGS method,
  • ‘l2sgd’ for Stochastic Gradient Descent with L2 regularization term
  • ‘ap’ for Averaged Perceptron
  • ‘pa’ for Passive Aggressive
  • ‘arow’ for Adaptive Regularization Of Weight Vector

type : string, optional

The name of the graphical model.

set(self, name, value)

Set a training parameter. This function sets a parameter value for the graphical model and training algorithm specified by Trainer.select() method.

Parameters:

name : string

The parameter name.

value : string

The value of the parameter.

train(self, model, int holdout=-1)

Run the training algorithm. This function starts the training algorithm with the data set given by Trainer.append_dicts() or Trainer.append_stringlists() methods.

Parameters:

model : string

The filename to which the trained model is stored. If this value is empty, this function does not write out a model file.

holdout : int, optional

The group number of holdout evaluation. The instances with this group number will not be used for training, but for holdout evaluation. Default value is -1, meaning “use all instances for training”.

Tagging

class pycrfsuite.Tagger

The tagger class.

This class provides the functionality for predicting label sequences for input sequences using a model.

close(self)

Close the model.

dump(self, filename=None)

Dump a CRF model in plain-text format.

Parameters:

filename : string, optional

File name to dump the model to. If None, the model is dumped to stdout.

info(self)

Return a ParsedDump structure with model internal information.

labels(self)

Obtain the list of labels.

Returns:

list of strings :

The list of labels in the model.

marginal(self, y, pos)

Compute the marginal probability of the label y at position pos for the current input sequence (i.e. a sequence set using one of the Tagger.set() methods or a sequence used in a previous Tagger.tag() call).

Parameters:

y : string

The label.

t : int

The position of the label.

Returns:

float :

The marginal probability of the label y at position t.

open(self, name)

Open a model file.

Parameters:

name : string

The file name of the model file.

probability(self, yseq)

Compute the probability of the label sequence for the current input sequence (i.e. a sequence set using Tagger.set() method or a sequence used in a previous Tagger.tag() call).

Parameters:

yseq : list of strings

The label sequence.

Returns:

float :

The probability P(yseq|xseq).

set(self, xseq, feature_format=None)

Set an instance (item sequence) for future calls of Tagger.tag(), Tagger.probability() and Tagger.marginal() methods.

See also: Tagger.set_dicts(), Tagger.set_stringlists().

Parameters:

xseq : item sequence

The sequence of features.

feature_format : {‘stringlist’, ‘dict’}, optional

Item sequence data format.

  • ‘stringlist’ means that xseq must be a sequence of lists of strings, where each list contains observed features (all weights are assumed to be 1.0).
  • ‘dict’ means that xseq must be a sequence of {string: float} dicts, where each dict is has observed features as keys and their weights as values.

By default, feature_format passed to Tagger constructor is used, or ‘stringlist’ if no feature_format is passed to Tagger constructor.

set_dicts(self, xseq)

Set an instance (item sequence) for future calls of Tagger.tag(), Tagger.probability() and Tagger.marginal() methods.

Parameters:

xseq : a sequence of {string: float} dicts

The sequence of feature dicts to be tagged. Each dict should be a string -> float mapping where keys are observed features and values are their weights.

set_stringlists(self, xseq)

Set an instance (item sequence) for future calls of Tagger.tag(), Tagger.probability() and Tagger.marginal() methods.

Parameters:

xseq : a sequence of lists of strings

The sequence of feature lists. Each list should contain observed features (as strings).

tag(self, xseq=None, feature_format=None)

Predict the label sequence for the item sequence.

Parameters:

xseq : item sequence, optional

The sequence of features. If omitted, the current sequence is used (e.g. a sequence set using Tagger.set() method).

feature_format : {‘stringlist’, ‘dict’}, optional

Item sequence data format.

  • ‘stringlist’ means that xseq must be a sequence of lists of strings, where each list contains observed features (all weights are assumed to be 1.0).
  • ‘dict’ means that xseq must be a sequence of {string: float} dicts, where each dict is has observed features as keys and their weights as values.

By default, feature_format passed to Tagger constructor is used, or ‘stringlist’ if no feature_format is passed to Tagger constructor.

Returns:

list of strings :

The label sequence predicted.

Debugging

class pycrfsuite._dumpparser.ParsedDump[source]

CRFsuite model parameters. Objects of this type are returned by pycrfsuite.Tagger.info() method.

Attributes

transitions dict {(from_label, to_label): weight} dict with learned transition weights
state_features dict {(attribute, label): weight} dict with learned (attribute, label) weights
header dict Metadata from the file header
labels dict {name: internal_id} dict with model labels
attributes dict {name: internal_id} dict with known attributes