pybdm package¶
Subpackages¶
Submodules¶
pybdm.algorithms module¶
Algorithms based on BDM
objects.
-
class
pybdm.algorithms.
PerturbationExperiment
(bdm, X=None, metric='bdm')[source]¶ Bases:
object
Perturbation experiment class.
Perturbation experiment studies change of BDM / entropy under changes applied to the underlying dataset. This is the main tool for detecting parts of a system having some causal significance as opposed to noise parts.
Parts which when perturbed yield negative contribution to the overall complexity after change are likely to be important for the system, since their removal make it more noisy. On the other hand parts that yield positive contribution to the overall complexity after change are likely to be noise since they elongate the system’s description length.
-
bdm
¶ BDM object. It has to be configured properly to handle the dataset that is to be studied.
- Type
-
X
¶ Dataset for perturbation analysis. May be set later.
- Type
array_like (optional)
-
metric
¶ Which metric to use for perturbing.
- Type
{‘bdm’, ‘ent’}
See also
pybdm.bdm.BDM
BDM computations
Examples
>>> import numpy as np >>> from pybdm import BDM, PerturbationExperiment >>> X = np.random.randint(0, 2, (100, 100)) >>> bdm = BDM(ndim=2) >>> pe = PerturbationExperiment(bdm, metric='bdm') >>> pe.set_data(X) >>> idx = np.argwhere(X) # Perturb only ones (1 --> 0) >>> delta_bdm = pe.run(idx) >>> len(delta_bdm) == idx.shape[0] True
More examples can be found in Usage.
-
property
ndim
¶ Data number of axes getter.
-
perturb
(idx, value=-1, keep_changes=False)[source]¶ Perturb element of the dataset.
- Parameters
idx (tuple) – Index tuple of an element.
value (int or callable or None) – Value to assign. If negative then new value is randomly selected from the set of other possible values. For binary data this is just a bit flip and no random numbers generation is involved in the process.
keep_changes (bool) – If
True
then changes in the dataset are persistent, so each perturbation step depends on the previous ones.
- Returns
BDM value change.
- Return type
float
Examples
>>> from pybdm import BDM >>> bdm = BDM(ndim=1) >>> X = np.ones((30, ), dtype=int) >>> perturbation = PerturbationExperiment(bdm, X) >>> perturbation.perturb((10, ), -1) 26.91763012739709
-
run
(idx=None, values=None, keep_changes=False)[source]¶ Run perturbation experiment.
- Parameters
idx (array_like or None) – Numpy integer array providing indexes (in rows) of elements to perturb. If
None
then all elements are perturbed.values (array_like or None) – Value to assign during perturbation. Negative values correspond to changing value to other randomly selected symbols from the alphabet. If
None
then all values are assigned this way. If set then its dimensions must agree with the dimensions ofidx
(they are horizontally stacked).keep_changes (bool) – If
True
then changes in the dataset are persistent, so each perturbation step depends on the previous ones.
- Returns
1D float array with perturbation values.
- Return type
array_like
Examples
>>> from pybdm import BDM >>> bdm = BDM(ndim=1) >>> X = np.ones((30, ), dtype=int) >>> perturbation = PerturbationExperiment(bdm, X) >>> changes = np.array([10, 20]) >>> perturbation.run(changes) array([26.91763013, 27.34823681])
-
set_data
(X)[source]¶ Set dataset for the perturbation experiment.
- Parameters
X (array_like) – Dataset to perturb.
-
property
shape
¶ Data shape getter.
-
property
size
¶ Data size getter.
-
pybdm.bdm module¶
Block Decomposition Method
BDM class provides a top-level interface for configuring an instance of a block decomposition method as well as running actual computations approximating algorithmic complexity of given datasets.
Configuration step is necessary for specifying dimensionality of allowed
datasets, reference CTM data as well as
boundary conditions for block decomposition etc. This is why BDM
is implemented in an object-oriented fashion, so an instance can be first
configured properly and then it exposes a public method BDM.bdm()
for computing approximated complexity via BDM.
-
class
pybdm.bdm.
BDM
(ndim, nsymbols=2, shape=None, partition=<class 'pybdm.partitions.PartitionIgnore'>, ctmname=None, warn_if_missing_ctm=True, raise_if_zero=True, **kwds)[source]¶ Bases:
object
Block decomposition method.
-
ndim
¶ Number of dimensions of target dataset objects. Positive integer.
- Type
int
-
nsymbols
¶ Number of symbols in the alphabet.
- Type
int
-
partition
¶ Partition algorithm class object. The class is called with the shape attribute determined automatically if not passed and other attributes passed via
**kwds
.- Type
Partition class
-
ctmname
¶ Name of the CTM dataset. If
None
then a CTM dataset is selected automatically based on ndim and nsymbols.- Type
str
-
warn_if_missing_ctm
¶ Should
BDMRuntimeWarning
be sent in the case there is missing CTM value. Some CTM values may be missing for larger alphabets as it is computationally infeasible to explore entire parts space. Missing CTM values are imputed with mean CTM complexities over all parts of a given shape. This can be also disabled globally with the global option of the same name, i.e.pybdm.options.set(warn_if_missing_ctm=False)
.- Type
bool
-
raise_if_zero
¶ Should error be raised if BDM value is zero. Zero value indicates that a dataset could have incorrect dimensions. This can be also disabled globally with the global option of the same name, i.e.
pybdm.options.set(raise_if_zero=False)
.- Type
bool
Notes
Block decomposition method in PyBDM is implemented in an object-oriented fashion. This design choice was dictated by the fact that BDM can not be applied willy-nilly to any dataset, but has to be configured for a particular type of data (e.g. binary matrices). Hence, it is more convenient to first configure and instatiate a particular instance of BDM and the apply it freely to data instead of passing a lot of arguments at every call.
BDM has also natural structure corresponding to the so-called split-apply-combine strategy in data analysis. First, a large dataset it decomposed into smaller block for which precomputed CTM values can be efficiently looked up. Then CTM values for slices are aggregated in a theory-informed way into a global approximation of complexity of the full dataset. Thus, BDM computations naturally decomposes into four stages:
Partition (decomposition) stage. First a dataset is decomposed into block. This is done by the
decompose()
method. The method itself is dependent on the partition attribute which points to apybdm.partitions
object, which implements and configures a particular variant of the decomposition algorithm. Detailed description of the available algorithms can be found in Theory & Design.Lookup stage. At this stage CTM values for blocks are looked up. This is when the CTM reference dataset is used. It is implemented in the :py:meth`lookup` method.
Count stage. Unique dataset blocks are counted and arranged in an efficient data structure together with their CTM values.
Aggregate stage. Final BDM value is computed based on block counter data structure.
See also
pybdm.ctmdata
available CTM datasets
pybdm.partitions
available partition and boundary condition classes
-
bdm
(X, normalized=False, check_data=True)[source]¶ Approximate complexity of a dataset with BDM.
- Parameters
X (array_like) – Dataset representation as a
numpy.ndarray
. Number of axes must agree with the ndim attribute.normalized (bool) – Should BDM be normalized to be in the [0, 1] range.
check_data (bool) – Should data format be checked. May be disabled to gain some speed when calling multiple times.
- Returns
Approximate algorithmic complexity.
- Return type
float
- Raises
TypeError – If X is not an integer array and check_data=True.
ValueError – If X has more than nsymbols unique values and check_data=True.
ValueError – If X has symbols outside of the
0
to nsymbols-1 range and check_data=True.ValueError – If computed BDM value is 0 and raise_if_zero is
True
.
Notes
Detailed description can be found in Theory & Design.
Examples
>>> import numpy as np >>> bdm = BDM(ndim=2) >>> bdm.bdm(np.ones((12, 12), dtype=int)) 25.176631293734488
-
compute_bdm
(*counters)[source]¶ Approximate Kolmogorov complexity based on the BDM formula.
- Parameters
*counters – Counter objects grouping object keys and occurences.
- Returns
Approximate algorithmic complexity.
- Return type
float
Notes
Detailed description can be found in Theory & Design.
Examples
>>> from collections import Counter >>> bdm = BDM(ndim=1) >>> c1 = Counter([('111111111111', 1.95207842085224e-08)]) >>> c2 = Counter([('111111111111', 1.95207842085224e-08)]) >>> bdm.compute_bdm(c1, c2) 1.000000019520784
-
compute_ent
(*counters)[source]¶ Compute block entropy from a counter obejct.
- Parameters
*counters – Counter objects grouping object keys and occurences.
- Returns
Block entropy in base 2.
- Return type
float
Examples
>>> from collections import Counter >>> bdm = BDM(ndim=1) >>> c1 = Counter([('111111111111', 1.95207842085224e-08)]) >>> c2 = Counter([('000000000000', 1.95207842085224e-08)]) >>> bdm.compute_ent(c1, c2) 1.0
-
count
(ctms)[source]¶ Count unique blocks.
- Parameters
ctms (sequence of 2-tuples) – Ordered 1D sequence of string keys and CTM values.
- Returns
Set of unique blocks with their CTM values and numbers of occurences.
- Return type
Counter
Examples
>>> bdm = BDM(ndim=1) >>> data = np.ones((24, ), dtype=int) >>> parts = bdm.decompose(data) >>> ctms = bdm.lookup(parts) >>> bdm.count(ctms) Counter({('111111111111', 25.610413747641715): 2})
-
decompose
(X)[source]¶ Decompose a dataset into blocks.
- Parameters
x (array_like) – Dataset of arbitrary dimensionality represented as a Numpy array.
- Yields
array_like – Dataset blocks.
- Raises
AttributeError – If blocks’ shape and dataset’s shape have different numbers of axes.
Acknowledgments
Special thanks go to Paweł Weroński for the help with the design of the non-recursive partition algorithm.
Examples
>>> bdm = BDM(ndim=2, shape=(2, 2)) >>> [ x for x in bdm.decompose(np.ones((4, 3), dtype=int)) ] [array([[1, 1], [1, 1]]), array([[1, 1], [1, 1]])]
-
decompose_and_count
(X, lookup_ctm=True)[source]¶ Decompose and count blocks.
- Parameters
X (array_like) – Dataset representation as a
numpy.ndarray
. Number of axes must agree with the ndim attribute.lookup_ctm (bool) – Should CTM values be looked up.
- Returns
Lookup table mapping 2-tuples with string keys and CTM values to numbers of occurences.
- Return type
collections.Counter
Notes
This is equivalent to calling
decompose()
,lookup()
andcount()
.Examples
>>> import numpy as np >>> bdm = BDM(ndim=1) >>> bdm.decompose_and_count(np.ones((12, ), dtype=int)) Counter({('111111111111', 25.610413747641715): 1})
-
ent
(X, normalized=False, check_data=True)[source]¶ Block entropy of a dataset.
- Parameters
X (array_like) – Dataset representation as a
numpy.ndarray
. Number of axes must agree with the ndim attribute.normalized (bool) – Should entropy be normalized to be in the [0, 1] range.
check_data (bool) – Should data format be checked. May be disabled to gain some speed when calling multiple times.
- Returns
Block entropy in base 2.
- Return type
float
- Raises
TypeError – If X is not an integer array and check_data=True.
ValueError – If X has more than nsymbols unique values and check_data=True.
ValueError – If X has symbols outside of the
0
to nsymbols-1 range and check_data=True.
Examples
>>> import numpy as np >>> bdm = BDM(ndim=2) >>> bdm.ent(np.ones((12, 12), dtype=int)) 0.0
-
lookup
(blocks, lookup_ctm=True)[source]¶ Lookup CTM values for blocks in a reference dataset.
- Parameters
blocks (sequence) – Ordered sequence of dataset parts.
lookup_ctm (bool) – Should CTM values be looked up.
- Yields
tuple – 2-tuple with string representation of a dataset part and its CTM value.
- Raises
KeyError – If key of an object can not be found in the reference CTM lookup table.
- Warns
BDMRuntimeWarning – If
warn_if_missing_ctm=True
and there is no precomputed CTM value for a part during the lookup stage. This can be always disabled with the global option of the same name.
Examples
>>> bdm = BDM(ndim=1) >>> data = np.ones((12, ), dtype=int) >>> parts = bdm.decompose(data) >>> [ x for x in bdm.lookup(parts) ] [('111111111111', 25.610413747641715)]
-
pybdm.encoding module¶
Encoding and decoding of arrays with fixed number of unique symbols.
While computing BDM dataset blocks have to be encoded into simple hashable objects such as strings or integers for efficient lookup of CTM values from reference datasets.
Currently string-based keys are used in CTM datasets. However, this may be changed to integer keys in the future in order to lower the memory footprint.
Integer encoding can be also used for easy generation of objects of fixed dimensionality as each such object using a fixed, finite alphabet of symbols can be uniquely mapped to an integer code.
-
pybdm.encoding.
array_from_string
(x, shape, cast_to=<class 'int'>)[source]¶ Make array from string code.
- Parameters
x (str) – String code.
shape (tuple) – Desired shape of the output array.
cast_to (type or None) – Cast array to given type. No casting if
None
. Defaults to integer type.
- Returns
Array encoded in the string code.
- Return type
array_like
Examples
>>> array_from_string('1010', shape=(4,)) array([1, 0, 1, 0]) >>> array_from_string('1000', shape=(2, 2)) array([[1, 0], [0, 0]])
-
pybdm.encoding.
decode_array
(code, shape, base=2, **kwds)[source]¶ Decode array of integer-symbols from a sequence code.
- Parameters
code (int) – Non-negative integer.
shape (tuple of ints) – Expected array shape.
base (int) – Encoding base.
**kwds – Keyword arguments passed to
numpy.reshape()
.
- Returns
Numpy array.
- Return type
array_like
-
pybdm.encoding.
decode_sequence
(code, base=2, min_length=None)[source]¶ Decode sequence from a sequence code.
- Parameters
code (int) – Non-negative integer.
base (int) – Encoding base. Should be equal to the number of unique symbols in the alphabet.
min_length (int or None) – Minimal number of represented bits. Use shortest representation if
None
.
- Returns
1D Numpy array.
- Return type
array_like
Examples
>>> decode_sequence(4) array([1, 0, 0])
-
pybdm.encoding.
encode_array
(x, base=2, **kwds)[source]¶ Encode array of integer-symbols.
- Parameters
x ((N, k) array_like) – Array of integer symbols.
base (int) – Encoding base.
**kwds – Keyword arguments passed to
numpy.ravel()
.
- Returns
Integer code of an array.
- Return type
int
-
pybdm.encoding.
encode_sequence
(seq, base=2)[source]¶ Encode sequence of integer-symbols.
- Parameters
seq ((N, ) array_like) – Sequence of integer symbols represented as 1D Numpy array.
base (int) – Encoding base. Should be equal to the number of unique symbols in the alphabet.
- Returns
Integer code of a sequence.
- Return type
int
- Raises
AttributeError – If seq is not 1D.
TypeError – If seq is not of integer type.
ValueError – If seq contain values which are negative or beyond the size of the alphabet (encoding base).
Examples
>>> encode_sequence(np.array([1, 0, 0])) 4
-
pybdm.encoding.
normalize_array
(X)[source]¶ Normalize array so symbols are consecutively mapped to 0, 1, 2, …
- Parameters
X (array_like) – Numpy array of arbitrary dimensions.
- Returns
Numpy array of the same dimensions with mapped symbols.
- Return type
array_like
Examples
>>> X = np.array([1, 2, 3], dtype=int) >>> normalize_array(X) array([0, 1, 2]) >>> X = np.array([[1,2],[2,1]], dtype=int) >>> normalize_array(X) array([[0, 1], [1, 0]])
-
pybdm.encoding.
normalize_key
(key)[source]¶ Normalize part key so symbols are consecutively mapped to 0, 1, 2, …
- Parameters
key (str) – Part key as returned by
string_from_array()
.- Returns
Normalized key with mapped symbols.
- Return type
str
Examples
>>> normalize_key('123') '012' >>> normalize_key('40524') '01230'
pybdm.exceptions module¶
PyBDM exception and warning classes.
-
exception
pybdm.exceptions.
BDMConfigurationError
[source]¶ Bases:
AttributeError
General BDM configuration error.
pybdm.options module¶
Global package options.
-
pybdm.options.
warn_if_missing_ctm
¶ Should warnings for missing CTM values be sent.
- Type
bool
-
pybdm.options.
raise_if_zero
¶ Should error be raised in the case of zero BDM value, which is usually indicative of malformed data.
- Type
bool
-
pybdm.options.
get
(name=None)[source]¶ Get option value or options dict.
- Parameters
name (str or None) – If
None
then the copy of the option dict is returned. Ifstr
then the given option value is returned.
See also
set_options()
description of the available global options
- Raises
KeyError – If name does not give a proper option name.
-
pybdm.options.
set
(warn_if_missing_ctm=None, raise_if_zero=None)[source]¶ Set global package options.
- Parameters
warn_if_missing_ctm (bool) – Should warnings for missing CTM values be sent.
raise_if_zero (bool) – Should error be raised in the case of zero BDM value, which is usually indicative of malformed data.
pybdm.partitions module¶
Partition algorithm classes.
Partition algorithms are used during the decomposition stage of BDM
(see Theory & Design and pybdm.bdm
), in which datasets are sliced
into blocks of appropriate sizes.
Decomposition can be done in multiple ways that handles boundaries differently. This is why partition algorithms have to be properly configured, so it is well-specified what approach exactly is to be used.
Bases:
pybdm.partitions.PartitionIgnore
Partition with the ‘correlated’ boundary condition.
Part shape.
- Type
tuple
Shift parameter for the sliding window.
- Type
int (positive)
Notes
See Theory & Design for a detailed description.
- Raises
AttributeError – If shift is not positive.
Decompose with the ‘correlated’ boundary.
Decompose a dataset into blocks.
- Parameters
x (array_like) – Dataset of arbitrary dimensionality represented as a Numpy array.
- Yields
array_like – Dataset blocks.
-
class
pybdm.partitions.
PartitionIgnore
(shape)[source]¶ Bases:
pybdm.partitions._Partition
Partition with the ‘ignore’ boundary condition.
-
shape
¶ Part shape.
- Type
tuple
Notes
See Theory & Design for a detailed description.
-
decompose
(X)[source]¶ Decompose with the ‘ignore’ boundary.
-
_Partition.
decompose
(X)¶ Decompose a dataset into blocks.
- Parameters
x (array_like) – Dataset of arbitrary dimensionality represented as a Numpy array.
- Yields
array_like – Dataset blocks.
-
-
name
= 'ignore'¶
-
-
class
pybdm.partitions.
PartitionRecursive
(shape, min_length=2)[source]¶ Bases:
pybdm.partitions._Partition
Partition with the ‘recursive’ boundary condition.
-
shape
¶ Part shape.
- Type
tuple
-
min_length
¶ Minimum parts’ length. Non-negative. In case of multidimensional objects it specifies minimum length of any single dimension.
- Type
int
Notes
See Theory & Design for a detailed description.
-
decompose
(X)[source]¶ Decompose with the ‘recursive’ boundary.
-
_Partition.
decompose
(X)¶ Decompose a dataset into blocks.
- Parameters
x (array_like) – Dataset of arbitrary dimensionality represented as a Numpy array.
- Yields
array_like – Dataset blocks.
-
-
name
= 'recursive'¶
-
property
params
¶
-
pybdm.utils module¶
Utility functions.
-
pybdm.utils.
decompose_dataset
(X, shape, shift=0)[source]¶ Decompose a dataset into blocks.
- Parameters
X (array_like) – Daataset represented as a Numpy array.
shape (tuple) – Slice shape.
shift (int) – Shift value for slicing. Nonoverlaping slicing if non-positive.
- Yields
array_like – Dataset blocks.
Examples
>>> import numpy as np >>> X = np.ones((5, 3), dtype=int) >>> [ x for x in decompose_dataset(X, (3, 3)) ] [array([[1, 1, 1], [1, 1, 1], [1, 1, 1]]), array([[1, 1, 1], [1, 1, 1]])]
-
pybdm.utils.
get_ctm_dataset
[source]¶ Get CTM dataset by name.
This function uses a global cache, so each CTM dataset is loaded to the memory only once.
- Parameters
name (str) – Name of a dataset.
- Returns
CTM lookup table.
- Return type
dict
- Raises
ValueError – If non-existent CTM dataset is requested.
-
pybdm.utils.
iter_part_shapes
(X, shape, shift=0)[source]¶ Iterate over part shapes induced by slicing.
- Parameters
X (array_like) – Dataset represented as a Numpy array.
shape (tuple) – Slice shape.
shift (int) – Shift value for slicing. Nonoverlaping slicing if non-positive.
- Yields
tuple – Part shapes.
Examples
>>> import numpy as np >>> X = np.ones((5, 3), dtype=int) >>> [ x for x in iter_part_shapes(X, (3, 3)) ] [(3, 3), (2, 3)]
-
pybdm.utils.
iter_slices
(X, shape, shift=0)[source]¶ Iter over slice indices of a dataset.
Slicing is done in a way that ensures that only pieces on boundaries of the sliced dataset can have leftovers in regard to a specified shape.
- Parameters
X (array_like) – Daataset represented as a Numpy array.
shape (tuple) – Slice shape.
shift (int) – Shift value for slicing. Nonoverlaping slicing if non-positive.
- Yields
slice – Slice indices.
Examples
>>> import numpy as np >>> X = np.ones((5, 3), dtype=int) >>> [ x for x in iter_slices(X, (3, 3)) ] [(slice(0, 3, None), slice(0, 3, None)), (slice(3, 5, None), slice(0, 3, None))]
Module contents¶
PyBDM: Block Decomposition Method
This package provides the pybdm.BDM
class for computing approximated
algorithmic complexity of arbitrarily large binary 1D and 2D arrays
as well as 1D arrays with 4, 5, 6 or 9 unique symbols based
on the Block Decomposition Method (BDM).
Theory and the design of the package are described in Theory & Design.