`latdata` — Tabular Data

Named-axis data structures built on top of NDArray.

The latdata module provides lightweight labeled 2D containers. The core design philosophy is that a Table is an NDArray with row and column labels — not a full DataFrame clone. This means:

Selection always delegates to NDArray under the hood and inherits its computation methods.
Label-based indexing is the primary interface, but integer positional indexing also works.
There is no separate index object, no multi-index, and no set-operations on labels. The Axis is simply an ordered list of names with optional shortcuts (aliases).

Axis

from latpy.latdata import Axis

An Axis is an ordered list of labels with support for name-based lookups and user-defined aliases. Every Table has two Axes (.rows and .cols).

Signature	Description
`Axis(name, labels)`	Create axis with string name and label list
`len(axis)`	Number of labels
`axis.has(label) -> bool`	Check if label exists
`axis.pos(label) -> int`	Integer position of label (0-based)
`axis.alias(name, sel)`	Register a named selector (slice, int, label list, predicate)
`axis.resolve(sel) -> tuple`	Convert a user selector to internal index form

Selectors (accepted by resolve):

Selector Type	Example	Meaning
`slice`	`1:5:2`	Positional slice (integers only)
`int`	`2`	Direct integer index (supports negative)
`Label`	`"NYC"`	Single label lookup
`Sequence[Label]`	`["a", "b", "c"]`	List of labels
`(Label, Label)`	`("jan", "jun")`	Inclusive label range (by axis order, supports reverse)
`Callable[[Label], bool]`	`lambda x: x.startswith("t")`	Predicate filter
Alias name (str)	`"top_10"`	Expands to previously registered selector

Basic usage

axis = Axis("city", ["NYC", "LA", "CHI", "HOU"])

len(axis)          # 4
axis.has("LA")     # True
axis.has("SF")     # False
axis.pos("CHI")    # 2

# Resolve various selectors:
axis.resolve(0)            # ("int", 0, ["NYC"])
axis.resolve(-1)           # ("int", 3, ["HOU"])   — negative wraps
axis.resolve("LA")         # ("int", 1, ["LA"])
axis.resolve(["NYC", "HOU"])  # ("list", [0, 3], ["NYC", "HOU"])
axis.resolve(("CHI", "LA"))   # ("slice", slice(2, 0, -1), ["CHI", "LA", "NYC"]) — reverse range
axis.resolve(lambda x: len(x) == 3)  # ("list", [0, 1, 3], ["NYC", "LA", "HOU"])

Aliases

Aliases are named shortcuts. They are stored by name and expanded during resolve. This is useful for giving semantic names to frequently used sub-selections, e.g. "features", "top_10", "validation_set".

Rationale: Aliases make code self-documenting. Instead of repeating table[:, ["col1", "col3", "col7"]] everywhere, you define table.cols.alias("features", ["col1", "col3", "col7"]) once and write table[:, "features"] thereafter.

axis = Axis("row", ["train", "val", "test"])
axis.alias("train_val", ["train", "val"])
axis.alias("not_test", lambda x: x != "test")

axis.resolve("train_val")  # ("list", [0, 1], ["train", "val"])
axis.resolve("not_test")   # ("list", [0, 1], ["train", "val"])

Aliases can wrap any selector type: a single label, a list, a slice, a callable, or even another alias name.

Edge cases

Empty labels list — Axis allows zero labels. resolve and pos will raise errors on lookups, but len(axis) returns 0 and the Axis can still store aliases.

empty = Axis("x", [])
len(empty)          # 0
empty.has("a")      # False
empty.pos("a")      # ShapeError: Axis[x]: unknown label 'a'

Duplicate labels — When constructing an Axis, duplicate labels are silently deduplicated: the first occurrence wins. The labels list keeps duplicates but the internal _index dict only stores the first position.

ax = Axis("x", ["a", "b", "a"])   # duplicate "a"
len(ax)                 # 3
ax.has("a")             # True
ax.pos("a")             # 0   (first occurrence)
ax.labels               # ["a", "b", "a"]  (may show duplicates)

This can produce surprising behavior — resolve returns label lists based on the stored labels, but positional lookups always use the first occurrence. Users are encouraged to avoid duplicate labels.

Alias with an unknown label — If an alias name is registered but its selector refers to a label that does not exist, the error surfaces at resolve time, not at alias registration time.

ax = Axis("x", ["a", "b"])
ax.alias("bad", ["a", "z"])        # registers fine
ax.resolve("bad")                  # ShapeError: Axis[x]: unknown label 'z'

Resolve with a missing label — Direct label lookup on a non-existent label raises ShapeError.

ax = Axis("x", ["a", "b"])
ax.resolve("z")     # ShapeError: Axis[x]: unknown label 'z'

Resolve with None — Special case: None is treated as slice(None), meaning “all labels”.

ax = Axis("x", ["a", "b", "c"])
ax.resolve(None)  # ("slice", slice(0, 3, 1), ["a", "b", "c"])

Table

from latpy.latdata import Table

A Table is a 2D labeled grid: a single NDArray together with a row Axis and a column Axis. It is not a full DataFrame — it intentionally omits pivot tables, joins, stacked operations, and fancy index manipulation. Its strength is label-based reading and writing with a clean T[rows, cols] syntax.

Signature	Description
`Table.from_list(values_2d, row_labels, col_labels, dtype=None, name_rows="rows", name_cols="cols") -> Table`	Construct from nested lists
`.data`	`NDArray` backing store
`.rows`	Row `Axis`
`.cols`	Column `Axis`
`.tolist() -> list[list]`	Convert to Python lists
`table[key] -> scalar \| NDArray \| Table`	Index by label / integer / slice
`.where(mask, a, b) -> Table`	Element-wise select (like `ndarray.where`)
`.sum(axis="rows") -> NDArray`	Sum over rows or columns

Construction

# From nested lists with explicit labels
T = Table.from_list(
    [[1, 2, 3],
     [4, 5, 6]],
    row_labels=["r1", "r2"],
    col_labels=["a", "b", "c"],
)
# T.data is NDArray of shape (2, 3), dtype i64
# T.rows → Axis("rows", ["r1", "r2"])
# T.cols → Axis("cols", ["a", "b", "c"])
# T.tolist() → [[1, 2, 3], [4, 5, 6]]

# Auto-generated labels when omitted:
T2 = Table.from_list([[1, 2], [3, 4]])
# T2.rows.labels → ["r0", "r1"]
# T2.cols.labels → ["c0", "c1"]

Indexing — return-type rules

The indexing expression T[rows, cols] follows these rules:

rows selector	cols selector	return type	Example
single label/int	single label/int	scalar	`T["r1", "b"] → 2`
single label/int	multiple	1D NDArray	`T["r1", :] → [1,2,3]`
multiple	single label/int	1D NDArray	`T[:, "b"] → [2,5]`
`:`	`:`	Table (view)	`T[:, :] → Table`
multiple (non-slice)	multiple (non-slice)	Table (copy)	`T[["r1"], ["a","c"]] → Table`

Rationale: 1D results are plain NDArray objects (not 1-column Tables) because a 1D NDArray is simpler, lighter, and interoperates directly with mathematical operations. A 2D result is always wrapped in a Table so that labels are preserved for further chaining.

T = Table.from_list(
    [[1, 2, 3],
     [4, 5, 6]],
    row_labels=["r1", "r2"],
    col_labels=["a", "b", "c"],
)

# Scalar
T["r1", "b"]      # → 2  (Python int)

# 1D NDArray (single row, all cols)
T["r1", :]        # → NDArray(shape=(3,), dtype=i64, axes=("cols",))
T["r1", :].tolist()  # [1, 2, 3]

# 1D NDArray (all rows, single column)
T[:, "b"]         # → NDArray(shape=(2,), dtype=i64, axes=("rows",))
T[:, "b"].tolist()  # [2, 5]

# Table (row range, column range)
S = T["r1":"r2", "a":"b"]
# S is Table(shape=(2, 2), rows=rows, cols=cols)
S.tolist()        # [[1, 2], [4, 5]]

# Table (full slice — view)
S = T[:, :]
S.tolist()        # [[1, 2, 3], [4, 5, 6]]

# Column shortcut: T["a"] is T[:, "a"]
T["a"]            # NDArray([1, 4])

# List-of-labels selection
T[:, ["a", "c"]]  # Table(shape=(2, 2), rows=rows, cols=cols)
T[:, ["a", "c"]].tolist()  # [[1, 3], [4, 6]]

Boolean / predicate indexing

When you pass a callable predicate as a selector, the same dimensionality rules apply. If both dimensions resolve to multiple entries, the result is a Table.

T = Table.from_list(
    [[1, 2, 3],
     [4, 5, 6],
     [7, 8, 9]],
    row_labels=["train", "val", "test"],
    col_labels=["a", "b", "c"],
)

# Predicate on rows → Table (multiple rows, all cols)
S = T[lambda r: r != "val", :]
# S is Table(shape=(2, 3), rows=rows, cols=cols)
S.rows.labels  # ["train", "test"]
S.tolist()     # [[1, 2, 3], [7, 8, 9]]

# Predicate on cols → Table (all rows, filtered cols)
S2 = T[:, lambda c: c > "a"]
S2.cols.labels  # ["b", "c"]
S2.tolist()     # [[2, 3], [5, 6], [8, 9]]

Edge cases

from_list with empty list — Raises ShapeError because the underlying NDArray cannot infer shape from [].

Table.from_list([], row_labels=[], col_labels=[])
# ShapeError: array(): expected list or nested list

from_list with mismatched label lengths — Raises ShapeError in __post_init__ because the data shape must match the label count.

Table.from_list(
    [[1, 2], [3, 4]],
    row_labels=["r1"],         # length 1, data has 2 rows
    col_labels=["a", "b"],
)
# ShapeError: Table: row labels length must match data.shape[0]

Indexing with out-of-range label — Raises ShapeError from Axis.pos.

T[:, "z"]   # ShapeError: Axis[cols]: unknown label 'z'

Indexing with out-of-range integer — Also a ShapeError.

T[10, :]    # ShapeError: Axis[rows]: index out of bounds

Mixed label/integer indexing — Labels and integers can be freely mixed. An integer is treated as a positional index, not a label name.

T[0, "b"]          # scalar: 2  (row 0, column "b")
T["r1", 2]         # scalar: 3  (row "r1", column index 2)
T[1, 0]            # scalar: 4  (row 1, column 0)
T[0, ["a", "c"]]   # NDArray([1, 3])  — row 0, cols "a" and "c"

Computation

sum — Reduces along one axis, returning a 1D NDArray. Compatible axis name aliases: "rows", "row", "r", "0" for axis=0; "cols", "col", "c", "1" for axis=1.

T = Table.from_list(
    [[1, 2, 3],
     [4, 5, 6]],
    row_labels=["r1", "r2"],
    col_labels=["a", "b", "c"],
)

T.sum(axis="rows")   # NDArray([5, 7, 9])   — sum down each column
T.sum(axis="cols")   # NDArray([6, 15])     — sum across each row

where — Element-wise select between two values based on a boolean mask. The mask can be:

A Table with b1 data
A callable (row_label, col_label) -> bool
A broadcastable NDArray with dtype b1

T = Table.from_list(
    [[1, 2],
     [3, 4]],
    row_labels=["a", "b"],
    col_labels=["x", "y"],
)

# Mask via callable: put 99 wherever label pair matches
result = T.where(lambda r, c: r == "a" and c == "x", 99, T)
result.tolist()  # [[99, 2], [3, 4]]
# result is a Table with same shape and labels

# Mask via b1 Table
from latpy.latmath.array import zeros
from latpy.latmath.array.dtypes import B1
mask_arr = zeros((2, 2), B1)
mask_arr[0, 0] = 1
mask_t = Table(mask_arr, T.rows, T.cols)
result2 = T.where(mask_t, T, 0)
result2.tolist()  # [[1, 0], [0, 0]]

GroupBy

from latpy.latdata import GroupBy

GroupBy implements split-apply-combine over a Table’s rows. You group rows by a column of the table, by row label, or by a callable predicate, then apply an aggregation (sum, mean, count, min, max).

Rationale: GroupBy is the classic split-apply-combine pattern from SQL/pandas. Splitting by row labels is the most common use case for labeled data, since row labels often encode categorical membership (e.g. "train", "val", "test"). Each aggregation produces a new Table whose row labels are the group keys and whose columns match the original table.

Signature	Description
`GroupBy(table, by)`	Group rows by label string, label list, or callable predicate
`.sum() -> Table`	Sum of each group along columns
`.mean() -> Table`	Mean of each group along columns
`.count() -> Table`	Row count of each group
`.min() -> Table`	Minimum of each group along columns
`.max() -> Table`	Maximum of each group along columns

Basic example

t = Table.from_list(
    [[1, 2],
     [3, 4],
     [5, 6]],
    row_labels=["a", "b", "a"],
    col_labels=["x", "y"],
)

gb = GroupBy(t, "row")      # group by row label
result = gb.sum()
result.tolist()              # [[6, 8], [3, 4]]
result.rows.labels           # ["a", "b"]   (group keys become row labels)
result.cols.labels           # ["x", "y"]   (columns preserved)

Output of each aggregation on this data:

Aggregation	Output
`gb.sum()`	`[[6, 8], [3, 4]]`
`gb.mean()`	`[[3.0, 4.0], [3.0, 4.0]]`
`gb.count()`	`[[2, 2], [1, 1]]`
`gb.min()`	`[[1, 2], [3, 4]]`
`gb.max()`	`[[5, 6], [3, 4]]`

Chaining

GroupBy results are Tables, so they can be further indexed or converted.

# Group → sum → extract one column as NDArray
GroupBy(t, "row").sum()[:, "y"]    # NDArray([8, 4])

# Group → mean → convert to plain Python lists
GroupBy(t, "row").mean().tolist()  # [[3.0, 4.0], [3.0, 4.0]]

# Group → sum → sum again (grand total)
GroupBy(t, "row").sum().sum(axis="cols")  # NDArray([14, 7])

Grouping by callable predicate

When you pass a callable, it receives each row label and the return value is stringified to form the group key.

t = Table.from_list(
    [[1, 2], [3, 4], [5, 6], [7, 8]],
    row_labels=["set_a", "set_b", "set_a", "set_b"],
    col_labels=["x", "y"],
)

gb = GroupBy(t, lambda lbl: lbl.split("_")[1])
gb.sum().tolist()   # [[6, 8], [10, 12]]
gb.sum().rows.labels  # ["a", "b"]

Edge cases

Group by string that doesn’t match any row label — The by parameter refers to row labels. If you pass a string that doesn’t match any row label, all rows fall into a single group named by that string.

t = Table.from_list([[1], [2]], row_labels=["a", "b"], col_labels=["x"])
gb = GroupBy(t, "z")             # no row label "z"
gb.sum().rows.labels             # ["z"]   — all rows grouped together
gb.sum().tolist()                # [[3]]
gb.count().tolist()              # [[2]]

Single-element groups — Groups with exactly one row work naturally; the aggregation returns that row’s values unchanged.

t = Table.from_list([[1, 2], [3, 4]], row_labels=["a", "b"], col_labels=["x", "y"])
GroupBy(t, "row").sum().tolist()    # [[1, 2], [3, 4]]  (each row is its own group)
GroupBy(t, "row").count().tolist()  # [[1, 1], [1, 1]]

All-identical rows — When all rows share the same label, there is one group containing all rows. Aggregation behaves normally.

t = Table.from_list([[1, 2], [3, 4]], row_labels=["a", "a"], col_labels=["x", "y"])
GroupBy(t, "row").sum().tolist()      # [[4, 6]]
GroupBy(t, "row").mean().tolist()     # [[2.0, 3.0]]
GroupBy(t, "row").count().tolist()    # [[2, 2]]

Empty table — If the table has no rows (shape (0, n)), GroupBy produces zero groups and aggregation returns an empty Table with shape (0, n_cols).

Design rationale

Table wraps NDArray with labels, not a full DataFrame clone

The Table deliberately does not replicate pandas DataFrames:

No multi-index — Axes are flat label lists. Hierarchical indexing is left to the application layer.
No joins/merges — Table is for in-memory computation on a single grid. Combining tables is done via NDArray operations.
No mutation — There is no drop, insert, or rename on Table. Create a new Table with the desired axes instead.
1D results are NDArrays, not 1-column Tables — This keeps the API simple and ensures computed vectors can be immediately used in math ops.
Inherits NDArray computation — Methods like where and sum delegate directly to NDArray, so performance characteristics match the underlying array library.

What Table does well

Clean T[rows, cols] syntax with label, slice, list, range, callable, and alias selectors.
Interop with NDArray for computation.
Lightweight — no dependency on pandas; uses only NDArray + standard library.

latdata — Tabular Data

Axis

Basic usage

Aliases

Edge cases

Table

Construction

Indexing — return-type rules

Boolean / predicate indexing

Edge cases

Computation

GroupBy

Basic example

Chaining

Grouping by callable predicate

Edge cases

Design rationale

Table wraps NDArray with labels, not a full DataFrame clone

What Table does well

`latdata` — Tabular Data