latdata — Tabular Data

Named-axis data structures built on top of NDArray.

The latdata module provides lightweight labeled 2D containers. The core design philosophy is that a Table is an NDArray with row and column labels — not a full DataFrame clone. This means:

  • Selection always delegates to NDArray under the hood and inherits its computation methods.

  • Label-based indexing is the primary interface, but integer positional indexing also works.

  • There is no separate index object, no multi-index, and no set-operations on labels. The Axis is simply an ordered list of names with optional shortcuts (aliases).


Axis

from latpy.latdata import Axis

An Axis is an ordered list of labels with support for name-based lookups and user-defined aliases. Every Table has two Axes (.rows and .cols).

Signature

Description

Axis(name, labels)

Create axis with string name and label list

len(axis)

Number of labels

axis.has(label) -> bool

Check if label exists

axis.pos(label) -> int

Integer position of label (0-based)

axis.alias(name, sel)

Register a named selector (slice, int, label list, predicate)

axis.resolve(sel) -> tuple

Convert a user selector to internal index form

Selectors (accepted by resolve):

Selector Type

Example

Meaning

slice

1:5:2

Positional slice (integers only)

int

2

Direct integer index (supports negative)

Label

"NYC"

Single label lookup

Sequence[Label]

["a", "b", "c"]

List of labels

(Label, Label)

("jan", "jun")

Inclusive label range (by axis order, supports reverse)

Callable[[Label], bool]

lambda x: x.startswith("t")

Predicate filter

Alias name (str)

"top_10"

Expands to previously registered selector

Basic usage

axis = Axis("city", ["NYC", "LA", "CHI", "HOU"])

len(axis)          # 4
axis.has("LA")     # True
axis.has("SF")     # False
axis.pos("CHI")    # 2

# Resolve various selectors:
axis.resolve(0)            # ("int", 0, ["NYC"])
axis.resolve(-1)           # ("int", 3, ["HOU"])   — negative wraps
axis.resolve("LA")         # ("int", 1, ["LA"])
axis.resolve(["NYC", "HOU"])  # ("list", [0, 3], ["NYC", "HOU"])
axis.resolve(("CHI", "LA"))   # ("slice", slice(2, 0, -1), ["CHI", "LA", "NYC"]) — reverse range
axis.resolve(lambda x: len(x) == 3)  # ("list", [0, 1, 3], ["NYC", "LA", "HOU"])

Aliases

Aliases are named shortcuts. They are stored by name and expanded during resolve. This is useful for giving semantic names to frequently used sub-selections, e.g. "features", "top_10", "validation_set".

Rationale: Aliases make code self-documenting. Instead of repeating table[:, ["col1", "col3", "col7"]] everywhere, you define table.cols.alias("features", ["col1", "col3", "col7"]) once and write table[:, "features"] thereafter.

axis = Axis("row", ["train", "val", "test"])
axis.alias("train_val", ["train", "val"])
axis.alias("not_test", lambda x: x != "test")

axis.resolve("train_val")  # ("list", [0, 1], ["train", "val"])
axis.resolve("not_test")   # ("list", [0, 1], ["train", "val"])

Aliases can wrap any selector type: a single label, a list, a slice, a callable, or even another alias name.

Edge cases

Empty labels list — Axis allows zero labels. resolve and pos will raise errors on lookups, but len(axis) returns 0 and the Axis can still store aliases.

empty = Axis("x", [])
len(empty)          # 0
empty.has("a")      # False
empty.pos("a")      # ShapeError: Axis[x]: unknown label 'a'

Duplicate labels — When constructing an Axis, duplicate labels are silently deduplicated: the first occurrence wins. The labels list keeps duplicates but the internal _index dict only stores the first position.

ax = Axis("x", ["a", "b", "a"])   # duplicate "a"
len(ax)                 # 3
ax.has("a")             # True
ax.pos("a")             # 0   (first occurrence)
ax.labels               # ["a", "b", "a"]  (may show duplicates)

This can produce surprising behavior — resolve returns label lists based on the stored labels, but positional lookups always use the first occurrence. Users are encouraged to avoid duplicate labels.

Alias with an unknown label — If an alias name is registered but its selector refers to a label that does not exist, the error surfaces at resolve time, not at alias registration time.

ax = Axis("x", ["a", "b"])
ax.alias("bad", ["a", "z"])        # registers fine
ax.resolve("bad")                  # ShapeError: Axis[x]: unknown label 'z'

Resolve with a missing label — Direct label lookup on a non-existent label raises ShapeError.

ax = Axis("x", ["a", "b"])
ax.resolve("z")     # ShapeError: Axis[x]: unknown label 'z'

Resolve with None — Special case: None is treated as slice(None), meaning “all labels”.

ax = Axis("x", ["a", "b", "c"])
ax.resolve(None)  # ("slice", slice(0, 3, 1), ["a", "b", "c"])

Table

from latpy.latdata import Table

A Table is a 2D labeled grid: a single NDArray together with a row Axis and a column Axis. It is not a full DataFrame — it intentionally omits pivot tables, joins, stacked operations, and fancy index manipulation. Its strength is label-based reading and writing with a clean T[rows, cols] syntax.

Signature

Description

Table.from_list(values_2d, row_labels, col_labels, dtype=None, name_rows="rows", name_cols="cols") -> Table

Construct from nested lists

.data

NDArray backing store

.rows

Row Axis

.cols

Column Axis

.tolist() -> list[list]

Convert to Python lists

table[key] -> scalar | NDArray | Table

Index by label / integer / slice

.where(mask, a, b) -> Table

Element-wise select (like ndarray.where)

.sum(axis="rows") -> NDArray

Sum over rows or columns

Construction

# From nested lists with explicit labels
T = Table.from_list(
    [[1, 2, 3],
     [4, 5, 6]],
    row_labels=["r1", "r2"],
    col_labels=["a", "b", "c"],
)
# T.data is NDArray of shape (2, 3), dtype i64
# T.rows → Axis("rows", ["r1", "r2"])
# T.cols → Axis("cols", ["a", "b", "c"])
# T.tolist() → [[1, 2, 3], [4, 5, 6]]

# Auto-generated labels when omitted:
T2 = Table.from_list([[1, 2], [3, 4]])
# T2.rows.labels → ["r0", "r1"]
# T2.cols.labels → ["c0", "c1"]

Indexing — return-type rules

The indexing expression T[rows, cols] follows these rules:

rows selector

cols selector

return type

Example

single label/int

single label/int

scalar

T["r1", "b"] 2

single label/int

multiple

1D NDArray

T["r1", :] [1,2,3]

multiple

single label/int

1D NDArray

T[:, "b"] [2,5]

:

:

Table (view)

T[:, :] Table

multiple (non-slice)

multiple (non-slice)

Table (copy)

T[["r1"], ["a","c"]] Table

Rationale: 1D results are plain NDArray objects (not 1-column Tables) because a 1D NDArray is simpler, lighter, and interoperates directly with mathematical operations. A 2D result is always wrapped in a Table so that labels are preserved for further chaining.

T = Table.from_list(
    [[1, 2, 3],
     [4, 5, 6]],
    row_labels=["r1", "r2"],
    col_labels=["a", "b", "c"],
)

# Scalar
T["r1", "b"]      # → 2  (Python int)

# 1D NDArray (single row, all cols)
T["r1", :]        # → NDArray(shape=(3,), dtype=i64, axes=("cols",))
T["r1", :].tolist()  # [1, 2, 3]

# 1D NDArray (all rows, single column)
T[:, "b"]         # → NDArray(shape=(2,), dtype=i64, axes=("rows",))
T[:, "b"].tolist()  # [2, 5]

# Table (row range, column range)
S = T["r1":"r2", "a":"b"]
# S is Table(shape=(2, 2), rows=rows, cols=cols)
S.tolist()        # [[1, 2], [4, 5]]

# Table (full slice — view)
S = T[:, :]
S.tolist()        # [[1, 2, 3], [4, 5, 6]]

# Column shortcut: T["a"] is T[:, "a"]
T["a"]            # NDArray([1, 4])

# List-of-labels selection
T[:, ["a", "c"]]  # Table(shape=(2, 2), rows=rows, cols=cols)
T[:, ["a", "c"]].tolist()  # [[1, 3], [4, 6]]

Boolean / predicate indexing

When you pass a callable predicate as a selector, the same dimensionality rules apply. If both dimensions resolve to multiple entries, the result is a Table.

T = Table.from_list(
    [[1, 2, 3],
     [4, 5, 6],
     [7, 8, 9]],
    row_labels=["train", "val", "test"],
    col_labels=["a", "b", "c"],
)

# Predicate on rows → Table (multiple rows, all cols)
S = T[lambda r: r != "val", :]
# S is Table(shape=(2, 3), rows=rows, cols=cols)
S.rows.labels  # ["train", "test"]
S.tolist()     # [[1, 2, 3], [7, 8, 9]]

# Predicate on cols → Table (all rows, filtered cols)
S2 = T[:, lambda c: c > "a"]
S2.cols.labels  # ["b", "c"]
S2.tolist()     # [[2, 3], [5, 6], [8, 9]]

Edge cases

from_list with empty list — Raises ShapeError because the underlying NDArray cannot infer shape from [].

Table.from_list([], row_labels=[], col_labels=[])
# ShapeError: array(): expected list or nested list

from_list with mismatched label lengths — Raises ShapeError in __post_init__ because the data shape must match the label count.

Table.from_list(
    [[1, 2], [3, 4]],
    row_labels=["r1"],         # length 1, data has 2 rows
    col_labels=["a", "b"],
)
# ShapeError: Table: row labels length must match data.shape[0]

Indexing with out-of-range label — Raises ShapeError from Axis.pos.

T[:, "z"]   # ShapeError: Axis[cols]: unknown label 'z'

Indexing with out-of-range integer — Also a ShapeError.

T[10, :]    # ShapeError: Axis[rows]: index out of bounds

Mixed label/integer indexing — Labels and integers can be freely mixed. An integer is treated as a positional index, not a label name.

T[0, "b"]          # scalar: 2  (row 0, column "b")
T["r1", 2]         # scalar: 3  (row "r1", column index 2)
T[1, 0]            # scalar: 4  (row 1, column 0)
T[0, ["a", "c"]]   # NDArray([1, 3])  — row 0, cols "a" and "c"

Computation

sum — Reduces along one axis, returning a 1D NDArray. Compatible axis name aliases: "rows", "row", "r", "0" for axis=0; "cols", "col", "c", "1" for axis=1.

T = Table.from_list(
    [[1, 2, 3],
     [4, 5, 6]],
    row_labels=["r1", "r2"],
    col_labels=["a", "b", "c"],
)

T.sum(axis="rows")   # NDArray([5, 7, 9])   — sum down each column
T.sum(axis="cols")   # NDArray([6, 15])     — sum across each row

where — Element-wise select between two values based on a boolean mask. The mask can be:

  • A Table with b1 data

  • A callable (row_label, col_label) -> bool

  • A broadcastable NDArray with dtype b1

T = Table.from_list(
    [[1, 2],
     [3, 4]],
    row_labels=["a", "b"],
    col_labels=["x", "y"],
)

# Mask via callable: put 99 wherever label pair matches
result = T.where(lambda r, c: r == "a" and c == "x", 99, T)
result.tolist()  # [[99, 2], [3, 4]]
# result is a Table with same shape and labels

# Mask via b1 Table
from latpy.latmath.array import zeros
from latpy.latmath.array.dtypes import B1
mask_arr = zeros((2, 2), B1)
mask_arr[0, 0] = 1
mask_t = Table(mask_arr, T.rows, T.cols)
result2 = T.where(mask_t, T, 0)
result2.tolist()  # [[1, 0], [0, 0]]

GroupBy

from latpy.latdata import GroupBy

GroupBy implements split-apply-combine over a Table’s rows. You group rows by a column of the table, by row label, or by a callable predicate, then apply an aggregation (sum, mean, count, min, max).

Rationale: GroupBy is the classic split-apply-combine pattern from SQL/pandas. Splitting by row labels is the most common use case for labeled data, since row labels often encode categorical membership (e.g. "train", "val", "test"). Each aggregation produces a new Table whose row labels are the group keys and whose columns match the original table.

Signature

Description

GroupBy(table, by)

Group rows by label string, label list, or callable predicate

.sum() -> Table

Sum of each group along columns

.mean() -> Table

Mean of each group along columns

.count() -> Table

Row count of each group

.min() -> Table

Minimum of each group along columns

.max() -> Table

Maximum of each group along columns

Basic example

t = Table.from_list(
    [[1, 2],
     [3, 4],
     [5, 6]],
    row_labels=["a", "b", "a"],
    col_labels=["x", "y"],
)

gb = GroupBy(t, "row")      # group by row label
result = gb.sum()
result.tolist()              # [[6, 8], [3, 4]]
result.rows.labels           # ["a", "b"]   (group keys become row labels)
result.cols.labels           # ["x", "y"]   (columns preserved)

Output of each aggregation on this data:

Aggregation

Output

gb.sum()

[[6, 8], [3, 4]]

gb.mean()

[[3.0, 4.0], [3.0, 4.0]]

gb.count()

[[2, 2], [1, 1]]

gb.min()

[[1, 2], [3, 4]]

gb.max()

[[5, 6], [3, 4]]

Chaining

GroupBy results are Tables, so they can be further indexed or converted.

# Group → sum → extract one column as NDArray
GroupBy(t, "row").sum()[:, "y"]    # NDArray([8, 4])

# Group → mean → convert to plain Python lists
GroupBy(t, "row").mean().tolist()  # [[3.0, 4.0], [3.0, 4.0]]

# Group → sum → sum again (grand total)
GroupBy(t, "row").sum().sum(axis="cols")  # NDArray([14, 7])

Grouping by callable predicate

When you pass a callable, it receives each row label and the return value is stringified to form the group key.

t = Table.from_list(
    [[1, 2], [3, 4], [5, 6], [7, 8]],
    row_labels=["set_a", "set_b", "set_a", "set_b"],
    col_labels=["x", "y"],
)

gb = GroupBy(t, lambda lbl: lbl.split("_")[1])
gb.sum().tolist()   # [[6, 8], [10, 12]]
gb.sum().rows.labels  # ["a", "b"]

Edge cases

Group by string that doesn’t match any row label — The by parameter refers to row labels. If you pass a string that doesn’t match any row label, all rows fall into a single group named by that string.

t = Table.from_list([[1], [2]], row_labels=["a", "b"], col_labels=["x"])
gb = GroupBy(t, "z")             # no row label "z"
gb.sum().rows.labels             # ["z"]   — all rows grouped together
gb.sum().tolist()                # [[3]]
gb.count().tolist()              # [[2]]

Single-element groups — Groups with exactly one row work naturally; the aggregation returns that row’s values unchanged.

t = Table.from_list([[1, 2], [3, 4]], row_labels=["a", "b"], col_labels=["x", "y"])
GroupBy(t, "row").sum().tolist()    # [[1, 2], [3, 4]]  (each row is its own group)
GroupBy(t, "row").count().tolist()  # [[1, 1], [1, 1]]

All-identical rows — When all rows share the same label, there is one group containing all rows. Aggregation behaves normally.

t = Table.from_list([[1, 2], [3, 4]], row_labels=["a", "a"], col_labels=["x", "y"])
GroupBy(t, "row").sum().tolist()      # [[4, 6]]
GroupBy(t, "row").mean().tolist()     # [[2.0, 3.0]]
GroupBy(t, "row").count().tolist()    # [[2, 2]]

Empty table — If the table has no rows (shape (0, n)), GroupBy produces zero groups and aggregation returns an empty Table with shape (0, n_cols).


Design rationale

Table wraps NDArray with labels, not a full DataFrame clone

The Table deliberately does not replicate pandas DataFrames:

  • No multi-index — Axes are flat label lists. Hierarchical indexing is left to the application layer.

  • No joins/merges — Table is for in-memory computation on a single grid. Combining tables is done via NDArray operations.

  • No mutation — There is no drop, insert, or rename on Table. Create a new Table with the desired axes instead.

  • 1D results are NDArrays, not 1-column Tables — This keeps the API simple and ensures computed vectors can be immediately used in math ops.

  • Inherits NDArray computation — Methods like where and sum delegate directly to NDArray, so performance characteristics match the underlying array library.

What Table does well

  • Clean T[rows, cols] syntax with label, slice, list, range, callable, and alias selectors.

  • Interop with NDArray for computation.

  • Lightweight — no dependency on pandas; uses only NDArray + standard library.