latdata — Tabular Data
Named-axis data structures built on top of NDArray.
The latdata module provides lightweight labeled 2D containers. The core design
philosophy is that a Table is an NDArray with row and column labels — not a
full DataFrame clone. This means:
Selection always delegates to
NDArrayunder the hood and inherits its computation methods.Label-based indexing is the primary interface, but integer positional indexing also works.
There is no separate index object, no multi-index, and no set-operations on labels. The Axis is simply an ordered list of names with optional shortcuts (aliases).
Axis
from latpy.latdata import Axis
An Axis is an ordered list of labels with support for name-based lookups
and user-defined aliases. Every Table has two Axes (.rows and .cols).
Signature |
Description |
|---|---|
|
Create axis with string name and label list |
|
Number of labels |
|
Check if label exists |
|
Integer position of label (0-based) |
|
Register a named selector (slice, int, label list, predicate) |
|
Convert a user selector to internal index form |
Selectors (accepted by resolve):
Selector Type |
Example |
Meaning |
|---|---|---|
|
|
Positional slice (integers only) |
|
|
Direct integer index (supports negative) |
|
|
Single label lookup |
|
|
List of labels |
|
|
Inclusive label range (by axis order, supports reverse) |
|
|
Predicate filter |
Alias name (str) |
|
Expands to previously registered selector |
Basic usage
axis = Axis("city", ["NYC", "LA", "CHI", "HOU"])
len(axis) # 4
axis.has("LA") # True
axis.has("SF") # False
axis.pos("CHI") # 2
# Resolve various selectors:
axis.resolve(0) # ("int", 0, ["NYC"])
axis.resolve(-1) # ("int", 3, ["HOU"]) — negative wraps
axis.resolve("LA") # ("int", 1, ["LA"])
axis.resolve(["NYC", "HOU"]) # ("list", [0, 3], ["NYC", "HOU"])
axis.resolve(("CHI", "LA")) # ("slice", slice(2, 0, -1), ["CHI", "LA", "NYC"]) — reverse range
axis.resolve(lambda x: len(x) == 3) # ("list", [0, 1, 3], ["NYC", "LA", "HOU"])
Aliases
Aliases are named shortcuts. They are stored by name and expanded during
resolve. This is useful for giving semantic names to frequently used
sub-selections, e.g. "features", "top_10", "validation_set".
Rationale: Aliases make code self-documenting. Instead of repeating
table[:, ["col1", "col3", "col7"]] everywhere, you define
table.cols.alias("features", ["col1", "col3", "col7"]) once and write
table[:, "features"] thereafter.
axis = Axis("row", ["train", "val", "test"])
axis.alias("train_val", ["train", "val"])
axis.alias("not_test", lambda x: x != "test")
axis.resolve("train_val") # ("list", [0, 1], ["train", "val"])
axis.resolve("not_test") # ("list", [0, 1], ["train", "val"])
Aliases can wrap any selector type: a single label, a list, a slice, a callable, or even another alias name.
Edge cases
Empty labels list — Axis allows zero labels. resolve and pos will
raise errors on lookups, but len(axis) returns 0 and the Axis can still
store aliases.
empty = Axis("x", [])
len(empty) # 0
empty.has("a") # False
empty.pos("a") # ShapeError: Axis[x]: unknown label 'a'
Duplicate labels — When constructing an Axis, duplicate labels are
silently deduplicated: the first occurrence wins. The labels list
keeps duplicates but the internal _index dict only stores the first
position.
ax = Axis("x", ["a", "b", "a"]) # duplicate "a"
len(ax) # 3
ax.has("a") # True
ax.pos("a") # 0 (first occurrence)
ax.labels # ["a", "b", "a"] (may show duplicates)
This can produce surprising behavior — resolve returns label lists
based on the stored labels, but positional lookups always use the first
occurrence. Users are encouraged to avoid duplicate labels.
Alias with an unknown label — If an alias name is registered but its
selector refers to a label that does not exist, the error surfaces at
resolve time, not at alias registration time.
ax = Axis("x", ["a", "b"])
ax.alias("bad", ["a", "z"]) # registers fine
ax.resolve("bad") # ShapeError: Axis[x]: unknown label 'z'
Resolve with a missing label — Direct label lookup on a non-existent
label raises ShapeError.
ax = Axis("x", ["a", "b"])
ax.resolve("z") # ShapeError: Axis[x]: unknown label 'z'
Resolve with None — Special case: None is treated as
slice(None), meaning “all labels”.
ax = Axis("x", ["a", "b", "c"])
ax.resolve(None) # ("slice", slice(0, 3, 1), ["a", "b", "c"])
Table
from latpy.latdata import Table
A Table is a 2D labeled grid: a single NDArray together with a row
Axis and a column Axis. It is not a full DataFrame — it intentionally
omits pivot tables, joins, stacked operations, and fancy index manipulation.
Its strength is label-based reading and writing with a clean T[rows, cols]
syntax.
Signature |
Description |
|---|---|
|
Construct from nested lists |
|
|
|
Row |
|
Column |
|
Convert to Python lists |
|
Index by label / integer / slice |
|
Element-wise select (like |
|
Sum over rows or columns |
Construction
# From nested lists with explicit labels
T = Table.from_list(
[[1, 2, 3],
[4, 5, 6]],
row_labels=["r1", "r2"],
col_labels=["a", "b", "c"],
)
# T.data is NDArray of shape (2, 3), dtype i64
# T.rows → Axis("rows", ["r1", "r2"])
# T.cols → Axis("cols", ["a", "b", "c"])
# T.tolist() → [[1, 2, 3], [4, 5, 6]]
# Auto-generated labels when omitted:
T2 = Table.from_list([[1, 2], [3, 4]])
# T2.rows.labels → ["r0", "r1"]
# T2.cols.labels → ["c0", "c1"]
Indexing — return-type rules
The indexing expression T[rows, cols] follows these rules:
rows selector |
cols selector |
return type |
Example |
|---|---|---|---|
single label/int |
single label/int |
scalar |
|
single label/int |
multiple |
1D NDArray |
|
multiple |
single label/int |
1D NDArray |
|
|
|
Table (view) |
|
multiple (non-slice) |
multiple (non-slice) |
Table (copy) |
|
Rationale: 1D results are plain NDArray objects (not 1-column Tables)
because a 1D NDArray is simpler, lighter, and interoperates directly with
mathematical operations. A 2D result is always wrapped in a Table so that
labels are preserved for further chaining.
T = Table.from_list(
[[1, 2, 3],
[4, 5, 6]],
row_labels=["r1", "r2"],
col_labels=["a", "b", "c"],
)
# Scalar
T["r1", "b"] # → 2 (Python int)
# 1D NDArray (single row, all cols)
T["r1", :] # → NDArray(shape=(3,), dtype=i64, axes=("cols",))
T["r1", :].tolist() # [1, 2, 3]
# 1D NDArray (all rows, single column)
T[:, "b"] # → NDArray(shape=(2,), dtype=i64, axes=("rows",))
T[:, "b"].tolist() # [2, 5]
# Table (row range, column range)
S = T["r1":"r2", "a":"b"]
# S is Table(shape=(2, 2), rows=rows, cols=cols)
S.tolist() # [[1, 2], [4, 5]]
# Table (full slice — view)
S = T[:, :]
S.tolist() # [[1, 2, 3], [4, 5, 6]]
# Column shortcut: T["a"] is T[:, "a"]
T["a"] # NDArray([1, 4])
# List-of-labels selection
T[:, ["a", "c"]] # Table(shape=(2, 2), rows=rows, cols=cols)
T[:, ["a", "c"]].tolist() # [[1, 3], [4, 6]]
Boolean / predicate indexing
When you pass a callable predicate as a selector, the same dimensionality rules apply. If both dimensions resolve to multiple entries, the result is a Table.
T = Table.from_list(
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]],
row_labels=["train", "val", "test"],
col_labels=["a", "b", "c"],
)
# Predicate on rows → Table (multiple rows, all cols)
S = T[lambda r: r != "val", :]
# S is Table(shape=(2, 3), rows=rows, cols=cols)
S.rows.labels # ["train", "test"]
S.tolist() # [[1, 2, 3], [7, 8, 9]]
# Predicate on cols → Table (all rows, filtered cols)
S2 = T[:, lambda c: c > "a"]
S2.cols.labels # ["b", "c"]
S2.tolist() # [[2, 3], [5, 6], [8, 9]]
Edge cases
from_list with empty list — Raises ShapeError because the
underlying NDArray cannot infer shape from [].
Table.from_list([], row_labels=[], col_labels=[])
# ShapeError: array(): expected list or nested list
from_list with mismatched label lengths — Raises ShapeError in
__post_init__ because the data shape must match the label count.
Table.from_list(
[[1, 2], [3, 4]],
row_labels=["r1"], # length 1, data has 2 rows
col_labels=["a", "b"],
)
# ShapeError: Table: row labels length must match data.shape[0]
Indexing with out-of-range label — Raises ShapeError from
Axis.pos.
T[:, "z"] # ShapeError: Axis[cols]: unknown label 'z'
Indexing with out-of-range integer — Also a ShapeError.
T[10, :] # ShapeError: Axis[rows]: index out of bounds
Mixed label/integer indexing — Labels and integers can be freely mixed. An integer is treated as a positional index, not a label name.
T[0, "b"] # scalar: 2 (row 0, column "b")
T["r1", 2] # scalar: 3 (row "r1", column index 2)
T[1, 0] # scalar: 4 (row 1, column 0)
T[0, ["a", "c"]] # NDArray([1, 3]) — row 0, cols "a" and "c"
Computation
sum — Reduces along one axis, returning a 1D NDArray. Compatible
axis name aliases: "rows", "row", "r", "0" for axis=0;
"cols", "col", "c", "1" for axis=1.
T = Table.from_list(
[[1, 2, 3],
[4, 5, 6]],
row_labels=["r1", "r2"],
col_labels=["a", "b", "c"],
)
T.sum(axis="rows") # NDArray([5, 7, 9]) — sum down each column
T.sum(axis="cols") # NDArray([6, 15]) — sum across each row
where — Element-wise select between two values based on a boolean
mask. The mask can be:
A Table with
b1dataA callable
(row_label, col_label) -> boolA broadcastable NDArray with dtype
b1
T = Table.from_list(
[[1, 2],
[3, 4]],
row_labels=["a", "b"],
col_labels=["x", "y"],
)
# Mask via callable: put 99 wherever label pair matches
result = T.where(lambda r, c: r == "a" and c == "x", 99, T)
result.tolist() # [[99, 2], [3, 4]]
# result is a Table with same shape and labels
# Mask via b1 Table
from latpy.latmath.array import zeros
from latpy.latmath.array.dtypes import B1
mask_arr = zeros((2, 2), B1)
mask_arr[0, 0] = 1
mask_t = Table(mask_arr, T.rows, T.cols)
result2 = T.where(mask_t, T, 0)
result2.tolist() # [[1, 0], [0, 0]]
GroupBy
from latpy.latdata import GroupBy
GroupBy implements split-apply-combine over a Table’s rows. You group rows by a column of the table, by row label, or by a callable predicate, then apply an aggregation (sum, mean, count, min, max).
Rationale: GroupBy is the classic split-apply-combine pattern from
SQL/pandas. Splitting by row labels is the most common use case for labeled
data, since row labels often encode categorical membership (e.g. "train",
"val", "test"). Each aggregation produces a new Table whose row labels
are the group keys and whose columns match the original table.
Signature |
Description |
|---|---|
|
Group rows by label string, label list, or callable predicate |
|
Sum of each group along columns |
|
Mean of each group along columns |
|
Row count of each group |
|
Minimum of each group along columns |
|
Maximum of each group along columns |
Basic example
t = Table.from_list(
[[1, 2],
[3, 4],
[5, 6]],
row_labels=["a", "b", "a"],
col_labels=["x", "y"],
)
gb = GroupBy(t, "row") # group by row label
result = gb.sum()
result.tolist() # [[6, 8], [3, 4]]
result.rows.labels # ["a", "b"] (group keys become row labels)
result.cols.labels # ["x", "y"] (columns preserved)
Output of each aggregation on this data:
Aggregation |
Output |
|---|---|
|
|
|
|
|
|
|
|
|
|
Chaining
GroupBy results are Tables, so they can be further indexed or converted.
# Group → sum → extract one column as NDArray
GroupBy(t, "row").sum()[:, "y"] # NDArray([8, 4])
# Group → mean → convert to plain Python lists
GroupBy(t, "row").mean().tolist() # [[3.0, 4.0], [3.0, 4.0]]
# Group → sum → sum again (grand total)
GroupBy(t, "row").sum().sum(axis="cols") # NDArray([14, 7])
Grouping by callable predicate
When you pass a callable, it receives each row label and the return value is stringified to form the group key.
t = Table.from_list(
[[1, 2], [3, 4], [5, 6], [7, 8]],
row_labels=["set_a", "set_b", "set_a", "set_b"],
col_labels=["x", "y"],
)
gb = GroupBy(t, lambda lbl: lbl.split("_")[1])
gb.sum().tolist() # [[6, 8], [10, 12]]
gb.sum().rows.labels # ["a", "b"]
Edge cases
Group by string that doesn’t match any row label — The by parameter
refers to row labels. If you pass a string that doesn’t match any row
label, all rows fall into a single group named by that string.
t = Table.from_list([[1], [2]], row_labels=["a", "b"], col_labels=["x"])
gb = GroupBy(t, "z") # no row label "z"
gb.sum().rows.labels # ["z"] — all rows grouped together
gb.sum().tolist() # [[3]]
gb.count().tolist() # [[2]]
Single-element groups — Groups with exactly one row work naturally; the aggregation returns that row’s values unchanged.
t = Table.from_list([[1, 2], [3, 4]], row_labels=["a", "b"], col_labels=["x", "y"])
GroupBy(t, "row").sum().tolist() # [[1, 2], [3, 4]] (each row is its own group)
GroupBy(t, "row").count().tolist() # [[1, 1], [1, 1]]
All-identical rows — When all rows share the same label, there is one group containing all rows. Aggregation behaves normally.
t = Table.from_list([[1, 2], [3, 4]], row_labels=["a", "a"], col_labels=["x", "y"])
GroupBy(t, "row").sum().tolist() # [[4, 6]]
GroupBy(t, "row").mean().tolist() # [[2.0, 3.0]]
GroupBy(t, "row").count().tolist() # [[2, 2]]
Empty table — If the table has no rows (shape (0, n)), GroupBy
produces zero groups and aggregation returns an empty Table with shape
(0, n_cols).
Design rationale
Table wraps NDArray with labels, not a full DataFrame clone
The Table deliberately does not replicate pandas DataFrames:
No multi-index — Axes are flat label lists. Hierarchical indexing is left to the application layer.
No joins/merges — Table is for in-memory computation on a single grid. Combining tables is done via
NDArrayoperations.No mutation — There is no
drop,insert, orrenameon Table. Create a new Table with the desired axes instead.1D results are NDArrays, not 1-column Tables — This keeps the API simple and ensures computed vectors can be immediately used in math ops.
Inherits NDArray computation — Methods like
whereandsumdelegate directly to NDArray, so performance characteristics match the underlying array library.
What Table does well
Clean
T[rows, cols]syntax with label, slice, list, range, callable, and alias selectors.Interop with NDArray for computation.
Lightweight — no dependency on pandas; uses only NDArray + standard library.