Wednesday, 9 July 2025

Revisiting Map Reduce

Map reduce is very prevalent in distributed data processing. 

It's an idea from functional programming. The basic idea is you have a map function that can apply a filter to a list,  and then apply a summary operation, which is the reduce.

A Mickey Mouse example to bring this to life could be - you have list of fishing vessels, you want to filter for foreign flags, and then you want to add them all up, and you want to do this every 6 hours and create a time series database of this data.

A popular open source implementation is Apache Hadoop.

Thursday, 3 July 2025

The Autoregressive Nature of LLM Operations

LLMs are AUTOREGRESSIVE generative models.

An autoregression is a regression of a variable against its own lagged values.

For example, an AR(1) predicts the current value based on the immediately preceding value, AR(n) uses the n most recent values.  

One remarkable feature of these models is in-context learning, which has been hypothesised as being Bayesian in nature.

The Python Datasets library

Hugging Face has a Python datasets library which has natural language training data sets amongst others.