_index.org

Big Data

Last edited: August 8, 2025

Big Data is a term for datasets large enough that traditional data processing applications are inadequate. i.e. when non-parallel processing is inadequate.

That is: “Big Data” is when Pandas and SQL is inadequate. To handle big data, its very difficult to sequentially go through and process stuff. To make it work, you usually have to perform parallel processing under the hood.

Rules of Thumb of Datasets

  • 1000 Genomes (AWS, 260TB)
  • CommonCraw - the entire web (On PSC! 300-800 TB)
  • GDELT - https://www.gdeltproject.org/ a dataset that contains everything that’s happening in the world right now in terms of news (small!! 2.5 TB per year; however, there is a LOT of fields: 250 Million fields)

Evolution of Big Data

Good Ol’ SQL

  1. schemas are too set in stone (“not a fit for Agile development” — a research scientist)
  2. SQL sharding, when working correctly, is

KV Stores

And this is why we gave up and made Redis (or Amazon DynamoDB, Riak, Memcached) which keeps only Key/Value information. We just make the key really really complicated to support structures: GET cart:joe:15~4...

binary operation

Last edited: August 8, 2025

A binary operation means that you are taking two things in and you are getting one thing out; for instance:

\begin{equation} f: (\mathbb{F},\mathbb{F}) \to \mathbb{F} \end{equation}

This is also closed, but binary operations doesn’t have to be.

binomial distribution

Last edited: August 8, 2025

A binomial distribution is a typo of distribution whose contents are:

  • Binary
  • Independent
  • Fixed number
  • Same probability: “That means: WITH REPLACEMENT”

Think: “what’s the probability of \(n\) coin flips getting \(k\) heads given the head’s probability is \(p\)”.

constituents

We write:

\begin{equation} X \sim Bin(n,p) \end{equation}

where, \(n\) is the number of trials, \(p\) is the probability of success on each trial.

requirements

Here is the probability mass function:

\begin{equation} P(X=k) = {n \choose k} p^{k}(1-p)^{n-k} \end{equation}

bioinformatics

Last edited: August 8, 2025

bioinformatics is a field of biology that deals with biology information. Blending CS, Data, Strategies and of course biology into one thing.

First, let’s review genetic information

possible use for bioinformatics

  • Find the start/stop codons of known gene, and determine the gene and protein length

bitmask

Last edited: August 8, 2025

bitmasking is a very helpful to create bit vectors.

  • | with a 1-mask is useful to turning things on
  • & with a 0-mask is useful to turning things off (bitvector & not(1-mask))
  • | is useful for set unions
  • & is useful for intersections of bits
  • ^ is useful for flipping isolated bits: 0 is bit preserving, 1 is bit negating
  • ~ is useful for flipping all bits