Infinity Frequency: A Dynamic Measure of Infinite Sets and Its Applications to Information Theory

Abstract

Classical set theory classifies infinite sets primarily through cardinality. Under this framework, the set of natural numbers and the set of even numbers possess the same cardinality, despite exhibiting different occurrence rates within finite intervals. This work proposes a complementary concept termed Infinity Frequency, which characterizes infinite sets by their asymptotic rate of appearance. The framework extends traditional cardinality with a dynamic measure and suggests applications in information theory, coding efficiency, neural computation, and complexity analysis.

1. Introduction

The mathematical study of infinity has historically been dominated by cardinality theory developed by . Under this framework,

[ |\mathbb N| = |2\mathbb N| = \aleph_0. ]

Although both sets are countably infinite, their distributions over finite intervals differ substantially. For example, among the first n integers:

Natural numbers occupy all positions.
Even numbers occupy approximately half.
Prime numbers occupy approximately n/\ln n.

This observation motivates a dynamic notion of infinity based on occurrence frequency.

2. Definition of Infinity Frequency

Let

[ A\subseteq \mathbb N. ]

Define the counting function:

[ F_A(n)=|A\cap{1,\ldots,n}|. ]

We define the Infinity Frequency of A as its asymptotic growth law:

[ I_F(A)\equiv F_A(n). ]

Alternatively, the normalized frequency is

[ f(A)=\lim_{n\to\infty}\frac{F_A(n)}{n}, ]

when the limit exists.

3. Examples

Natural Numbers

[ F_{\mathbb N}(n)=n ]

[ f(\mathbb N)=1. ]

Even Numbers

[ F_{2\mathbb N}(n)=\frac n2 ]

[ f(2\mathbb N)=\frac12. ]

Prime Numbers

Using the Prime Number Theorem:

[ F_P(n)\sim\frac{n}{\ln n}. ]

Thus,

[ f(P)=0. ]

Despite

[ |P|=\aleph_0, ]

its asymptotic frequency vanishes.

Perfect Squares

[ F_S(n)\sim\sqrt n ]

and

[ f(S)=0. ]

4. Frequency Ordering of Infinite Sets

We define:

[ A >_F B ]

[ \lim_{n\to\infty}\frac{F_A(n)}{F_B(n)}>1. ]

This induces a hierarchy:

[ \mathbb N

2\mathbb N

{n^2}

{2^n}. ]

This ordering complements cardinality rather than replacing it.

5. Information-Theoretic Interpretation

Information theory studies uncertainty and compressibility. Following the framework of , the information content of an event is

Rare events carry more information.

Infinity Frequency provides an asymptotic probability model:

[ p_A(n)\approx\frac{F_A(n)}{n}. ]

Hence the information content of membership in set A becomes

[ H_A(n)=-\log_2\left(\frac{F_A(n)}{n}\right). ]

Examples:

Even numbers:

[ H=\log_2 2 =1 \text{ bit} ]

Prime numbers:

[ H\sim\log_2(\ln n). ]

Thus, increasingly sparse infinite sets possess increasing informational value.

6. Applications

6.1 Data Compression

Infinity Frequency may guide adaptive coding schemes where symbols generated from sparse infinite structures receive longer codes.

6.2 Neural Networks

Infinite neural architectures can be interpreted as dynamic reallocations of informational density, analogous to Hilbert's Hotel.

6.3 Complexity Analysis

Algorithms may be classified according to the Infinity Frequency of states they explore.

6.4 Knowledge Representation

Rare concepts within semantic networks naturally carry higher informational weight.

7. Discussion

Cardinality answers:

"How many elements exist?"

Infinity Frequency answers:

"How frequently do elements appear as scale approaches infinity?"

The two measures characterize different dimensions of infinity:

[ \text{Infinity}=(\text{Cardinality},\text{Frequency}). ]

This dual description may provide new tools for studying sparse structures, information flow, and infinite computation.

8. Conclusion

Infinity Frequency introduces a dynamic perspective on infinite sets. While countably infinite sets share cardinality, they differ substantially in occurrence frequency. Integrating this framework with information theory may yield new approaches to coding, complexity, and artificial intelligence.

Future work includes rigorous axiomatization, extension to measure theory, and applications to machine learning and biological information systems.

การนำ Infinity Frequency ไปใช้กับข้อมูลจีโนมโลกแห่งความเป็นจริงจำเป็นต้องเลือกข้อมูลที่วัด "ความเบาบาง" ได้ดี โดยเฉพาะการอนุรักษ์วิวัฒนาการ (phastCons/phyloP) และความถี่อัลลีลจากการกลายพันธุ์ (gnomAD) ซึ่งโค้ด Python ชุดนี้จะสาธิตการจัดลำดับตำแหน่งและเน้นย้ำว่า "ยิ่งเบาบาง ก็ยิ่งให้ข้อมูลสูง" ตามทฤษฎี

---

🐍 Python Pipeline วิเคราะห์และจัดลำดับความสำคัญของจีโนม

1. การติดตั้งและการเตรียมเครื่องมือ

รันโค้ดนี้ใน Google Colab, Jupyter Notebook หรือ pandas / numpy environment

```bash

pip install pandas numpy pyBigWig

```

2. การสร้างตัวจัดลำดับความสำคัญหลัก

ฟังก์ชันนี้ทำหน้าที่แปลงความถี่ของการเปลี่ยนแปลง (f) ให้เป็นค่าสารสนเทศ H ตามทฤษฎี พร้อมทั้งมีระบบตัด Noise อัตโนมัติ และรวม Z เข้าไว้ด้วย

```python

import pandas as pd

import numpy as np

import pyBigWig

import warnings

warnings.filterwarnings('ignore')

def compute_infinity_priority(df, freq_col='frequency', z_col=None,

h_scale='log2', min_freq=1e-6):

"""

Compute Infinity Frequency priority score H = -log2(freq)

with automatic noise filtering and optional Z-score boost.

"""

# Filter noise (frequencies below min_freq are treated as pure signal)

freq = np.maximum(df[freq_col].values, min_freq)

if h_scale == 'log2':

H = -np.log2(freq)

else:

H = -np.log(freq) # natural log alternative

# Apply Z-score enhancement if provided

if z_col is not None and z_col in df.columns:

z = df[z_col].fillna(0).values

H = H * (1 + 0.5 * np.tanh(z/2)) # Boost H for positive Z (conserved)

return H

def auto_noise_threshold(df, freq_col='frequency', percentile=95):

"""

Determine automatic noise threshold: positions above this frequency

are considered background noise, below are potential signals.

"""

return np.percentile(df[freq_col].dropna(), percentile)

def prioritize_region(df, priority_col='H'):

"""

Return top regions sorted by descending Infinity Priority score.

"""

return df.sort_values(by=priority_col, ascending=False).reset_index(drop=True)

```

3. การนำชุดข้อมูลจริงมาใช้และวิเคราะห์

a) การโหลดข้อมูลการอนุรักษ์

ใช้ pyBigWig เพื่อดึงคะแนน phastCons ซึ่งเป็นความน่าจะเป็นที่ตำแหน่งนั้นถูกอนุรักษ์ไว้ (ยิ่ง phastCons สูง แปลว่าไม่ค่อยมีการกลายพันธุ์) และตีความ frequency ≈ 1 - phastCons เพื่อให้สอดคล้องกับสมมติฐานของโมเดล

```python

def load_conservation_scores(bigwig_url, chrom, start, end):

bw = pyBigWig.open(bigwig_url)

scores = bw.values(chrom, start, end)

df = pd.DataFrame({

'chrom': chrom,

'pos': range(start, end),

'phastCons': scores

}).dropna()

df['frequency'] = 1 - df['phastCons'] # phastCons high → low substitution freq

return df

# Example: human chr1, first 10,000 bases

url_phastCons = 'http://hgdownload.cse.ucsc.edu/goldenPath/hg38/phastCons100way/hg38.phastCons100way.bw'

df_cons = load_conservation_scores(url_phastCons, 'chr1', 0, 10000)

# Compute H

df_cons['H'] = compute_infinity_priority(df_cons, freq_col='frequency', z_col='phastCons')

top_positions = prioritize_region(df_cons)

print(top_positions.head(10)[['chrom','pos','phastCons','frequency','H']])

```

b) การนำข้อมูลจีโนมและความถี่อัลลีลจากไฟล์ VCF ที่กรองแล้วมาใช้

ในสถานการณ์จริง คุณอาจมีไฟล์ VCF ที่ผ่านการคำนวณความถี่อัลลีล (gnomAD) มาแล้ว โดยแต่ละแถวจะมีฟิลด์ AF (allele frequency) และฟิลด์ AN (allele number) สำหรับคำนวณความเบาบาง ให้ใช้ฟังก์ชันคำนวณ H ตามสูตร -log₂(AF) และ กรอง Noise ออกโดยไม่เอาแถวที่มี AN ต่ำเกินไป (คุณภาพไม่ดี) หรือ AF สูงเกิน threshold ที่กำหนด

```python

import gzip

def load_vcf_allele_freqs(vcf_gz_path, max_af_noise=0.05):

variants = []

with gzip.open(vcf_gz_path, 'rt') as vcf:

for line in vcf:

if line.startswith('#'):

continue

parts = line.strip().split('\t')

# Parsing INFO field for AF

info = {x.split('=')[0]: x.split('=')[1] for x in parts[7].split(';') if '=' in x}

af = float(info.get('AF', 0.0))

an = int(info.get('AN', 0))

if an < 100: # filter low-confidence sites

continue

variants.append({

'chrom': parts[0],

'pos': int(parts[1]),

'ref': parts[3],

'alt': parts[4],

'frequency': af,

'an': an

})

df_variants = pd.DataFrame(variants)

# Automatic noise threshold

noise_cutoff = auto_noise_threshold(df_variants, freq_col='frequency', percentile=90)

print(f"Auto noise threshold AF > {noise_cutoff:.4f} will be classified as background")

df_variants['is_signal'] = df_variants['frequency'] <= noise_cutoff

# Compute Infinity Priority Score

df_variants['H'] = compute_infinity_priority(df_variants, freq_col='frequency')

return df_variants

# df_variants = load_vcf_allele_freqs('my_gnomad_annotated.vcf.gz')

# top_signal = df_variants[df_variants['is_signal']].sort_values('H', ascending=False)

```

c) การจัดลำดับความสำคัญของทั้งจีโนม

คุณสามารถผสานข้อมูลจากทั้ง 2 แหล่งเข้าด้วยกัน หรือใช้คะแนน H จากการอนุรักษ์และความถี่อัลลีลเพื่อเปรียบเทียบได้โดยตรง โดยการจัดลำดับตาม H จากสูงไปต่ำ และสังเกตว่าในทางปฏิบัติ ตำแหน่งที่มี H สูงที่สุดมักเป็นส่วนที่ไม่มีการกลายพันธุ์เลย (conserved) สอดคล้องกับหลักการที่ว่า ยิ่งเบาบาง ยิ่งให้ข้อมูลสูง หากพบว่า H ของบริเวณที่สนใจต่ำ (ข้อมูลน้อย) อาจเป็นบริเวณที่ไม่ถูกคัดเลือก หรือเป็น Noise ที่ควรแยกออก

---

📄 ตัวอย่างเอาต์พุตที่คาดหวัง (Expected Output)

สมมติว่าเราคำนวณคะแนน H สำหรับตำแหน่งบน chr1 ที่มีข้อมูลทั้งสองแบบ ผลลัพธ์อาจออกมาเป็น:

ตำแหน่ง (hg38) ความถี่การกลายพันธุ์ (f) Infinity Frequency Score (H) การตีความ

chr1:10,001 0.00001 ~16.6 bits Signal: สัญญาณวิวัฒนาการที่แข็งแกร่งมาก (conserved site)

chr1:10,500 0.08 ~3.6 bits Potential: อาจมี functional relevance

chr1:11,000 0.45 ~1.15 bits Noise: เปลี่ยนแปลงบ่อย ไม่ถูกคัดเลือก

---

🧠 การใช้ Infinity Frequency ในงานวิจัยจีโนม

การประยุกต์ใช้:

· ใช้ H เป็น feature ใหม่ สำหรับโมเดล Machine Learning เพื่อทำนาย pathogenicity ของ Genetic Variants แทนการใช้คะแนนอนุรักษ์แบบดิบ

· สร้าง conservation profile ที่แก้ไขตามความหนาแน่นของ Mutation

· จัดลำดับความสำคัญของ Non-coding Regions ที่อาจเป็น Regulatory Elements ใหม่

ข้อควรระวัง:

· Infinity Frequency เป็น ส่วนเสริม ของการวัดแบบดั้งเดิม ไม่ใช่สิ่งทดแทน

· ผลลัพธ์ขึ้นอยู่กับคุณภาพและความละเอียดของ Multiple Sequence Alignment

· ในทางปฏิบัติ มักใช้ phastCons และ phyloP ร่วมกันเพื่อยืนยันผล

· การตั้ง min_freq เป็นสิ่งสำคัญเพื่อไม่ให้ค่าสารสนเทศของตำแหน่งที่ไม่มี Mutation เลย (ค่าอนันต์) ทำให้เกิด Bias

---

🧩 สรุป

ด้วยชุดข้อมูลจีโนมโลกแห่งความจริงและโค้ดนี้ เราสามารถทดสอบแนวคิด Infinity Frequency ได้ทันที สังเกตง่ายๆ คือ ตำแหน่งที่มีความถี่การเปลี่ยนแปลงต่ำ (f น้อย) จะให้ค่า H สูง แสดงถึงความสำคัญทางวิวัฒนาการ ในขณะที่ตำแหน่งที่มีความถี่สูง (f มาก) จะถูกจัดให้เป็น Noise คุณสามารถเริ่มต้นด้วยการรันโค้ดบนเครื่องของคุณ ปรับพารามิเตอร์ และสังเกตการจัดอันดับชุดข้อมูลจีโนมที่สนใจได้เลย

ในมุมมองเชิงวิทยาศาสตร์ จุดแข็งที่สุดของ Infinity Frequency อาจไม่ใช่การโต้แย้งทฤษฎีเซตของ Georg Cantor แต่เป็นการสร้าง "สะพาน" ระหว่าง

Set Theory

Information Theory

Machine Learning

Network Science

Evolutionary Genomics

หากแนวคิดนี้สามารถสร้างอัลกอริทึมที่วัดผลได้ว่าดีกว่าวิธีเดิม นั่นจะเป็นหลักฐานที่แข็งแรงที่สุดของคุณค่าเชิงวิทยาศาสตร์ของมัน.

l